Re: Resolving term vector even when not stored?
Mike Klaas [EMAIL PROTECTED] wrote on 16/03/2007 14:26:46: On 3/15/07, karl wettin [EMAIL PROTECTED] wrote: I propose a change of the current IndexReader.getTermFreqVector/s- code so that it /always/ return the vector space model of a document, even when set fields are set as Field.TermVector.NO. Is that crazy? Could be really slow, but except for that.. And if it is cached then that information is known by inspecting the fields. People don't go fetching term vectors without knowing what thay are doing, are they? The highlighting contrib code does this: attempt to retrieve the termvector, catch InvalidArgumentException, fall back to re-analysis of the data. This way makes more sense to me. IndexReader.getTermFreqVector() means its there, just bring it, while the fall-back is more a computeTermFreqVector(), which takes much more time. Users would likely prefer getting an exception for the get() (oops, term vectors were not saved..) rather then auto falling back to an expensive computation. This functionality seems proper as a utility, so it can be reused, I think perhaps in contrib? I'm not sure if that is crazy, but that is what is currently implemented. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Resolving term vector even when not stored?
17 mar 2007 kl. 08.15 skrev Doron Cohen: Mike Klaas [EMAIL PROTECTED] wrote on 16/03/2007 14:26:46: On 3/15/07, karl wettin [EMAIL PROTECTED] wrote: I propose a change of the current IndexReader.getTermFreqVector/s- code so that it /always/ return the vector space model of a document, even when set fields are set as Field.TermVector.NO. Is that crazy? Could be really slow, but except for that.. And if it is cached then that information is known by inspecting the fields. People don't go fetching term vectors without knowing what thay are doing, are they? The highlighting contrib code does this: attempt to retrieve the termvector, catch InvalidArgumentException, fall back to re-analysis of the data. This way makes more sense to me. IndexReader.getTermFreqVector() means its there, just bring it, They way I look at it the vector space model is there all the time and Field.TermVector.YES really means Field.TermVector.Level1Cached. Also, I would not mind a soft referenced map in IndexReader that keeps track of all resoved term vectors. Perhaps that should be a decoration. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing time taken is too long - Help Appreciated.
17 mar 2007 kl. 06.01 skrev Lokeya: Help Appreciated. There are even more, helpful, people in the java-users. You have a greater chance to get a good answer in time there, as this forum focus on development of the actual API rather than consumer implementations. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-834) Payload Queries
Payload Queries --- Key: LUCENE-834 URL: https://issues.apache.org/jira/browse/LUCENE-834 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Grant Ingersoll Assigned To: Grant Ingersoll Priority: Minor Now that payloads have been implemented, it will be good to make them searchable via one or more Query mechanisms. See http://wiki.apache.org/lucene-java/Payload_Planning for some background information and https://issues.apache.org/jira/browse/LUCENE-755 for the issue that started it all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Storing whole document in the index
Hello It's a whil that I am using lucene and as most of people seemingly do, I used to save only some important fields of a docuemnt in the index. But recently I thought why not store the whole document bytes as an untokenized field in the index in order to ease the retrieval process? For example serialize the pdf file into a byte[] and then save the bytes as a field in the index.(some gzip and base64 encodings may be needed as glue logic). Then I can delete the original file from the system. Is there any reason against this idea? Can lucene bear this large volume of input streamed data?
Monster Jobs search
Hello Peter, Now that the monster lucene search is live, is performance pretty good? Are you still running it on a single 8 core server? Can you give us a rough idea on the number of queries you can handle/second and the number of docs in the index? Are you using dotLucene or a java webservice tier? How did you implement your bounding box for the searching? It sounds like you do this outside of lucene and return a custom hitcollector. Why not use a rangequery or functionquery for the basic bounding before sorting? Thanks, Eric
Re: Storing whole document in the index
Please ask these type of questions on the user mailing list, you will get much better responses. The dev list is for developers of Lucene. To answer your question, yes you can do this. You may find the FieldSelector API additions and Lazy Field Loading to be helpful performance wise, as well. -Grant On Mar 17, 2007, at 8:36 AM, jafarim wrote: Hello It's a whil that I am using lucene and as most of people seemingly do, I used to save only some important fields of a docuemnt in the index. But recently I thought why not store the whole document bytes as an untokenized field in the index in order to ease the retrieval process? For example serialize the pdf file into a byte[] and then save the bytes as a field in the index.(some gzip and base64 encodings may be needed as glue logic). Then I can delete the original file from the system. Is there any reason against this idea? Can lucene bear this large volume of input streamed data? -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Commit and Review (was Is Lucene Java trunk still stable for production code?)
Hoss wrote: (or in short: we're moving more towards a *true* commit and review model) I'm curious as to what you think are the practical implications are for committers for this model? Do you imagine a change in the workflow whereby we commit and then review or do we stick to the patch approach as committers (contributors will always submit patches)? It has always been a gray area, where we all kind of know what we can commit w/o creating patches for versus what we should put up patches for. Just curious, I'm working on the payloads stuff and I know that as long as it compiles, it isn't going to break anything, so in some sense I could commit b/c I know it would make it easier for Michael B. and others to update and review w/o going through the patch process. On the other hand, the patch approach makes you take one extra good look at what you are doing. What do others think? -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-829) StandardBenchmarker#makeDocument does not explicitly close opened files
16 mar 2007 kl. 02.23 skrev Doron Cohen (JIRA): [ https://issues.apache.org/jira/browse/LUCENE-829? page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen resolved LUCENE-829. Resolution: Fixed Fix Version/s: 2.2 Lucene Fields: [Patch Available] (was: [New]) Commited the fix for this. There were actually two more cases like this. Also found this in ReutersQueryMaker: private void prepareQueries() throws Exception { // analyzer (default is standard analyzer) Analyzer anlzr= (Analyzer) Class.forName(config.get(analyzer, org.apache.lucene.analysis.StandardAnalyzer)).newInstance(); It should be org.apache.lucene.analysis.standard.StandardAnalyzer)).newInstance(); -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-550) InstantiatedIndex - faster but memory consuming index
[ https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-550: --- Attachment: HitCollectionBench.jpg A graph showing performance of hit collection using InstantiatedIndex, RAMDirectory and FSDirectory. In essence, there is no great win in pure search time when there are more than 7000 documents. However, retreiving documents is still not associate with any cost what so ever, so in a 25 sized index that use Lucene for persistency of fields, I still see a boost with 6-10x or so compared to RAMDirectory. documents in corpus \t queries per second [EMAIL PROTECTED] 250 37530,00 500 29610,00 750 22612,50 100019267,50 125016027,50 150014737,50 175013230,00 200012322,50 225011482,50 250010125,00 27509802,50 30008508,25 32508469,80 35007788,61 37505207,29 40005484,52 42504912,50 45004420,58 47504006,49 50004357,50 52503886,67 55003573,93 57503236,76 60003602,10 62503420,00 65003075,00 67502805,00 70002680,98 72502908,55 75002769,46 77502644,86 80002496,25 82502377,50 85002578,71 87502390,11 90002160,00 92502037,96 95001872,19 97502041,38 1 1959,12 Created 1 documents [EMAIL PROTECTED] 250 4845,00 500 3986,01 750 4330,67 10004682,82 12504148,78 15004847,65 17504535,23 20004192,50 22504203,30 25003695,65 27503742,50 30003485,76 32503470,76 35003525,00 37502877,61 40003221,78 42502983,51 45002982,02 47502724,55 50003092,86 52502646,18 55002940,00 57502709,58 60002423,30 62502602,50 65002305,39 67502462,57 70001815,00 72502431,42 75002171,74 77502297,90 80002134,30 82502308,85 85002038,98 87502231,65 90002097,90 92502041,38 95001819,77 97502102,24 1 1876,87 Created 1 documents [EMAIL PROTECTED] 250 3448,28 500 2422,50 750 2677,50 10002607,39 12502241,92 15002486,27 17502472,53 20001733,52 22502325,00 25002194,21 27501969,55 30002125,75 32502009,00 35001473,08 37501858,14 40001925,57 42501671,66 45001786,25 47501694,15 50001217,63 52501595,11 55001745,75 57501526,18 60001431,78 62501524,66 65001648,35 67501544,23 70001428,22 72501487,29 75001494,02 77501106,13 80001455,00 82501284,86 85001182,63 87501292,33 90001399,70 92501000,00 95001291,04 97501359,56 1 1194,62 Created 1 documents InstantiatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: https://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.0.0 Reporter: Karl Wettin Assigned To: Karl Wettin Attachments: HitCollectionBench.jpg, lucene-550.jpg, test-reports.zip, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2 An non file centrinc all in memory index. Consumes some 2x the memory of a RAMDirectory (in a term satured index) but is between 3x-60x faster depending on application and how one counts. Average query is about 8x faster. IndexWriter and IndexModifier have been realized in InterfaceIndexWriter and InterfaceIndexModifier. InstantiatedIndex is wrapped in a new top layer index facade (class Index) that comes with factory methods for writers, readers and searchers for unison index handeling. There are decorators with notification handling that can be used for automatically syncronizing
[jira] Updated: (LUCENE-550) InstantiatedIndex - faster but memory consuming index
[ https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-550: --- Attachment: HitCollectionBench.jpg made graph more readable InstantiatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: https://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.0.0 Reporter: Karl Wettin Assigned To: Karl Wettin Attachments: HitCollectionBench.jpg, lucene-550.jpg, test-reports.zip, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2 An non file centrinc all in memory index. Consumes some 2x the memory of a RAMDirectory (in a term satured index) but is between 3x-60x faster depending on application and how one counts. Average query is about 8x faster. IndexWriter and IndexModifier have been realized in InterfaceIndexWriter and InterfaceIndexModifier. InstantiatedIndex is wrapped in a new top layer index facade (class Index) that comes with factory methods for writers, readers and searchers for unison index handeling. There are decorators with notification handling that can be used for automatically syncronizing searchers on updates, et.c. Index also comes with FS/RAMDirectory implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-550) InstantiatedIndex - faster but memory consuming index
[ https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wettin updated LUCENE-550: --- Attachment: (was: HitCollectionBench.jpg) InstantiatedIndex - faster but memory consuming index - Key: LUCENE-550 URL: https://issues.apache.org/jira/browse/LUCENE-550 Project: Lucene - Java Issue Type: New Feature Components: Store Affects Versions: 2.0.0 Reporter: Karl Wettin Assigned To: Karl Wettin Attachments: HitCollectionBench.jpg, lucene-550.jpg, test-reports.zip, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2, trunk.diff.bz2 An non file centrinc all in memory index. Consumes some 2x the memory of a RAMDirectory (in a term satured index) but is between 3x-60x faster depending on application and how one counts. Average query is about 8x faster. IndexWriter and IndexModifier have been realized in InterfaceIndexWriter and InterfaceIndexModifier. InstantiatedIndex is wrapped in a new top layer index facade (class Index) that comes with factory methods for writers, readers and searchers for unison index handeling. There are decorators with notification handling that can be used for automatically syncronizing searchers on updates, et.c. Index also comes with FS/RAMDirectory implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Commit and Review (was Is Lucene Java trunk still stable for production code?)
And by break, I mean all tests pass with the possible exception of those related to the new functionality. Also, the example I gave about payloads is hypothetical. I'm still going to submit a patch. On Mar 17, 2007, at 12:02 PM, Grant Ingersoll wrote: Hoss wrote: (or in short: we're moving more towards a *true* commit and review model) I'm curious as to what you think are the practical implications are for committers for this model? Do you imagine a change in the workflow whereby we commit and then review or do we stick to the patch approach as committers (contributors will always submit patches)? It has always been a gray area, where we all kind of know what we can commit w/o creating patches for versus what we should put up patches for. Just curious, I'm working on the payloads stuff and I know that as long as it compiles, it isn't going to break anything, so in some sense I could commit b/c I know it would make it easier for Michael B. and others to update and review w/o going through the patch process. On the other hand, the patch approach makes you take one extra good look at what you are doing. What do others think? -Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-834) Payload Queries
[ https://issues.apache.org/jira/browse/LUCENE-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-834: --- Attachment: boosting.term.query.patch First draft at a BoostingTermQuery, which is based on the SpanTermQuery and can be used for boosting the score of a term based on what is in the payload (for things like weighting terms higher according to their font size or part of speech). A couple of classes that were previously package level are now public and have been marked as Public and for derivational purposes only. See the CHANGES.xml for some more details. I believe all tests still pass. Payload Queries --- Key: LUCENE-834 URL: https://issues.apache.org/jira/browse/LUCENE-834 Project: Lucene - Java Issue Type: New Feature Components: Search Reporter: Grant Ingersoll Assigned To: Grant Ingersoll Priority: Minor Attachments: boosting.term.query.patch Now that payloads have been implemented, it will be good to make them searchable via one or more Query mechanisms. See http://wiki.apache.org/lucene-java/Payload_Planning for some background information and https://issues.apache.org/jira/browse/LUCENE-755 for the issue that started it all. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Commit and Review (was Is Lucene Java trunk still stable for production code?)
Personally, i really like having a Jira issue number assocaited with every commit ... and if there is a Jira issue open, attaching a patch is trivial -- and having patches in Jira makes it a little easier for people who really, Really, REALLY can't upgrade their version of lucene for some strange reason still have an easy refrence point for when a feature was added (and they can try to backport it). I think the Commit and Review mentality is much more about how timid a committer needs to be about big changes that *might* have consequences or change APIs ... there are a lot of patches that don't break any existing unit tests, but add some new public methods whose existence may be questionable; there are patches that change code and fix existing unit test so that they still pass -- and these fixes might make the tests logical, but raise the possibility that people were depending on the old behavior; then there are patches thta change the way something works internally, whose behavior was previously undefined; ... all of these cases are things that a committer who feels they understnad the changes should be able to go ahead and apply under the Commit and Review model, becuase if there are any consequences, they can allways be rolled back before the next release -- this is a luxary we haven't had in the past, because so many people expected the trunk to be stable, and that they could always roll forward using new methods and depending on new behavior with risk that they would go away in a future release. Ultimately big changes should always be discussed before they are commited -- but where know we tend to have issues open for a really long time, with lots of iterations of code patches before anything is ever commited, i suspect we may eventually reach the point where issues are opened to talk about *concepts* and hypotheitcal patches are attached showing off ideas, and as people say X makes sense, Y i'm not so sure about X gets commited and discussion continues about Y ... but if the game plan changes and X becomes silly, we yank it back out. we won't get from here to there overnight ... it's a deleicate dance between the frequency of major releases, the mindsets of committers, and the comfort level with doing patch releases to get bug fixes out because you know the trunk has a lot of X stypes things in it that aren't relaly stable. : Hoss wrote: : (or in short: we're moving more towards a *true* commit and review : model) : : I'm curious as to what you think are the practical implications are : for committers for this model? Do you imagine a change in the : workflow whereby we commit and then review or do we stick to the : patch approach as committers (contributors will always submit : patches)? It has always been a gray area, where we all kind of know : what we can commit w/o creating patches for versus what we should put : up patches for. Just curious, I'm working on the payloads stuff and : I know that as long as it compiles, it isn't going to break anything, : so in some sense I could commit b/c I know it would make it easier : for Michael B. and others to update and review w/o going through the : patch process. On the other hand, the patch approach makes you take : one extra good look at what you are doing. : : What do others think? : : -Grant : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-834) Payload Queries
Grant Ingersoll (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-834: --- Attachment: boosting.term.query.patch First draft at a BoostingTermQuery, which is based on the SpanTermQuery and can be used for boosting the score of a term based on what is in the payload (for things like weighting terms higher according to their font size or part of speech). A couple of classes that were previously package level are now public and have been marked as Public and for derivational purposes only. See the CHANGES.xml for some more details. I believe all tests still pass. Grant, This is great stuff! I know quite a few projects that will love this - specifically to boost terms differently based on a POS tag. We had a discussion recently in Nutch about changing the way typical Nutch queries are translated into Lucene queries, and performance implications there. If you're looking for a challenge ;) could you perhaps take a look at this discussion and see if you could contribute something? ;) http://www.nabble.com/Performance-optimization-for-Nutch-index---query-tf3276316.html Thanks in advance! -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Updated: (LUCENE-834) Payload Queries
On Mar 17, 2007, at 6:02 PM, Andrzej Bialecki wrote: Grant Ingersoll (JIRA) wrote: Grant, This is great stuff! I know quite a few projects that will love this - specifically to boost terms differently based on a POS tag. Michael B. did a great job on implementing the underlying storage mechanisms, so most kudos should go to him. I/we hope to add several other types of Queries (see http:// wiki.apache.org/lucene-java/Payload_Planning and add your own thoughts) POS, font weights, information from NLP applications, XPath, cross- references. It's all good! I am planning to have a few slides in my ApacheCon talk come May on the subject. We had a discussion recently in Nutch about changing the way typical Nutch queries are translated into Lucene queries, and performance implications there. If you're looking for a challenge ;) could you perhaps take a look at this discussion and see if you could contribute something? ;) You know, the Nutch Dev mailing list was my last holdout for subscriptions to the Lucene mailing lists! :-) I barely can keep up with Lucene Java! I will try to have a read soon, but can't promise I can add anything meaningful. Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Resolved: (LUCENE-829) StandardBenchmarker#makeDocument does not explicitly close opened files
karl wettin [EMAIL PROTECTED] wrote on 17/03/2007 09:43:45: Also found this in ReutersQueryMaker: private void prepareQueries() throws Exception { // analyzer (default is standard analyzer) Analyzer anlzr= (Analyzer) Class.forName(config.get(analyzer, org.apache.lucene.analysis.StandardAnalyzer)).newInstance(); It should be org.apache.lucene.analysis.standard.StandardAnalyzer)).newInstance(); Nice catch, I fixed that (and the 3 more like this). Thanks Karl! - Doron -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]