[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.5.patch Fixed TestFieldNormModifier and TestLengthNormModifier. All tests pass now (including contrib) Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Problem using Lucene RangeQuery
Lucene stores and searches STRINGS so range [0..2] may return 0,1,101, ..109, 11, 110, ..119, 12, ., 2 prefix and normalize your number, like: 001,002...011,012,, 113, etc, if you'll have bigger numbers, put more 0's All of these and much more are documented on the wiki, javadocs and so on, please read them first. On Thu, Apr 2, 2009 at 05:40, mitu2009 musicfrea...@gmail.com wrote: I'm using Rangequery to get all the documents which have amount between say 0 to 2. When i execute the query, Lucene gives me documents which have amount greater than 2 also...What am i missing here? Here is my code: Term lowerTerm = new Term(amount, minAmount); Term upperTerm = new Term(amount, maxAmount); RangeQuery amountQuery = new RangeQuery(lowerTerm, upperTerm, true); finalQuery.Add(amountQuery, BooleanClause.Occur.MUST); and here is what goes into my index: doc.Add(new Field(amount, amount.ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED, Field.TermVector.YES)); Thanks. -- View this message in context: http://www.nabble.com/Problem-using-Lucene-RangeQuery-tp22839692p22839692.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer I think first priority is to get a good assessment of the performance of the current implementation (from LUCENE-1516). My initial tests are very promising: with a writer updating (replacing random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I was able to get reopen the reader once per second and do a large ( 500K results) search that sorts by date. The reopen time was typically ~40 msec, and search time typically ~35 msec (though there were random spikes up to ~340 msec). Though, these results were on an SSD (Intel X25M 160 GB). We need more datapoints of the current approach, but this looks likely to be good enough for starters. And since we can get it into 2.9, hopefully it'll get some early usage and people will report back to help us assess whether further performance improvements are necessary. If they do turn out to be necessary, I think before your step 1, we should write small segments into a RAMDirectory instead of the real directory. That's simpler than truly searching IndexWriter's in-memory postings data. 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level What is passing filters to the SegmentReader level? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. 3. Deleting by doc id using IndexWriter We need a clean approach for the docIDs suddenly shift when merge is committed problem for this... Thinking more on this... I think one possible solution may be to somehow expose IndexWriter's internal docID remapping code. IndexWriter does delete by docID internally, and whenever a merge is committed we stop-the-world (sync on IW) and go remap those docIDs. If we somehow allowed user to register a callback that we could call when this remapping occurs, then user's code could carry the docIDs without them becoming stale. Or maybe we could make a class PendingDocIDs, which you'd ask the reader to give you, that holds docIDs and remaps them after each merge. The problem is, IW internally always logically switches to the current reader for any further docID deletion, but the user's code may continue to use an old reader. So simply exposing this remapping won't fix it... we'd need to somehow track the genealogy (quite a bit more complex). With 1) I'm interested in how we will lock a section of the bytes for use by a given reader? We would not actually lock them, but we need to set aside the bytes such that for example if the postings grows, TermDocs iteration does not progress to beyond it's limits. Are there any modifications that are needed of the RAM buffer format? How would the term table be stored? We would not be using the current hash method? I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694917#action_12694917 ] Michael McCandless commented on LUCENE-1313: Jason, your last patch looks like it's taking the flush first to RAM Dir approach I just described as the next step (on the java-dev thread), right? Which is great! So this has no external dependencies, right? And it simply layers on top of LUCENE-1516. I'd be very interested to compare (benchmark) this approach vs solely LUCENE-1516. Could we change this class so that instead of taking a Transaction object, holding adds deletes, it simply mirrors IndexWriter's API? Ie, I'd like to decouple the performance optimization of let's flush small segments ithrough a RAMDir first from the transactional semantics of I process a transaction atomically, and lock out other thread's transactions. Ie, the transactional restriction could/should layer on top of this performance optimization for near-realtime search? Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694927#action_12694927 ] Shai Erera commented on LUCENE-1575: I thought that ant test runs all tests. Thanks for the education. The reason is that TimeLimitedCollector now extends Collector, which does not extend HitCollector. Therefore the method attempts to return an invalid type. I'm not sure how to fix it, because I cannot change the 2.4 test code, since Collector is not there. So the only reasonable solution I see here is to: * Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it will change to extend Collector and that in the meantime use HitCollectorWrapper if you want. * Comment out all the Collector related methods, including the new ctor, with a TODO to reenstate in 3.0. * Fix the TestTimeLimitedCollector wrap it with a HCW as well as using only HitCollector as the wrapped collector. Other solutions which I don't like are: * deprecate TLC and create a new one NewTimeLimitedCollector - I hate the name :) * Have Collector extend HitCollector - I hate to even consider that. What do you think? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694938#action_12694938 ] Michael McCandless commented on LUCENE-1575: bq. I thought that ant test runs all tests. Thanks for the education. Probably, it should. I'll raise this on java-dev. bq. Change TimeLimitedCollector to extend HitCollector, document that in 3.0 it will change to extend Collector and that in the meantime use HitCollectorWrapper if you want. I think I like this solution best (though this is very much a lesser of all evils situation). lament Ahh the contortions we must go through because of Lucene's success. Marvin over on Lucy can happily make major changes without batting an eye. The sad reality is that the ongoing growth rate of a thing is inversely proportional to its popularity. /lament Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which
ant test should include test-tag
I think back-compat tests (ant test-tag) should run when you run ant test. Any objections? If not I'll commit soon... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12694919#action_12694919 ] Michael McCandless commented on LUCENE-1575: Could you also run ant test-tag (which tests JAR-drop-in back-compatibility)? EG I'm getting this compilation error: {code} [javac] /lucene/src/lucene.collection/tags/lucene_2_4_back_compat_tests_20090320/src/test/org/apache/lucene/search/TestTimeLimitedCollector.java:136: incompatible types [javac] found : org.apache.lucene.search.TimeLimitedCollector [javac] required: org.apache.lucene.search.HitCollector [javac] return res; [javac]^ {code} Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Updated: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1516: --- Attachment: LUCENE-1516.patch Added another test case to TestIndexWriterReader, stress testing adding/deleting docs while constant opening near real-time reader. Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant test should include test-tag
I definitely agree. It would have saved me another patch submission in 1575 :) On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless luc...@mikemccandless.com wrote: I think back-compat tests (ant test-tag) should run when you run ant test. Any objections? If not I'll commit soon... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant test should include test-tag
Shai Erera wrote: I definitely agree. It would have saved me another patch submission in 1575 :) On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: I think back-compat tests (ant test-tag) should run when you run ant test. Any objections? If not I'll commit soon... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org As long as I still have a target that will test without back compat tests. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant test should include test-tag
OK I'll add a test-core-contrib target. Mike On Thu, Apr 2, 2009 at 6:45 AM, Mark Miller markrmil...@gmail.com wrote: Shai Erera wrote: I definitely agree. It would have saved me another patch submission in 1575 :) On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: I think back-compat tests (ant test-tag) should run when you run ant test. Any objections? If not I'll commit soon... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org As long as I still have a target that will test without back compat tests. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant test should include test-tag
Wouldn't hurt I suppose - but test-core and test-contrib are probably sufficient. I wasn't very clear with that comment. I was just saying, as long as I can still run the tests a bit quicker than running through everything twice - which is already available. I should have just said +1. On the other hand, test-core-contrib doesn't hurt anything. Michael McCandless wrote: OK I'll add a test-core-contrib target. Mike On Thu, Apr 2, 2009 at 6:45 AM, Mark Miller markrmil...@gmail.com wrote: Shai Erera wrote: I definitely agree. It would have saved me another patch submission in 1575 :) On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: I think back-compat tests (ant test-tag) should run when you run ant test. Any objections? If not I'll commit soon... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org As long as I still have a target that will test without back compat tests. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily arrays of String and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional STRing[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Documents can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, util FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.6.patch Changes: # TimeLimitedCollector, TestTimeLimitedCollector and CHANGES. # I also fixed a bug in TestTermScorer, which was discovered by the test-tag task, and existed since 1483 and propagated into HitCollectorWrapper as well: docBase was set to -1 by default, relying on setNextReader to be called. However if it's not called (as in TestTermScorer, or if someone called Scorer.score(Collector)), all document Ids are shifted backwards by 1. The test had a bug which asserted on the unshifted doc Id, and after I fixed the Ids to shift, it failed. Anyway, the test now works correctly, as well as HCW. # I checked all other Collector implementations and changed the default base to 0, unless in some test cases, where -1 had a meaning. All tests (contrib, core and tags) pass. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of
[jira] Updated: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-1582: -- Description: TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. was: TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily arrays of String and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional STRing[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Documents can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, util FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the
Atomic optimize() + commit()
Hi I've run into a problem in my code when I upgraded to 2.4. I am not sure if it is a real problem, but I thought I'd let you know anyway. The following is a background of how I ran into the issue, but I think the discussion does not necessarily involve my use of Lucene. I have a class which wraps all Lucene-related operations, i.e., addDocument, deleteDocument, search and optimize (those are the important ones for this email). It maintains an IndexWriter open, through which it does the add/delete/optimize operations and periodically opens an IndexReader for the search operations using the reopen() API. The application indexes operations (add, delete, update) by multiple threads, while there's a manager which after the last operation has been processed, calls commit, which does writer.commit(). I also check from time to time if the index needs to be optimized and optimizes if needed (the criteria for when to do it is irrelevant now). I also have a unit test which does several add/update/delete operations, calls optimize and checks the number of deleted documents. It expects to find 0, since optimize has been called and after I upgraded to 2.4 it failed. Now ... with the move to 2.4, I discovered that optimize() does not commit automatically and I have to call commit. It's a good place to say that when I was on 2.3 I used the default autoCommit=true and with the move to 2.4 that default has changed, and being a good citizen, I also changed my code to call commit when I want and not use any deprecated ctors or rely on internal Lucene logic. I can only guess that that's why at the end of the test I still see numDeletedDocs != 0 (since optimize does not commit by default). So I went ahead and fixed my optimize() method to do: (1) writer.optimize() (2) writer.commit(). But then I thought - is this fix correct? Is it the right approach? Suppose that at the sime time optimize was running, or just between (1) and (2) there was a context switch, and a thread added documents to the index. Upon calling commit(), the newly added documents are also committed, without the caller intending to do so. In my scenario this will probably not be too catastrophically, but I can imagine scenarios in which someone in addition to indexing updates a DB and has a virtual atomic commit, which commits the changes to the index as well as the DB, all the while locking any update operations. Suddenly that someone's code breaks. There are a couple of ways I can solve it, like for example synchronizing the optimize + commit on a lock which all indexing threads will also synchronize (allowing all of them to index concurrently, but if optimize is running all are blocked), but that will hold all my indexing threads. Or, I can just not call commit at the end, relying on the workers manager to commit at the next batch indexing work. However, during that time the readers will search on an unoptimized index, with deletes, while they can search on a freshly optimized index with no deletes (and less segments). The problem with those solutions is that they are not intuitive. To start with, the Lucene documentation itself is wrong - In IndexWriter.commit() it says: Commits all pending updates (added deleted documents) - optimize is not mentioned (shouldn't this be fixed anyway?). Also, notice that the problem stems from the fact that the optimize operation may be called by another thread, not knowing there are update operations running. Lucene documents that you can call addDocument while optimize() is running, so there's no need to protect against that. Suddenly, we're requiring every search application developer to disregard the documentation and think to himself do I want to allow optimize() to run concurrently with add/deletes?. I'm not saying that it's wrong, but if you're ok with it, we should document it. I wonder though if there isn't room to introduce an atomic optimize() + commit() in Lucene. The incentive is that optimize is not the same as add/delete. Add/delete are operations I may want to hide from my users, because they change the state of the index (i.e., how many searchable documents are there). Optimize just reorganizes the index, and is supposed to improve performance. When I call optimize, don't I want it to be committed? Will I ever want to hold that commit off (taking out edge cases)? I assume that 99.9% of the time that's what we expect from it. Now, just adding a call to commit() at the end of optimize() will not solve it, because that's the same as calling commit outside optimize(). We need the optimize's commit to only commit its changes. And if there are updates pending commit - not touch them. BTW, I've scanned through the documentation and haven't found any mention of such thing, however I may have still missed it. So if there is already a solution to that, or such an atomic optimize+commit, I apologize in advance for forcing you to read such a long email (for those of you who made it thus far) and
[jira] Created: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards
SpanOrQuery skipTo() doesn't always move forwards - Key: LUCENE-1583 URL: https://issues.apache.org/jira/browse/LUCENE-1583 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 2.4.1, 2.4, 2.3.2, 2.3.1, 2.3, 2.2, 2.1, 2.0.0, 1.9 Reporter: Moti Nisenson Priority: Minor In SpanOrQuery the skipTo() method is improperly implemented if the target doc is less than or equal to the current doc, since skipTo() may not be called for any of the clauses' spans: public boolean skipTo(int target) throws IOException { if (queue == null) { return initSpanQueue(target); } while (queue.size() != 0 top().doc() target) { if (top().skipTo(target)) { queue.adjustTop(); } else { queue.pop(); } } return queue.size() != 0; } This violates the correct behavior (as described in the Spans interface documentation), that skipTo() should always move forwards, in other words the correct implementation would be: public boolean skipTo(int target) throws IOException { if (queue == null) { return initSpanQueue(target); } boolean skipCalled = false; while (queue.size() != 0 top().doc() target) { if (top().skipTo(target)) { queue.adjustTop(); } else { queue.pop(); } skipCalled = true; } if (skipCalled) { return queue.size() != 0; } return next(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1583) SpanOrQuery skipTo() doesn't always move forwards
[ https://issues.apache.org/jira/browse/LUCENE-1583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1583: --- Fix Version/s: 2.9 LUCENE-1327 was a similar issue. SpanOrQuery skipTo() doesn't always move forwards - Key: LUCENE-1583 URL: https://issues.apache.org/jira/browse/LUCENE-1583 Project: Lucene - Java Issue Type: Bug Components: Search Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.4, 2.4.1 Reporter: Moti Nisenson Priority: Minor Fix For: 2.9 In SpanOrQuery the skipTo() method is improperly implemented if the target doc is less than or equal to the current doc, since skipTo() may not be called for any of the clauses' spans: public boolean skipTo(int target) throws IOException { if (queue == null) { return initSpanQueue(target); } while (queue.size() != 0 top().doc() target) { if (top().skipTo(target)) { queue.adjustTop(); } else { queue.pop(); } } return queue.size() != 0; } This violates the correct behavior (as described in the Spans interface documentation), that skipTo() should always move forwards, in other words the correct implementation would be: public boolean skipTo(int target) throws IOException { if (queue == null) { return initSpanQueue(target); } boolean skipCalled = false; while (queue.size() != 0 top().doc() target) { if (top().skipTo(target)) { queue.adjustTop(); } else { queue.pop(); } skipCalled = true; } if (skipCalled) { return queue.size() != 0; } return next(); } -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: ant test should include test-tag
OK I I just left that new one off. So you have to run ant test-core test-contrib. Mike On Thu, Apr 2, 2009 at 7:21 AM, Mark Miller markrmil...@gmail.com wrote: Wouldn't hurt I suppose - but test-core and test-contrib are probably sufficient. I wasn't very clear with that comment. I was just saying, as long as I can still run the tests a bit quicker than running through everything twice - which is already available. I should have just said +1. On the other hand, test-core-contrib doesn't hurt anything. Michael McCandless wrote: OK I'll add a test-core-contrib target. Mike On Thu, Apr 2, 2009 at 6:45 AM, Mark Miller markrmil...@gmail.com wrote: Shai Erera wrote: I definitely agree. It would have saved me another patch submission in 1575 :) On Thu, Apr 2, 2009 at 12:44 PM, Michael McCandless luc...@mikemccandless.com mailto:luc...@mikemccandless.com wrote: I think back-compat tests (ant test-tag) should run when you run ant test. Any objections? If not I'll commit soon... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org mailto:java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org mailto:java-dev-h...@lucene.apache.org As long as I still have a target that will test without back compat tests. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1582) Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values
[ https://issues.apache.org/jira/browse/LUCENE-1582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695016#action_12695016 ] Michael McCandless commented on LUCENE-1582: This sounds like a great improvement! Make TrieRange completely independent from Document/Field with TokenStream of prefix encoded values --- Key: LUCENE-1582 URL: https://issues.apache.org/jira/browse/LUCENE-1582 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.9 Reporter: Uwe Schindler Assignee: Uwe Schindler Fix For: 2.9 TrieRange has currently the following problem: - To add a field, that uses a trie encoding, you can manually add each term to the index or use a helper method from TrieUtils. The helper method has the problem, that it uses a fixed field configuration - TrieUtils currently creates per default a helper field containing the lower precision terms to enable sorting (limitation of one term/document for sorting) - trieCodeLong/Int() creates unnecessarily String[] and char[] arrays that is heavy for GC, if you index lot of numeric values. Also a lot of char[] to String copying is involved. This issue should improve this: - trieCodeLong/Int() returns a TokenStream. During encoding, all char[] arrays are reused by Token API, additional String[] arrays for the encoded result are not created, instead the TokenStream enumerates the trie values. - Trie fields can be added to Documents during indexing using the standard API: new Field(name,TokenStream,...), so no extra util method needed. By using token filters, one could also add payload and so and customize everything. The drawback is: Sorting would not work anymore. To enable sorting, a (sub-)issue can extend the FieldCache to stop iterating the terms, as soon as a lower precision one is enumerated by TermEnum. I will create a hack patch for TrieUtils-use only, that uses a non-checked Exceptionin the Parser to stop iteration. With LUCENE-831, a more generic API for this type can be used (custom parser/iterator implementation for FieldCache). I will attach the field cache patch (with the temporary solution, until FieldCache is reimplemented) as a separate patch file, or maybe open another issue for it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Atomic optimize() + commit()
With ConcurrentMergeScheduler, IndexWriter has gained alot of concurrency, such that an optimize (or normal BG merge) could be running at the same time as deletes/adds. I think this is a good thing and we should keep improving it (there are still places that block, eg while a flush is running a merge cannot commit). But, there are clearly cases where you want to explicitly prevent concurrency operations (like your class that wraps IndexWriter/Reader). The current patch on LUCENE-1313 has something similar, except in that case the atomic operation is do adds, do deletes, open new near-realtime reader. Grant also proposed generalizing IndexAccessor (in LUCENE-1516). However: I think all such logic should live above IndexWriter/IndexReader. IndexWriter should try to be as concurrent as possible, and if apps need further atomicity of certain groups of operations, it should be done outside of Lucene's core. Of course, if IndexWriter doesn't expose enough APIs to enable such atomicity, we should fix that. I definitely agree we should fix commit's javadocs to include other changes, like optimize() calls, addIndexes, etc. -- I'll do that. Mike On Thu, Apr 2, 2009 at 8:22 AM, Shai Erera ser...@gmail.com wrote: Hi I've run into a problem in my code when I upgraded to 2.4. I am not sure if it is a real problem, but I thought I'd let you know anyway. The following is a background of how I ran into the issue, but I think the discussion does not necessarily involve my use of Lucene. I have a class which wraps all Lucene-related operations, i.e., addDocument, deleteDocument, search and optimize (those are the important ones for this email). It maintains an IndexWriter open, through which it does the add/delete/optimize operations and periodically opens an IndexReader for the search operations using the reopen() API. The application indexes operations (add, delete, update) by multiple threads, while there's a manager which after the last operation has been processed, calls commit, which does writer.commit(). I also check from time to time if the index needs to be optimized and optimizes if needed (the criteria for when to do it is irrelevant now). I also have a unit test which does several add/update/delete operations, calls optimize and checks the number of deleted documents. It expects to find 0, since optimize has been called and after I upgraded to 2.4 it failed. Now ... with the move to 2.4, I discovered that optimize() does not commit automatically and I have to call commit. It's a good place to say that when I was on 2.3 I used the default autoCommit=true and with the move to 2.4 that default has changed, and being a good citizen, I also changed my code to call commit when I want and not use any deprecated ctors or rely on internal Lucene logic. I can only guess that that's why at the end of the test I still see numDeletedDocs != 0 (since optimize does not commit by default). So I went ahead and fixed my optimize() method to do: (1) writer.optimize() (2) writer.commit(). But then I thought - is this fix correct? Is it the right approach? Suppose that at the sime time optimize was running, or just between (1) and (2) there was a context switch, and a thread added documents to the index. Upon calling commit(), the newly added documents are also committed, without the caller intending to do so. In my scenario this will probably not be too catastrophically, but I can imagine scenarios in which someone in addition to indexing updates a DB and has a virtual atomic commit, which commits the changes to the index as well as the DB, all the while locking any update operations. Suddenly that someone's code breaks. There are a couple of ways I can solve it, like for example synchronizing the optimize + commit on a lock which all indexing threads will also synchronize (allowing all of them to index concurrently, but if optimize is running all are blocked), but that will hold all my indexing threads. Or, I can just not call commit at the end, relying on the workers manager to commit at the next batch indexing work. However, during that time the readers will search on an unoptimized index, with deletes, while they can search on a freshly optimized index with no deletes (and less segments). The problem with those solutions is that they are not intuitive. To start with, the Lucene documentation itself is wrong - In IndexWriter.commit() it says: Commits all pending updates (added deleted documents) - optimize is not mentioned (shouldn't this be fixed anyway?). Also, notice that the problem stems from the fact that the optimize operation may be called by another thread, not knowing there are update operations running. Lucene documents that you can call addDocument while optimize() is running, so there's no need to protect against that. Suddenly, we're requiring every search application developer to disregard the documentation and think to himself do I want to
Re: Future projects
Michael: I love your suggestion on 3)! This really opens doors for flexible indexing. -John On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer I think first priority is to get a good assessment of the performance of the current implementation (from LUCENE-1516). My initial tests are very promising: with a writer updating (replacing random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I was able to get reopen the reader once per second and do a large ( 500K results) search that sorts by date. The reopen time was typically ~40 msec, and search time typically ~35 msec (though there were random spikes up to ~340 msec). Though, these results were on an SSD (Intel X25M 160 GB). We need more datapoints of the current approach, but this looks likely to be good enough for starters. And since we can get it into 2.9, hopefully it'll get some early usage and people will report back to help us assess whether further performance improvements are necessary. If they do turn out to be necessary, I think before your step 1, we should write small segments into a RAMDirectory instead of the real directory. That's simpler than truly searching IndexWriter's in-memory postings data. 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level What is passing filters to the SegmentReader level? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. 3. Deleting by doc id using IndexWriter We need a clean approach for the docIDs suddenly shift when merge is committed problem for this... Thinking more on this... I think one possible solution may be to somehow expose IndexWriter's internal docID remapping code. IndexWriter does delete by docID internally, and whenever a merge is committed we stop-the-world (sync on IW) and go remap those docIDs. If we somehow allowed user to register a callback that we could call when this remapping occurs, then user's code could carry the docIDs without them becoming stale. Or maybe we could make a class PendingDocIDs, which you'd ask the reader to give you, that holds docIDs and remaps them after each merge. The problem is, IW internally always logically switches to the current reader for any further docID deletion, but the user's code may continue to use an old reader. So simply exposing this remapping won't fix it... we'd need to somehow track the genealogy (quite a bit more complex). With 1) I'm interested in how we will lock a section of the bytes for use by a given reader? We would not actually lock them, but we need to set aside the bytes such that for example if the postings grows, TermDocs iteration does not progress to beyond it's limits. Are there any modifications that are needed of the RAM buffer format? How would the term table be stored? We would not be using the current hash method? I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
4) An additional possibly contrib module is caching the results of TermQueries. In looking at the TermQuery code would we need to cache the entire docs and freqs as arrays which would be a memory hog? On Wed, Apr 1, 2009 at 4:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level 3. Deleting by doc id using IndexWriter With 1) I'm interested in how we will lock a section of the bytes for use by a given reader? We would not actually lock them, but we need to set aside the bytes such that for example if the postings grows, TermDocs iteration does not progress to beyond it's limits. Are there any modifications that are needed of the RAM buffer format? How would the term table be stored? We would not be using the current hash method?
Re: Future projects
I'm interested in merging cached bitsets and field caches. While this may be something related to LUCENE-831, in Bobo there are custom field caches which we want to merge in RAM (rather than reload from the reader using termenum + termdocs). This could somehow lead to delete by doc id. Tracking the genealogy of segments is something we can provide as a callback from IndexWriter? Or could we add a method to IndexCommit or SegmentReader that returns the segments it originated from? On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer I think first priority is to get a good assessment of the performance of the current implementation (from LUCENE-1516). My initial tests are very promising: with a writer updating (replacing random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I was able to get reopen the reader once per second and do a large ( 500K results) search that sorts by date. The reopen time was typically ~40 msec, and search time typically ~35 msec (though there were random spikes up to ~340 msec). Though, these results were on an SSD (Intel X25M 160 GB). We need more datapoints of the current approach, but this looks likely to be good enough for starters. And since we can get it into 2.9, hopefully it'll get some early usage and people will report back to help us assess whether further performance improvements are necessary. If they do turn out to be necessary, I think before your step 1, we should write small segments into a RAMDirectory instead of the real directory. That's simpler than truly searching IndexWriter's in-memory postings data. 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level What is passing filters to the SegmentReader level? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. 3. Deleting by doc id using IndexWriter We need a clean approach for the docIDs suddenly shift when merge is committed problem for this... Thinking more on this... I think one possible solution may be to somehow expose IndexWriter's internal docID remapping code. IndexWriter does delete by docID internally, and whenever a merge is committed we stop-the-world (sync on IW) and go remap those docIDs. If we somehow allowed user to register a callback that we could call when this remapping occurs, then user's code could carry the docIDs without them becoming stale. Or maybe we could make a class PendingDocIDs, which you'd ask the reader to give you, that holds docIDs and remaps them after each merge. The problem is, IW internally always logically switches to the current reader for any further docID deletion, but the user's code may continue to use an old reader. So simply exposing this remapping won't fix it... we'd need to somehow track the genealogy (quite a bit more complex). With 1) I'm interested in how we will lock a section of the bytes for use by a given reader? We would not actually lock them, but we need to set aside the bytes such that for example if the postings grows, TermDocs iteration does not progress to beyond it's limits. Are there any modifications that are needed of the RAM buffer format? How would the term table be stored? We would not be using the current hash method? I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
What is passing filters to the SegmentReader level? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. The patch I was thinking of is LUCENE-1536. I wasn't sure what the next steps are for it, i.e. the JumpScorer, Scorer.skipToButNotNext, or simply implementing a commitable version of LUCENE-1536? On Thu, Apr 2, 2009 at 1:40 AM, Michael McCandless luc...@mikemccandless.com wrote: On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer I think first priority is to get a good assessment of the performance of the current implementation (from LUCENE-1516). My initial tests are very promising: with a writer updating (replacing random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I was able to get reopen the reader once per second and do a large ( 500K results) search that sorts by date. The reopen time was typically ~40 msec, and search time typically ~35 msec (though there were random spikes up to ~340 msec). Though, these results were on an SSD (Intel X25M 160 GB). We need more datapoints of the current approach, but this looks likely to be good enough for starters. And since we can get it into 2.9, hopefully it'll get some early usage and people will report back to help us assess whether further performance improvements are necessary. If they do turn out to be necessary, I think before your step 1, we should write small segments into a RAMDirectory instead of the real directory. That's simpler than truly searching IndexWriter's in-memory postings data. 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level What is passing filters to the SegmentReader level? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. 3. Deleting by doc id using IndexWriter We need a clean approach for the docIDs suddenly shift when merge is committed problem for this... Thinking more on this... I think one possible solution may be to somehow expose IndexWriter's internal docID remapping code. IndexWriter does delete by docID internally, and whenever a merge is committed we stop-the-world (sync on IW) and go remap those docIDs. If we somehow allowed user to register a callback that we could call when this remapping occurs, then user's code could carry the docIDs without them becoming stale. Or maybe we could make a class PendingDocIDs, which you'd ask the reader to give you, that holds docIDs and remaps them after each merge. The problem is, IW internally always logically switches to the current reader for any further docID deletion, but the user's code may continue to use an old reader. So simply exposing this remapping won't fix it... we'd need to somehow track the genealogy (quite a bit more complex). With 1) I'm interested in how we will lock a section of the bytes for use by a given reader? We would not actually lock them, but we need to set aside the bytes such that for example if the postings grows, TermDocs iteration does not progress to beyond it's limits. Are there any modifications that are needed of the RAM buffer format? How would the term table be stored? We would not be using the current hash method? I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695098#action_12695098 ] Michael McCandless commented on LUCENE-1575: Super, all tests pass for me too... Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm interested in merging cached bitsets and field caches. While this may be something related to LUCENE-831, in Bobo there are custom field caches which we want to merge in RAM (rather than reload from the reader using termenum + termdocs). This could somehow lead to delete by doc id. What does Bobo use the cached bitsets for? Merging FieldCache in RAM is also interesting for near-realtime search, once we have column stride fields. Ie, they should behave like deleted docs: there's no reason to go through disk when merging them -- just carry them straight to the merged reader. Only on commit do they need to go to disk. Hmm in fact we could do this today, too, eg with norms as a future optimization if needed. And that optimization applies to flushing as well (ie, when flushing a new segment, since we know we will open a reader, we could NOT flush the norms, and instead put them into the reader, and only on eventual commit, flush to disk). Tracking the genealogy of segments is something we can provide as a callback from IndexWriter? Or could we add a method to IndexCommit or SegmentReader that returns the segments it originated from? Well the problem with my idea (callback from IW when docs shift) is internally IW always uses the latest reader to get any new docIDs. Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each generation is a renumbering event). But if you have a reader, perhaps oldish by now, we'd need to give you a way to map across N generations of docID shifts (which'd require the genealogy tracking). Alas I think it will quickly get hairy. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
I'm not sure how big a win this'd be, since the OS will cache those in RAM and the CPU cost there (to pull from OS's cache and reprocess) is maybe not high. Optimizing search is interesting, because it's the wicked slow queries that you need to make faster even when it's at the expense of wicked fast queries. If you make a wicked fast query 3X slower (eg 1 ms - 3 ms), it's almost harmless in nearly all apps. So this makes things like PFOR (and LUCENE-1458, to enable pluggable codecs for postings) important since it addresses the very large queries. In fact for very large postings we should do PFOR minus the exceptions, ie, do a simple Nbit encode, even if it wastes some bits. Mike On Thu, Apr 2, 2009 at 1:52 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: 4) An additional possibly contrib module is caching the results of TermQueries. In looking at the TermQuery code would we need to cache the entire docs and freqs as arrays which would be a memory hog? On Wed, Apr 1, 2009 at 4:05 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Now that LUCENE-1516 is close to being committed perhaps we can figure out the priority of other issues: 1. Searchable IndexWriter RAM buffer 2. Finish up benchmarking and perhaps implement passing filters to the SegmentReader level 3. Deleting by doc id using IndexWriter With 1) I'm interested in how we will lock a section of the bytes for use by a given reader? We would not actually lock them, but we need to set aside the bytes such that for example if the postings grows, TermDocs iteration does not progress to beyond it's limits. Are there any modifications that are needed of the RAM buffer format? How would the term table be stored? We would not be using the current hash method? - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Thu, Apr 2, 2009 at 2:29 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What is passing filters to the SegmentReader level? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. The patch I was thinking of is LUCENE-1536. I wasn't sure what the next steps are for it, i.e. the JumpScorer, Scorer.skipToButNotNext, or simply implementing a commitable version of LUCENE-1536? Ahh OK. We should pursue this one -- many filters are cached, or would otherwise be able to expose random-access API. For such filters, it'd also make sense to pre-multiply the deleted docs, to save doing that multiply for every query that uses the filter. We'd need some sort of caching / segment wrapper class to manage that, maybe? But we should first do the Filter/Query unification, and Filter as clause on BooleanQuery, and then re-assess the performance difference. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
What does Bobo use the cached bitsets for? Bobo is a faceting engine that uses custom field caches and sometimes cached bitsets rather than relying exclusively on bitsets to calculate facets. It is useful where many facets (50+) need to be calculated and bitset caching, loading and intersection would be too costly. Instead it iterates over in memory custom field caches while hit collecting. Because we're also doing realtime search, making the loading more efficient via the in memory field cache merging is interesting. True, we do the in memory merging with deleted docs, norms would be good as well. As a first step how should we expose the segments a segment has originated from? I would like to get this implemented for 2.9 as a building block that perhaps we can write other things on. Column stride fields still requires some encoding and merging field caches in ram would presumably be faster? Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each generation is a renumbering event). Couldn't each SegmentReader keep a docmap and the names of the segments it originated from. However the name is not enough of a unique key as there's the deleted docs that change? It seems like we need a unique id for each segment reader, where the id is assigned to cloned readers (which normally have the same segment name as the original SR). The ID could be a stamp (perhaps only given to readonlyreaders?). That way the SegmentReader.getMergedFrom method does not need to return the actual readers, but a docmap and the parent readers IDs? It would be assumed the user would be holding the readers somewhere? Perhaps all this can be achieved with a callback in IW, and all this logic could be kept somewhat internal to Lucene? On Thu, Apr 2, 2009 at 12:59 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 2, 2009 at 2:07 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I'm interested in merging cached bitsets and field caches. While this may be something related to LUCENE-831, in Bobo there are custom field caches which we want to merge in RAM (rather than reload from the reader using termenum + termdocs). This could somehow lead to delete by doc id. What does Bobo use the cached bitsets for? Merging FieldCache in RAM is also interesting for near-realtime search, once we have column stride fields. Ie, they should behave like deleted docs: there's no reason to go through disk when merging them -- just carry them straight to the merged reader. Only on commit do they need to go to disk. Hmm in fact we could do this today, too, eg with norms as a future optimization if needed. And that optimization applies to flushing as well (ie, when flushing a new segment, since we know we will open a reader, we could NOT flush the norms, and instead put them into the reader, and only on eventual commit, flush to disk). Tracking the genealogy of segments is something we can provide as a callback from IndexWriter? Or could we add a method to IndexCommit or SegmentReader that returns the segments it originated from? Well the problem with my idea (callback from IW when docs shift) is internally IW always uses the latest reader to get any new docIDs. Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each generation is a renumbering event). But if you have a reader, perhaps oldish by now, we'd need to give you a way to map across N generations of docID shifts (which'd require the genealogy tracking). Alas I think it will quickly get hairy. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1574) PooledSegmentReader, pools SegmentReader underlying byte arrays
[ https://issues.apache.org/jira/browse/LUCENE-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695130#action_12695130 ] Jason Rutherglen commented on LUCENE-1574: -- True the pool would hold onto spares, but they would expire. It's mostly useful for the large on disk segments as those byte arrays (for BitVectors) are large, and because there's more docs in them would get hit with deletes more often, and so they'd be reused fairly often. I'm not knowledgeable enough to say whether the transactional data structure will be fast enough. We had been using http://fastutil.dsi.unimi.it/docs/it/unimi/dsi/fastutil/ints/IntR BTreeSet.html in Zoie for deleted docs and it's way slow. Binary search of an int array is faster, albeit not fast enough. The multi dimensional array thing isn't fast enough (for searching) as we implemented this in Bobo. It's implemented in Bobo because we have a multi value field cache (which is quite large because for each doc we're storing potentially 64 or more values in an inplace bitset) and a single massive array kills the GC. In some cases this is faster than a single large array because of the way Java (or the OS?) transfers memory around through the CPU cache. PooledSegmentReader, pools SegmentReader underlying byte arrays --- Key: LUCENE-1574 URL: https://issues.apache.org/jira/browse/LUCENE-1574 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Original Estimate: 168h Remaining Estimate: 168h PooledSegmentReader pools the underlying byte arrays of deleted docs and norms for realtime search. It is designed for use with IndexReader.clone which can create many copies of byte arrays, which are of the same length for a given segment. When pooled they can be reused which could save on memory. Do we want to benchmark the memory usage comparison of PooledSegmentReader vs GC? Many times GC is enough for these smaller objects. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What does Bobo use the cached bitsets for? Bobo is a faceting engine that uses custom field caches and sometimes cached bitsets rather than relying exclusively on bitsets to calculate facets. It is useful where many facets (50+) need to be calculated and bitset caching, loading and intersection would be too costly. Instead it iterates over in memory custom field caches while hit collecting. Because we're also doing realtime search, making the loading more efficient via the in memory field cache merging is interesting. OK. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). True, we do the in memory merging with deleted docs, norms would be good as well. Yes, and eventually column stride fields. As a first step how should we expose the segments a segment has originated from? I'm not sure; it's quite messy. Each segment must track what other segment it got merged to, and must hold a copy of its deletes as of the time it was merged. And each segment must know what other segments it got merged with. Is this really a serious problem in your realtime search? Eg, from John's numbers in using payloads to read in the docID - UID mapping, it seems like you could make a Query that when given a reader would go and do the Approach 2 to perform the deletes (if indeed you are needing to delete thousands of docs with each update). What sort of docs/sec rates are you needing to handle? I would like to get this implemented for 2.9 as a building block that perhaps we can write other things on. I think that's optimistic. It's still at the hairy-can't-see-a-clean-way-to-do-it phase still. Plus I'd like to understand that all other options have been exhausted first. Especially once we have column stride fields and they are merged in RAM, you'll be handed a reader pre-warmed and you can then jump through those arrays to find docs to delete. Column stride fields still requires some encoding and merging field caches in ram would presumably be faster? Yes, potentially much faster. There's no sense in writing through to disk until commit is called. Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each generation is a renumbering event). Couldn't each SegmentReader keep a docmap and the names of the segments it originated from. However the name is not enough of a unique key as there's the deleted docs that change? It seems like we need a unique id for each segment reader, where the id is assigned to cloned readers (which normally have the same segment name as the original SR). The ID could be a stamp (perhaps only given to readonlyreaders?). That way the SegmentReader.getMergedFrom method does not need to return the actual readers, but a docmap and the parent readers IDs? It would be assumed the user would be holding the readers somewhere? Perhaps all this can be achieved with a callback in IW, and all this logic could be kept somewhat internal to Lucene? The docMap is a costly way to store it, since it consumes 32 bits per doc (vs storing a copy of the deleted docs). But, then docMap gives you random-access on the map. What if prior to merging, or committing merged deletes, there were a callback to force the app to materialize any privately buffered deletes? And then the app is not allowed to use those readers for further deletes? Still kinda messy. I think I need to understand better why delete by Query isn't viable in your situation... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
I think I need to understand better why delete by Query isn't viable in your situation... The delete by query is a separate problem which I haven't fully explored yet. Tracking the segment genealogy is really an interim step for merging field caches before column stride fields gets implemented. Actually CSF cannot be used with Bobo's field caches anyways which means we'd need a way to find out about the segment parents. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). We need to see how Bobo integrates with LUCENE-1483. It seems like we've been talking about CSF for 2 years and there isn't a patch for it? If I had more time I'd take a look. What is the status of it? I'll write a patch that implements a callback for the segment merging such that the user can decide what information they want to record about the merged SRs (I'm pretty sure there isn't a way to do this with MergePolicy?) On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What does Bobo use the cached bitsets for? Bobo is a faceting engine that uses custom field caches and sometimes cached bitsets rather than relying exclusively on bitsets to calculate facets. It is useful where many facets (50+) need to be calculated and bitset caching, loading and intersection would be too costly. Instead it iterates over in memory custom field caches while hit collecting. Because we're also doing realtime search, making the loading more efficient via the in memory field cache merging is interesting. OK. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). True, we do the in memory merging with deleted docs, norms would be good as well. Yes, and eventually column stride fields. As a first step how should we expose the segments a segment has originated from? I'm not sure; it's quite messy. Each segment must track what other segment it got merged to, and must hold a copy of its deletes as of the time it was merged. And each segment must know what other segments it got merged with. Is this really a serious problem in your realtime search? Eg, from John's numbers in using payloads to read in the docID - UID mapping, it seems like you could make a Query that when given a reader would go and do the Approach 2 to perform the deletes (if indeed you are needing to delete thousands of docs with each update). What sort of docs/sec rates are you needing to handle? I would like to get this implemented for 2.9 as a building block that perhaps we can write other things on. I think that's optimistic. It's still at the hairy-can't-see-a-clean-way-to-do-it phase still. Plus I'd like to understand that all other options have been exhausted first. Especially once we have column stride fields and they are merged in RAM, you'll be handed a reader pre-warmed and you can then jump through those arrays to find docs to delete. Column stride fields still requires some encoding and merging field caches in ram would presumably be faster? Yes, potentially much faster. There's no sense in writing through to disk until commit is called. Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each generation is a renumbering event). Couldn't each SegmentReader keep a docmap and the names of the segments it originated from. However the name is not enough of a unique key as there's the deleted docs that change? It seems like we need a unique id for each segment reader, where the id is assigned to cloned readers (which normally have the same segment name as the original SR). The ID could be a stamp (perhaps only given to readonlyreaders?). That way the SegmentReader.getMergedFrom method does not need to return the actual readers, but a docmap and the parent readers IDs? It would be assumed the user would be holding the readers somewhere? Perhaps all this can be achieved with a callback in IW, and all this logic could be kept somewhat internal to Lucene? The docMap is a costly way to store it, since it consumes 32 bits per doc (vs storing a copy of the deleted docs). But, then docMap gives you random-access on the map. What if prior to merging, or committing merged deletes, there were a callback to force the app to materialize any privately buffered deletes? And then the app is not allowed to use those readers for further deletes? Still kinda messy. I think I need to understand better why delete by Query isn't viable in your situation... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
Lucene filter
How do you create a Lucene Filter to check if a field has a value? It is part for a ChainedFilter that I am creating. -- View this message in context: http://www.nabble.com/Lucene-filter-tp22858220p22858220.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
Just to clarify, Approach 1 and approach 2 are both currently performing ok currently for us. -John On Thu, Apr 2, 2009 at 2:41 PM, Michael McCandless luc...@mikemccandless.com wrote: On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: What does Bobo use the cached bitsets for? Bobo is a faceting engine that uses custom field caches and sometimes cached bitsets rather than relying exclusively on bitsets to calculate facets. It is useful where many facets (50+) need to be calculated and bitset caching, loading and intersection would be too costly. Instead it iterates over in memory custom field caches while hit collecting. Because we're also doing realtime search, making the loading more efficient via the in memory field cache merging is interesting. OK. Does it operate at the segment level? Seems like that'd give you good enough realtime performance (though merging in RAM will definitely be faster). True, we do the in memory merging with deleted docs, norms would be good as well. Yes, and eventually column stride fields. As a first step how should we expose the segments a segment has originated from? I'm not sure; it's quite messy. Each segment must track what other segment it got merged to, and must hold a copy of its deletes as of the time it was merged. And each segment must know what other segments it got merged with. Is this really a serious problem in your realtime search? Eg, from John's numbers in using payloads to read in the docID - UID mapping, it seems like you could make a Query that when given a reader would go and do the Approach 2 to perform the deletes (if indeed you are needing to delete thousands of docs with each update). What sort of docs/sec rates are you needing to handle? I would like to get this implemented for 2.9 as a building block that perhaps we can write other things on. I think that's optimistic. It's still at the hairy-can't-see-a-clean-way-to-do-it phase still. Plus I'd like to understand that all other options have been exhausted first. Especially once we have column stride fields and they are merged in RAM, you'll be handed a reader pre-warmed and you can then jump through those arrays to find docs to delete. Column stride fields still requires some encoding and merging field caches in ram would presumably be faster? Yes, potentially much faster. There's no sense in writing through to disk until commit is called. Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where each generation is a renumbering event). Couldn't each SegmentReader keep a docmap and the names of the segments it originated from. However the name is not enough of a unique key as there's the deleted docs that change? It seems like we need a unique id for each segment reader, where the id is assigned to cloned readers (which normally have the same segment name as the original SR). The ID could be a stamp (perhaps only given to readonlyreaders?). That way the SegmentReader.getMergedFrom method does not need to return the actual readers, but a docmap and the parent readers IDs? It would be assumed the user would be holding the readers somewhere? Perhaps all this can be achieved with a callback in IW, and all this logic could be kept somewhat internal to Lucene? The docMap is a costly way to store it, since it consumes 32 bits per doc (vs storing a copy of the deleted docs). But, then docMap gives you random-access on the map. What if prior to merging, or committing merged deletes, there were a callback to force the app to materialize any privately buffered deletes? And then the app is not allowed to use those readers for further deletes? Still kinda messy. I think I need to understand better why delete by Query isn't viable in your situation... Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter
Callback for intercepting merging segments in IndexWriter - Key: LUCENE-1584 URL: https://issues.apache.org/jira/browse/LUCENE-1584 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 For things like merging field caches or bitsets, it's useful to know which segments were merged to create a new segment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1516) Integrate IndexReader with IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695185#action_12695185 ] Jason Rutherglen commented on LUCENE-1516: -- In ReaderPool.get(SegmentInfo info, boolean doOpenStores, int readBufferSize) the readBufferSize needs to be passed into SegmentReader.get Integrate IndexReader with IndexWriter --- Key: LUCENE-1516 URL: https://issues.apache.org/jira/browse/LUCENE-1516 Project: Lucene - Java Issue Type: Improvement Affects Versions: 2.4 Reporter: Jason Rutherglen Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, LUCENE-1516.patch, magnetic.png, ssd.png, ssd2.png Original Estimate: 672h Remaining Estimate: 672h The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates. This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1584: - Attachment: LUCENE-1584.patch Patch is combined with LUCENE-1516. IndexWriter has a setSegmentMergerCallback method that is called in IW.mergeMiddle where the readers being merged and the newly merged reader are passed to the SMC.mergedSegments method. I think we need to expose the SegmentReader segment name somehow either via IndexReader.getSegmentName or an interface on top of SegmentReader? Callback for intercepting merging segments in IndexWriter - Key: LUCENE-1584 URL: https://issues.apache.org/jira/browse/LUCENE-1584 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1584.patch Original Estimate: 96h Remaining Estimate: 96h For things like merging field caches or bitsets, it's useful to know which segments were merged to create a new segment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
IndexWriter.addIndexesNoOptimize(IndexReader[] readers)
This seems like something that's tenable? It would be useful for merging ram indexes to disk where if a directory is passed, the directory may be changed.