[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695674#action_12695674 ] Shai Erera commented on LUCENE-1575: Mike - about your comments on the new Searcher and Searchable search(Weight, Filter, Collector). I think that best (if not only) option currently is to remove them from the interface (comment out I mean) with a TODO to add in 3.0. I tried to just comment out in Searchable, and empty impl in Searcher which throws UOE. However that caused a problem in in MultiSearcher, ParallelMultiSearcher and RemoteSearchable: * RemoteSearchable impls Searchable - commenting out the new impl method with a TODO for 3.0 will be fine, but * MS and PMS accept Searchable in their ctor and use them in search(W, F, C) which they extend from Searcher (they MUST extend it because Searcher's throws UOE). However they call searchable.search, which accepts just a HC, and we can't wrap a Collector with a HC. Previously, MS and PMS implemented the HC version by always wrapping with a MRHC. I think we should just pass in the given HC to the Searchable.search method, and rely on its wrapping by a HCW later on. In 3.0 we'll delete it entirely and use the Collector implementation. Do you see any other way? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: sortCollate5.py sortBench5.py I'm attaching the Python scripts I use to run the tests. You also need this small mod: {code} Index: contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java === --- contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (revision 761709) +++ contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (working copy) @@ -63,6 +63,9 @@ super(runData); } + // nocommit + static boolean first = true; + public int doLogic() throws Exception { int res = 0; boolean closeReader = false; @@ -101,6 +104,11 @@ } else { hits = searcher.search(q, numHits); } +// nocommit +if (first) { + System.out.println(NUMHITS= + hits.totalHits); + first = false; +} //System.out.println(q= + q + : + hits.totalHits + total hits); if (withTraverse()) { {code} All the python scripts do is write an alg, run it, gather the results, and collate in the end. You run sortBench5.py once on trunk and once in a checkout with this patch, each time in the contrib/benchmark directory. It saves a pickle file (results.pk) which sortCollate5.py then loads (you'll have to edit the hardwired paths in sortCollate5.py). Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695700#action_12695700 ] Michael McCandless commented on LUCENE-1575: {quote} BTW, I can change FieldValueHitQueue like I changed TopFieldCollector by introducing a factory create() method which will return a OneComparaterFieldValueHitQueue and MultiComparatorsFieldValueHitQueue. Today, FVHQ.lessThan checks the numComparators in each call, which is redundant. {quote} Seems good, unless the extra subclassing (and additions of super.XXX()) is somehow cause our performance loss. bq. Also the class isn't final and I'm not sure if we want to change it. Yes let's make it final. We need to eek... Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695701#action_12695701 ] Michael McCandless commented on LUCENE-1575: {quote} How problematic is this break in back-compat, given it will be documented in CHANGES? * Have search(W, F, C) on Searchable? I don't think it will have such a great impact as I don't believe too many actually implement Searchable. * Have search(W, F, C) on Searcher as abstract? I know you offered, Mike, to create an empty impl which throws UOE, but I'm not sure what's worse: having a compilation error or UOE at runtime (which can happen at the customer's). After all, all the search methods call this one eventually, and if you did extend Searcher (rather than IndexSearcher), you'll get UOE on every search. {quote} OK let's add both and call it out in CHANGES.txt? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well
Re: Future projects
On Fri, Apr 3, 2009 at 5:32 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: meaning in Bobo you'd like to manage your own memory resident field caches, and merge them whenever IW has merged a segment? Seems like you don't need genealogy for that. Agreed, there is no need for full genealogy. OK CSF isn't really designed yet. How come it can't be used with Bobo's field caches? I guess CSF should be able to support it, makes sense. As long as the container is flexible with the encoding (I need to look into this more on the Bobo side). Well as CSF unfolds let's take Bobo's usage into account. Lucene's internal field cache usage is now entirely at the segment level (ie, Lucene core should never request full field cache array at the MultiSegmentReader level). I think Bobo must have to do the same, if it handles near realtime updates, to get adequate performance. Bobo needs to migrate to this model, I don't think we've done that yet. Hmm OK. That's means reopen is very costly? EG how come Bobo made its own field cache impl? Just because uninversion is too slow? It could be integrated once LUCENE-831 is completed. I think the current model of a weak reference and the inability to unload if needed is a concern. I don't think it's because of uninversion. Ahh OK. We know about that one (for LUCENE-831): the app should have control over the caching policy. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Fri, Apr 3, 2009 at 5:42 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Sounds pretty good. Are there any other gotchas in the design? Yes: the flushing process becomes challenging. When we flush, we must forcefully cutover any open readers searching that RAM buffer, to the now-on-disk segment. Such swap-out is tricky because there could be [many threads of] searches in-flight, iterating through postings, etc. Which means the RAM buffer would have to become an independent entity that is not re-used after flushing, but instead sticks around until GC determines all outstanding readers have switched to the on-disk segment. I would rather not go here unless it's clear the current near real-time performance is too limiting. But my simplistic test the current performance looks good. And, if the performance does turn out to be lacking, the next step (before searching IW's ram buffer) is to flush the new tiny segments through a RAMDir, first. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695711#action_12695711 ] Shai Erera commented on LUCENE-1575: There are no super.XXX calls. The two FVHQ implementations just implement lessThan according to whether it's a single comparator or muli case. This removes the check of numComparators == 1. On Sat, Apr 4, 2009 at 12:53 PM, Michael McCandless (JIRA) Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional
Re: Future projects
On Fri, Apr 3, 2009 at 7:11 PM, Michael Busch busch...@gmail.com wrote: Yeah me too. I think eventually we want this to be a Codec, but we probably don't want to wait until all the flexible indexing work is done. So maybe we should just not worry too much about a perfectly integrated API yet and release it as experimental API with 2.9 and 3.0, just like we did with the initial payloads implementation. Then when we hopefully get flexible indexing and codecs into 3.1 we can rework CSF and integrate it as a Codec. +1 LUCENE-1458 shouldn't block 831/1231. As I recently mentioned on 1231 I'm looking into changing the Document and Field APIs. I've some rough prototype. I think we should also try to get it in before 2.9? On the other hand I don't want to block the 2.9 release with too much stuff. That'd be great -- I'd say post the rough prototype and let's iterate? Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Future projects
On Fri, Apr 3, 2009 at 8:01 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: I looked at the IndexWriter code in regards to creating a realtime reader, with the many flexible indexing classes I'm unsure of how one would get a frozenish IndexInput of the byte slices, given the byte slices are attached to different threads? The realtime reader would have to have sub-readers per thread, and an aggregate reader that joins them by interleaving the docIDs. When flushing we create such a beast, but, it's not general purpose (ie it does not implement IndexReader API; it only implements enough to write the postings). BTW there are benefits to not reusing the RAM buffer, outside of faster near real-time search: it would allow flushing to be done in the BG. Ie, flush could start, and we'd immediately switch to a new RAM buffer, thus allowing add/deletes in other threads to run. Currently they are all blocked (stop the world) during flush, though it's not clear on a fast IO device (SSD) how big a deal this stop the world really is to indexing throughput. But still it's a complex change. Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695723#action_12695723 ] Shai Erera commented on LUCENE-1575: bq. OK let's add both and call it out in CHANGES.txt? great. so I leave them as they are in the latest patch and add a note to CHANGES. bq. Yes let's make it final. We need to eek... This isn't necessary after all, since the class is now abstract, with a private ctor and two private final internal classes, which will be the concrete objects returned by create(). Before submitting the next patch version, I'd like to verify if super.collect() in TFC is the cause of the perf. degradation. We should also perf. test sorting w/o score tracking and note if there is any improvement over trunk. I'm downloading the latest enwiki xml (20090306), so I hope that sometime tomorrow it will finish the download, extract, indexing and search tests. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases
[jira] Commented: (LUCENE-1231) Column-stride fields (aka per-document Payloads)
[ https://issues.apache.org/jira/browse/LUCENE-1231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695726#action_12695726 ] Michael McCandless commented on LUCENE-1231: {quote} Eventually we need more flexibility to utilize the flexible indexing chain anyway. We need to store which codec to use for a field. Then we could also just make a new codec for column-stride fields and maybe then we do not have to introduce a new Field API. {quote} By creating a custom indexing chain you could actually write CSF, today. But the lack of extensibility of Field needs to be addressed: you need some way to store something arbitrary opaque into a field such that your indexing chain could pick it up and act. And FieldInfos also needs store this opaque thing for me API. One of the big changes in LUCENE-1458 is to strongly separate different fields on the read APIs. EG there is a separate FieldsEnum from TermsEnum, meaning you first seek to the field you want, then request a TermsEnum from that, which can iterate through the terms only for that field. It's the codec's job to return the right TermsEnum for a given field. Not to delay 2.9 further, but... I also wonder if Lucene had NumericField (say), how it would simplify things here. EG, today, if I have a field weight that is a float, I'm going to have to set something to tell the CSF (man the similarity of that to CFS is going to cause problems!) writer to cast-it-and-save-it-as-float-array to disk; I'm going to have to tell the TrieRangeUtil to do the same, etc. It'd be much better if that field stored a float (not String), and if it default naturally to using these two special indexers... {quote} DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc. {quote} What is the fastest way in Java to slurp in a bunch of bytes as an int[], short[], float[], etc? Seems that we need to answer that first and then work out how to fix our store APIs to enable that. (Maybe it's IntBuffer wrapping ByteBuffer, instead of an int[]?). {quote} The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior. {quote} I think it's better to have good performance with added risk of danger, then forced handholding always. {quote} The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume. {quote} I generally prefer liberal use of asserts to trip bugs like this, instead of explicit strongly divoced code paths / classes / modes etc., containing real if statements at production runtime. Column-stride fields (aka per-document Payloads) Key: LUCENE-1231 URL: https://issues.apache.org/jira/browse/LUCENE-1231 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael Busch Assignee: Michael Busch Priority: Minor Fix For: 3.0 This new feature has been proposed and discussed here: http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results Currently it is possible in Lucene to store data as stored fields or as payloads. Stored fields provide good performance if you want to load all fields for one document, because this is an sequential I/O operation. If you however want to load the data from one field for a large number of documents, then stored fields perform quite badly, because lot's of I/O seeks might have to be performed. A better way to do this is using payloads. By creating a special posting list that has one posting with payload for each document you can simulate a column- stride field. The performance is significantly better compared to stored fields, however still not optimal. The reason is that for each document the freq value, which is in this particular case always 1, has to be decoded, also one position value, which is always 0, has to be loaded. As a solution we want to add real column-stride fields to Lucene. A possible format for the new data structure could look like this (CSD stands for column- stride data, once we decide for a final name for this feature we can change this): CSDList -- FixedLengthList | VariableLengthList, SkipList FixedLengthList -- Payload^SegSize VariableLengthList -- DocDelta, PayloadLength?, Payload Payload -- Byte^PayloadLength PayloadLength -- VInt SkipList -- see frq.file We distinguish here between the fixed length and the variable length cases. To allow flexibility, Lucene could automatically pick the right
[jira] Commented: (LUCENE-1584) Callback for intercepting merging segments in IndexWriter
[ https://issues.apache.org/jira/browse/LUCENE-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695742#action_12695742 ] Michael McCandless commented on LUCENE-1584: I'd like to step back and understand the wider use case / context that's driving this need (to know precisely when segments got merged). EG if we fix Lucene's field cache, and Lucene's near real-time search manages CSF's efficiently in memory, does that address the use case behind this? It's possible that we should simply make SegmentInfo(s) public, so that MergePolicy/Scheduler can be fully created external to Lucene, and track all specifics of why/when merges are happening. But those APIs have a high surface area, and we do make changes over time. Callback for intercepting merging segments in IndexWriter - Key: LUCENE-1584 URL: https://issues.apache.org/jira/browse/LUCENE-1584 Project: Lucene - Java Issue Type: Improvement Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1584.patch Original Estimate: 96h Remaining Estimate: 96h For things like merging field caches or bitsets, it's useful to know which segments were merged to create a new segment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695743#action_12695743 ] Michael McCandless commented on LUCENE-1575: {quote} There are no super.XXX calls. The two FVHQ implementations just implement lessThan according to whether it's a single comparator or muli case. This removes the check of numComparators == 1. {quote} Excellent! bq. Before submitting the next patch version, I'd like to verify if super.collect() in TFC is the cause of the perf. degradation I'll run this post back. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695745#action_12695745 ] Michael McCandless commented on LUCENE-1575: Odd -- inlining super.collect into OCSC, and making OCSC final, did not alter the numbers much (I re-ran trunk baseline to confirm its close to prior trunk baseline): ||query||sort||hits||qps||qpsnew||pctg|| |147|score| 6953|3635.8|3650.1| 0.4%| |147|title| 6953|2915.7|2297.6|-21.2%| |147|doc| 6953|3265.6|2665.8|-18.4%| |text|score| 157101| 208.5| 202.9| -2.7%| |text|title| 157101| 97.0| 85.4|-12.0%| |text|doc| 157101| 174.3| 125.0|-28.3%| |1|score| 565452| 58.2| 56.6| -2.7%| |1|title| 565452| 44.6| 34.6|-22.4%| |1|doc| 565452| 49.2| 35.2|-28.5%| |1 OR 2|score| 784928| 14.1| 13.7| -2.8%| |1 OR 2|title| 784928| 12.6| 11.5| -8.7%| |1 OR 2|doc| 784928| 13.0| 11.9| -8.5%| |1 AND 2|score| 333153| 15.6| 15.5| -0.6%| |1 AND 2|title| 333153| 14.8| 13.7| -7.4%| |1 AND 2|doc| 333153| 15.2| 14.2| -6.6%| Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695746#action_12695746 ] Michael McCandless commented on LUCENE-1575: bq. We should also perf. test sorting w/o score tracking and note if there is any improvement over trunk. Let's wait a bit until we sort things out (eg, w/ current patch, TermScorer will still compute its score even if I don't need it). Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695750#action_12695750 ] Michael McCandless commented on LUCENE-1575: Shai can you post your latest patch, where TermScorer itself is passed down to the collector? Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1586) add IndexReader.getUniqueTermCount
add IndexReader.getUniqueTermCount -- Key: LUCENE-1586 URL: https://issues.apache.org/jira/browse/LUCENE-1586 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Simple API to return number of unique terms (across all fields). Spinoff from here: http://www.lucidimagination.com/search/document/536b22e017be3e27/term_limit -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1586) add IndexReader.getUniqueTermCount
[ https://issues.apache.org/jira/browse/LUCENE-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1586: --- Attachment: LUCENE-1586.patch Attached patch. I plan to commit in a day or two... add IndexReader.getUniqueTermCount -- Key: LUCENE-1586 URL: https://issues.apache.org/jira/browse/LUCENE-1586 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1586.patch Simple API to return number of unique terms (across all fields). Spinoff from here: http://www.lucidimagination.com/search/document/536b22e017be3e27/term_limit -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.7.patch - Changed TermScorer.score() method to not call Similarity.decodeNorm. If we can change Scorer.similarity to be protected, we can give up getSimilarity() call in score(). Also changed TermScorer.score(Collector) to set 'this' as the collector's scorer. - Deprecated TimeLimitedCollector, created new TimeLimitingCollector, renamed TestTimeLimitedCollector to TestTimeLimitingCollector and used the new TimeLimitingCollector. - Changed FVHQ to have a static create which returns One/MultiComparatorFieldValueHitQueue version. - Changed TopFieldCollector setNextReader versions to not call pq.size() but rather use numHits. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There
[jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695782#action_12695782 ] Michael McCandless commented on LUCENE-1575: OK thanks. Numbers w/ new patch: ||query||sort||hits||qps||qpsnew||pctg|| |147|score| 6953|3635.8|3704.1| 1.9%| |147|title| 6953|2915.7|2262.9|-22.4%| |147|doc| 6953|3265.6|2655.1|-18.7%| |text|score| 157101| 208.5| 199.9| -4.1%| |text|title| 157101| 97.0| 87.1|-10.2%| |text|doc| 157101| 174.3| 134.6|-22.8%| |1|score| 565452| 58.2| 56.5| -2.9%| |1|title| 565452| 44.6| 35.3|-20.9%| |1|doc| 565452| 49.2| 38.0|-22.8%| |1 OR 2|score| 784928| 14.1| 13.8| -2.1%| |1 OR 2|title| 784928| 12.6| 11.6| -7.9%| |1 OR 2|doc| 784928| 13.0| 11.9| -8.5%| |1 AND 2|score| 333153| 15.6| 15.4| -1.3%| |1 AND 2|title| 333153| 14.8| 13.7| -7.4%| |1 AND 2|doc| 333153| 15.2| 14.2| -6.6%| Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: LUCENE-1575.patch Attached patch; only differences are: * Under contrib/benchmark I made changes so you can specify non-scoring field sorting * Fixed the rename of TestTimeLimitedCollector -- Limiting to be patch-friendly OK I ran performance with score tracking disabled during field sorted search: ||query||sort||hits||qps||qpsnew||pctg|| |147|title| 6953|2915.7|4043.3| 38.7%| |147|doc| 6953|3265.6|4840.1| 48.2%| |text|title| 157101| 97.0| 128.0| 32.0%| |text|doc| 157101| 174.3| 273.2| 56.7%| |1|title| 565452| 44.6| 60.2| 35.0%| |1|doc| 565452| 49.2| 75.3| 53.0%| |1 OR 2|title| 784928| 12.6| 14.8| 17.5%| |1 OR 2|doc| 784928| 13.0| 15.2| 16.9%| |1 AND 2|title| 333153| 14.8| 17.9| 20.9%| |1 AND 2|doc| 333153| 15.2| 18.9| 24.3%| Very nice speedups! We just have to figure out why the score-tracking variant got slower... Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a
[jira] Commented: (LUCENE-1586) add IndexReader.getUniqueTermCount
[ https://issues.apache.org/jira/browse/LUCENE-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695787#action_12695787 ] Uwe Schindler commented on LUCENE-1586: --- Hi Mike, why not just use getSequentialSubReaders() in the default implementation and recursively sum up all term counts? getSequentialSubReaders is part of the IndexReader API, so also available in the abstract class. SegmentReader can override the method and return its real count. If getSequentialSubReaders() returns null, throw the UOE. add IndexReader.getUniqueTermCount -- Key: LUCENE-1586 URL: https://issues.apache.org/jira/browse/LUCENE-1586 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1586.patch Simple API to return number of unique terms (across all fields). Spinoff from here: http://www.lucidimagination.com/search/document/536b22e017be3e27/term_limit -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1586) add IndexReader.getUniqueTermCount
[ https://issues.apache.org/jira/browse/LUCENE-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695788#action_12695788 ] Uwe Schindler commented on LUCENE-1586: --- Sorry, that cannot work, because the segments can share the same terms, so the sum is always the real unique term count. add IndexReader.getUniqueTermCount -- Key: LUCENE-1586 URL: https://issues.apache.org/jira/browse/LUCENE-1586 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1586.patch Simple API to return number of unique terms (across all fields). Spinoff from here: http://www.lucidimagination.com/search/document/536b22e017be3e27/term_limit -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org