[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.9.patch bq. Did you add a test case verifying maxScore is correct (so that the Float.NaN issue would trip the test)? I added to following tests: * testSortWithoutScoreTracking - asserts that ScoreDoc.score is set to Float.NaN as well as maxScore. * testSortWithScoreNoMaxScoreTracking - asserts that ScoreDoc.score is not Float.NaN, but maxScore is. * testSortWithScoreAndMaxScoreTracking - asserts that both ScoreDoc.score and maxScore are not set to NaN. * testSortWithScoreAndMaxScoreTrackingNoResults - asserts that in case of a maxScore tracking collector with 0 results, maxScore is set to Float.NaN, rather than NEG_INF. bq. HitCollector isn't deprecated Somehow when I applied your patch, this change wasn't taken in. Anyway, I noticed its javadocs also referenced MRHC, so I fixed it also. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: (was: LUCENE-1575.9.patch) Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.9.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: LUCENE-1575.patch Attached new patch: * Changed members methods in TopFieldCollector from protected to package-private. * Tweaked javadocs, CHANGES.txt * Removed some dead code, nocommits * Re-added TestTimeLimitedCollector Besides the java ghosts, for which we will close our eyes and hope they disappear, I think this is ready to go in! I'll way a few days and then commit. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. -
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: LUCENE-1575.patch New patch which just fixes contrib/spatial's cutover to the new API to further cutover to the new new API. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail:
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.8.patch Added JustCompileSearch, JustCompileSearchFunction and JustCompileSearchSpans that extend/implement all abstract classes/interfaces in o.a.l.s, o.a.l.s.s and o.a.l.s.f. Those are not unit tests per-sei, however if anyone will change the interfaces/abstract classes in a way that it breaks back-compat, we'll know it right away. I think that in general this is something good to have for Lucene overall, however I only took care of the search.* packages in this patch. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.8.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, PerfTest.java, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: sortCollate5.py sortBench5.py I'm attaching the Python scripts I use to run the tests. You also need this small mod: {code} Index: contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java === --- contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (revision 761709) +++ contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask/tasks/ReadTask.java (working copy) @@ -63,6 +63,9 @@ super(runData); } + // nocommit + static boolean first = true; + public int doLogic() throws Exception { int res = 0; boolean closeReader = false; @@ -101,6 +104,11 @@ } else { hits = searcher.search(q, numHits); } +// nocommit +if (first) { + System.out.println(NUMHITS= + hits.totalHits); + first = false; +} //System.out.println(q= + q + : + hits.totalHits + total hits); if (withTraverse()) { {code} All the python scripts do is write an alg, run it, gather the results, and collate in the end. You run sortBench5.py once on trunk and once in a checkout with this patch, each time in the contrib/benchmark directory. It saves a pickle file (results.pk) which sortCollate5.py then loads (you'll have to edit the hardwired paths in sortCollate5.py). Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.7.patch - Changed TermScorer.score() method to not call Similarity.decodeNorm. If we can change Scorer.similarity to be protected, we can give up getSimilarity() call in score(). Also changed TermScorer.score(Collector) to set 'this' as the collector's scorer. - Deprecated TimeLimitedCollector, created new TimeLimitingCollector, renamed TestTimeLimitedCollector to TestTimeLimitingCollector and used the new TimeLimitingCollector. - Changed FVHQ to have a static create which returns One/MultiComparatorFieldValueHitQueue version. - Changed TopFieldCollector setNextReader versions to not call pq.size() but rather use numHits. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: LUCENE-1575.patch Attached patch; only differences are: * Under contrib/benchmark I made changes so you can specify non-scoring field sorting * Fixed the rename of TestTimeLimitedCollector -- Limiting to be patch-friendly OK I ran performance with score tracking disabled during field sorted search: ||query||sort||hits||qps||qpsnew||pctg|| |147|title| 6953|2915.7|4043.3| 38.7%| |147|doc| 6953|3265.6|4840.1| 48.2%| |text|title| 157101| 97.0| 128.0| 32.0%| |text|doc| 157101| 174.3| 273.2| 56.7%| |1|title| 565452| 44.6| 60.2| 35.0%| |1|doc| 565452| 49.2| 75.3| 53.0%| |1 OR 2|title| 784928| 12.6| 14.8| 17.5%| |1 OR 2|doc| 784928| 13.0| 15.2| 16.9%| |1 AND 2|title| 333153| 14.8| 17.9| 20.9%| |1 AND 2|doc| 333153| 15.2| 18.9| 24.3%| Very nice speedups! We just have to figure out why the score-tracking variant got slower... Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-1575: --- Attachment: LUCENE-1575.patch OK, I attached a new patch with some minor changes: * Beefed up javadocs in Collector.java; fixed other javadocs warnings. Tweaked CHANGES.txt. * Renamed PositiveOnlyScoresCollector -- PositiveScoresOnlyCollector And also came across these questions/issues: * TopFieldCollector's updateBottom add methods take score, and are passed score from the non-scoring collectors, but shouldn't? * TermScorer need not override score(HitCollector hc) (super does the same thing). * The changes to TermScorer make me a bit nervous. EG, the new InternalScorer: will it hurt performance? Also this part: {code} +// Set the Scorer doc and score before calling collect in case it will be +// used in collect() +s.d = doc; +s.score = score; +c.collect(doc); // collect score {code} is spooky: I don't like how we worry that one may call scorer.doc() (I don't like the ambiguity in the API -- we both pass doc and fear you may call scorer.doc()). Not sure how to resolve it. * Hmm -- we added a new abstract method to src/java/org/apache/lucene/search/Searcher.java (that accepts Collector). Should that method be concrete (and throw UOE), for back compat? * We've also added a method to the Searchable interface, which is a break in back-compat. But my feeling is we should allow this break (but Shai can you add another Note at the top of CHANGES.txt, calling this out?). Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.5.patch Fixed TestFieldNormModifier and TestLengthNormModifier. All tests pass now (including contrib) Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.6.patch Changes: # TimeLimitedCollector, TestTimeLimitedCollector and CHANGES. # I also fixed a bug in TestTermScorer, which was discovered by the test-tag task, and existed since 1483 and propagated into HitCollectorWrapper as well: docBase was set to -1 by default, relying on setNextReader to be called. However if it's not called (as in TestTermScorer, or if someone called Scorer.score(Collector)), all document Ids are shifted backwards by 1. The test had a bug which asserted on the unshifted doc Id, and after I fixed the Ids to shift, it failed. Anyway, the test now works correctly, as well as HCW. # I checked all other Collector implementations and changed the default base to 0, unless in some test cases, where -1 had a meaning. All tests (contrib, core and tags) pass. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Assignee: Michael McCandless Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, LUCENE-1575.6.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.patch Eventually I decided to include just one patch file (instead of code and test) since it was simpler after all. Please be sure to review the following: # Collector class and documentation. # New TopDocsCollector class. # TopFieldCollector refactoring. # Methods deprecation. # New TestTopDocsCollector as well as test cases in TestSort. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.1.patch oops :) leftovers from when it extended MultiReaderHitCollector (now called Collector) Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.2.patch Thanks Mike. I ran the javadocs task and found other mentions of MultiReaderHitCollector as well as fixed some more javadocs. BTW, the javadoc Ant task outputs many errors on missing files/names, but that something for another issue. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1575: --- Attachment: LUCENE-1575.3.patch Includes the latest comments from Mike. Refactoring Lucene collectors (HitCollector and extensions) --- Key: LUCENE-1575 URL: https://issues.apache.org/jira/browse/LUCENE-1575 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, LUCENE-1575.3.patch, LUCENE-1575.patch This issue is a result of a recent discussion we've had on the mailing list. You can read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. We have agreed to do the following refactoring: * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base class for all Collector implementations. * Deprecate HitCollector in favor of the new Collector. * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that accept HitCollector. ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in IndexSearcher, wrapping the given HitCollector. ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we remove HitCollector. ** It will remove any instanceof checks that currently exist in IndexSearcher code. * Create a new (abstract) TopDocsCollector, which will: ** Leave collect and setNextReader unimplemented. ** Introduce protected members PriorityQueue and totalHits. ** Introduce a single protected constructor which accepts a PriorityQueue. ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can be used as-are by extending classes, as well as be overridden. ** Introduce a new topDocs(start, howMany) method which will be used a convenience method when implementing a search application which allows paging through search results. It will also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested size only. * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits() implementations as they are from TopDocsCollector. The class will also be made final. * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also implement topDocs(start, howMany). * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector. Implement topDocs(start, howMany) * Review other places where HitCollector is used, such as in Scorer, deprecate those places and use Collector instead. Additionally, the following proposal was made w.r.t. decoupling score from collect(): * Change collect to accecpt only a doc Id (unbased). * Introduce a setScorer(Scorer) method. * If during collect the implementation needs the score, it can call scorer.score(). If we do this, then we need to review all places in the code where collect(doc, score) is called, and assert whether Scorer can be passed. Also this raises few questions: * What if during collect() Scorer is null? (i.e., not set) - is it even possible? * I noticed that many (if not all) of the collect() implementations discard the document if its score is not greater than 0. Doesn't it mean that score is needed in collect() always? Open issues: * The name for Collector * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think TopDocsCollector makes sense, because of its TopDocs output. * Decoupling score from collect(). I will post a patch a bit later, as this is expected to be a very large patch. I will split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector, as well as testing the new topDocs(start, howMany) method. There might be even a 3rd patch which handles the setScorer thing in Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org