[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335541921 ## File path: lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java ## @@ -253,6 +263,123 @@ public void testDocFiltering() throws Exception { iw.close(); } + /** + * Test that the correct amount of documents are collected if using a collector that also rejects documents. + */ + public void testCollectorThatRejects() throws Exception { +// use synonym analyzer to have multiple paths to same suggested document. This mock adds "dog" as synonym for "dogs" +Analyzer analyzer = new MockSynonymAnalyzer(); +RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwcWithSuggestField(analyzer, "suggest_field")); +List expectedResults = new ArrayList(); + +for (int docCount = 10; docCount > 0; docCount--) { + Document document = new Document(); + String value = "ab" + docCount + " dogs"; + document.add(new SuggestField("suggest_field", value, docCount)); + expectedResults.add(new Entry(value, docCount)); + iw.addDocument(document); +} + +if (rarely()) { + iw.commit(); +} + +DirectoryReader reader = iw.getReader(); +SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader); + +PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new Term("suggest_field", "ab")); +int topN = 5; + +// use a TopSuggestDocsCollector that rejects results with duplicate docIds +TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, false) { + + private Set seenDocIds = new HashSet<>(); + + @Override + public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException { + int globalDocId = docID + docBase; + boolean collected = false; + if (seenDocIds.contains(globalDocId) == false) { + super.collect(docID, key, context, score); + seenDocIds.add(globalDocId); + collected = true; + } + return collected; + } + + @Override + protected boolean canReject() { +return true; + } +}; + +indexSearcher.suggest(query, collector); +TopSuggestDocs suggestions = collector.get(); +assertSuggestions(suggestions, expectedResults.subList(0, topN).toArray(new Entry[0])); +assertTrue(suggestions.isComplete()); Review comment: I extended the existing test to the case where try getting the top 10. In this case the queue would have a max depth of 15, the reject count it 9. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335502868 ## File path: lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java ## @@ -253,6 +263,123 @@ public void testDocFiltering() throws Exception { iw.close(); } + /** + * Test that the correct amount of documents are collected if using a collector that also rejects documents. + */ + public void testCollectorThatRejects() throws Exception { +// use synonym analyzer to have multiple paths to same suggested document. This mock adds "dog" as synonym for "dogs" +Analyzer analyzer = new MockSynonymAnalyzer(); +RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwcWithSuggestField(analyzer, "suggest_field")); +List expectedResults = new ArrayList(); + +for (int docCount = 10; docCount > 0; docCount--) { + Document document = new Document(); + String value = "ab" + docCount + " dogs"; + document.add(new SuggestField("suggest_field", value, docCount)); + expectedResults.add(new Entry(value, docCount)); + iw.addDocument(document); +} + +if (rarely()) { + iw.commit(); +} + +DirectoryReader reader = iw.getReader(); +SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader); + +PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new Term("suggest_field", "ab")); +int topN = 5; + +// use a TopSuggestDocsCollector that rejects results with duplicate docIds +TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, false) { + + private Set seenDocIds = new HashSet<>(); + + @Override + public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException { + int globalDocId = docID + docBase; + boolean collected = false; + if (seenDocIds.contains(globalDocId) == false) { + super.collect(docID, key, context, score); + seenDocIds.add(globalDocId); + collected = true; + } + return collected; + } + + @Override + protected boolean canReject() { +return true; + } +}; + +indexSearcher.suggest(query, collector); +TopSuggestDocs suggestions = collector.get(); +assertSuggestions(suggestions, expectedResults.subList(0, topN).toArray(new Entry[0])); +assertTrue(suggestions.isComplete()); Review comment: This will happen if the estimated queue size in `NRTSuggester#lookup` is not large enough to accout for all rejected documents, e.g. when in this particular test we try to get the top 5 of only 5 documents. In that case the queue size heuristic in `NRTSuggester#getMaxTopNSearcherQueueSize` will only size to queue to 7 (topN + numDocs/2), which is less than the number of topN + rejections, so the TopResults returned will have the `isComplete` flag set. I can add that case to the existing test if this helps. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335389135 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/Util.java ## @@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc node, T startOutput, boolean allowEmptyStri continue; } -if (results.size() == topN-1 && maxQueueDepth == topN) { - // Last path -- don't bother w/ queue anymore: - queue = null; Review comment: As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me. > Have you run the suggest benchmarks to see if removing this opto hurt performance? No, where are they and how can I run them? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335389135 ## File path: lucene/core/src/java/org/apache/lucene/util/fst/Util.java ## @@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc node, T startOutput, boolean allowEmptyStri continue; } -if (results.size() == topN-1 && maxQueueDepth == topN) { - // Last path -- don't bother w/ queue anymore: - queue = null; Review comment: As far as I understand this optimization assumes we surely accept (and collect) the path later in L516s acceptResult(), which always seems to be the case for collectors that don't reject, but if the collector that is eventually called via NRTSuggesters acceptResult() chooses to reject this option, we were losing expected results. This surfaced in the prefix completion tests I added. @jimczi might be able to explain this a bit better than me. > Have you run the suggest benchmarks to see if removing this opto hurt performance? No, where are they and how can I run them? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335372402 ## File path: lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java ## @@ -253,6 +263,123 @@ public void testDocFiltering() throws Exception { iw.close(); } + /** + * Test that the correct amount of documents are collected if using a collector that also rejects documents. + */ + public void testCollectorThatRejects() throws Exception { +// use synonym analyzer to have multiple paths to same suggested document. This mock adds "dog" as synonym for "dogs" +Analyzer analyzer = new MockSynonymAnalyzer(); +RandomIndexWriter iw = new RandomIndexWriter(random(), dir, iwcWithSuggestField(analyzer, "suggest_field")); +List expectedResults = new ArrayList(); + +for (int docCount = 10; docCount > 0; docCount--) { + Document document = new Document(); + String value = "ab" + docCount + " dogs"; + document.add(new SuggestField("suggest_field", value, docCount)); + expectedResults.add(new Entry(value, docCount)); + iw.addDocument(document); +} + +if (rarely()) { + iw.commit(); +} + +DirectoryReader reader = iw.getReader(); +SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader); + +PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new Term("suggest_field", "ab")); +int topN = 5; + +// use a TopSuggestDocsCollector that rejects results with duplicate docIds +TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, false) { + + private Set seenDocIds = new HashSet<>(); + + @Override + public boolean collect(int docID, CharSequence key, CharSequence context, float score) throws IOException { + int globalDocId = docID + docBase; + boolean collected = false; + if (seenDocIds.contains(globalDocId) == false) { Review comment: The collector is called multiple times with the same docID because of the MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each document has two completion paths. This collector is meant to de-duplicate this. I added a note explaining this. This is a simplified version of the behaviour we observe in https://github.com/elastic/elasticsearch/issues/46445. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r332913649 ## File path: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java ## @@ -105,6 +105,10 @@ public TopSuggestDocs(TotalHits totalHits, SuggestScoreDoc[] scoreDocs, boolean this.isComplete = isComplete; } + public TopSuggestDocs(TotalHits totalHits, SuggestScoreDoc[] scoreDocs) { Review comment: I'm not sure about how bwc is handled in Lucene, especially if we intend to backport this to maybe 8.x, so I don't know if we need a ctor without the isComplete flag? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r332908048 ## File path: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java ## @@ -116,19 +123,28 @@ public TopSuggestDocs(TotalHits totalHits, SuggestScoreDoc[] scoreDocs) { */ public static TopSuggestDocs merge(int topN, TopSuggestDocs[] shardHits) { SuggestScoreDocPriorityQueue priorityQueue = new SuggestScoreDocPriorityQueue(topN); +boolean allComplete = true; for (TopSuggestDocs shardHit : shardHits) { for (SuggestScoreDoc scoreDoc : shardHit.scoreLookupDocs()) { if (scoreDoc == priorityQueue.insertWithOverflow(scoreDoc)) { break; } } + allComplete = allComplete && shardHit.isComplete; } SuggestScoreDoc[] topNResults = priorityQueue.getResults(); if (topNResults.length > 0) { - return new TopSuggestDocs(new TotalHits(topNResults.length, TotalHits.Relation.EQUAL_TO), topNResults); + return new TopSuggestDocs(new TotalHits(topNResults.length, TotalHits.Relation.EQUAL_TO), topNResults, + allComplete); } else { return TopSuggestDocs.EMPTY; } } + /** + * returns true if we exhausted all possibilities to collect results Review comment: Sounds good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r332905842 ## File path: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java ## @@ -201,4 +205,27 @@ public void collect(int doc) throws IOException { public ScoreMode scoreMode() { return ScoreMode.COMPLETE; } + + /** + * returns true if the collector clearly exhausted all possibilities to collect results + */ + boolean isComplete() { Review comment: Not necessary, but since we need the underlying flag and its currently private, I thought its okay to have at least a package private getter. Or does it bloat the class too much? Happy to remove either way, wdyt? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection
cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection URL: https://github.com/apache/lucene-solr/pull/913#discussion_r330672796 ## File path: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java ## @@ -283,17 +299,25 @@ public int compare(Pair o1, Pair o2) { * * If a filter is applied, the queue size is increased by * half the number of live documents. + * + * If the collector can reject documents upon collecting, the queue size is + * increased by half the number of live documents again. + * * * The maximum queue size is {@link #MAX_TOP_N_QUEUE_SIZE} */ - private int getMaxTopNSearcherQueueSize(int topN, int numDocs, double liveDocsRatio, boolean filterEnabled) { + private int getMaxTopNSearcherQueueSize(int topN, int numDocs, double liveDocsRatio, boolean filterEnabled, Review comment: I considered this as well first, but then moved to the two independent flags. Will revert that part. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org