jimczi commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335431950
 
 

 ##########
 File path: 
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java
 ##########
 @@ -253,6 +258,61 @@ public void testDocFiltering() throws Exception {
     iw.close();
   }
 
+  /**
+   * Test that the correct amount of documents are collected if using a 
collector that also rejects documents.
+   */
+  public void testCollectorThatRejects() throws Exception {
+    // use synonym analyzer to have multiple paths to same suggested document. 
This mock adds "dog" as synonym for "dogs"
+    Analyzer analyzer = new MockSynonymAnalyzer();
+    RandomIndexWriter iw = new RandomIndexWriter(random(), dir, 
iwcWithSuggestField(analyzer, "suggest_field"));
+    List<Entry> expectedResults = new ArrayList<Entry>();
+
+    for (int docCount = 10; docCount > 0; docCount--) {
+      Document document = new Document();
+      String value = "ab" + docCount + " dogs";
+      document.add(new SuggestField("suggest_field", value, docCount));
+      expectedResults.add(new Entry(value, docCount));
+      iw.addDocument(document);
+    }
+
+    if (rarely()) {
+      iw.commit();
+    }
+
+    DirectoryReader reader = iw.getReader();
+    SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader);
+
+    PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new 
Term("suggest_field", "ab"));
+    int topN = 5;
+
+    // use a TopSuggestDocsCollector that rejects results with duplicate docIds
+    TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, 
false) {
+
+      private Set<Integer> seenDocIds = new HashSet<>();
+
+      @Override
+      public boolean collect(int docID, CharSequence key, CharSequence 
context, float score) throws IOException {
+          int globalDocId = docID + docBase;
+          boolean collected = false;
+          if (seenDocIds.contains(globalDocId) == false) {
+              super.collect(docID, key, context, score);
+              seenDocIds.add(globalDocId);
+              collected = true;
+          }
+          return collected;
+      }
+    };
+
+    indexSearcher.suggest(query, collector);
+    assertSuggestions(collector.get(), expectedResults.subList(0, 
topN).toArray(new Entry[0]));
+
+    // TODO expecting true here, why false?
 
 Review comment:
   I'll open an issue. I also wonder if we shouldn't rely on the fact that the 
top suggest collector will also early terminate so whenever we expect rejection 
(because of deleted docs or because we deduplicate on suggestions/doc) we could 
set the queue size to its maximum value (5000). Currently we have different 
heuristics that tries to pick a sensitive value automatically but there is no 
guarantee of admissibility. For instance if we want to deduplicate by document 
id we should ensure that the queue size is greater than 
`topN*maxAnalyzedValuesPerDoc` and we'd need to compute this value at index 
time.
   I may be completely off but it would be interesting to see the effects  of 
setting the queue size to its maximum value on all search. This way the 
admissibility is easier to reason about and we don't need to correlate it with 
the choice made by the heuristic.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to