[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-16 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335541921
 
 

 ##
 File path: 
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java
 ##
 @@ -253,6 +263,123 @@ public void testDocFiltering() throws Exception {
 iw.close();
   }
 
+  /**
+   * Test that the correct amount of documents are collected if using a 
collector that also rejects documents.
+   */
+  public void testCollectorThatRejects() throws Exception {
+// use synonym analyzer to have multiple paths to same suggested document. 
This mock adds "dog" as synonym for "dogs"
+Analyzer analyzer = new MockSynonymAnalyzer();
+RandomIndexWriter iw = new RandomIndexWriter(random(), dir, 
iwcWithSuggestField(analyzer, "suggest_field"));
+List expectedResults = new ArrayList();
+
+for (int docCount = 10; docCount > 0; docCount--) {
+  Document document = new Document();
+  String value = "ab" + docCount + " dogs";
+  document.add(new SuggestField("suggest_field", value, docCount));
+  expectedResults.add(new Entry(value, docCount));
+  iw.addDocument(document);
+}
+
+if (rarely()) {
+  iw.commit();
+}
+
+DirectoryReader reader = iw.getReader();
+SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader);
+
+PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new 
Term("suggest_field", "ab"));
+int topN = 5;
+
+// use a TopSuggestDocsCollector that rejects results with duplicate docIds
+TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, 
false) {
+
+  private Set seenDocIds = new HashSet<>();
+
+  @Override
+  public boolean collect(int docID, CharSequence key, CharSequence 
context, float score) throws IOException {
+  int globalDocId = docID + docBase;
+  boolean collected = false;
+  if (seenDocIds.contains(globalDocId) == false) {
+  super.collect(docID, key, context, score);
+  seenDocIds.add(globalDocId);
+  collected = true;
+  }
+  return collected;
+  }
+
+  @Override
+  protected boolean canReject() {
+return true;
+  }
+};
+
+indexSearcher.suggest(query, collector);
+TopSuggestDocs suggestions = collector.get();
+assertSuggestions(suggestions, expectedResults.subList(0, 
topN).toArray(new Entry[0]));
+assertTrue(suggestions.isComplete());
 
 Review comment:
   I extended the existing test to the case where try getting the top 10. In 
this case the queue would have a max depth of 15, the reject count it 9.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-16 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335502868
 
 

 ##
 File path: 
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java
 ##
 @@ -253,6 +263,123 @@ public void testDocFiltering() throws Exception {
 iw.close();
   }
 
+  /**
+   * Test that the correct amount of documents are collected if using a 
collector that also rejects documents.
+   */
+  public void testCollectorThatRejects() throws Exception {
+// use synonym analyzer to have multiple paths to same suggested document. 
This mock adds "dog" as synonym for "dogs"
+Analyzer analyzer = new MockSynonymAnalyzer();
+RandomIndexWriter iw = new RandomIndexWriter(random(), dir, 
iwcWithSuggestField(analyzer, "suggest_field"));
+List expectedResults = new ArrayList();
+
+for (int docCount = 10; docCount > 0; docCount--) {
+  Document document = new Document();
+  String value = "ab" + docCount + " dogs";
+  document.add(new SuggestField("suggest_field", value, docCount));
+  expectedResults.add(new Entry(value, docCount));
+  iw.addDocument(document);
+}
+
+if (rarely()) {
+  iw.commit();
+}
+
+DirectoryReader reader = iw.getReader();
+SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader);
+
+PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new 
Term("suggest_field", "ab"));
+int topN = 5;
+
+// use a TopSuggestDocsCollector that rejects results with duplicate docIds
+TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, 
false) {
+
+  private Set seenDocIds = new HashSet<>();
+
+  @Override
+  public boolean collect(int docID, CharSequence key, CharSequence 
context, float score) throws IOException {
+  int globalDocId = docID + docBase;
+  boolean collected = false;
+  if (seenDocIds.contains(globalDocId) == false) {
+  super.collect(docID, key, context, score);
+  seenDocIds.add(globalDocId);
+  collected = true;
+  }
+  return collected;
+  }
+
+  @Override
+  protected boolean canReject() {
+return true;
+  }
+};
+
+indexSearcher.suggest(query, collector);
+TopSuggestDocs suggestions = collector.get();
+assertSuggestions(suggestions, expectedResults.subList(0, 
topN).toArray(new Entry[0]));
+assertTrue(suggestions.isComplete());
 
 Review comment:
   This will happen if the estimated queue size in `NRTSuggester#lookup` is not 
large enough to accout for all rejected documents, e.g. when in this particular 
test we try to get the top 5 of only 5 documents. In that case the queue size 
heuristic in `NRTSuggester#getMaxTopNSearcherQueueSize` will only size to queue 
to 7 (topN + numDocs/2), which is less than the number of topN + rejections, so 
the TopResults returned will have the `isComplete` flag set.
   I can add that case to the existing test if this helps.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-16 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335389135
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/Util.java
 ##
 @@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc node, T startOutput, 
boolean allowEmptyStri
   continue;
 }
 
-if (results.size() == topN-1 && maxQueueDepth == topN) {
-  // Last path -- don't bother w/ queue anymore:
-  queue = null;
 
 Review comment:
   As far as I understand this optimization assumes we surely accept (and 
collect) the path later in L516s acceptResult(), which always seems to be the 
case for collectors that don't reject, but if the collector that is eventually 
called via NRTSuggesters acceptResult() chooses to reject this option, we were 
losing expected results. This surfaced in the prefix completion tests I added. 
@jimczi might be able to explain this a bit better than me.
   
   > Have you run the suggest benchmarks to see if removing this opto hurt 
performance?
   
   No, where are they and how can I run them?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-16 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335389135
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/util/fst/Util.java
 ##
 @@ -460,11 +460,6 @@ public void addStartPaths(FST.Arc node, T startOutput, 
boolean allowEmptyStri
   continue;
 }
 
-if (results.size() == topN-1 && maxQueueDepth == topN) {
-  // Last path -- don't bother w/ queue anymore:
-  queue = null;
 
 Review comment:
   As far as I understand this optimization assumes we surely accept (and 
collect) the path later in L516s acceptResult(), which always seems to be the 
case for collectors that don't reject, but if the collector that is eventually 
called via NRTSuggesters acceptResult() chooses to reject this option, we were 
losing expected results. This surfaced in the prefix completion tests I added. 
@jimczi might be able to explain this a bit better than me.
   
   > Have you run the suggest benchmarks to see if removing this opto hurt 
performance?
   No, where are they and how can I run them?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-16 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r335372402
 
 

 ##
 File path: 
lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestPrefixCompletionQuery.java
 ##
 @@ -253,6 +263,123 @@ public void testDocFiltering() throws Exception {
 iw.close();
   }
 
+  /**
+   * Test that the correct amount of documents are collected if using a 
collector that also rejects documents.
+   */
+  public void testCollectorThatRejects() throws Exception {
+// use synonym analyzer to have multiple paths to same suggested document. 
This mock adds "dog" as synonym for "dogs"
+Analyzer analyzer = new MockSynonymAnalyzer();
+RandomIndexWriter iw = new RandomIndexWriter(random(), dir, 
iwcWithSuggestField(analyzer, "suggest_field"));
+List expectedResults = new ArrayList();
+
+for (int docCount = 10; docCount > 0; docCount--) {
+  Document document = new Document();
+  String value = "ab" + docCount + " dogs";
+  document.add(new SuggestField("suggest_field", value, docCount));
+  expectedResults.add(new Entry(value, docCount));
+  iw.addDocument(document);
+}
+
+if (rarely()) {
+  iw.commit();
+}
+
+DirectoryReader reader = iw.getReader();
+SuggestIndexSearcher indexSearcher = new SuggestIndexSearcher(reader);
+
+PrefixCompletionQuery query = new PrefixCompletionQuery(analyzer, new 
Term("suggest_field", "ab"));
+int topN = 5;
+
+// use a TopSuggestDocsCollector that rejects results with duplicate docIds
+TopSuggestDocsCollector collector = new TopSuggestDocsCollector(topN, 
false) {
+
+  private Set seenDocIds = new HashSet<>();
+
+  @Override
+  public boolean collect(int docID, CharSequence key, CharSequence 
context, float score) throws IOException {
+  int globalDocId = docID + docBase;
+  boolean collected = false;
+  if (seenDocIds.contains(globalDocId) == false) {
 
 Review comment:
   The collector is called multiple times with the same docID because of the 
MockSynonymAnalyzer used in the test setup which adds "dog" for "dogs", so each 
document has two completion paths. This collector is meant to de-duplicate 
this. I added a note explaining this. This is a simplified version of the 
behaviour we observe in https://github.com/elastic/elasticsearch/issues/46445.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-09 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r332913649
 
 

 ##
 File path: 
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java
 ##
 @@ -105,6 +105,10 @@ public TopSuggestDocs(TotalHits totalHits, 
SuggestScoreDoc[] scoreDocs, boolean
 this.isComplete = isComplete;
   }
 
+  public TopSuggestDocs(TotalHits totalHits, SuggestScoreDoc[] scoreDocs) {
 
 Review comment:
   I'm not sure about how bwc is handled in Lucene, especially if we intend to 
backport this to maybe 8.x, so I don't know if we need a ctor without the 
isComplete flag? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-09 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r332908048
 
 

 ##
 File path: 
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocs.java
 ##
 @@ -116,19 +123,28 @@ public TopSuggestDocs(TotalHits totalHits, 
SuggestScoreDoc[] scoreDocs) {
*/
   public static TopSuggestDocs merge(int topN, TopSuggestDocs[] shardHits) {
 SuggestScoreDocPriorityQueue priorityQueue = new 
SuggestScoreDocPriorityQueue(topN);
+boolean allComplete = true;
 for (TopSuggestDocs shardHit : shardHits) {
   for (SuggestScoreDoc scoreDoc : shardHit.scoreLookupDocs()) {
 if (scoreDoc == priorityQueue.insertWithOverflow(scoreDoc)) {
   break;
 }
   }
+  allComplete = allComplete && shardHit.isComplete;
 }
 SuggestScoreDoc[] topNResults = priorityQueue.getResults();
 if (topNResults.length > 0) {
-  return new TopSuggestDocs(new TotalHits(topNResults.length, 
TotalHits.Relation.EQUAL_TO), topNResults);
+  return new TopSuggestDocs(new TotalHits(topNResults.length, 
TotalHits.Relation.EQUAL_TO), topNResults,
+  allComplete);
 } else {
   return TopSuggestDocs.EMPTY;
 }
   }
 
+  /**
+   * returns true if we exhausted all possibilities to collect results
 
 Review comment:
   Sounds good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-09 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r332905842
 
 

 ##
 File path: 
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java
 ##
 @@ -201,4 +205,27 @@ public void collect(int doc) throws IOException {
   public ScoreMode scoreMode() {
 return ScoreMode.COMPLETE;
   }
+
+  /**
+   * returns true if the collector clearly exhausted all possibilities to 
collect results
+   */
+  boolean isComplete() {
 
 Review comment:
   Not necessary, but since we need the underlying flag and its currently 
private, I thought its okay to have at least a package private getter. Or does 
it bloat the class too much? Happy to remove either way, wdyt?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] cbuescher commented on a change in pull request #913: LUCENE-8995: TopSuggestDocsCollector#collect should be able to signal rejection

2019-10-02 Thread GitBox
cbuescher commented on a change in pull request #913: LUCENE-8995: 
TopSuggestDocsCollector#collect should be able to signal rejection
URL: https://github.com/apache/lucene-solr/pull/913#discussion_r330672796
 
 

 ##
 File path: 
lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggester.java
 ##
 @@ -283,17 +299,25 @@ public int compare(Pair o1, Pair o2) {
* 
* If a filter is applied, the queue size is increased by
* half the number of live documents.
+   *
+   * If the collector can reject documents upon collecting, the queue size is
+   * increased by half the number of live documents again.
+   *
* 
* The maximum queue size is {@link #MAX_TOP_N_QUEUE_SIZE}
*/
-  private int getMaxTopNSearcherQueueSize(int topN, int numDocs, double 
liveDocsRatio, boolean filterEnabled) {
+  private int getMaxTopNSearcherQueueSize(int topN, int numDocs, double 
liveDocsRatio, boolean filterEnabled,
 
 Review comment:
   I considered this as well first, but then moved to the two independent 
flags. Will revert that part.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org