Re: [PR] SOLR-17447 : Support to early terminate a search based on maxHits per collector. [solr]

via GitHub Thu, 30 Jan 2025 10:30:27 -0800


gus-asf commented on code in PR #2960:
URL: https://github.com/apache/solr/pull/2960#discussion_r1935871739



##########
solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java:
##########
@@ -29,11 +30,15 @@
  */
 public class EarlyTerminatingCollector extends FilterCollector {
 
+  private final int chunkSize = 100; // Check across threads only at a chunk 
size

Review Comment:
   How was this chosen? Should it be tuneable via env/sysprop?



##########
solr/solr-ref-guide/modules/query-guide/pages/common-query-parameters.adoc:
##########
@@ -399,6 +399,23 @@ For example, setting `cpuAllowed=500` gives a limit of at 
most 500 ms of CPU tim
 
 All other considerations regarding partial results listed for the 
`timeAllowed` parameter apply here, too.
 
+
+== maxHits Parameter
+
+[%autowidth,frame=none]
+|===
+|Optional |Default: `false`
+|===
+
+This parameter specifies the max number of hits a searcher will iterate 
through before early terminating the search.
+The count is per shard and across all threads involved in case of 
multi-threaded search. This parameter works
+in conjunction with other parameters that could early terminate a search, ex: 
_timeAllowed_ etc. In case the search
+was early terminated due to it exceeding maxHits a _terminatedEarly_ header in 
the response will be set along with
+_partialResults_ to indicate the same. Note that the _partialResults_ flag 
could be set in the absence of the _maxHits_
+parameter due to other limits like _timeAllowed_ or _cpuAllowed_.
+Note : the hits counted will need not be exactly equal to the maxHits 
provided, however it will be within range of this value.
+
+

Review Comment:
   I have a similar concern as Houston regarding helping the user understand 
when to use this, here's my attempt to rewrite your text based on what I've 
gathered... of course if I misunderstood something or I'm too wordy we should 
fix that ;) As a side note I think we're supposed to write these as "ventilated 
prose" which is one sentence per line, no line breaks in a sentence.
   ```
   == maxHits Parameter
   
   [%autowidth,frame=none]
   |===
   |Optional |Default: `false`
   |===
   
   This parameter specifies the max number of hits a searcher will iterate 
through.
   Searchers will arbitrarily ignore any number of additional hits beyond this 
value.
   Practically speaking, the count is per shard and across all threads involved 
in case of multi-threaded search.
   The intention of this feature is to favor speed over perfect relevancy & 
recall.
   The trade-off is that if one shard contains many relevant hits and another 
contains a few less relevant hits the less relevant hits from the second shard 
may get returned instead of the more relevant hits that were clustered in the 
first shard.
   
   This parameter works in conjunction with other parameters that could early 
terminate a search, ex: _timeAllowed_ etc.
   In case the search was early terminated due to it exceeding maxHits a 
_terminatedEarly_ header in the response will be set along with 
_partialResults_ to indicate the same.
   The _partialResults_ flag could be set in the absence of the _maxHits_ 
parameter due to other limits like _timeAllowed_ or _cpuAllowed_.
   ```
   I have to admit I don't understand this bit:
   
   ```
   NOTE: the hits counted will need not be exactly equal to the maxHits 
provided, however it will be within range of this value.
   ```



##########
solr/core/src/java/org/apache/solr/search/QueryCommand.java:
##########
@@ -194,7 +195,8 @@ public QueryCommand setNeedDocSet(boolean needDocSet) {
   }
 
   public boolean getTerminateEarly() {
-    return (flags & SolrIndexSearcher.TERMINATE_EARLY) != 0;
+    return (flags & SolrIndexSearcher.TERMINATE_EARLY) != 0

Review Comment:
   As I noted in another comment I think this and the Lucene feature are 
significantly different. I don't like conflating them. 



##########
solr/core/src/java/org/apache/solr/search/MultiThreadedSearcher.java:
##########
@@ -190,10 +194,16 @@ public ScoreMode scoreMode() {
   static class SearchResult {
     final ScoreMode scoreMode;
     private final Object[] result;
+    final boolean isTerminatedEarly;
 
     public SearchResult(ScoreMode scoreMode, Object[] result) {
+      this(scoreMode, result, false);

Review Comment:
   My IDE claims this unused. Since this is a package private inner class we 
probably don't need to maintain this constructor for the API.



##########
solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java:
##########
@@ -61,11 +69,25 @@ public LeafCollector getLeafCollector(LeafReaderContext 
context) throws IOExcept
       public void collect(int doc) throws IOException {
         super.collect(doc);
         numCollected++;
-        if (maxDocsToCollect <= numCollected) {
+        terminatedEarly = maxDocsToCollect <= numCollected;
+        if (numCollected % chunkSize == 0) {
+          pendingDocsToCollect.add(chunkSize);
+          final long overallCollectedDocCount = 
pendingDocsToCollect.intValue();
+          terminatedEarly = overallCollectedDocCount >= maxDocsToCollect;

Review Comment:
   I think the code would be more legible if you just threw in two cases rather 
than tracking it with a boolean that is overwritten. This initially looked like 
a logic error. I expected to an `|=` until I stopped and thought about it for a 
while to convince myself that the second case can't be false after the first is 
true. Seems better to me if future maintainers don't have to sort that out for 
themselves.



##########
solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java:
##########
@@ -43,11 +48,14 @@ public class EarlyTerminatingCollector extends 
FilterCollector {
    * @param maxDocsToCollect - the maximum number of documents to Collect
    */
   public EarlyTerminatingCollector(Collector delegate, int maxDocsToCollect) {
-    super(delegate);
-    assert 0 < maxDocsToCollect;

Review Comment:
   Why are the asserts removed?



##########
solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java:
##########
@@ -61,11 +69,25 @@ public LeafCollector getLeafCollector(LeafReaderContext 
context) throws IOExcept
       public void collect(int doc) throws IOException {
         super.collect(doc);
         numCollected++;
-        if (maxDocsToCollect <= numCollected) {
+        terminatedEarly = maxDocsToCollect <= numCollected;
+        if (numCollected % chunkSize == 0) {
+          pendingDocsToCollect.add(chunkSize);
+          final long overallCollectedDocCount = 
pendingDocsToCollect.intValue();
+          terminatedEarly = overallCollectedDocCount >= maxDocsToCollect;
+        }
+        if (terminatedEarly) {
           throw new EarlyTerminatingCollectorException(
-              numCollected, prevReaderCumulativeSize + (doc + 1));
+              maxDocsToCollect, prevReaderCumulativeSize + (doc + 1));
         }
       }
     };
   }
+
+  public boolean isTerminatedEarly() {

Review Comment:
   My IDE says this method is unused... do we use it somewhere I've missed?



##########
solr/core/src/java/org/apache/solr/search/EarlyTerminatingCollector.java:
##########
@@ -29,11 +30,15 @@
  */
 public class EarlyTerminatingCollector extends FilterCollector {
 
+  private final int chunkSize = 100; // Check across threads only at a chunk 
size
+
   private final int maxDocsToCollect;
 
   private int numCollected = 0;
   private int prevReaderCumulativeSize = 0;
   private int currentReaderSize = 0;
+  private final LongAdder pendingDocsToCollect;

Review Comment:
   LongAdder is a neat class I wasn't aware of, but my read of the javadocs for 
LongAdder is that it's for rarely read values like statistics/metrics (frequent 
update, infrequent read)... the way you are using it you read from it 
immediately after every add. Maybe just AtomicLong? Happy to hear arguments to 
the contrary, but everyone knows what an AtomicLong is so slightly simpler to 
have that (and perhaps slightly less memory usage according to the javadocs).



##########
solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java:
##########
@@ -295,7 +295,7 @@ private Collector buildAndRunCollectorChain(
 
     final boolean terminateEarly = cmd.getTerminateEarly();
     if (terminateEarly) {
-      collector = new EarlyTerminatingCollector(collector, cmd.getLen());
+      collector = new EarlyTerminatingCollector(collector, 
cmd.getMaxHitsTerminateEarly());

Review Comment:
   How is it we are replacing getLen() here... how does this not break 
SOLR-3240 
(https://github.com/apache/solr/commit/7141cb35a761eea1f0e6f3ad6abf84a4c972fedc#diff-99978700f1c69d6a5f6c2190f89c98cfe20c441161db5a183ec002e15cb1be28)



##########
solr/core/src/java/org/apache/solr/search/SolrIndexSearcher.java:
##########
@@ -329,12 +329,12 @@ private Collector buildAndRunCollectorChain(
       if (collector instanceof DelegatingCollector) {
         ((DelegatingCollector) collector).complete();
       }
-      throw etce;
+      qr.setPartialResults(true);

Review Comment:
   Does not throwing here effect SOLR-3240, (the original reason for ETCE) ?



##########
solr/core/src/java/org/apache/solr/search/QueryCommand.java:
##########
@@ -194,7 +195,8 @@ public QueryCommand setNeedDocSet(boolean needDocSet) {
   }
 
   public boolean getTerminateEarly() {
-    return (flags & SolrIndexSearcher.TERMINATE_EARLY) != 0;
+    return (flags & SolrIndexSearcher.TERMINATE_EARLY) != 0

Review Comment:
   And after looking deeper, there's something funny going on in flags. There 
are two flags: 
   ```
     static final int SEGMENT_TERMINATE_EARLY = 0x08;
     public static final int TERMINATE_EARLY = 0x04;
   ```
   yet I see this code (ide says it's unused) in QueryCommand: 
   ```
     public QueryCommand setTerminateEarly(boolean segmentTerminateEarly) {
       if (segmentTerminateEarly) {
         return setFlags(SolrIndexSearcher.TERMINATE_EARLY);
       } else {
         return clearFlags(SolrIndexSearcher.TERMINATE_EARLY);
       }
     }
   
   ```
   as well as:
   ```
     public QueryCommand setSegmentTerminateEarly(boolean 
segmentSegmentTerminateEarly) {
       if (segmentSegmentTerminateEarly) {
         return setFlags(SolrIndexSearcher.SEGMENT_TERMINATE_EARLY);
       } else {
         return clearFlags(SolrIndexSearcher.SEGMENT_TERMINATE_EARLY);
       }
     }
   ```
   So there's something we should maybe understand and maybe clean up here...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] SOLR-17447 : Support to early terminate a search based on maxHits per collector. [solr]

Reply via email to