Re: [PR] Consistent KNN query results with multiple leafs [lucene]

via GitHub Tue, 11 Feb 2025 13:32:21 -0800


benwtrent commented on code in PR #14191:
URL: https://github.com/apache/lucene/pull/14191#discussion_r1939664144



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -19,11 +19,7 @@
 import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
 
 import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Arrays;
-import java.util.Comparator;
-import java.util.List;
-import java.util.Objects;
+import java.util.*;

Review Comment:
   lets not use `*`



##########
lucene/core/src/java/org/apache/lucene/search/knn/TopKnnCollectorManager.java:
##########
@@ -55,8 +57,13 @@ public KnnCollector newCollector(
     if (globalScoreQueue == null) {
       return new TopKnnCollector(k, visitedLimit, searchStrategy);
     } else {
-      return new MultiLeafKnnCollector(
-          k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy));
+      if (freeze.getAndSet(false)) {
+        return new MultiLeafKnnCollector(
+            k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy), false);
+      } else {

Review Comment:
   I don't think you need to do this "freeze" thing. The only issue was sharing 
information between threads. A single executor continuing to share information 
is completely ok.



##########
lucene/core/src/java/org/apache/lucene/search/knn/TopKnnCollectorManager.java:
##########
@@ -55,8 +57,13 @@ public KnnCollector newCollector(
     if (globalScoreQueue == null) {
       return new TopKnnCollector(k, visitedLimit, searchStrategy);
     } else {
-      return new MultiLeafKnnCollector(
-          k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy));
+      if (freeze.getAndSet(false)) {
+        return new MultiLeafKnnCollector(
+            k, globalScoreQueue, new TopKnnCollector(k, visitedLimit, 
searchStrategy), false);
+      } else {

Review Comment:
   For this change, we simply make `MultiLeafKnnCollector` not thread safe at 
all. We can continue to have this `globalScoreQueue` that is unique per thread 
and shared between all the collectors, but all the code assumes its within a 
single thread. 
   
   This will likely speed things up significantly.



##########
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##########
@@ -88,11 +91,64 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
             getKnnCollectorManager(k, indexSearcher), 
indexSearcher.getTimeout());
     TaskExecutor taskExecutor = indexSearcher.getTaskExecutor();
     List<LeafReaderContext> leafReaderContexts = reader.leaves();
+
     List<Callable<TopDocs>> tasks = new ArrayList<>(leafReaderContexts.size());
-    for (LeafReaderContext context : leafReaderContexts) {
-      tasks.add(() -> searchLeaf(context, filterWeight, knnCollectorManager));
+
+    TopDocs[] perLeafResults;
+    if (leafReaderContexts.size() > 1) {
+      if (true) {
+        /* sort LRCs by segment size */
+        List<LeafReaderContext> sortedLeafReaderContexts = 
leafReaderContexts.stream()
+                .sorted(Comparator.comparingInt(o -> 
o.reader().numDocs())).toList();
+        int noLRCs = sortedLeafReaderContexts.size();

Review Comment:
   I think this is OK for a POC, but we can easily sort actual number of 
vectors for real data.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Consistent KNN query results with multiple leafs [lucene]

Reply via email to