Re: [PR] WIP: draft of intra segment concurrency [lucene]

via GitHub Thu, 04 Jul 2024 08:47:05 -0700


stefanvodita commented on code in PR #13542:
URL: https://github.com/apache/lucene/pull/13542#discussion_r1665869563



##########
lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java:
##########
@@ -328,42 +336,65 @@ protected LeafSlice[] slices(List<LeafReaderContext> 
leaves) {
   /** Static method to segregate LeafReaderContexts amongst multiple slices */
   public static LeafSlice[] slices(
       List<LeafReaderContext> leaves, int maxDocsPerSlice, int 
maxSegmentsPerSlice) {
+
+    // TODO this is a temporary hack to force testing against multiple leaf 
reader context slices.
+    // It must be reverted before merging.
+    maxDocsPerSlice = 1;
+    maxSegmentsPerSlice = 1;
+    // end hack
+
     // Make a copy so we can sort:
     List<LeafReaderContext> sortedLeaves = new ArrayList<>(leaves);
 
     // Sort by maxDoc, descending:
-    Collections.sort(
-        sortedLeaves, Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc())));
+    sortedLeaves.sort(Collections.reverseOrder(Comparator.comparingInt(l -> 
l.reader().maxDoc())));
 
-    final List<List<LeafReaderContext>> groupedLeaves = new ArrayList<>();
-    long docSum = 0;
-    List<LeafReaderContext> group = null;
+    final List<List<LeafReaderContextPartition>> groupedLeafPartitions = new 
ArrayList<>();
+    int currentSliceNumDocs = 0;
+    List<LeafReaderContextPartition> group = null;
     for (LeafReaderContext ctx : sortedLeaves) {
       if (ctx.reader().maxDoc() > maxDocsPerSlice) {
         assert group == null;
-        groupedLeaves.add(Collections.singletonList(ctx));
+        // if the segment does not fit in a single slice, we split it in 
multiple partitions of

Review Comment:
   Thank you for moving forward on this issue @javanna!
   I had a different strategy in mind for slicing the index. With the current 
implementation, we deduce the number of slices based on a given per-slice doc 
count. What if the number of slices was given instead? Each segment would not 
be divided into n slices, but a slice could straddle over a few segments. I 
think it gives us better control over the concurrency and leaves us with fewer 
slices per segment while achieving the same level of concurrency, which is 
better because it reduces the fixed cost of search that we pay per slice per 
segment. Are there challenges that make that hard to implement?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] WIP: draft of intra segment concurrency [lucene]

Reply via email to