[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

GitBox Wed, 06 Feb 2019 22:34:21 -0800

iverase commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254560658


 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
 ##########
 @@ -194,40 +169,28 @@ protected BKDWriter(int maxDoc, Directory tempDir, 
String tempFileNamePrefix, in
     minPackedValue = new byte[packedIndexBytesLength];
     maxPackedValue = new byte[packedIndexBytesLength];
 
-    // If we may have more than 1+Integer.MAX_VALUE values, then we must 
encode ords with long (8 bytes), else we can use int (4 bytes).
-    this.longOrds = longOrds;
-
-    this.singleValuePerDoc = singleValuePerDoc;
+    // dimensional values (numDims * bytesPerDim) + docID (int)
+    bytesPerDoc = packedBytesLength + Integer.BYTES;
 
-    // dimensional values (numDims * bytesPerDim) + ord (int or long) + docID 
(int)
-    if (singleValuePerDoc) {
-      // Lucene only supports up to 2.1 docs, so we better not need longOrds 
in this case:
-      assert longOrds == false;
-      bytesPerDoc = packedBytesLength + Integer.BYTES;
-    } else if (longOrds) {
-      bytesPerDoc = packedBytesLength + Long.BYTES + Integer.BYTES;
-    } else {
-      bytesPerDoc = packedBytesLength + Integer.BYTES + Integer.BYTES;
-    }
 
     // As we recurse, we compute temporary partitions of the data, halving the
     // number of points at each recursion.  Once there are few enough points,
     // we can switch to sorting in heap instead of offline (on disk).  At any
     // time in the recursion, we hold the number of points at that level, plus
     // all recursive halves (i.e. 16 + 8 + 4 + 2) so the memory usage is 2X
     // what that level would consume, so we multiply by 0.5 to convert from
-    // bytes to points here.  Each dimension has its own sorted partition, so
-    // we must divide by numDims as wel.
+    // bytes to points here.  In addition the radix partitioning may sort on 
memory
+    // double of this size so we multiply by another 0.5.
 
-    maxPointsSortInHeap = (int) (0.5 * (maxMBSortInHeap * 1024 * 1024) / 
(bytesPerDoc * numDataDims));
+    maxPointsSortInHeap = (int) (0.25 * (maxMBSortInHeap * 1024 * 1024) / 
(bytesPerDoc));
 
 Review comment:
   This is due to the fact that BKDRadixSelector can build a buffer up to `2 * 
maxPointsSortInHeap`. Thinking about it, it might not be ideal(we are not using 
all available heap) but I have in mind a follow up issue where this will be 
handled better so I will keep it like that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[GitHub] iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

Reply via email to