[GitHub] jpountz commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

GitBox Wed, 06 Feb 2019 06:22:37 -0800

jpountz commented on a change in pull request #556: LUCENE-8673: Use radix 
partitioning when merging dimensional points
URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254284219


 ##########
 File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
 ##########
 @@ -194,40 +169,28 @@ protected BKDWriter(int maxDoc, Directory tempDir, 
String tempFileNamePrefix, in
     minPackedValue = new byte[packedIndexBytesLength];
     maxPackedValue = new byte[packedIndexBytesLength];
 
-    // If we may have more than 1+Integer.MAX_VALUE values, then we must 
encode ords with long (8 bytes), else we can use int (4 bytes).
-    this.longOrds = longOrds;
-
-    this.singleValuePerDoc = singleValuePerDoc;
+    // dimensional values (numDims * bytesPerDim) + docID (int)
+    bytesPerDoc = packedBytesLength + Integer.BYTES;
 
-    // dimensional values (numDims * bytesPerDim) + ord (int or long) + docID 
(int)
-    if (singleValuePerDoc) {
-      // Lucene only supports up to 2.1 docs, so we better not need longOrds 
in this case:
-      assert longOrds == false;
-      bytesPerDoc = packedBytesLength + Integer.BYTES;
-    } else if (longOrds) {
-      bytesPerDoc = packedBytesLength + Long.BYTES + Integer.BYTES;
-    } else {
-      bytesPerDoc = packedBytesLength + Integer.BYTES + Integer.BYTES;
-    }
 
     // As we recurse, we compute temporary partitions of the data, halving the
     // number of points at each recursion.  Once there are few enough points,
     // we can switch to sorting in heap instead of offline (on disk).  At any
     // time in the recursion, we hold the number of points at that level, plus
     // all recursive halves (i.e. 16 + 8 + 4 + 2) so the memory usage is 2X
     // what that level would consume, so we multiply by 0.5 to convert from
-    // bytes to points here.  Each dimension has its own sorted partition, so
-    // we must divide by numDims as wel.
+    // bytes to points here.  In addition the radix partitioning may sort on 
memory
+    // double of this size so we multiply by another 0.5.
 
-    maxPointsSortInHeap = (int) (0.5 * (maxMBSortInHeap * 1024 * 1024) / 
(bytesPerDoc * numDataDims));
+    maxPointsSortInHeap = (int) (0.25 * (maxMBSortInHeap * 1024 * 1024) / 
(bytesPerDoc));
 
 Review comment:
   I understand the first 0.5 factor due to how recursion levels add up, but 
not the second one.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] jpountz commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points

Reply via email to