jpountz commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254284219
########## File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ########## @@ -194,40 +169,28 @@ protected BKDWriter(int maxDoc, Directory tempDir, String tempFileNamePrefix, in minPackedValue = new byte[packedIndexBytesLength]; maxPackedValue = new byte[packedIndexBytesLength]; - // If we may have more than 1+Integer.MAX_VALUE values, then we must encode ords with long (8 bytes), else we can use int (4 bytes). - this.longOrds = longOrds; - - this.singleValuePerDoc = singleValuePerDoc; + // dimensional values (numDims * bytesPerDim) + docID (int) + bytesPerDoc = packedBytesLength + Integer.BYTES; - // dimensional values (numDims * bytesPerDim) + ord (int or long) + docID (int) - if (singleValuePerDoc) { - // Lucene only supports up to 2.1 docs, so we better not need longOrds in this case: - assert longOrds == false; - bytesPerDoc = packedBytesLength + Integer.BYTES; - } else if (longOrds) { - bytesPerDoc = packedBytesLength + Long.BYTES + Integer.BYTES; - } else { - bytesPerDoc = packedBytesLength + Integer.BYTES + Integer.BYTES; - } // As we recurse, we compute temporary partitions of the data, halving the // number of points at each recursion. Once there are few enough points, // we can switch to sorting in heap instead of offline (on disk). At any // time in the recursion, we hold the number of points at that level, plus // all recursive halves (i.e. 16 + 8 + 4 + 2) so the memory usage is 2X // what that level would consume, so we multiply by 0.5 to convert from - // bytes to points here. Each dimension has its own sorted partition, so - // we must divide by numDims as wel. + // bytes to points here. In addition the radix partitioning may sort on memory + // double of this size so we multiply by another 0.5. - maxPointsSortInHeap = (int) (0.5 * (maxMBSortInHeap * 1024 * 1024) / (bytesPerDoc * numDataDims)); + maxPointsSortInHeap = (int) (0.25 * (maxMBSortInHeap * 1024 * 1024) / (bytesPerDoc)); Review comment: I understand the first 0.5 factor due to how recursion levels add up, but not the second one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org