iverase commented on a change in pull request #556: LUCENE-8673: Use radix partitioning when merging dimensional points URL: https://github.com/apache/lucene-solr/pull/556#discussion_r254560658
########## File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java ########## @@ -194,40 +169,28 @@ protected BKDWriter(int maxDoc, Directory tempDir, String tempFileNamePrefix, in minPackedValue = new byte[packedIndexBytesLength]; maxPackedValue = new byte[packedIndexBytesLength]; - // If we may have more than 1+Integer.MAX_VALUE values, then we must encode ords with long (8 bytes), else we can use int (4 bytes). - this.longOrds = longOrds; - - this.singleValuePerDoc = singleValuePerDoc; + // dimensional values (numDims * bytesPerDim) + docID (int) + bytesPerDoc = packedBytesLength + Integer.BYTES; - // dimensional values (numDims * bytesPerDim) + ord (int or long) + docID (int) - if (singleValuePerDoc) { - // Lucene only supports up to 2.1 docs, so we better not need longOrds in this case: - assert longOrds == false; - bytesPerDoc = packedBytesLength + Integer.BYTES; - } else if (longOrds) { - bytesPerDoc = packedBytesLength + Long.BYTES + Integer.BYTES; - } else { - bytesPerDoc = packedBytesLength + Integer.BYTES + Integer.BYTES; - } // As we recurse, we compute temporary partitions of the data, halving the // number of points at each recursion. Once there are few enough points, // we can switch to sorting in heap instead of offline (on disk). At any // time in the recursion, we hold the number of points at that level, plus // all recursive halves (i.e. 16 + 8 + 4 + 2) so the memory usage is 2X // what that level would consume, so we multiply by 0.5 to convert from - // bytes to points here. Each dimension has its own sorted partition, so - // we must divide by numDims as wel. + // bytes to points here. In addition the radix partitioning may sort on memory + // double of this size so we multiply by another 0.5. - maxPointsSortInHeap = (int) (0.5 * (maxMBSortInHeap * 1024 * 1024) / (bytesPerDoc * numDataDims)); + maxPointsSortInHeap = (int) (0.25 * (maxMBSortInHeap * 1024 * 1024) / (bytesPerDoc)); Review comment: This is due to the fact that BKDRadixSelector can build a buffer up to `2 * maxPointsSortInHeap`. Thinking about it, it might not be ideal(we are not using all available heap) but I have in mind a follow up issue where this will be handled better so I will keep it like that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org