Re: [PR] Encode dense blocks of postings as bit sets. [lucene]

via GitHub Tue, 14 Jan 2025 07:43:25 -0800


jpountz commented on code in PR #14133:
URL: https://github.com/apache/lucene/pull/14133#discussion_r1915083479



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java:
##########
@@ -572,7 +597,36 @@ public int freq() throws IOException {
     }
 
     private void refillFullBlock() throws IOException {
-      forDeltaUtil.decodeAndPrefixSum(docInUtil, prevDocID, docBuffer);
+      int bitsPerValue = docIn.readByte();
+      if (bitsPerValue > 0) {
+        forDeltaUtil.decodeAndPrefixSum(bitsPerValue, docInUtil, prevDocID, 
docBuffer);
+        encoding = DeltaEncoding.PACKED;
+      } else if (bitsPerValue == 0) {
+        // dense block: 128 one bits
+        docBitSet.set(0, BLOCK_SIZE);
+        docBitSetBase = prevDocID + 1;
+        docCumulativeWordPopCounts[0] = Long.SIZE;
+        docCumulativeWordPopCounts[1] = 2 * Long.SIZE;
+        encoding = DeltaEncoding.UNARY;
+      } else {
+        assert level0LastDocID != NO_MORE_DOCS;
+        // block is encoded as a bit set
+        docBitSetBase = prevDocID + 1;
+        int numLongs = -bitsPerValue;
+        docIn.readLongs(docBitSet.getBits(), 0, numLongs);
+        // Note: this for loop auto-vectorizes
+        for (int i = 0; i < numLongs - 1; ++i) {
+          docCumulativeWordPopCounts[i] = 
Long.bitCount(docBitSet.getBits()[i]);
+        }
+        for (int i = 1; i < numLongs - 1; ++i) {
+          docCumulativeWordPopCounts[i] += docCumulativeWordPopCounts[i - 1];
+        }
+        docCumulativeWordPopCounts[numLongs - 1] = BLOCK_SIZE;
+        assert docCumulativeWordPopCounts[numLongs - 2]

Review Comment:
   We only use the bit set encoding for "full" blocks. Tail blocks, which may 
have less than 128 doc IDs to record, keep using the current encoding that 
stores deltas using group-varint, they never use a bit set.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java:
##########
@@ -572,7 +597,36 @@ public int freq() throws IOException {
     }
 
     private void refillFullBlock() throws IOException {
-      forDeltaUtil.decodeAndPrefixSum(docInUtil, prevDocID, docBuffer);
+      int bitsPerValue = docIn.readByte();
+      if (bitsPerValue > 0) {
+        forDeltaUtil.decodeAndPrefixSum(bitsPerValue, docInUtil, prevDocID, 
docBuffer);
+        encoding = DeltaEncoding.PACKED;
+      } else if (bitsPerValue == 0) {
+        // dense block: 128 one bits
+        docBitSet.set(0, BLOCK_SIZE);
+        docBitSetBase = prevDocID + 1;
+        docCumulativeWordPopCounts[0] = Long.SIZE;
+        docCumulativeWordPopCounts[1] = 2 * Long.SIZE;
+        encoding = DeltaEncoding.UNARY;
+      } else {
+        assert level0LastDocID != NO_MORE_DOCS;
+        // block is encoded as a bit set
+        docBitSetBase = prevDocID + 1;
+        int numLongs = -bitsPerValue;
+        docIn.readLongs(docBitSet.getBits(), 0, numLongs);
+        // Note: this for loop auto-vectorizes
+        for (int i = 0; i < numLongs - 1; ++i) {
+          docCumulativeWordPopCounts[i] = 
Long.bitCount(docBitSet.getBits()[i]);
+        }
+        for (int i = 1; i < numLongs - 1; ++i) {

Review Comment:
   Indeed. :) I added a comment to make it clearer.



##########
lucene/core/src/java/org/apache/lucene/codecs/lucene101/Lucene101PostingsReader.java:
##########
@@ -572,7 +597,36 @@ public int freq() throws IOException {
     }
 
     private void refillFullBlock() throws IOException {
-      forDeltaUtil.decodeAndPrefixSum(docInUtil, prevDocID, docBuffer);
+      int bitsPerValue = docIn.readByte();
+      if (bitsPerValue > 0) {
+        forDeltaUtil.decodeAndPrefixSum(bitsPerValue, docInUtil, prevDocID, 
docBuffer);
+        encoding = DeltaEncoding.PACKED;
+      } else if (bitsPerValue == 0) {
+        // dense block: 128 one bits

Review Comment:
   I'm not sure what is confusing, `docBitSet.set(0, BLOCK_SIZE)` sets 
BLOCK_SIZE bits to `true`? I refactored a bit, hopefully it is clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Encode dense blocks of postings as bit sets. [lucene]

Reply via email to