Re: [PR] GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) [parquet-java]

via GitHub Tue, 28 Apr 2026 11:08:12 -0700


Fokko commented on code in PR #3523:
URL: https://github.com/apache/parquet-java/pull/3523#discussion_r3156243951



##########
parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java:
##########
@@ -90,17 +95,24 @@ private void readNext() throws IOException {
       case PACKED:
         int numGroups = header >>> 1;
         currentCount = numGroups * 8;
+        currentBufferLength = currentCount;
         LOG.debug("reading {} values BIT PACKED", currentCount);
-        currentBuffer = new int[currentCount]; // TODO: reuse a buffer
-        byte[] bytes = new byte[numGroups * bitWidth];
+        if (packedValuesBuffer.length < currentCount) {
+          packedValuesBuffer = new int[currentCount];
+        }
+        currentBuffer = packedValuesBuffer;
+        int bytesRequired = numGroups * bitWidth;
+        if (packedBytesBuffer.length < bytesRequired) {
+          packedBytesBuffer = new byte[bytesRequired];
+        }
         // At the end of the file RLE data though, there might not be that 
many bytes left.
         int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
         bytesToRead = Math.min(bytesToRead, in.available());
-        new DataInputStream(in).readFully(bytes, 0, bytesToRead);
+        new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);

Review Comment:
   I think it is on purpose to not close this resource, should we add a comment 
to it?



##########
parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java:
##########
@@ -90,17 +95,24 @@ private void readNext() throws IOException {
       case PACKED:
         int numGroups = header >>> 1;
         currentCount = numGroups * 8;
+        currentBufferLength = currentCount;
         LOG.debug("reading {} values BIT PACKED", currentCount);
-        currentBuffer = new int[currentCount]; // TODO: reuse a buffer
-        byte[] bytes = new byte[numGroups * bitWidth];
+        if (packedValuesBuffer.length < currentCount) {
+          packedValuesBuffer = new int[currentCount];
+        }
+        currentBuffer = packedValuesBuffer;
+        int bytesRequired = numGroups * bitWidth;
+        if (packedBytesBuffer.length < bytesRequired) {
+          packedBytesBuffer = new byte[bytesRequired];
+        }
         // At the end of the file RLE data though, there might not be that 
many bytes left.
         int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
         bytesToRead = Math.min(bytesToRead, in.available());
-        new DataInputStream(in).readFully(bytes, 0, bytesToRead);
+        new DataInputStream(in).readFully(packedBytesBuffer, 0, bytesToRead);
         for (int valueIndex = 0, byteIndex = 0;
             valueIndex < currentCount;
             valueIndex += 8, byteIndex += bitWidth) {
-          packer.unpack8Values(bytes, byteIndex, currentBuffer, valueIndex);
+          packer.unpack8Values(packedBytesBuffer, byteIndex, currentBuffer, 
valueIndex);

Review Comment:
   I noticed that `unpack8Values(final byte[] input, ...)` is deprecated, 
should we switch to the `ByteBuffer` overload while we're reworking this?
   
   ```
   diff --git 
a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
 
b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
   index da9db00c7..14ec1a87d 100644
   --- 
a/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
   +++ 
b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java
   @@ -21,6 +21,7 @@ package org.apache.parquet.column.values.rle;
    import java.io.DataInputStream;
    import java.io.IOException;
    import java.io.InputStream;
   +import java.nio.ByteBuffer;
    import org.apache.parquet.Preconditions;
    import org.apache.parquet.bytes.BytesUtils;
    import org.apache.parquet.column.values.bitpacking.BytePacker;
   @@ -52,7 +53,7 @@ public class RunLengthBitPackingHybridDecoder {
    
      // Reusable buffers to avoid per-run allocation in PACKED mode
      private int[] packedValuesBuffer = new int[0];
   -  private byte[] packedBytesBuffer = new byte[0];
   +  private ByteBuffer packedBytesBuffer = ByteBuffer.allocate(0);
    
      public RunLengthBitPackingHybridDecoder(int bitWidth, InputStream in) {
        LOG.debug("decoding bitWidth {}", bitWidth);
   @@ -102,13 +103,13 @@ public class RunLengthBitPackingHybridDecoder {
            }
            currentBuffer = packedValuesBuffer;
            int bytesRequired = numGroups * bitWidth;
   -        if (packedBytesBuffer.length < bytesRequired) {
   -          packedBytesBuffer = new byte[bytesRequired];
   +        if (packedBytesBuffer.capacity() < bytesRequired) {
   +          packedBytesBuffer = ByteBuffer.allocate(bytesRequired);
            }
            // At the end of the file RLE data though, there might not be that 
many bytes left.
            int bytesToRead = (int) Math.ceil(currentCount * bitWidth / 8.0);
            bytesToRead = Math.min(bytesToRead, in.available());
   -        new DataInputStream(in).readFully(packedBytesBuffer, 0, 
bytesToRead);
   +        new DataInputStream(in).readFully(packedBytesBuffer.array(), 0, 
bytesToRead);
            for (int valueIndex = 0, byteIndex = 0;
                valueIndex < currentCount;
                valueIndex += 8, byteIndex += bitWidth) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] GH-3522: Reuse intermediate buffers in RunLengthBitPackingHybridDecoder PACKED path (~22% throughput on dictionary-id decode) [parquet-java]

Reply via email to