arouel opened a new issue, #3466:
URL: https://github.com/apache/parquet-java/issues/3466

   ### Describe the enhancement requested
   
   `RunLengthBitPackingHybridDecoder.readNext()` allocates a new `int[]` and 
`byte[]` on every PACKED-mode call. In workloads that decode many bit-packed 
runs (definition levels, repetition levels, RLE-encoded integers), these 
allocations dominate the read-side allocation profile. The upstream code even 
acknowledges this with a `// TODO: reuse a buffer` comment.
   
   ### Problem 1: per-call buffer allocation
   
   [Lines 
94–95](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java#L94-L95)
 allocate fresh arrays on every PACKED-mode `readNext()`:
   
   ```java
   currentBuffer = new int[currentCount]; // TODO: reuse a buffer
   byte[] bytes = new byte[numGroups * bitWidth];
   ```
   
   `currentCount` is always `numGroups * 8`, and `numGroups` is typically small 
(1–16 groups = 8–128 values per run). These allocations are individually modest 
but occur thousands of times per column chunk — once per bit-packed run. In a 
180M-row merge with multiple integer/boolean columns, the cumulative allocation 
is substantial.
   
   Since `currentCount` varies between runs (different `numGroups` values), the 
fix retains the field-level `int[]` and a new field-level `byte[]`, growing 
them only when the next run requires a larger buffer.
   
   ### Problem 2: per-call DataInputStream wrapping
   
   [Line 
98](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java#L98)
 creates a `new DataInputStream(in)` on every PACKED-mode call:
   
   ```java
   new DataInputStream(in).readFully(bytes, 0, bytesToRead);
   ```
   
   This allocates a `DataInputStream` wrapper object per call just to access 
`readFully()`. A private `readFully()` method on the decoder itself eliminates 
this allocation and the virtual dispatch through the wrapper.
   
   
   ### Component(s)
   
   Core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to