arouel opened a new pull request, #3467:
URL: https://github.com/apache/parquet-java/pull/3467

   ### Rationale for this change
   
   `RunLengthBitPackingHybridDecoder.readNext()` allocates a new `int[]` and 
`byte[]` on every PACKED-mode call. The existing code even acknowledges this 
with a `// TODO: reuse a buffer` comment at line 94. In workloads that decode 
many bit-packed runs (definition levels, repetition levels, RLE-encoded 
integers), these allocations become a significant source of GC pressure.
   
   There are two issues:
   
   1. **Per-call buffer allocation.** `currentBuffer = new int[currentCount]` 
and `byte[] bytes = new byte[numGroups * bitWidth]` are allocated fresh on 
every PACKED-mode `readNext()`. The individual allocations are modest (8–128 
ints per run) but occur thousands of times per column chunk. In a custom 
JFR-profiled benchmark merging 60 Parquet files (180M rows, 10 columns), these 
two sites were the **number 2 and number 6 allocation hotspots** respectively: 
2,402 out of 12,711 total allocation samples (18.9%), accounting for ~7.2 GB of 
allocation per operation.
   
   2. **Per-call `DataInputStream` wrapping.** `new 
DataInputStream(in).readFully(bytes, 0, bytesToRead)` creates a wrapper object 
on every PACKED-mode call just to access `readFully()`.
   
   ### What changes are included in this PR?
   
   Three changes to `RunLengthBitPackingHybridDecoder`:
   
   - **`currentBuffer` promoted from local to field**, reused across 
`readNext()` calls with a grow-only strategy, only reallocated when the next 
run requires a larger buffer than currently held.
   - **`byte[] bytes` promoted to a field `packedBytes`**, same grow-only reuse 
strategy.
   - **`new DataInputStream(in).readFully(...)` replaced with a private 
`readFully()` method** that reads directly from the underlying `InputStream`, 
eliminating the per-call wrapper allocation and virtual dispatch.
   
   **Custom JFR-profiled benchmark results** (merge of 60 Parquet files, 180M 
rows, 10 columns, JDK 26, macOS aarch64):
   
   | Metric | Before | After | Change |
   |---|---|---|---|
   | Throughput | 44.2 s/op | 42.5 s/op | **-3.6%** |
   | Allocation rate | 908.9 MB/s | 793.9 MB/s | **-12.7%** |
   | Allocation per op | 42.1 GB | 35.4 GB | **-15.8%** |
   | RLE decoder alloc samples | 2,402 (18.9%) | 146 (1.2%) | **-93.9%** |
   | RLE decoder alloc bytes | ~7,245 MB | ~417 MB | **-94.2%** |
   
   ### Are these changes tested?
   
   Yes. Existing tests for `RunLengthBitPackingHybridDecoder` (e.g. 
`TestRunLengthBitPackingHybridEncoder`) cover round-trip correctness for both 
RLE and PACKED modes across various bit widths. No new tests are needed since 
the change is purely a performance optimization. The decoded values are 
identical, only the buffer lifecycle changes from allocate-per-call to 
reuse-and-grow.
   
   ### Are there any user-facing changes?
   
   No. This is a transparent performance improvement internal to the RLE 
decoder. Decoded values are identical. No API changes, no configuration 
changes, no behavioral changes.
   
   
   
   Closes #3466
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to