arouel opened a new pull request, #3467: URL: https://github.com/apache/parquet-java/pull/3467
### Rationale for this change `RunLengthBitPackingHybridDecoder.readNext()` allocates a new `int[]` and `byte[]` on every PACKED-mode call. The existing code even acknowledges this with a `// TODO: reuse a buffer` comment at line 94. In workloads that decode many bit-packed runs (definition levels, repetition levels, RLE-encoded integers), these allocations become a significant source of GC pressure. There are two issues: 1. **Per-call buffer allocation.** `currentBuffer = new int[currentCount]` and `byte[] bytes = new byte[numGroups * bitWidth]` are allocated fresh on every PACKED-mode `readNext()`. The individual allocations are modest (8–128 ints per run) but occur thousands of times per column chunk. In a custom JFR-profiled benchmark merging 60 Parquet files (180M rows, 10 columns), these two sites were the **number 2 and number 6 allocation hotspots** respectively: 2,402 out of 12,711 total allocation samples (18.9%), accounting for ~7.2 GB of allocation per operation. 2. **Per-call `DataInputStream` wrapping.** `new DataInputStream(in).readFully(bytes, 0, bytesToRead)` creates a wrapper object on every PACKED-mode call just to access `readFully()`. ### What changes are included in this PR? Three changes to `RunLengthBitPackingHybridDecoder`: - **`currentBuffer` promoted from local to field**, reused across `readNext()` calls with a grow-only strategy, only reallocated when the next run requires a larger buffer than currently held. - **`byte[] bytes` promoted to a field `packedBytes`**, same grow-only reuse strategy. - **`new DataInputStream(in).readFully(...)` replaced with a private `readFully()` method** that reads directly from the underlying `InputStream`, eliminating the per-call wrapper allocation and virtual dispatch. **Custom JFR-profiled benchmark results** (merge of 60 Parquet files, 180M rows, 10 columns, JDK 26, macOS aarch64): | Metric | Before | After | Change | |---|---|---|---| | Throughput | 44.2 s/op | 42.5 s/op | **-3.6%** | | Allocation rate | 908.9 MB/s | 793.9 MB/s | **-12.7%** | | Allocation per op | 42.1 GB | 35.4 GB | **-15.8%** | | RLE decoder alloc samples | 2,402 (18.9%) | 146 (1.2%) | **-93.9%** | | RLE decoder alloc bytes | ~7,245 MB | ~417 MB | **-94.2%** | ### Are these changes tested? Yes. Existing tests for `RunLengthBitPackingHybridDecoder` (e.g. `TestRunLengthBitPackingHybridEncoder`) cover round-trip correctness for both RLE and PACKED modes across various bit widths. No new tests are needed since the change is purely a performance optimization. The decoded values are identical, only the buffer lifecycle changes from allocate-per-call to reuse-and-grow. ### Are there any user-facing changes? No. This is a transparent performance improvement internal to the RLE decoder. Decoded values are identical. No API changes, no configuration changes, no behavioral changes. Closes #3466 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
