arouel opened a new issue, #3466: URL: https://github.com/apache/parquet-java/issues/3466
### Describe the enhancement requested `RunLengthBitPackingHybridDecoder.readNext()` allocates a new `int[]` and `byte[]` on every PACKED-mode call. In workloads that decode many bit-packed runs (definition levels, repetition levels, RLE-encoded integers), these allocations dominate the read-side allocation profile. The upstream code even acknowledges this with a `// TODO: reuse a buffer` comment. ### Problem 1: per-call buffer allocation [Lines 94–95](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java#L94-L95) allocate fresh arrays on every PACKED-mode `readNext()`: ```java currentBuffer = new int[currentCount]; // TODO: reuse a buffer byte[] bytes = new byte[numGroups * bitWidth]; ``` `currentCount` is always `numGroups * 8`, and `numGroups` is typically small (1–16 groups = 8–128 values per run). These allocations are individually modest but occur thousands of times per column chunk — once per bit-packed run. In a 180M-row merge with multiple integer/boolean columns, the cumulative allocation is substantial. Since `currentCount` varies between runs (different `numGroups` values), the fix retains the field-level `int[]` and a new field-level `byte[]`, growing them only when the next run requires a larger buffer. ### Problem 2: per-call DataInputStream wrapping [Line 98](https://github.com/apache/parquet-java/blob/4c8f4d4b/parquet-column/src/main/java/org/apache/parquet/column/values/rle/RunLengthBitPackingHybridDecoder.java#L98) creates a `new DataInputStream(in)` on every PACKED-mode call: ```java new DataInputStream(in).readFully(bytes, 0, bytesToRead); ``` This allocates a `DataInputStream` wrapper object per call just to access `readFully()`. A private `readFully()` method on the decoder itself eliminates this allocation and the virtual dispatch through the wrapper. ### Component(s) Core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
