[PR] GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer [parquet-java]

via GitHub Sun, 19 Apr 2026 12:36:33 -0700


iemejia opened a new pull request, #3510:
URL: https://github.com/apache/parquet-java/pull/3510


   ### Rationale for this change
   
   `BinaryPlainValuesReader.readBytes()` is the hot-path decoder for `BINARY` 
(and `STRING`) columns using `PLAIN` encoding. The current implementation 
funnels every length read through `BytesUtils.readIntLittleEndian(InputStream)` 
and every value slice through `ByteBufferInputStream.slice(int)`:
   
   ```java
   public Binary readBytes() {
     try {
       int length = BytesUtils.readIntLittleEndian(in);
       return Binary.fromConstantByteBuffer(in.slice(length));
     } catch (IOException | RuntimeException e) {
       throw new ParquetDecodingException("could not read bytes at offset " + 
in.position(), e);
     }
   }
   ```
   
   Two issues per value:
   
   1. `BytesUtils.readIntLittleEndian(InputStream)` calls `in.read()` four 
times — full `IOException` plumbing and virtual dispatch on 
`ByteBufferInputStream` (`SingleBufferInputStream` / `MultiBufferInputStream`).
   2. `in.slice(length)` is also a virtual dispatch on `ByteBufferInputStream` 
for every value.
   
   If the page is materialised as a `MultiBufferInputStream` the cost is even 
higher because each slice may have to walk a buffer list.
   
   ### What changes are included in this PR?
   
   Replace the `ByteBufferInputStream` field with a single `ByteBuffer` set up 
once in `initFromPage`. The length prefix is then a single 
`ByteBuffer.getInt()` (one bounds check, JIT-friendly little-endian intrinsic, 
no `IOException` plumbing) and each value slice is a direct 
`ByteBuffer.slice()` instead of a virtual `ByteBufferInputStream.slice(int)`.
   
   Trade-off: when the input is a `MultiBufferInputStream` the upfront 
`stream.slice(available)` call may consolidate the page into a single fresh 
`ByteBuffer`. This is one allocation per page in exchange for inlined per-value 
reads — a clear win whenever the page contains more than a handful of values.
   
   ### Benchmark
   
   `BinaryEncodingBenchmark.decodePlain` (100k values per invocation, JDK 18, 
JMH `-wi 5 -i 10 -f 3`, 30 samples per row):
   
   | cardinality | stringLength | Before  | After   | Δ                  |
   |------------:|-------------:|--------:|--------:|-------------------:|
   | HIGH        | 10           | 23.11M  | 27.13M  | **+17.4% (1.17x)** |
   | HIGH        | 100          | 20.52M  | 22.20M  | **+8.2% (1.08x)**  |
   | HIGH        | 1000         | 7.07M   | 7.68M   | **+8.6% (1.09x)**  |
   | LOW         | 10           | 22.89M  | 26.46M  | **+15.6% (1.16x)** |
   | LOW         | 100          | 20.35M  | 22.16M  | **+8.9% (1.09x)**  |
   | LOW         | 1000         | 6.28M   | 7.50M   | **+19.4% (1.19x)** |
   
   Per-op allocation is unchanged (~88 B/op = the returned `Binary` + the 
per-value `ByteBuffer` slice).
   
   The improvement is largest at small string lengths because the per-value 
fixed cost (length read + slice) dominates more there; at 1000-byte values the 
cost is increasingly dominated by the value-bytes work downstream rather than 
the length read itself, but the gain is still ~9–19% even there.
   
   ### Are these changes tested?
   
   Yes. All 573 `parquet-column` tests pass. No new test was added because the 
returned `Binary` values are byte-identical to before (covered by existing 
`BinaryPlainValuesReader` round-trip tests).
   
   ### Are there any user-facing changes?
   
   No. Only an internal reader optimization. No public API, file format, or 
configuration change.
   
   ### Closes #3509
   
   Part of a small series of focused performance PRs from work in 
[parquet-perf](https://github.com/iemejia/parquet-perf). Previous: #3494, 
#3496, #3500, #3504, #3506.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] GH-3509: Optimize BinaryPlainValuesReader by reading directly from ByteBuffer [parquet-java]

Reply via email to