iemejia opened a new pull request, #3506:
URL: https://github.com/apache/parquet-java/pull/3506
### Rationale for this change
`ByteStreamSplitValuesReader` is the symmetric reader for
`BYTE_STREAM_SPLIT`-encoded `FLOAT`, `DOUBLE`, `INT32`, and `INT64` columns. On
`initFromPage` it eagerly transposes the entire page from stream-split layout
(`elementSizeInBytes` separate streams of `valuesCount` bytes each) back to
interleaved layout. The current loop is:
```java
private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
byte[] decoded = new byte[encoded.limit()];
int destByteIndex = 0;
for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
for (int stream = 0; stream < elementSizeInBytes; ++stream,
++destByteIndex) {
decoded[destByteIndex] = encoded.get(srcValueIndex + stream *
valuesCount);
}
}
return decoded;
}
```
Two issues on the hot path:
1. Every read goes through `ByteBuffer.get(int)` (per-call bounds checks +
virtual dispatch through `HeapByteBuffer`/`DirectByteBuffer`).
2. The inner stream offset (`stream * valuesCount`) is recomputed on every
iteration even though it depends only on the outer loop.
For a 100k-value `FLOAT` page that is 400k `ByteBuffer.get(int)` calls; for
`DOUBLE`/`LONG` it is 800k.
### What changes are included in this PR?
Rewrite `decodeData` in three steps:
1. **Drop down to a `byte[]` view** of the encoded buffer. When
`encoded.hasArray()` is true (the typical case), use the backing array directly
with the correct base offset; otherwise copy once with a single `get(byte[])`
call. Eliminates the per-byte `ByteBuffer.get(int)` bounds check and virtual
dispatch.
2. **Specialize loops for the common element sizes (4 and 8)**. Hoist all
`stream * valuesCount` offsets into local ints (`s0..s3` for floats/ints,
`s0..s7` for doubles/longs) and write each output slot exactly once in a single
sequential pass. Reads come from `elementSizeInBytes` concurrent sequential
streams, which modern hardware prefetchers handle well.
3. **Generic fallback** for arbitrary element sizes (`FIXED_LEN_BYTE_ARRAY`
of any width).
### Benchmark
New `ByteStreamSplitDecodingBenchmark` (100k values per invocation, JDK 18,
JMH `-wi 5 -i 10 -f 3`, 30 samples per row):
| Type | Before | After | Δ |
|--------|--------:|---------:|---------------:|
| Float | 47.80M | 162.29M | **+240% (3.40x)** |
| Double | 26.32M | 66.00M | **+151% (2.51x)** |
| Int | 47.07M | 162.18M | **+245% (3.45x)** |
| Long | 26.80M | 66.00M | **+146% (2.46x)** |
Decoded output is byte-identical to before; per-op heap allocation is
unchanged.
### Are these changes tested?
Yes. All 573 `parquet-column` tests pass; 51 BSS-specific tests pass (`mvn
test -pl parquet-column -Dtest='*ByteStreamSplit*'`). No new test was added
because the decoded bytes are unchanged (covered by existing round-trip and
`ByteStreamSplitValuesReaderTest` tests).
### Are there any user-facing changes?
No. Only an internal reader optimization. No public API, file format, or
configuration change.
### Closes #3505
Symmetric companion to #3504 (writer-side BSS optimization). Part of a small
series of focused performance PRs from work in
[parquet-perf](https://github.com/iemejia/parquet-perf). Previous: #3494,
#3496, #3500, #3504.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]