iemejia opened a new issue, #3505:
URL: https://github.com/apache/parquet-java/issues/3505
### Describe the enhancement requested
`ByteStreamSplitValuesReader` is the symmetric reader for
`BYTE_STREAM_SPLIT`-encoded `FLOAT`, `DOUBLE`, `INT32`, and `INT64` columns. On
`initFromPage` it eagerly transposes the entire page from stream-split layout
(`elementSizeInBytes` separate streams of `valuesCount` bytes each) back to
interleaved layout (`valuesCount` elements of `elementSizeInBytes` bytes each).
The current loop is:
```java
private byte[] decodeData(ByteBuffer encoded, int valuesCount) {
byte[] decoded = new byte[encoded.limit()];
int destByteIndex = 0;
for (int srcValueIndex = 0; srcValueIndex < valuesCount; ++srcValueIndex) {
for (int stream = 0; stream < elementSizeInBytes; ++stream,
++destByteIndex) {
decoded[destByteIndex] = encoded.get(srcValueIndex + stream *
valuesCount);
}
}
return decoded;
}
```
Two issues on the hot path:
1. Every read goes through `ByteBuffer.get(int)`, which does per-call bounds
checks and dispatches through `HeapByteBuffer`/`DirectByteBuffer` virtual
methods.
2. The inner stream offset (`stream * valuesCount`) is recomputed on every
iteration even though it depends only on the outer loop.
For a 100k-value `FLOAT` page that is 400k `ByteBuffer.get(int)` calls; for
a `DOUBLE`/`LONG` page it is 800k.
JMH (new `ByteStreamSplitDecodingBenchmark`, 100k values per invocation, JDK
18, `-wi 5 -i 10 -f 3`, 30 samples) on master:
| Type | ops/s |
|--------|--------:|
| Float | 47.80M |
| Double | 26.32M |
| Int | 47.07M |
| Long | 26.80M |
### Proposal
Restructure `decodeData` in `ByteStreamSplitValuesReader`:
1. **Drop down to a `byte[]` view** of the encoded buffer. When
`encoded.hasArray()` is true (the typical case), use the backing array directly
with the correct base offset; otherwise copy once with a single `get(byte[])`
call. This eliminates the per-byte `ByteBuffer.get(int)` bounds check and
virtual dispatch.
2. **Specialize loops for the common element sizes (4 and 8)**. Hoist all
`stream * valuesCount` offsets out of the inner loop into local ints (`s0..s3`
for floats/ints, `s0..s7` for doubles/longs), and write each output slot
exactly once in a single sequential pass. The reads come from
`elementSizeInBytes` concurrent sequential streams, which modern hardware
prefetchers handle well (typically 8–16 tracked streams per core).
3. **Generic fallback** for arbitrary element sizes (`FIXED_LEN_BYTE_ARRAY`
of any width).
Expected speedup (same JMH config):
| Type | Before | After | Δ |
|--------|--------:|---------:|---------------:|
| Float | 47.80M | 162.29M | **+240% (3.4x)** |
| Double | 26.32M | 66.00M | **+151% (2.5x)** |
| Int | 47.07M | 162.18M | **+245% (3.5x)** |
| Long | 26.80M | 66.00M | **+146% (2.5x)** |
### Scope
- Single file change to
`parquet-column/src/main/java/org/apache/parquet/column/values/bytestreamsplit/ByteStreamSplitValuesReader.java`.
- No public-API change; only the `private decodeData` helper is rewritten.
- All 573 `parquet-column` tests pass; 51 BSS-specific tests pass.
### Relation
Symmetric companion to #3504 (writer-side BSS optimization). Part of a small
series of focused performance PRs from work in
[parquet-perf](https://github.com/iemejia/parquet-perf). Previous: #3494,
#3496, #3500, #3504.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]