[PR] [SPARK-56897][SQL] Reduce per-value allocations in DELTA_BYTE_ARRAY Parquet decoder [spark]

via GitHub Sat, 16 May 2026 17:42:06 -0700


iemejia opened a new pull request, #55924:
URL: https://github.com/apache/spark/pull/55924


   ### What changes were proposed in this pull request?
   
   Reduces per-value heap allocations in `VectorizedDeltaByteArrayReader` (the 
Parquet DELTA_BYTE_ARRAY vectorized decoder) by replacing `ByteBuffer`-based 
state tracking with a reusable `byte[]` buffer.
   
   Key changes:
   
   1. **Replace `ByteBuffer previous` with `byte[] prevBuf` + `int prevLen`**: 
The DELTA_BYTE_ARRAY encoding exploits prefix sharing between consecutive 
values. The old code materialized each decoded value as a `ByteBuffer` obtained 
from `arrayData.getByteBuffer()` (which allocates ~48 bytes per value via 
`ByteBuffer.wrap()`). The new approach uses a reusable `byte[]` buffer where 
the prefix bytes from the previous iteration are already in place at the start 
-- only the suffix portion needs to be written.
   
   2. **Add `getSuffixInto()` to `VectorizedDeltaLengthByteArrayReader`**: New 
method that reads suffix bytes directly into a caller-supplied `byte[]` via 
`InputStream.read()`, avoiding the `ByteBuffer` allocation that `getBytes()` / 
`in.slice()` performs. Also adds `getSuffixLength()` for cases where only the 
length is needed.
   
   3. **Rewrite `skipBinary()` to use `prevBuf` directly**: The old 
implementation was particularly wasteful -- it maintained two 
`WritableColumnVector` instances (`binaryValVector` and `tempBinaryValVector`), 
reset one per skipped value, and swapped them after each iteration. The new 
implementation simply assembles into `prevBuf` without touching any column 
vectors.
   
   4. **Remove `tempBinaryValVector` field**: No longer needed after the 
`skipBinary` rewrite.
   
   ### Why are the changes needed?
   
   The DELTA_BYTE_ARRAY decoder allocated 2 `ByteBuffer` objects per decoded 
value (one from `in.slice()` for the suffix, one from 
`arrayData.getByteBuffer()` for the previous-value reference). For a typical 
4096-value page, this produced ~8K short-lived objects (~384KB) that stress the 
young-generation GC.
   
   The `skipBinary` path was even worse: it reset and rebuilt column vectors 
per skipped value, involving `reserve()`, `Arrays.copyOf()`, and `putArray()` 
overhead just to maintain the previous-value state.
   
   After this change, the hot path performs zero per-value heap allocations 
(the `prevBuf` array is reused across all values and only grows if a value 
exceeds its current capacity).
   
   Normalized benchmark improvements (using Variant reads as cross-hardware 
baseline):
   - `readBinary`: 1.1-1.3x faster across overlap shapes
   - `skipBinary`: 1.5-1.9x faster (largest gains from eliminating column 
vector reset/swap)
   - `readBinary(len)` single-value: ~1.2x faster
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   - Existing tests: `ParquetDeltaByteArrayEncodingSuite` (14 tests), 
`ParquetDeltaLengthByteArrayEncodingSuite`, `ParquetEncodingSuite` -- all pass.
   - Benchmark: `VectorizedDeltaReaderBenchmark` Group C (DELTA_BYTE_ARRAY) 
with four overlap shapes (no overlap, half overlap, full overlap, len=64).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode with Claude Opus (claude-opus-4.6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56897][SQL] Reduce per-value allocations in DELTA_BYTE_ARRAY Parquet decoder [spark]

Reply via email to