[PR] [SPARK-56907][SQL] Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY Parquet vectorized reader [spark]

via GitHub Sun, 17 May 2026 02:11:22 -0700


iemejia opened a new pull request, #55932:
URL: https://github.com/apache/spark/pull/55932


   ### What changes were proposed in this pull request?
   
   This PR reduces object allocation in the DELTA_LENGTH_BYTE_ARRAY vectorized 
Parquet reader (`VectorizedDeltaLengthByteArrayReader`) by applying three 
targeted changes:
   
   **readBinary**: Replace per-value `in.slice(length)` (one ByteBuffer 
allocation per value) with a single bulk `in.slice(totalDataLen)` that reads 
the entire batch at once. Individual values are then written to the column 
vector via `putByteArray` from the shared backing array, eliminating N-1 
ByteBuffer object allocations.
   
   **skipBinary**: Replace the per-value skip loop (N separate `in.skip()` 
calls) with a single bulk skip by summing all value lengths upfront.
   
   **readGeoData**: Remove the `ByteBuffer.wrap()` + `ByteBufferOutputWriter` 
indirection per value and call `putByteArray` directly from the converter 
output array.
   
   ### Why are the changes needed?
   
   The DELTA_LENGTH_BYTE_ARRAY encoding is used for binary/string columns in 
Parquet v2 pages. In the current vectorized reader, `readBinary` allocates one 
`ByteBuffer` per value via `in.slice(length)`, and `skipBinary` performs a 
separate stream skip per value. For large batches (e.g. 1M values per page), 
this creates significant allocation pressure and per-call overhead.
   
   Micro-benchmarks on `VectorizedDeltaReaderBenchmark` Group D show:
   | Benchmark | Before (ms) | After (ms) | Speedup |
   |---|---|---|---|
   | readBinary, payloadLen=8 | 12 | 10 | **1.2x** |
   | readBinary, payloadLen=32 | 16 | 14 | **1.1x** |
   | readBinary, payloadLen=128 | 13 | 12 | **1.1x** |
   | readBinary, payloadLen=512 | 32 | 32 | ~1.0x |
   | skipBinary (all sizes) | 7 | 5 | **1.4x** |
   
   `readBinary` speedup is larger for small payloads where allocation cost 
dominates. `skipBinary` shows consistent 1.4x improvement across all payload 
sizes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing tests: `ParquetDeltaLengthByteArrayEncodingSuite` (14 tests 
including serialization, random strings, empty strings, skip interleaving, and 
geo types) and `ParquetEncodingSuite` all pass.
   
   Benchmarks: `VectorizedDeltaReaderBenchmark` Group D 
(DELTA_LENGTH_BYTE_ARRAY) run locally on JDK 17.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode with Claude claude-opus-4.6


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56907][SQL] Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY Parquet vectorized reader [spark]

Reply via email to