iemejia opened a new pull request, #55919:
URL: https://github.com/apache/spark/pull/55919

   ### What changes were proposed in this pull request?
   
   Replace per-element lambda dispatch in `readIntegers`/`readLongs` with bulk 
paths that compute prefix sums in-place over the unpacked delta buffer and 
write via `putInts`/`putLongs` (backed by `System.arraycopy` on-heap).
   
   Three optimizations in this PR:
   
   1. **Bulk read for INT32/INT64**: `readBulkIntegers` and `readBulkLongs` 
replace the generic `readValues()` lambda-per-value path. A single 
`loadMiniBlockBulk` method handles block/mini-block loading, prefix-sum 
computation, and delegates the type-specific write to a `BulkWriter` callback 
(called once per mini-block, not per value).
   
   2. **Zero-allocation unsigned long encoding**: Replace `new 
BigInteger(Long.toUnsignedString(v)).toByteArray()` (3 allocations per value: 
String + BigInteger + byte[]) with `ByteBuffer.putLong` into a reusable scratch 
buffer. The shared utility `encodeUnsignedLongBigEndian` is extracted into 
`VectorizedReaderBase` and applied to all call sites 
(`VectorizedDeltaBinaryPackedReader`, `UnsignedLongUpdater`, 
`ParquetDictionary`).
   
   3. **Benchmark fix**: Add `unsignedLongVec.reset()` before 
`readUnsignedLongs` to prevent unbounded `arrayData()` growth across benchmark 
iterations (OOM).
   
   ### Why are the changes needed?
   
   The DELTA_BINARY_PACKED decoder was 2-5x slower than PLAIN encoding for 
INT32/INT64 reads due to per-element lambda dispatch and lack of bulk vector 
writes. The `readUnsignedLongs` path allocated 3 objects per value (12,288 
allocations per 4096-row batch) due to `BigInteger(Long.toUnsignedString(v))`.
   
   Benchmark results on the same machine (AMD EPYC 9V45, OpenJDK 25.0.3+9-LTS):
   
   | Benchmark | Baseline (M/s) | After (M/s) | Speedup |
   |---|---|---|---|
   | INT32 readIntegers, monotonic | 644 | 873 | **1.4x** |
   | INT32 readIntegers, small-delta | 466 | 553 | **1.2x** |
   | INT32 readIntegers, wide random | 357 | 417 | **1.2x** |
   | INT64 readLongs, constant | 316 | 879 | **2.8x** |
   | INT64 readLongs, monotonic | 252 | 951 | **3.8x** |
   | INT64 readLongs, small-delta | 216 | 587 | **2.7x** |
   | INT64 readLongs, wide random | 163 | 313 | **1.9x** |
   | readUnsignedLongs | 9.2 | 66 | **7.2x** |
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is a performance improvement to internal Parquet decoding. No API 
or behavior changes.
   
   ### How was this patch tested?
   
   - Existing unit tests: `ParquetDeltaEncodingInteger` (13 tests), 
`ParquetDeltaEncodingLong` (13 tests), `ParquetDeltaByteArrayEncodingSuite`, 
`ParquetDeltaLengthByteArrayEncodingSuite`, `ParquetVectorizedSuite` (25 
tests), `ParquetIOSuite` (unsigned Parquet logical types test) -- all pass.
   - Benchmark: `VectorizedDeltaReaderBenchmark` run before and after on the 
same machine with changes stashed/unstashed for fair comparison.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode (Claude claude-opus-4.6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to