iemejia opened a new issue, #56011:
URL: https://github.com/apache/spark/issues/56011

   ## Overview
   
   This is an umbrella issue tracking a series of performance improvements to 
the Parquet vectorized reader in Spark SQL. The changes target allocation 
reduction, bulk-read optimizations, and JIT-friendly code patterns across 
multiple encoding paths.
   
   All PRs are independent and can be reviewed/merged in any order. Together 
they yield significant throughput gains (1.2x to 7x depending on the encoding 
and data shape) for Parquet reads with no user-facing behavioral changes.
   
   ## Pull Requests
   
   ### 1. DELTA_BINARY_PACKED bulk read optimization
   **PR:** #55919 
([SPARK-56892](https://issues.apache.org/jira/browse/SPARK-56892))
   
   Replaces per-element lambda dispatch in `readIntegers`/`readLongs` with bulk 
paths that compute prefix sums in-place and write via `putInts`/`putLongs`. 
Also eliminates 3 allocations per value in `readUnsignedLongs` by replacing 
`BigInteger(Long.toUnsignedString(v))` with a reusable `ByteBuffer`.
   
   | Type | Speedup |
   |------|---------|
   | INT32 (monotonic) | 1.4x |
   | INT64 (monotonic) | 3.8x |
   | readUnsignedLongs | 7.2x |
   
   ---
   
   ### 2. Dictionary decoding hasNull fast path + per-class updater overrides
   **PR:** #55920 
([SPARK-56893](https://issues.apache.org/jira/browse/SPARK-56893))
   
   Adds a `hasNull()` fast path that skips per-element null checks when the 
column has no nulls (common case). Per-class `decodeDictionaryIds` overrides 
give C2 monomorphic call sites, enabling full inlining of type-specific decode 
expressions.
   
   | Scenario | Speedup |
   |----------|---------|
   | No nulls (avg across 6 updaters) | 1.24x |
   
   ---
   
   ### 3. Vectorized BYTE_STREAM_SPLIT reader
   **PR:** #55921 
([SPARK-56894](https://issues.apache.org/jira/browse/SPARK-56894))
   
   Adds a new `VectorizedByteStreamSplitValuesReader` that decodes BSS-encoded 
pages (FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY) using batch 
byte-gathering instead of falling back to parquet-mr per-value reads.
   
   | Type | Speedup vs parquet-mr |
   |------|-----------------------|
   | INT32 | 4.5x |
   | INT64 | 2.8x |
   | FLOAT | 4.2x |
   | DOUBLE | 2.8x |
   
   ---
   
   ### 4. Batch ByteBuffer slice in RLE PACKED decode
   **PR:** #55922 
([SPARK-56895](https://issues.apache.org/jira/browse/SPARK-56895))
   
   Replaces per-group `in.slice(bitWidth)` (one `ByteBuffer` allocation per 8 
values) with a single bulk slice for the entire PACKED run. Eliminates ~128K 
short-lived ByteBuffer allocations per 1M-value page.
   
   | bitWidth | Speedup (readIntegers) |
   |----------|------------------------|
   | 4 | 2.1x |
   | 8 | 2.4x |
   | 12 | 1.6x |
   | 20 | 1.4x |
   
   ---
   
   ### 5. Bulk read paths for timestamp/date Parquet vector updaters
   **PR:** #55923 
([SPARK-56896](https://issues.apache.org/jira/browse/SPARK-56896))
   
   Replaces per-element `readValue` loops with two-pass bulk read + in-place 
conversion for five updaters (`LongAsMicrosUpdater`, `LongAsNanosUpdater`, 
`LongAsMicrosRebaseUpdater`, `DateToTimestampNTZUpdater`, 
`DateToTimestampNTZWithRebaseUpdater`). Avoids per-element virtual dispatch 
through `VectorizedValuesReader`.
   
   | Updater | Speedup |
   |---------|---------|
   | LongAsMicrosUpdater | 2.9x |
   | DateToTimestampNTZUpdater | 1.2x |
   
   ---
   
   ### 6. Reduce per-value allocations in DELTA_BYTE_ARRAY decoder
   **PR:** #55924 
([SPARK-56897](https://issues.apache.org/jira/browse/SPARK-56897))
   
   Replaces `ByteBuffer`-based state tracking with a reusable `byte[]` buffer, 
eliminating 2 ByteBuffer allocations per decoded value (~8K objects per 
4096-value page). Also rewrites `skipBinary` to avoid column vector reset/swap 
overhead.
   
   | Operation | Speedup |
   |-----------|---------|
   | readBinary | 1.1-1.3x |
   | skipBinary | 1.5-1.9x |
   
   ---
   
   ### 7. Reduce per-value allocation in DELTA_LENGTH_BYTE_ARRAY decoder
   **PR:** #55932 
([SPARK-56907](https://issues.apache.org/jira/browse/SPARK-56907))
   
   Replaces per-value `in.slice(length)` with a single bulk slice for the 
entire batch. Replaces per-value skip loop with a single bulk skip.
   
   | Operation | Speedup |
   |-----------|---------|
   | readBinary (small payloads) | 1.2x |
   | skipBinary | 1.4x |
   
   ---
   
   ## Common Themes
   
   - **Allocation reduction**: Replace per-value `ByteBuffer.slice()` / 
`ByteBuffer.wrap()` with bulk reads into reusable buffers
   - **Bulk vectorized reads**: Replace per-element virtual dispatch with 
single batch calls backed by `System.arraycopy`
   - **JIT-friendly patterns**: Per-class method overrides for monomorphic call 
sites; avoiding megamorphic profile pollution from shared helpers
   
   ## Benchmarking
   
   All benchmarks were run on AMD EPYC 9V45 with OpenJDK 17/25, comparing 
upstream `master` against the patched version on the same machine with 
identical JVM flags.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to