[PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

via GitHub Sat, 16 May 2026 15:32:33 -0700


iemejia opened a new pull request, #55921:
URL: https://github.com/apache/spark/pull/55921


   ### What changes were proposed in this pull request?
   
   This PR adds a vectorized reader for the Parquet `BYTE_STREAM_SPLIT` 
encoding (`VectorizedByteStreamSplitValuesReader`), wired into 
`VectorizedColumnReader.getValuesReader()`.
   
   **BYTE_STREAM_SPLIT** de-interleaves N fixed-width values (W bytes each) 
into W separate byte streams. Decoding gathers the original bytes back: 
`value[i] = {stream[0][i], stream[1][i], ..., stream[W-1][i]}`. This encoding 
is particularly effective for time-series and scientific data where adjacent 
values share high-order bytes.
   
   The new reader:
   - Loads the entire encoded page into a `byte[]` via `initFromPage`
   - Uses direct per-element `assembleInt` / `assembleLong` helpers for byte 
gathering
   - Implements all batch read methods (`readIntegers`, `readLongs`, 
`readFloats`, `readDoubles`, `readBinary`) and skip methods
   - Supports FLOAT (W=4), DOUBLE (W=8), INT32 (W=4), INT64 (W=8), and 
FIXED_LEN_BYTE_ARRAY (W=type length)
   
   The `VectorizedColumnReader` change is a single `case BYTE_STREAM_SPLIT ->` 
block (12 lines) that resolves the type width from the column descriptor and 
yields the new reader.
   
   ### Why are the changes needed?
   
   Before this PR, Spark fell back to parquet-mr's per-value 
`ByteStreamSplitValuesReader` for BSS-encoded columns. The new vectorized batch 
reader is **2.8-4.5x faster** on the benchmark:
   
   ```
   OpenJDK 64-Bit Server VM 17.0.19+10 on Linux 7.0.0-1004-azure
   AMD EPYC 9V45 96-Core Processor
   
   BYTE_STREAM_SPLIT INT32:                  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Spark vectorized readIntegers                         1              1       
    0       1103.4           0.9       1.0X
   parquet-mr readInteger (per-value)                    4              4       
    0        247.6           4.0       0.2X
   
   BYTE_STREAM_SPLIT INT64:                  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Spark vectorized readLongs                            2              3       
    0        428.1           2.3       1.0X
   parquet-mr readLong (per-value)                       7              7       
    0        151.4           6.6       0.4X
   
   BYTE_STREAM_SPLIT FLOAT:                  Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Spark vectorized readFloats                           1              1       
    0       1053.1           0.9       1.0X
   parquet-mr readFloat (per-value)                      4              4       
    0        251.5           4.0       0.2X
   
   BYTE_STREAM_SPLIT DOUBLE:                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   Spark vectorized readDoubles                          2              3       
    0        426.9           2.3       1.0X
   parquet-mr readDouble (per-value)                     7              7       
    0        151.1           6.6       0.4X
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is an internal performance optimization. BSS-encoded Parquet 
columns that were already readable via the parquet-mr fallback are now decoded 
faster through the vectorized path. No API, configuration, or behavioral 
changes.
   
   ### How was this patch tested?
   
   - **31 unit tests** across 5 test suites in 
`ParquetByteStreamSplitEncodingSuite.scala`:
     - Abstract base `ParquetByteStreamSplitEncodingSuite[T]` with 7 shared 
test cases (roundtrip, nulls, skip, large batches, special values, sequential 
reads, mixed skip-read)
     - Concrete suites for Int, Long, Float, Double (Float/Double override 
`assertEqual` for bitwise NaN-safe comparison)
     - Standalone FLBA suite with 3 tests
   - **Benchmark** in `VectorizedByteStreamSplitReaderBenchmark.scala` 
comparing against parquet-mr per-value readers
   - All 260 existing + new Parquet tests pass on JDK 17
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: OpenCode (Claude claude-opus-4.6)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56894][SQL] Add vectorized Parquet BYTE_STREAM_SPLIT reader [spark]

Reply via email to