[PR] [SPARK-56801][SQL] Add bulk read+widen path for INT32 to Double Parquet vector updater [spark]

via GitHub Mon, 11 May 2026 02:36:31 -0700


LuciferYang opened a new pull request, #55795:
URL: https://github.com/apache/spark/pull/55795


   ### What changes were proposed in this pull request?
   
   Follow-up to SPARK-56791, applying the same bulk read+widen pattern to 
`IntegerToDoubleUpdater` (INT32 -> Double widening on the Parquet vectorized 
read path).
   
   A new `default void readIntegersAsDoubles(int total, WritableColumnVector c, 
int rowId)` is added to `VectorizedValuesReader`. The default falls back to the 
legacy per-row loop so non-PLAIN readers continue to work unchanged. 
`VectorizedPlainValuesReader` overrides it with a single `getBuffer(total * 4)` 
and a tight in-method `c.putDouble(rowId + i, buffer.getInt())` loop. 
`IntegerToDoubleUpdater.readValues` is now a one-line delegation. The 
int-to-double widening is Java's primitive numeric conversion and is lossless 
because every INT32 fits exactly in a double's 53-bit mantissa.
   
   This is one sub-task under the umbrella for the type-converting Parquet 
vector updaters. Sibling Updaters (`FloatToDouble`, `DowncastLong`, 
`DateToTimestampNTZ`) and other reader implementations (e.g. 
`VectorizedDeltaBinaryPackedReader`) follow as separate sub-tasks.
   
   ### Why are the changes needed?
   
   The legacy per-row path pays a `ByteBuffer` slice/orient allocation inside 
`getBuffer(4)` for every element, which dominates the cost of 
`IntegerToDoubleUpdater.readValues`. Collapsing those N allocations into one 
yields a sizeable gain on every supported JDK.
   
   Local A/B on the existing `ParquetVectorUpdaterBenchmark` (Mac, OpenJDK 
17.0.18):
   
   | | Rate | Per Row |
   |---|---:|---:|
   | Baseline | 822.6 M/s | 1.2 ns |
   | After    | 2735.7 M/s | 0.4 ns |
   | Speedup  | **3.32x** | -67% |
   
   GHA benchmark results across JDK 17/21/25 will be regenerated via the `Run 
benchmarks` workflow on the fork. Based on the SPARK-56791 cross-JDK pattern, 
JDK 21 and 25 are expected to show a larger speedup.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Unit tests in `ParquetVectorUpdaterSuite` extended to cover 
`IntegerToDoubleUpdater`:
   - Boundary batch lengths (0, 1, 7, 8, 9, 17, 1024, 4097), reused with the 
existing `signedSampleValues` generator that mixes sign, zero, and MIN/MAX 
boundaries.
   - The singular `readValue` path (the def-level-decoder's run-of-1 path is 
independent of `readValues`).
   
   A new end-to-end test in `ParquetIOSuite` round-trips an INT32 Parquet file 
read back as `DoubleType`, exercising both REQUIRED columns (no def-levels) and 
OPTIONAL columns with interleaved nulls so that `readValue` and `readValues` 
are both invoked. A shared `widenSampleAt` helper is lifted to the class level 
and reused by the INT32 -> Long sibling test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56801][SQL] Add bulk read+widen path for INT32 to Double Parquet vector updater [spark]

Reply via email to