iemejia opened a new pull request, #55920:
URL: https://github.com/apache/spark/pull/55920
### What changes were proposed in this pull request?
This PR adds two optimizations to the Parquet vectorized dictionary decode
path (`ParquetVectorUpdater.decodeDictionaryIds`):
1. **`hasNull()` fast path**: A new `static decodeBatch` helper on
`ParquetVectorUpdater` splits decoding into two loops — when `values.hasNull()`
is false, the per-element `isNullAt(i)` check is skipped entirely.
2. **Per-class `decodeDictionaryIds` overrides** in six hot-path updaters
(`IntegerUpdater`, `IntegerToLongUpdater`, `LongUpdater`, `FloatUpdater`,
`FloatToDoubleUpdater`, `DoubleUpdater`): each override is a one-line
delegation to `decodeBatch(... this)`. Although the logic is identical to the
default method, each per-class override gives the C2 JIT compiler its own
bytecode — and therefore a monomorphic call site for `decodeSingleDictionaryId`
— enabling full inlining of the type-specific decode expression. The default
interface method's bytecode is shared by all ~30 implementors, producing a
megamorphic profile that prevents inlining.
Class hierarchy of the change:
```
ParquetVectorUpdater (interface)
├── default decodeDictionaryIds(...) → delegates to static decodeBatch(...
this)
├── static decodeBatch(...) → hasNull() branch + two loops
calling updater.decodeSingleDictionaryId()
│
└── Concrete updaters in ParquetVectorUpdaterFactory:
├── IntegerUpdater @Override decodeDictionaryIds →
decodeBatch(... this)
├── IntegerToLongUpdater @Override decodeDictionaryIds →
decodeBatch(... this)
├── LongUpdater @Override decodeDictionaryIds →
decodeBatch(... this)
├── FloatUpdater @Override decodeDictionaryIds →
decodeBatch(... this)
├── FloatToDoubleUpdater @Override decodeDictionaryIds →
decodeBatch(... this)
└── DoubleUpdater @Override decodeDictionaryIds →
decodeBatch(... this)
```
### Why are the changes needed?
The default `decodeDictionaryIds` method has two performance issues:
- **Unconditional `isNullAt` check**: Even when the column has no nulls
(common case), every element pays for an `isNullAt(i)` call.
`WritableColumnVector.hasNull()` is an O(1) flag check that allows skipping the
per-element null check entirely.
- **Megamorphic dispatch**: Java interface default methods compile to a
single bytecode shared by all implementors. C2 profiles one call site for
`decodeSingleDictionaryId` across ~30 updater types → megamorphic → no
inlining. Per-class overrides create per-class bytecode → per-class C2 profiles
→ monomorphic devirtualization → full inlining of the decode expression.
Benchmark results on AMD EPYC 9V45 (1M rows, dict size 100, Rate M/s higher
is better):
| Scenario | Upstream | Optimized | Speedup |
|---|---|---|---|
| No nulls (avg across 6 updaters) | ~332 M/s | ~412 M/s | **1.24x** |
| 10% nulls | ~284 M/s | ~277 M/s | ~1.0x (neutral) |
| 50% nulls | ~180 M/s | ~181 M/s | ~1.0x (neutral) |
The no-nulls case is the common production path and shows a clear
improvement. With nulls present the `isNullAt` check dominates regardless, so
performance is neutral.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Existing tests: `ParquetVectorUpdaterSuite` (30 tests),
`ParquetQuerySuite`, `ParquetIOSuite`, `ParquetEncodingSuite`,
`VectorizedRleValuesReaderSuite`, `ParquetSchemaSuite` (243 tests total) — all
pass.
- New benchmark: `ParquetDictionaryDecodeBenchmark` with global pre-warm
that interleaves both `hasNull()` branches (no-null and 50%-null) across all 6
updater types before measurement, avoiding C2 uncommon-trap bias. Three
benchmark groups: no nulls, 10% nulls, 50% nulls.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenCode (Claude claude-opus-4.6)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]