yaooqinn opened a new pull request, #12092:
URL: https://github.com/apache/gluten/pull/12092
### What changes were proposed in this pull request?
This PR fills the long-standing TODO in
`ColumnarCachedBatchSerializer.buildFilter`
by adding native-side min/max partition statistics to the columnar
`InMemoryRelation`
cache, enabling partition-level pruning when filtering over a cached
`DataFrame`.
The Velox backend now computes per-batch min/max/null-count stats during
`serializeWithStats` (cpp side, Arrow C-Data export-free) and embeds them
in the cached batch envelope. On the read path, the JVM serializer extends
`SimpleMetricsCachedBatchSerializer` and routes batches with stats through
the inherited `buildFilter` for partition pruning; batches without stats
(legacy v1 binary, or the SQLConf gate disabled) lazy-pass-through unchanged.
**Type matrix covered (vanilla Spark `ColumnStats.scala` parity):**
- Integer family: TINYINT / SMALLINT / INT / BIGINT
- Date / Timestamp / TimestampNTZ
- YearMonth / DayTime intervals
- Decimal (short P<19 / long P>=19 / HUGEINT)
- String (JVM marshal; cpp VARCHAR scan deferred — see notes)
- Boolean
- Float / Double (NaN-poison guarded)
VARCHAR cpp-side scan is intentionally deferred (`supported=0` from cpp,
no JVM crash path) due to Velox StringView lifetime considerations across
RowVector boundaries — tracked for a follow-up PR.
### Why are the changes needed?
Partition stats let `InMemoryRelation` skip whole cached partitions that
cannot satisfy a filter predicate, materially speeding up repeated point
lookups and selective filters over a cached DataFrame — a common pattern
in iterative SQL workloads (notebooks, ML feature engineering, BI tools
with cached intermediate results). Vanilla Spark has had this since
SPARK-32274 (3.1, ~5 years); Gluten's columnar cache previously
extended `CachedBatchSerializer` directly, leaving the `buildFilter` TODO
empty and skipping pruning entirely.
### Does this PR introduce _any_ user-facing change?
Yes — a new SQLConf:
| Key | Default | Meaning |
|---|---|---|
| `spark.gluten.sql.columnar.tableCache.partitionStats.enabled` | `false` |
Enable native partition stats computation + read-side pruning |
Default-off ship for the first release; flip to default-on after community
feedback on the included benchmark + a release of soak time. Gate is
double-checked on both the write path (skip stats compute) and read path
(via the inherited `buildFilter`).
### How was this patch tested?
**Unit / integration tests:**
- 19 cpp gtest cases (Velox `velox_operators_test`) — per-type min/max scan
+ framing + carry-overflow + NaN poison guard + truncate semantics
- 8 JVM Scala suites, 37 cases — Kryo wire format / stats blob marshal /
buildFilter wrapper / per-arm prune semantics (BIGINT / Date / Timestamp /
String / NaN sentinel / numCols cap)
- 1 Velox e2e suite (`ColumnarCachedBatchE2ESuite`) — cache + equality
filter end-to-end, plan shape verification, `numOutputRows` prune
evidence
**Benchmark**
(`backends-velox/benchmarks/ColumnarTableCachePartitionStatsBenchmark-results.txt`):
100M-row range partitioned by `c2` into 32 in-memory partitions,
`groupBy + sum/count/avg` follow-up to make pruned-partition savings
visible. Driver: `local[1]`, AMD EPYC 7763.
| Case | partitionStats off | partitionStats on | Speedup |
|---|---|---|---|
| cache build (write path) | 126,425 ms | 131,431 ms | **1.0x** (no
measurable overhead) |
| filter+agg, high selectivity (`c2 < 1000`, ~0.001%) | 4,431 ms | 1,744 ms
| **2.5x** |
| filter+agg, low selectivity (`c2 < 50000000`, ~50%) | 5,332 ms | 3,392 ms
| **1.6x** |
| filter+agg, point lookup (`c2 = 50000000`, 1 row) | 4,343 ms | 1,686 ms |
**2.6x** |
Write path shows no measurable overhead — stats computation runs in cpp
inside the existing batch scan loop.
### Notes on commit history
This PR retains the atomic per-slice commit history (~58 commits) rather
than squashing, to make per-feature `git bisect` and incremental review
practical. Each commit is self-contained: a feature slice, its test, and
any fixes amended into the slice.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]