[PR] [VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning [gluten]

via GitHub Wed, 13 May 2026 23:21:22 -0700


yaooqinn opened a new pull request, #12092:
URL: https://github.com/apache/gluten/pull/12092


   ### What changes were proposed in this pull request?
   
   This PR fills the long-standing TODO in 
`ColumnarCachedBatchSerializer.buildFilter`
   by adding native-side min/max partition statistics to the columnar 
`InMemoryRelation`
   cache, enabling partition-level pruning when filtering over a cached 
`DataFrame`.
   
   The Velox backend now computes per-batch min/max/null-count stats during
   `serializeWithStats` (cpp side, Arrow C-Data export-free) and embeds them
   in the cached batch envelope. On the read path, the JVM serializer extends
   `SimpleMetricsCachedBatchSerializer` and routes batches with stats through
   the inherited `buildFilter` for partition pruning; batches without stats
   (legacy v1 binary, or the SQLConf gate disabled) lazy-pass-through unchanged.
   
   **Type matrix covered (vanilla Spark `ColumnStats.scala` parity):**
   - Integer family: TINYINT / SMALLINT / INT / BIGINT
   - Date / Timestamp / TimestampNTZ
   - YearMonth / DayTime intervals
   - Decimal (short P<19 / long P>=19 / HUGEINT)
   - String (JVM marshal; cpp VARCHAR scan deferred — see notes)
   - Boolean
   - Float / Double (NaN-poison guarded)
   
   VARCHAR cpp-side scan is intentionally deferred (`supported=0` from cpp,
   no JVM crash path) due to Velox StringView lifetime considerations across
   RowVector boundaries — tracked for a follow-up PR.
   
   ### Why are the changes needed?
   
   Partition stats let `InMemoryRelation` skip whole cached partitions that
   cannot satisfy a filter predicate, materially speeding up repeated point
   lookups and selective filters over a cached DataFrame — a common pattern
   in iterative SQL workloads (notebooks, ML feature engineering, BI tools
   with cached intermediate results). Vanilla Spark has had this since
   SPARK-32274 (3.1, ~5 years); Gluten's columnar cache previously
   extended `CachedBatchSerializer` directly, leaving the `buildFilter` TODO
   empty and skipping pruning entirely.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — a new SQLConf:
   
   | Key | Default | Meaning |
   |---|---|---|
   | `spark.gluten.sql.columnar.tableCache.partitionStats.enabled` | `false` | 
Enable native partition stats computation + read-side pruning |
   
   Default-off ship for the first release; flip to default-on after community
   feedback on the included benchmark + a release of soak time. Gate is
   double-checked on both the write path (skip stats compute) and read path
   (via the inherited `buildFilter`).
   
   ### How was this patch tested?
   
   **Unit / integration tests:**
   - 19 cpp gtest cases (Velox `velox_operators_test`) — per-type min/max scan
     + framing + carry-overflow + NaN poison guard + truncate semantics
   - 8 JVM Scala suites, 37 cases — Kryo wire format / stats blob marshal /
     buildFilter wrapper / per-arm prune semantics (BIGINT / Date / Timestamp /
     String / NaN sentinel / numCols cap)
   - 1 Velox e2e suite (`ColumnarCachedBatchE2ESuite`) — cache + equality
     filter end-to-end, plan shape verification, `numOutputRows` prune
     evidence
   
   **Benchmark** 
(`backends-velox/benchmarks/ColumnarTableCachePartitionStatsBenchmark-results.txt`):
   
   100M-row range partitioned by `c2` into 32 in-memory partitions,
   `groupBy + sum/count/avg` follow-up to make pruned-partition savings
   visible. Driver: `local[1]`, AMD EPYC 7763.
   
   | Case | partitionStats off | partitionStats on | Speedup |
   |---|---|---|---|
   | cache build (write path) | 126,425 ms | 131,431 ms | **1.0x** (no 
measurable overhead) |
   | filter+agg, high selectivity (`c2 < 1000`, ~0.001%) | 4,431 ms | 1,744 ms 
| **2.5x** |
   | filter+agg, low selectivity (`c2 < 50000000`, ~50%) | 5,332 ms | 3,392 ms 
| **1.6x** |
   | filter+agg, point lookup (`c2 = 50000000`, 1 row) | 4,343 ms | 1,686 ms | 
**2.6x** |
   
   Write path shows no measurable overhead — stats computation runs in cpp
   inside the existing batch scan loop.
   
   ### Notes on commit history
   
   This PR retains the atomic per-slice commit history (~58 commits) rather
   than squashing, to make per-feature `git bisect` and incremental review
   practical. Each commit is self-contained: a feature slice, its test, and
   any fixes amended into the slice.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Add min/max partition stats to columnar InMemoryRelation cache for partition pruning [gluten]

Reply via email to