jackylee-ch opened a new pull request, #12073:
URL: https://github.com/apache/gluten/pull/12073
## What changes are proposed in this pull request?
This PR adds batch-level statistics collection and filter pushdown for
Velox columnar cached batches,
enabling Spark to skip cached batches that cannot satisfy query
predicates (similar to Parquet
row-group pruning but for in-memory cache).
Architecture:
- C++ (BatchStatsCollector): During columnar batch serialization, computes
per-column min/max bounds,
null counts, row counts, and byte sizes. Stats are appended to the
serialized payload in a compact
binary wire format.
- Scala (ColumnarCachedBatchSerializer): Decodes the wire-format stats and
integrates with Spark's
SimpleMetricsCachedBatch filter evaluation to skip batches whose bounds
prove the predicate cannot
match.
Supported bound types: Boolean, Byte, Short, Int, Long, Float, Double,
Date, Timestamp, String, and
Decimal(precision<=18).
Key design decisions:
- Wire format v1: per-column tag(1B) + hasBounds(1B) + bounds(variable) +
nullCount(4B) +
rowCount(4B) + sizeInBytes(8B)
- Tautological bounds (type extremes) for unknown/absent bounds to avoid
the 3VL null-skip
correctness bug (where null bounds cause Spark's predicate to evaluate to
null → coerced to false →
batch incorrectly skipped)
- Float/Double with NaN degrade to pass-through (no finite tautological
pair exists due to NaN
ordering)
- String bounds capped at 64 KiB to prevent metadata bloat
- Tag/dataType compatibility validation to reject corrupt payloads
gracefully
- Backward compatible: unknown tags fall through to pass-through filtering
- Controlled by config spark.gluten.sql.columnar.tableCacheFilterEnabled
(default: true)
## How was this patch tested?
1. Unit tests (ColumnarCachedBatchSerializerSuite): 38 tests covering wire
format round-trip for all
supported types, NaN poisoning, inverted bounds rejection, tag/schema
mismatch detection,
tautological bounds fallback, truncated payload handling, negative counter
rejection, and Decimal
bounds (with/without bounds, precision>18 rejection).
2. E2E tests (VeloxColumnarCacheSuite): Integration tests that cache real
data, run filter queries,
and validate correctness via checkAnswer for Int, String, Timestamp, and
Decimal predicates including
>, BETWEEN, =, and IS NULL.
## Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]