jackylee-ch opened a new pull request, #12073:
URL: https://github.com/apache/gluten/pull/12073

   ## What changes are proposed in this pull request?
   
     This PR adds batch-level statistics collection and filter pushdown for 
Velox columnar cached batches,
      enabling Spark to skip cached batches that cannot satisfy query 
predicates (similar to Parquet
     row-group pruning but for in-memory cache).
   
     Architecture:
   
     - C++ (BatchStatsCollector): During columnar batch serialization, computes 
per-column min/max bounds,
      null counts, row counts, and byte sizes. Stats are appended to the 
serialized payload in a compact
     binary wire format.
     - Scala (ColumnarCachedBatchSerializer): Decodes the wire-format stats and 
integrates with Spark's
     SimpleMetricsCachedBatch filter evaluation to skip batches whose bounds 
prove the predicate cannot
     match.
   
     Supported bound types: Boolean, Byte, Short, Int, Long, Float, Double, 
Date, Timestamp, String, and
     Decimal(precision<=18).
   
     Key design decisions:
     - Wire format v1: per-column tag(1B) + hasBounds(1B) + bounds(variable) + 
nullCount(4B) +
     rowCount(4B) + sizeInBytes(8B)
     - Tautological bounds (type extremes) for unknown/absent bounds to avoid 
the 3VL null-skip
     correctness bug (where null bounds cause Spark's predicate to evaluate to 
null → coerced to false →
     batch incorrectly skipped)
     - Float/Double with NaN degrade to pass-through (no finite tautological 
pair exists due to NaN
     ordering)
     - String bounds capped at 64 KiB to prevent metadata bloat
     - Tag/dataType compatibility validation to reject corrupt payloads 
gracefully
     - Backward compatible: unknown tags fall through to pass-through filtering
     - Controlled by config spark.gluten.sql.columnar.tableCacheFilterEnabled 
(default: true)
   
    ## How was this patch tested?
   
     1. Unit tests (ColumnarCachedBatchSerializerSuite): 38 tests covering wire 
format round-trip for all
     supported types, NaN poisoning, inverted bounds rejection, tag/schema 
mismatch detection,
     tautological bounds fallback, truncated payload handling, negative counter 
rejection, and Decimal
     bounds (with/without bounds, precision>18 rejection).
     2. E2E tests (VeloxColumnarCacheSuite): Integration tests that cache real 
data, run filter queries,
     and validate correctness via checkAnswer for Int, String, Timestamp, and 
Decimal predicates including
      >, BETWEEN, =, and IS NULL.
   
   ##  Was this patch authored or co-authored using generative AI tooling?
   
     Generated-by: Claude Code (Claude Opus 4.7)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to