yaooqinn opened a new pull request, #12112:
URL: https://github.com/apache/gluten/pull/12112

   ### What changes were proposed in this pull request?
   
   Skip min/max stats for non-binary-collation `StringType` columns in the 
Velox cache path, and write a permissive sentinel bound on the deserialize side 
as a fallback for any column whose `supported` flag is 0.
   
   New shim API `SparkShims.isBinaryCollationString` — default `true` for Spark 
3.x shims (no collation concept), overridden on Spark 4.0 / 4.1 to check 
`collationId == UTF8_BINARY_COLLATION_ID`.
   
   ### Why are the changes needed?
   
   On Spark 4.x with a non-binary collation, Velox's `scanMinMax<StringView>` 
does an unsigned byte-order compare while Spark's filter compare is 
collation-aware. The two disagree, so stats-based pruning can silently drop 
matching rows.
   
   Repro:
   ```scala
   spark.sql("CREATE TABLE t(s STRING COLLATE UTF8_LCASE) USING parquet")
   spark.sql("INSERT INTO t VALUES 'abc', 'XYZ'")
   spark.sql("CACHE TABLE t")
   spark.sql("SELECT * FROM t WHERE s = 'ABC'").show()
   // Before: 0 rows (wrong). After: 1 row.
   ```
   
   Vanilla Spark's `StringColumnStats` is collation-aware, so this is 
Gluten-specific.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — correctness fix. No new config.
   
   ### How was this patch tested?
   
   - New `ColumnarCachedBatchDeserializeStatsSentinelSuite` (5 cases: EqualTo / 
In / IsNotNull / StartsWith / LessThan) — PASS on spark-3.3 / 3.4 / 3.5 / 4.0 / 
4.1.
   - `BuildFilterPruneSuite` regression PASS on spark-3.5.
   - Cross-profile build 5/5 SUCCESS.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to