yaooqinn commented on PR #12112:
URL: https://github.com/apache/gluten/pull/12112#issuecomment-4489683082

   Thanks @zhli1142015 for the thorough review — both points are valid and 
addressed in rev 3.2 (`48c8bc749d`).
   
   **C1 (test coverage)**: The earlier sentinel-suite only exercised the 
deserialize hot path. Added 8 wrapper-behavior tests 
(`ColumnarCachedBatchBuildFilterPruneSuite` W1–W8) plus end-to-end UTF8_LCASE / 
UNICODE_CI cases in `ColumnarCachedBatchE2ESuite`. W8 is an explicit 
anti-regression that bypasses the wrapper to prove a UTF8_LCASE bound check 
would have wrongly pruned the batch.
   
   **C2 (sentinel correctness)**: You are right — `0xFF * 256B` is not a 
universal upper bound under non-binary collations. `PhysicalStringType.ordering 
= CollationFactory.fetchCollation(id).comparator` 
(`PhysicalDataType.scala:334`) means `<=` on collation-aware StringType is 
governed by the collation’s comparator, not by raw byte order, so any fixed 
sentinel can falsify the bound check.
   
   Rev 3.2 abandons the sentinel approach entirely. Reader-side mechanism is 
now a `splitConjunctivePredicates`-based wrapper around 
`SimpleMetricsCachedBatchSerializer.buildFilter`: any AND-conjunct referencing 
a non-binary collation StringType attribute is dropped before partition-stats 
evaluation; binary attributes still prune. Conjuncts that cannot be split (Or, 
etc.) referencing such attributes conservatively keep the batch. 
Writer/wire-format unchanged.
   
   Verified locally:
   - `mvn clean install` + suites on `-Pspark-3.5/scala-2.12`, 
`-Pspark-4.0/scala-2.13`, `-Pspark-4.1/scala-2.13`
   - spark-4.0 / 4.1: 42/42 PASS
   - spark-3.5: 32 PASS + 10 cleanly canceled (W1–W8 + 2 E2E collation cases 
guarded by `assume(isCollationAware)`, since CollationFactory does not exist on 
3.5)
   
   PR description updated to reflect rev 3.2. PTAL.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to