brijrajk commented on PR #12151:
URL: https://github.com/apache/gluten/pull/12151#issuecomment-4817526089

   ### Latest push (`8a1ef81f6`) — fix `SPARK-54336`
   
   Pushed a single focused commit on top of the q19 golden refresh.
   
   The earlier revision wrapped only the **outer** `might_contain` in 
`VeloxBloomFilterMightContain` whenever its value was not a plain column, while 
leaving the **inner** `bloom_filter_agg` vanilla. For the upstream 
`BloomFilterAggregateQuerySuite."SPARK-54336"` query — `might_contain((SELECT 
bloom_filter_agg(col) FROM t), 0L)`, where the value is a **literal** — the 
inner vanilla aggregate has no Substrait mapping, so it runs in the JVM and 
emits version=0 bytes that the Velox might-contain (version=1) cannot 
deserialize:
   
   ```
   Error Code: INVALID_ARGUMENT  Reason: (1 vs. 0)
   Expression: kBloomFilterV1 == version   Function: merge
   ```
   
   **Fix:** keep the producer and consumer on the same byte format.
   - column-valued `might_contain` → rewrite **both** the inner aggregate and 
the outer might-contain to Velox (the GLUTEN-12013 whole-stage-fallback case, 
unchanged);
   - non-column (literal) value → leave **both** vanilla, which also preserves 
vanilla's `NULL`-on-empty-input semantics.
   
   **Correction to an earlier note of mine on DPP / TPC-DS.** This rule is 
registered via `injectOptimizerRule`, which lands in Spark's **Operator 
Optimization** batch — *before* `InjectRuntimeFilter` and 
`MergeScalarSubqueries`. It therefore never observes DPP/runtime-filter 
`might_contain` expressions (`xxhash64(col)` values) or 
`ScalarSubqueryReference` nodes; those are produced downstream and are 
unaffected by this rule. In the failing CI run the DPP `might_contain` shows up 
as plain vanilla `BloomFilterMightContain` in the logs (`Not supported to map 
... might_contain(Subquery, xxhash64(...))`), confirming the rule does not 
touch it. The TPC-DS plan-stability failures were the pre-existing stale 
golden-file issue, not this rule's DPP handling.
   
   **Verified locally (Velox backend, Spark 4.0):**
   - `GlutenBloomFilterAggregateQuerySuiteCGOff` — `SPARK-54336` ✅ and 
`might_contain on bloom_filter_agg with empty input` ✅ (`SPARK-54336` 
previously crashed with `kBloomFilterV1 == version`).
   - `GlutenBloomFilterFallbackSuite` — all pass, including a new regression 
test mirroring the SPARK-54336 query.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to