brijrajk commented on PR #12151: URL: https://github.com/apache/gluten/pull/12151#issuecomment-4817526089
### Latest push (`8a1ef81f6`) — fix `SPARK-54336` Pushed a single focused commit on top of the q19 golden refresh. The earlier revision wrapped only the **outer** `might_contain` in `VeloxBloomFilterMightContain` whenever its value was not a plain column, while leaving the **inner** `bloom_filter_agg` vanilla. For the upstream `BloomFilterAggregateQuerySuite."SPARK-54336"` query — `might_contain((SELECT bloom_filter_agg(col) FROM t), 0L)`, where the value is a **literal** — the inner vanilla aggregate has no Substrait mapping, so it runs in the JVM and emits version=0 bytes that the Velox might-contain (version=1) cannot deserialize: ``` Error Code: INVALID_ARGUMENT Reason: (1 vs. 0) Expression: kBloomFilterV1 == version Function: merge ``` **Fix:** keep the producer and consumer on the same byte format. - column-valued `might_contain` → rewrite **both** the inner aggregate and the outer might-contain to Velox (the GLUTEN-12013 whole-stage-fallback case, unchanged); - non-column (literal) value → leave **both** vanilla, which also preserves vanilla's `NULL`-on-empty-input semantics. **Correction to an earlier note of mine on DPP / TPC-DS.** This rule is registered via `injectOptimizerRule`, which lands in Spark's **Operator Optimization** batch — *before* `InjectRuntimeFilter` and `MergeScalarSubqueries`. It therefore never observes DPP/runtime-filter `might_contain` expressions (`xxhash64(col)` values) or `ScalarSubqueryReference` nodes; those are produced downstream and are unaffected by this rule. In the failing CI run the DPP `might_contain` shows up as plain vanilla `BloomFilterMightContain` in the logs (`Not supported to map ... might_contain(Subquery, xxhash64(...))`), confirming the rule does not touch it. The TPC-DS plan-stability failures were the pre-existing stale golden-file issue, not this rule's DPP handling. **Verified locally (Velox backend, Spark 4.0):** - `GlutenBloomFilterAggregateQuerySuiteCGOff` — `SPARK-54336` ✅ and `might_contain on bloom_filter_agg with empty input` ✅ (`SPARK-54336` previously crashed with `kBloomFilterV1 == version`). - `GlutenBloomFilterFallbackSuite` — all pass, including a new regression test mirroring the SPARK-54336 query. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
