andygrove opened a new issue, #4395:
URL: https://github.com/apache/datafusion-comet/issues/4395

   ## Background
   
   [#4374](https://github.com/apache/datafusion-comet/pull/4374) fixes a 
correctness bug (#4242) by including `originalPlan.canonicalized` in 
`CometBroadcastExchangeExec.doCanonicalize`, so that two semantically distinct 
broadcasts (count+1 vs count-1 in correlated-IN count-bug decorrelation) no 
longer collide and get merged by AQE's `stageCache` / `CometReuseSubquery`.
   
   The trade-off is that q83 loses two `ReusedSubquery` collapses on the three 
DPP subqueries against `date_dim`:
   
   - `approved-plans-v1_4-spark3_5/q83/extended.txt`: deleted (falls through to 
the base file at 124 ops with 0 `ReusedSubquery`).
   - `approved-plans-v1_4-spark4_0/q83.ansi/extended.txt`: 98 ops → 124 ops; 2 
`ReusedSubquery` → 0.
   
   This is a perf regression (broadcasting `date_dim` three times instead of 
once) but not a correctness regression. Tracking the recovery separately so the 
correctness fix in #4374 isn't blocked.
   
   ## Root cause
   
   Spark's `Canonicalize` strips alias names (rewrites attributes to `none#N`) 
but preserves expression structure. That makes `originalPlan.canonicalized` the 
right discriminator for count-bug (the `Add(count, 1)` vs `Subtract(count, 1)` 
node inside `HashAggregateExec.resultExpressions` survives canonicalization).
   
   But `QueryPlan.normalizeExprIds` is **context-dependent**: exprIds are 
assigned by plan traversal order. For q83, each DPP subquery is wrapped in its 
own `AdaptiveSparkPlanExec`, and the three structurally-equivalent build plans 
end up with non-equal canonical forms because exprIds embedded in their 
expressions are numbered against different surrounding contexts. The old 
(buggy) form happened to collapse q83's DPPs because it only keyed on 
`child.canonicalized` — the inner Comet operator tree canonicalizes from 
scratch with stable internal exprIds, so the three equivalent child plans were 
byte-equal, and `originalPlan` wasn't in the key at all.
   
   ## What was tried and why it didn't work
   
   While investigating review feedback on #4374, I tried two narrower keys:
   
   1. **`originalPlan.canonicalized.output`** — fails. Spark canonicalization 
wipes alias names, so count+1 and count-1 broadcasts have output `[none#0L, 
none#1, none#2]` after canonicalization. The `CometAggregateSuite` count-bug 
regression test reproduces (Spark returns 2 rows, Comet returns 1).
   
   2. **Flat expression set: `canonicalOriginal.collect { case p => 
p.expressions }.flatten.toSet`** — discriminates count-bug correctly, but 
false-negatives on equivalent broadcasts. Sampling 5 distinct invocations of 
equivalent count+1 broadcasts in the same test produced two different sets: 
`(none#53L + 1) AS #2L` vs `(none#136L + 1) AS #2L`. Same problem would hit q83.
   
   The fundamental issue is that exprId assignment depends on traversal 
context, so structurally equivalent plans in different surrounding contexts get 
different canonical forms.
   
   ## Possible paths forward
   
   - **Custom exprId-insensitive canonicalization** for 
`CometBroadcastExchangeExec` that re-numbers exprIds within just the 
canonicalized originalPlan, independent of the surrounding plan context. 
Non-trivial but most principled.
   - **Unwrap `AdaptiveSparkPlanExec`** before canonicalization (would need to 
verify this is the actual carrier of the differing exprIds in q83).
   - **Targeted matching on the count-bug pattern only** — e.g. include 
originalPlan in the key only when it embeds Spark-side projections above Comet 
aggregates. Brittle and pattern-specific.
   
   ## Related
   
   - PR #4374 (the correctness fix this issue trades off against)
   - Issue #4242 (the count-bug root cause)
   - Review thread: 
https://github.com/apache/datafusion-comet/pull/4374#discussion_r3284477020
   - Pre-existing in `CometBroadcastExchangeExec`: `equals`/`hashCode` 
previously ignored `mode` and `output` (now fixed in #4374), but Spark's 
`BroadcastExchangeExec.doCanonicalize` calls `mode.canonicalized` while ours 
passed bare `mode` (also fixed in #4374). Mentioning for context — these are no 
longer outstanding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to