andygrove opened a new pull request, #3982:
URL: https://github.com/apache/datafusion-comet/pull/3982

   ## Which issue does this PR close?
   
   Related to #3949.
   
   ## Rationale for this change
   
   Comet's shuffle-support predicates (\`nativeShuffleSupported\`, 
\`columnarShuffleSupported\` in \`CometShuffleExchangeExec\`) are called at 
both initial planning and AQE stage-prep. Between those passes, AQE wraps 
already-materialized child stages in \`ShuffleQueryStageExec\` — a 
\`LeafExecNode\` whose \`children\` is \`Seq.empty\`. A naive re-evaluation of 
the support check can therefore flip the decision.
   
   Concrete example (and the suspected trigger for #3949): 
\`stageContainsDPPScan\` walks \`s.child.exists(...)\` looking for a 
\`FileSourceScanExec\` with a \`PlanExpression\` partition filter. When the DPP 
subtree sits under a materialized \`ShuffleQueryStageExec\`, \`.exists\` stops 
at the wrapper and the DPP scan becomes invisible. The same shuffle that 
correctly fell back to Spark at initial planning is then converted to Comet at 
stage prep — producing plan-shape inconsistencies across the two passes.
   
   This PR takes the same approach \`CometNativeScan.isSupported\` already uses 
for scans: if the node already carries any Comet fallback tag from a prior rule 
pass, preserve that decision rather than re-deriving it.
   
   ## What changes are included in this PR?
   
   - \`nativeShuffleSupported\` and \`columnarShuffleSupported\` short-circuit 
to \`false\` when \`hasExplainInfo(s)\` is true.
   - New \`CometShuffleFallbackStickinessSuite\`:
     - direct: tags a synthetic shuffle and asserts both predicates return 
\`false\`.
     - end-to-end: plans a DPP query, runs the support check once to establish 
the fallback tag, then swaps the shuffle's child for an opaque \`LeafExecNode\` 
that hides the DPP subtree (mimicking a materialized stage), and asserts the 
second call still falls back.
   
   Existing \`DPP fallback\` and \`DPP fallback avoids inefficient Comet 
shuffle (#3874)\` tests in \`CometExecSuite\` continue to pass.
   
   ## How are these changes tested?
   
   - New stickiness suite (two tests) passes.
   - Existing DPP tests pass.
   - I have **not** been able to reproduce the full #3949 \`ColumnarToRowExec\` 
canonicalization assertion with small synthetic DPP queries, so this PR is 
framed as a correctness fix for the decision-stability invariant — it may or 
may not close #3949 on its own.
   
   Draft until we either reproduce the full crash end-to-end, or confirm the 
fix resolves the awslabs benchmark failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to