cloud-fan opened a new pull request, #55719:
URL: https://github.com/apache/spark/pull/55719

   ### What changes were proposed in this pull request?
   
   Followup to SPARK-56482 (#55425). Two groups of changes to `UnionExec`'s 
whole-stage codegen path.
   
   **Code cleanness:**
   
   - Hoist `metricTerm("numOutputRows")` to `doProduce` and store it on the 
instance. `doConsume` runs once per child during emission, so the previous code 
registered the same metric N times in `references[]` for an N-child Union; now 
once.
   - Drop the dead `assert` in `perChildProjections` and the duplicate 
`allChildOutputDataTypesMatch` lazy val. The dataType comparison now has a 
single source of truth in the `type-mismatch` branch of the gate.
   - Inline the one-shot `hasAnyPartitionIndexDependentDescendant` lazy val.
   - Drop the unreachable `case other` in the `UnionPartition` match and 
replace with `asInstanceOf`. `unionedInputRDD` is built as `new UnionRDD(...)` 
two lines up, and `getPartitions` only ever returns `UnionPartition[_]`.
   - Factor `isPlainUnion` helper used by the gate and `doExecute` so the 
invariant "codegen path matches `sparkContext.union` semantics" lives in one 
place.
   - Hoist the child-local idx to a `childLocalIdx` local at helper entry. 
References emitted by `RangeExec`/`SampleExec` now read a plain int instead of 
re-evaluating `((int[]) refs[K])[partitionIndex]` per use.
   - Drop the `try/finally` around codegen state restoration. Codegen failure 
aborts the whole stage, so the restoration is unreachable.
   
   **Gate narrowing:**
   
   - Narrow `hasPartitionIndexDependentCodegen` to exclude `InputFileName`, 
`InputFileBlockStart`, and `InputFileBlockLength`. These are `Nondeterministic` 
but read from `InputFileBlockHolder` (a per-task thread-local) and do not embed 
`partitionIndex`, so they are safe under fusion. Queries like `SELECT 
input_file_name() FROM a UNION ALL SELECT input_file_name() FROM b` now fuse.
   
   ### Why are the changes needed?
   
   The cleanups remove accidental complexity in the fused code path: an N-fold 
metric reference, two duplicated dataType comparisons, an unreachable defensive 
guard, a per-iteration array deref, and a `try/finally` that protects against 
an unreachable case. The gate narrowing turns a missed optimization (file-scan 
unions) into a fused plan.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. `spark.sql.codegen.wholeStage.union.enabled` remains off by default; 
when on, the new behavior fuses additional plans (file-scan unions with 
`input_file_name()`) that the previous gate over-rejected.
   
   ### How was this patch tested?
   
   `UnionCodegenSuite`, `UnionCodegenAnsiSuite`, `UnionCodegenAqeSuite`, and 
the relevant `SQLMetricsSuite` test all pass. Two tests added:
   
   - `partitioning-aware union falls back to non-codegen` — covers a 
`supportCodegenFailureReason` branch that lacked explicit coverage.
   - `input_file_name child fuses (Nondeterministic but partition-index-free)` 
— validates the gate narrowing.
   
   The `columnar` fallback branch is not covered by a new test: reliably 
constructing a plan where `Union.supportsColumnar` is true via the user-facing 
API turned out to be brittle, since `ApplyColumnarRulesAndInsertTransitions` 
aggressively rebalances columnar/row transitions.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to