cloud-fan commented on code in PR #56056:
URL: https://github.com/apache/spark/pull/56056#discussion_r3288991100
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala:
##########
@@ -153,9 +153,29 @@ object DetectAmbiguousSelfJoin extends Rule[LogicalPlan] {
}
condition.toSeq.flatMap(getAmbiguousAttrs)
- case _ => ambiguousColRefs.toSeq.map { ref =>
- colRefAttrs.find(attr => toColumnReference(attr) == ref).get
- }
+ case _ =>
+ // SPARK-52498: For a Project on top of a self-join with a foldable
join condition
+ // (e.g., df.join(df, df("col") === 0).select(df("col"))), the
column references
+ // in the select are not ambiguous because the foldable condition
means it doesn't
+ // matter which side the column comes from.
+ val isProjectOverFoldableSelfJoin = plan match {
+ case Project(_, Join(
+ LogicalPlanWithDatasetId(_, leftId),
+ LogicalPlanWithDatasetId(_, rightId),
+ _, Some(condition), _)) if leftId == rightId =>
+ condition.collectFirst {
+ case Equality(_, b) if b.foldable => true
+ case Equality(a, _) if a.foldable => true
+ }.isDefined
+ case _ => false
+ }
+ if (isProjectOverFoldableSelfJoin) {
+ Nil
+ } else {
+ ambiguousColRefs.toSeq.map { ref =>
+ colRefAttrs.find(attr => toColumnReference(attr) == ref).get
+ }
+ }
Review Comment:
I think the premise of this exemption needs reconsidering — the case it
allows is genuinely ambiguous.
**The justification in the comment is inaccurate.** "The foldable condition
means it doesn't matter which side the column comes from" is not true. The
condition `df("col") === 0` resolves to one side only (LEFT, since `df("col")`
carries LEFT's exprId after `ResolveDeduplicateRelations`). The join is
effectively:
```sql
SELECT * FROM df L JOIN df R ON L.col = 0
```
That filters `L.col = 0` but leaves `R.col` unconstrained. For `df = [0, 1,
2]`:
| L.col | R.col |
|---|---|
| 0 | 0 |
| 0 | 1 |
| 0 | 2 |
`select df("col"))` → LEFT.col → `[0, 0, 0]`. Selecting the right side
(hypothetically) would give `[0, 1, 2]`. **The two sides do not agree.** (Note
`JoinWith.resolveSelfJoinCondition` only rewrites trivially-true `EqualTo(a, b)
if a.sameRef(b)` — `df("col") === 0` is not rewritten.)
So the column reference *is* ambiguous in exactly the way this rule exists
to flag: the user wrote `df("col")` and got LEFT, but a reasonable reader of
`df.join(df, df("col") === 0).select(df("col"))` might expect the result to be
"col from the join result" — which isn't a single well-defined thing.
**Internal consistency with the existing tests.** `SPARK-28344: fail
ambiguous self join - column ref in Project` in `DataFrameSelfJoinSuite` (line
148) already establishes that `self_join.select(df("col"))` is ambiguous and
must throw. The PR carves out an exemption when the join condition happens to
contain a foldable equality — but the foldable equality does no semantic work
here (it doesn't make L.col == R.col, doesn't make `df("col")` resolve any
differently). It's just the discriminator that separates the new test case from
the four existing tests that broke with the broader fix mentioned in the PR
description. Without a principled reason, this looks like reverse-engineering
the rule to fit one test.
**Implication for the motivation.** The PR description says the single-pass
resolver handles this "correctly." If we agree the case is ambiguous, then
single-pass is the one with the bug — the right direction is to align
single-pass with `DetectAmbiguousSelfJoin` (preserve the error), not to weaken
`DetectAmbiguousSelfJoin`.
Would you mind reconsidering the direction? Happy to discuss if there's a
use case I'm missing — but as-is I don't think we should land this exemption.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]