[jira] [Created] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

Vincent Doba (Jira) Tue, 28 Sep 2021 01:21:05 -0700

Vincent Doba created SPARK-36874:
------------------------------------

             Summary: Ambiguous Self-Join detected only on right dataframe
                 Key: SPARK-36874
                 URL: https://issues.apache.org/jira/browse/SPARK-36874
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.1.2
            Reporter: Vincent Doba



When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
        at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
        at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
        at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
        at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
        at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
        at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+----+----+-----+----+----+-----+
|key1|key2|value|key1|key2|value|
+----+----+-----+----+----+-----+
+----+----+-----+----+----+-----+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

Reply via email to