[ https://issues.apache.org/jira/browse/SPARK-29682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965216#comment-16965216 ]
sandeshyapuram commented on SPARK-29682: ---------------------------------------- [~imback82] & [~cloud_fan] Currently I've worked my way around by renaming every column in the dataframes to perform joins and that works. Let me know if you have a better workaround to deal with it. > Failure when resolving conflicting references in Join: > ------------------------------------------------------ > > Key: SPARK-29682 > URL: https://issues.apache.org/jira/browse/SPARK-29682 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit > Affects Versions: 2.4.3 > Reporter: sandeshyapuram > Priority: Major > > When I try to self join a parentDf with multiple childDf say childDf1 ... ... > where childDfs are derived after a cube or rollup and are filtered based on > group bys, > I get and error > {{Failure when resolving conflicting references in Join: }} > This shows a long error message which is quite unreadable. On the other hand, > if I replace cube or rollup with old groupBy, it works without issues. > > *Sample code:* > {code:java} > val numsDF = sc.parallelize(Seq(1,2,3,4,5,6)).toDF("nums") > val cubeDF = numsDF > .cube("nums") > .agg( > max(lit(0)).as("agcol"), > grouping_id().as("gid") > ) > > val group0 = cubeDF.filter(col("gid") <=> lit(0)) > val group1 = cubeDF.filter(col("gid") <=> lit(1)) > cubeDF.printSchema > group0.printSchema > group1.printSchema > //Recreating cubeDf > cubeDF.select("nums").distinct > .join(group0, Seq("nums"), "inner") > .join(group1, Seq("nums"), "inner") > .show > {code} > *Sample output:* > {code:java} > numsDF: org.apache.spark.sql.DataFrame = [nums: int] > cubeDF: org.apache.spark.sql.DataFrame = [nums: int, agcol: int ... 1 more > field] > group0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > group1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [nums: int, > agcol: int ... 1 more field] > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > root > |-- nums: integer (nullable = true) > |-- agcol: integer (nullable = true) > |-- gid: integer (nullable = false) > org.apache.spark.sql.AnalysisException: > Failure when resolving conflicting references in Join: > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > : +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > : +- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) > +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] > +- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > Conflicting attributes: nums#220 > ;; > 'Join Inner > :- Deduplicate [nums#220] > : +- Project [nums#220] > : +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > : +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > : +- Project [nums#212, nums#212 AS nums#219] > : +- Project [value#210 AS nums#212] > : +- SerializeFromObject [input[0, int, false] AS value#210] > : +- ExternalRDD [obj#209] > +- Filter (gid#217 <=> 0) > +- Aggregate [nums#220, spark_grouping_id#218], [nums#220, max(0) AS > agcol#216, spark_grouping_id#218 AS gid#217] > +- Expand [List(nums#212, nums#219, 0), List(nums#212, null, 1)], > [nums#212, nums#220, spark_grouping_id#218] > +- Project [nums#212, nums#212 AS nums#219] > +- Project [value#210 AS nums#212] > +- SerializeFromObject [input[0, int, false] AS value#210] > +- ExternalRDD [obj#209] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:96) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:335) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:96) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:109) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:202) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:106) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:68) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:66) > at org.apache.spark.sql.Dataset.join(Dataset.scala:939) > ... 46 elided > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org