[ https://issues.apache.org/jira/browse/SPARK-42132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-42132: ----------------------------------- Labels: correctness pull-request-available (was: correctness) > DeduplicateRelations rule breaks plan when co-grouping the same DataFrame > ------------------------------------------------------------------------- > > Key: SPARK-42132 > URL: https://issues.apache.org/jira/browse/SPARK-42132 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.3.1, 3.2.3, 3.4.0, 3.5.0 > Reporter: Enrico Minack > Priority: Major > Labels: correctness, pull-request-available > Fix For: 3.5.0 > > > Co-grouping two DataFrames that share references breaks on the > DeduplicateRelations rule: > {code:java} > val df = spark.range(3) > val left_grouped_df = df.groupBy("id").as[Long, Long] > val right_grouped_df = df.groupBy("id").as[Long, Long] > val cogroup_df = left_grouped_df.cogroup(right_grouped_df) { > case (key, left, right) => left > } > cogroup_df.explain() > {code} > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SerializeFromObject [input[0, bigint, false] AS value#12L] > +- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L], > [id#13L], [id#13L], obj#11: bigint > :- !Sort [id#13L ASC NULLS FIRST], false, 0 > : +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, > [plan_id=16] > : +- Range (0, 3, step=1, splits=16) > +- Sort [id#13L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, > [plan_id=17] > +- Range (0, 3, step=1, splits=16) > {code} > The DataFrame cannot be computed: > {code:java} > cogroup_df.show() > {code} > {code:java} > java.lang.IllegalStateException: Couldn't find id#13L in [id#0L] > {code} > The rule replaces `id#0L` on the right side with `id#13L` while replacing all > occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to > the left side and should not be replaced. Further, `id#0L` of the right > deserializer is not replaced. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org