[
https://issues.apache.org/jira/browse/SPARK-51262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081235#comment-18081235
]
Shrirang Mhalgi edited comment on SPARK-51262 at 5/15/26 4:47 PM:
------------------------------------------------------------------
I was able to reproduce this on master branch. The issue occurs when
{{exceptAll}} (which uses {{RewriteExceptAll}} optimizer rule) runs on a
DataFrame produced by {{{}dropDuplicates(subset){}}}. The root cause is an
attribute reference mismatch in the optimized plan.
Working on a fix.
Repro (Scala):
> val df1 = spark.createDataFrame(Seq((1, "a", 100), (1, "a", 200), (2, "b",
> 300)))
> .toDF("id", "name", "value")
> val df2 = spark.createDataFrame(Seq((1, "a", 100))).toDF("id", "name",
> "value")
> df1.dropDuplicates("id", "name").exceptAll(df2).count()
> // Throws: INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND
> ```
was (Author: JIRAUSER313104):
I was able to reproduce this on master (branch-4.0). The issue occurs when
{{exceptAll}} (which uses {{RewriteExceptAll}} optimizer rule) runs on a
DataFrame produced by {{{}dropDuplicates(subset){}}}. The root cause is an
attribute reference mismatch in the optimized plan.
Working on a fix.
Repro (Scala):
> val df1 = spark.createDataFrame(Seq((1, "a", 100), (1, "a", 200), (2, "b",
> 300)))
> .toDF("id", "name", "value")
> val df2 = spark.createDataFrame(Seq((1, "a", 100))).toDF("id", "name",
> "value")
> df1.dropDuplicates("id", "name").exceptAll(df2).count()
> // Throws: INTERNAL_ERROR_ATTRIBUTE_NOT_FOUND
> ```
> exceptAll not working with drop_duplicates using subset
> -------------------------------------------------------
>
> Key: SPARK-51262
> URL: https://issues.apache.org/jira/browse/SPARK-51262
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.5.0, 3.5.3
> Reporter: Nicolau Balbino
> Priority: Minor
> Labels: SQL, pull-request-available
>
> When using drop_duplicate with subset and after use exceptAll method, when
> calling some action (isEmpty, show, collect, count) raises a Py4J error.
> Searching web, this issues is related here:
> [https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-39612,]
> also marked as resolved.
> I tested locally with version 3.5.3 and also AWS Glue 5.0, using 3.5.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]