[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793889#comment-17793889 ]
Dongjoon Hyun commented on SPARK-45580: --------------------------------------- I raised this issue to the blocker for Apache Spark 3.3.4. > Subquery changes the output schema of the outer query > ----------------------------------------------------- > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.3, 3.4.1, 3.5.0 > Reporter: Bruce Robbins > Priority: Blocker > Labels: correctness, pull-request-available > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org