Bruce Robbins created SPARK-45580: ------------------------------------- Summary: RewritePredicateSubquery unexpectedly changes the output schema of certain queries Key: SPARK-45580 URL: https://issues.apache.org/jira/browse/SPARK-45580 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1, 3.3.3 Reporter: Bruce Robbins
A query can have an incorrect output schema because of a subquery. Assume this data: {noformat} create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3; {noformat} When run in {{spark-sql}}, the following query has a superfluous boolean column: {noformat} select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true {noformat} The result should be: {noformat} 1 2 3 {noformat} When executed via the {{Dataset}} API, you don't see this result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong). However, even with the {{Dataset}} API, this query goes wrong: {noformat} select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ... {noformat} Other queries that have the wrong schema: {noformat} select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) ); {noformat} and {noformat} select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org