[ https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17802355#comment-17802355 ]
Maytas Monsereenusorn commented on SPARK-30421: ----------------------------------------------- Not sure if this is the right place to ask this but seems like there are certain cases where the column will not be available for filtering. This is also a regression from 2.1 to post 2.1. Example query: SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = '2' This works fine in 2.1 as for the reasons mentioned in this thread (due to _ResolveMissingReferences)_ However, after 2.1, the plan changed and SubqueryAlias was added. This seems to prevent ResolveMissingReferences from being able to change the project to add the Y column reference. In Post Spark 2.1 (i.e. Spark 3.3): {code:java} spark-sql-3.3> SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = '2'; Error in query: Column 'Y' does not exist. Did you mean one of the following? [__auto_generated_subquery_name.x]; line 1 pos 60; 'Project [*] +- 'Filter ('Y = 2) +- SubqueryAlias __auto_generated_subquery_name +- Project [x#30] +- SubqueryAlias __auto_generated_subquery_name +- Project [1 AS x#30, 2 AS Y#31] +- OneRowRelation{code} * Spark 2.1: {code:java} spark-sql> SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = '2'; 1 Time taken: 2.725 seconds, Fetched 1 row(s) spark-sql> EXPLAIN EXTENDED SELECT * FROM (SELECT x FROM (SELECT 1 AS x, 2 AS Y)) WHERE Y = '2'; == Parsed Logical Plan == 'Project [*] +- 'Filter ('Y = 2) +- 'Project ['x] +- Project [1 AS x#4, 2 AS Y#5] +- OneRowRelation$== Analyzed Logical Plan == x: int Project [x#4] +- Project [x#4] +- Filter (cast(Y#5 as bigint) = cast(2 as bigint)) +- Project [x#4, Y#5] +- Project [1 AS x#4, 2 AS Y#5] +- OneRowRelation$== Optimized Logical Plan == Project [1 AS x#4] +- OneRowRelation$== Physical Plan == *Project [1 AS x#4] +- Scan OneRowRelation[] Time taken: 0.813 seconds, Fetched 1 row(s) {code} - Do we care that this is a regression? That the query used to work in 2.1 but now breaks in later version? - Do we care that filter using non-existing columns as long as the column exists in the original table only work in certain case but not all cases (if you have SubqueryAlias) > Dropped columns still available for filtering > --------------------------------------------- > > Key: SPARK-30421 > URL: https://issues.apache.org/jira/browse/SPARK-30421 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.4 > Reporter: Tobias Hermann > Priority: Minor > > The following minimal example: > {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar") > df.select("foo").where($"bar" === "a").show > df.drop("bar").where($"bar" === "a").show > {quote} > should result in an error like the following: > {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given > input columns: [foo]; > {quote} > However, it does not but instead works without error, as if the column "bar" > would exist. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org