Matthew Scruggs created SPARK-18065: ---------------------------------------
Summary: Spark 2 allows filter/where on columns not in current schema Key: SPARK-18065 URL: https://issues.apache.org/jira/browse/SPARK-18065 Project: Spark Issue Type: Bug Affects Versions: 2.0.1, 2.0.0 Reporter: Matthew Scruggs I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a DataFrame that previously had a column, but no longer has it in its schema due to a select() operation. In Spark 1.6.2, in spark-shell, we see that an exception is thrown when attempting to filter/where using the selected-out column: {code:title=Spark 1.6.2} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, "two")))).selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---+----+ | id|word| +---+----+ | 1| one| | 2| two| +---+----+ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input columns: [id]; {code} However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems to filter out data as if the column remains: {code:title=Spark 2.0.1} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. scala> val df1 = sc.parallelize(Seq((1, "one"), (2, "two"))).toDF().selectExpr("_1 as id", "_2 as word") df1: org.apache.spark.sql.DataFrame = [id: int, word: string] scala> df1.show() +---+----+ | id|word| +---+----+ | 1| one| | 2| two| +---+----+ scala> val df2 = df1.select("id") df2: org.apache.spark.sql.DataFrame = [id: int] scala> df2.printSchema() root |-- id: integer (nullable = false) scala> df2.where("word = 'one'").show() +---+ | id| +---+ | 1| +---+ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org