[jira] [Updated] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

Matthew Scruggs (JIRA) Sun, 23 Oct 2016 08:48:46 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthew Scruggs updated SPARK-18065:
------------------------------------
    Description: 
I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
DataFrame that previously had a column, but no longer has it in its schema due 
to a select() operation.

In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
attempting to filter/where using the selected-out column:

{code:title=Spark 1.6.2}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, 
"two")))).selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---+----+
| id|word|
+---+----+
|  1| one|
|  2| two|
+---+----+


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]

scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
columns: [id];
{code}

However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
(no AnalysisException) and seems to filter out data as if the column remains:
{code:title=Spark 2.0.1}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
"two"))).toDF().selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---+----+
| id|word|
+---+----+
|  1| one|
|  2| two|
+---+----+


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]


scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
+---+
| id|
+---+
|  1|
+---+
{code}

  was:
I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
DataFrame that previously had a column, but no longer has it in its schema due 
to a select() operation.

In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
attempting to filter/where using the selected-out column:

{code:title=Spark 1.6.2}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), (2, 
"two")))).selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---+----+
| id|word|
+---+----+
|  1| one|
|  2| two|
+---+----+


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]

scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
columns: [id];
{code}

However in Spark 2.0.0 and 2.0.1, we see that the same code succeeds and seems 
to filter out data as if the column remains:
{code:title=Spark 2.0.1}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
"two"))).toDF().selectExpr("_1 as id", "_2 as word")
df1: org.apache.spark.sql.DataFrame = [id: int, word: string]

scala> df1.show()
+---+----+
| id|word|
+---+----+
|  1| one|
|  2| two|
+---+----+


scala> val df2 = df1.select("id")
df2: org.apache.spark.sql.DataFrame = [id: int]


scala> df2.printSchema()
root
 |-- id: integer (nullable = false)


scala> df2.where("word = 'one'").show()
+---+
| id|
+---+
|  1|
+---+
{code}


> Spark 2 allows filter/where on columns not in current schema
> ------------------------------------------------------------
>
>                 Key: SPARK-18065
>                 URL: https://issues.apache.org/jira/browse/SPARK-18065
>             Project: Spark
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: Matthew Scruggs
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
> DataFrame that previously had a column, but no longer has it in its schema 
> due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
> attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>       /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), 
> (2, "two")))).selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---+----+
> | id|word|
> +---+----+
> |  1| one|
> |  2| two|
> +---+----+
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
> columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
> (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>       /_/
>          
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
> "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---+----+
> | id|word|
> +---+----+
> |  1| one|
> |  2| two|
> +---+----+
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> |  1|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

Reply via email to