Re: "where" clause able to access fields not in its schema
It seems that we are using the function incorrectly. val a = Seq((1,10),(2,20)).toDF("foo","bar") val b = a.select($"foo") val c = b.where(b("bar") === 20) c.show Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "bar" among (foo); -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: "where" clause able to access fields not in its schema
Yeah, the filter gets infront of the select after analyzing scala> b.where($"bar" === 20).explain(true) == Parsed Logical Plan == 'Filter ('bar = 20) +- AnalysisBarrier +- Project [foo#6] +- Project [_1#3 AS foo#6, _2#4 AS bar#7] +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#3, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#4] +- ExternalRDD [obj#2] == Analyzed Logical Plan == foo: int Project [foo#6] +- Filter (bar#7 = 20) +- Project [foo#6, bar#7] +- Project [_1#3 AS foo#6, _2#4 AS bar#7] +- SerializeFromObject [assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1 AS _1#3, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#4] +- ExternalRDD [obj#2] == Optimized Logical Plan == Project [_1#3 AS foo#6] +- Filter (_2#4 = 20) +- SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#4] +- ExternalRDD [obj#2] == Physical Plan == *(1) Project [_1#3 AS foo#6] +- *(1) Filter (_2#4 = 20) +- *(1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1 AS _1#3, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#4] +- Scan ExternalRDDScan[obj#2] On Wed, Feb 13, 2019 at 8:04 PM Yeikel wrote: > This is indeed strange. To add to the question , I can see that if I use a > filter I get an exception (as expected) , so I am not sure what's the > difference between the where clause and filter : > > > b.filter(s=> { > val bar : String = s.getAs("bar") > > bar.equals("20") > }).show > > * java.lang.IllegalArgumentException: Field "bar" does not exist.* > > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Sent from my iPhone
Re: "where" clause able to access fields not in its schema
This is indeed strange. To add to the question , I can see that if I use a filter I get an exception (as expected) , so I am not sure what's the difference between the where clause and filter : b.filter(s=> { val bar : String = s.getAs("bar") bar.equals("20") }).show * java.lang.IllegalArgumentException: Field "bar" does not exist.* -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
"where" clause able to access fields not in its schema
I don't know if this is a bug or a feature, but it's a bit counter-intuitive when reading code. The "b" dataframe does not have field "bar" in its schema, but is still able to filter on that field. scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar") a: org.apache.spark.sql.DataFrame = [foo: int, bar: int] scala> a.show +---+---+ |foo|bar| +---+---+ | 1| 10| | 2| 20| +---+---+ scala> val b = a.select($"foo") b: org.apache.spark.sql.DataFrame = [foo: int] scala> b.schema res3: org.apache.spark.sql.types.StructType = StructType(StructField(foo,IntegerType,false)) scala> b.select($"bar").show org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given input columns: [foo];; [...snip...] scala> b.where($"bar" === 20).show +---+ |foo| +---+ | 2| +---+