It seems that we are using the function incorrectly.
val a = Seq((1,10),(2,20)).toDF("foo","bar")
val b = a.select($"foo")
val c = b.where(b("bar") === 20)
c.show
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot
resolve column name "bar" among
Yeah, the filter gets infront of the select after analyzing
scala> b.where($"bar" === 20).explain(true)
== Parsed Logical Plan ==
'Filter ('bar = 20)
+- AnalysisBarrier
+- Project [foo#6]
+- Project [_1#3 AS foo#6, _2#4 AS bar#7]
+- SerializeFromObject
This is indeed strange. To add to the question , I can see that if I use a
filter I get an exception (as expected) , so I am not sure what's the
difference between the where clause and filter :
b.filter(s=> {
val bar : String = s.getAs("bar")
bar.equals("20")
}).show
*
I don't know if this is a bug or a feature, but it's a bit counter-intuitive
when reading code.
The "b" dataframe does not have field "bar" in its schema, but is still able to
filter on that field.
scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar")
a: