let me try that again. i left some crap at the bottom of my previous email as i was editing it. sorry about that. here it goes:
it is because you use Dataset[X] but the actual computations are still done in Dataset[Row] (so DataFrame). well... the actual computations are done in RDD[InternalRow] with spark's internal types to represent String, Map, Seq, structs, etc. so for example if you do: scala> val x: Dataset[(String, String)] = ... scala> val f: (String, String) => Boolean = _._2 != null scala> x.filter(f) in this case you are using a lambda function for the filter. this is a black-box operation to spark (spark cannot see what is inside the function). so spark will now convert the internal representation it is actually using (something like an InternalRow of size 2 with inside of it two objects of type UTF8String) into a Tuple2[String, String], and then call your function f on it. so for this very simply null comparison you are doing a relatively expensive conversion. now compare this to if you have a DataFrame that holds 2 columns of type String. scala> val x: DataFrame = ... x: org.apache.spark.sql.DataFrame = [x: string, y: string] scala> x.filter($"y" isNotNull) spark will parse your expression, and since it has an understanding of what you are trying to do, it can apply the logic directly on the InternalRow, which avoids the conversion. this will be faster. of course you pay the price for this in that you are forced to use a much more constrained framework to express what you want to do, which can lead to some hair pulling at times. On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan <gshy2...@gmail.com> wrote: > Hi Spark-users, > I came across a few sources which mentioned DataFrame can be more > efficient than Dataset. I can understand this is true because Dataset > allows functional transformation which Catalyst cannot look into and hence > cannot optimize well. But can DataFrame be more efficient than Dataset even > if we only use the relational transformation on dataset? If so, can anyone > give some explanation why it is so? Any benchmark comparing dataset vs. > dataframe? Thank you! > > Shiyuan >