let me try that again. i left some crap at the bottom of my previous email
as i was editing it. sorry about that. here it goes:

it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow] with spark's internal types to represent String, Map, Seq,
structs, etc.

so for example if you do:
scala> val x: Dataset[(String, String)] = ...
scala> val f: (String, String) => Boolean = _._2 != null
scala> x.filter(f)

in this case you are using a lambda function for the filter. this is a
black-box operation to spark (spark cannot see what is inside the
function). so spark will now convert the internal representation it is
actually using (something like an InternalRow of size 2 with inside of it
two objects of type UTF8String) into a Tuple2[String, String], and then
call your function f on it. so for this very simply null comparison you are
doing a relatively expensive conversion.

now compare this to if you have a DataFrame that holds 2 columns of type
String.
scala> val x: DataFrame = ...
x: org.apache.spark.sql.DataFrame = [x: string, y: string]
scala> x.filter($"y" isNotNull)

spark will parse your expression, and since it has an understanding of what
you are trying to do, it can apply the logic directly on the InternalRow,
which avoids the conversion. this will be faster. of course you pay the
price for this in that you are forced to use a much more constrained
framework to express what you want to do, which can lead to some hair
pulling at times.

On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan <gshy2...@gmail.com> wrote:

> Hi Spark-users,
>     I came across a few sources which mentioned DataFrame can be more
> efficient than Dataset.  I can understand this is true because Dataset
> allows functional transformation which Catalyst cannot look into and hence
> cannot optimize well. But can DataFrame be more efficient than Dataset even
> if we only use the relational transformation on dataset? If so, can anyone
> give some explanation why  it is so? Any benchmark comparing dataset vs.
> dataframe?   Thank you!
>
> Shiyuan
>

Reply via email to