Thank you for the detailed explanation!  You point out two reasons why
Dataset is not as efficeint as dataframe:
1). Spark cannot look into lambda and therefore cannot optimize.
2). The  type conversion  occurs under the hood, eg. from X to internal
row.

Just to check my understanding,  some method of Dataset can also take sql
expression string  instead of lambda function, in this case, Is it  the
type conversion still happens under the hood and therefore Dataset is still
not as efficient as DataFrame.  Here is the code,

//define a dataset and a dataframe, same content, but one is stored as
Dataset<Person>, the other is Dataset<Row>
scala> case class Person(name: String, age: Long)
scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
scala> val df = Seq(Person("A",32), Person("B", 18)).toDF
df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]

//Which filtering is more efficient? both use sql expression string.
scala> df.filter("age < 20")
res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name:
string, age: bigint]

scala> ds.filter("age < 20")
res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]








On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers <ko...@tresata.com> wrote:

> how would you use only relational transformations on dataset?
>
> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan <gshy2...@gmail.com> wrote:
>
>> Hi Spark-users,
>>     I came across a few sources which mentioned DataFrame can be more
>> efficient than Dataset.  I can understand this is true because Dataset
>> allows functional transformation which Catalyst cannot look into and hence
>> cannot optimize well. But can DataFrame be more efficient than Dataset even
>> if we only use the relational transformation on dataset? If so, can anyone
>> give some explanation why  it is so? Any benchmark comparing dataset vs.
>> dataframe?   Thank you!
>>
>> Shiyuan
>>
>
>

Reply via email to