Re: Why dataframe can be more efficient than dataset?

Koert Kuipers Sun, 09 Apr 2017 11:18:07 -0700

in this case there is no difference in performance. both will do the
operation directly on the internal representation of the data (so the
InternalRow).


also it is worth pointing out that switching back and forth between
Dataset[X] and DataFrame is free.

On Sun, Apr 9, 2017 at 1:28 PM, Shiyuan <gshy2...@gmail.com> wrote:

> Thank you for the detailed explanation!  You point out two reasons why
> Dataset is not as efficeint as dataframe:
> 1). Spark cannot look into lambda and therefore cannot optimize.
> 2). The  type conversion  occurs under the hood, eg. from X to internal
> row.
>
> Just to check my understanding,  some method of Dataset can also take sql
> expression string  instead of lambda function, in this case, Is it  the
> type conversion still happens under the hood and therefore Dataset is still
> not as efficient as DataFrame.  Here is the code,
>
> //define a dataset and a dataframe, same content, but one is stored as
> Dataset<Person>, the other is Dataset<Row>
> scala> case class Person(name: String, age: Long)
> scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS
> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
> scala> val df = Seq(Person("A",32), Person("B", 18)).toDF
> df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
>
> //Which filtering is more efficient? both use sql expression string.
> scala> df.filter("age < 20")
> res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name:
> string, age: bigint]
>
> scala> ds.filter("age < 20")
> res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
>
>
>
>
>
>
>
>
> On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> how would you use only relational transformations on dataset?
>>
>> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan <gshy2...@gmail.com> wrote:
>>
>>> Hi Spark-users,
>>>     I came across a few sources which mentioned DataFrame can be more
>>> efficient than Dataset.  I can understand this is true because Dataset
>>> allows functional transformation which Catalyst cannot look into and hence
>>> cannot optimize well. But can DataFrame be more efficient than Dataset even
>>> if we only use the relational transformation on dataset? If so, can anyone
>>> give some explanation why  it is so? Any benchmark comparing dataset vs.
>>> dataframe?   Thank you!
>>>
>>> Shiyuan
>>>
>>
>>
>

Re: Why dataframe can be more efficient than dataset?

Reply via email to