There is a JIRA and prototype which analyzes the JVM bytecode in the black
box, and convert the closures into catalyst expressions.

https://issues.apache.org/jira/browse/SPARK-14083

This potentially can address the issue discussed here.


Sincerely,

DB Tsai
----------------------------------------------------------
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Sun, Apr 9, 2017 at 11:17 AM, Koert Kuipers <ko...@tresata.com> wrote:

> in this case there is no difference in performance. both will do the
> operation directly on the internal representation of the data (so the
> InternalRow).
>
> also it is worth pointing out that switching back and forth between
> Dataset[X] and DataFrame is free.
>
> On Sun, Apr 9, 2017 at 1:28 PM, Shiyuan <gshy2...@gmail.com> wrote:
>
>> Thank you for the detailed explanation!  You point out two reasons why
>> Dataset is not as efficeint as dataframe:
>> 1). Spark cannot look into lambda and therefore cannot optimize.
>> 2). The  type conversion  occurs under the hood, eg. from X to internal
>> row.
>>
>> Just to check my understanding,  some method of Dataset can also take sql
>> expression string  instead of lambda function, in this case, Is it  the
>> type conversion still happens under the hood and therefore Dataset is still
>> not as efficient as DataFrame.  Here is the code,
>>
>> //define a dataset and a dataframe, same content, but one is stored as
>> Dataset<Person>, the other is Dataset<Row>
>> scala> case class Person(name: String, age: Long)
>> scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS
>> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
>> scala> val df = Seq(Person("A",32), Person("B", 18)).toDF
>> df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
>>
>> //Which filtering is more efficient? both use sql expression string.
>> scala> df.filter("age < 20")
>> res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name:
>> string, age: bigint]
>>
>> scala> ds.filter("age < 20")
>> res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> how would you use only relational transformations on dataset?
>>>
>>> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan <gshy2...@gmail.com> wrote:
>>>
>>>> Hi Spark-users,
>>>>     I came across a few sources which mentioned DataFrame can be more
>>>> efficient than Dataset.  I can understand this is true because Dataset
>>>> allows functional transformation which Catalyst cannot look into and hence
>>>> cannot optimize well. But can DataFrame be more efficient than Dataset even
>>>> if we only use the relational transformation on dataset? If so, can anyone
>>>> give some explanation why  it is so? Any benchmark comparing dataset vs.
>>>> dataframe?   Thank you!
>>>>
>>>> Shiyuan
>>>>
>>>
>>>
>>
>

Reply via email to