There is a JIRA and prototype which analyzes the JVM bytecode in the black box, and convert the closures into catalyst expressions.
https://issues.apache.org/jira/browse/SPARK-14083 This potentially can address the issue discussed here. Sincerely, DB Tsai ---------------------------------------------------------- Web: https://www.dbtsai.com PGP Key ID: 0x5CED8B896A6BDFA0 On Sun, Apr 9, 2017 at 11:17 AM, Koert Kuipers <ko...@tresata.com> wrote: > in this case there is no difference in performance. both will do the > operation directly on the internal representation of the data (so the > InternalRow). > > also it is worth pointing out that switching back and forth between > Dataset[X] and DataFrame is free. > > On Sun, Apr 9, 2017 at 1:28 PM, Shiyuan <gshy2...@gmail.com> wrote: > >> Thank you for the detailed explanation! You point out two reasons why >> Dataset is not as efficeint as dataframe: >> 1). Spark cannot look into lambda and therefore cannot optimize. >> 2). The type conversion occurs under the hood, eg. from X to internal >> row. >> >> Just to check my understanding, some method of Dataset can also take sql >> expression string instead of lambda function, in this case, Is it the >> type conversion still happens under the hood and therefore Dataset is still >> not as efficient as DataFrame. Here is the code, >> >> //define a dataset and a dataframe, same content, but one is stored as >> Dataset<Person>, the other is Dataset<Row> >> scala> case class Person(name: String, age: Long) >> scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS >> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] >> scala> val df = Seq(Person("A",32), Person("B", 18)).toDF >> df: org.apache.spark.sql.DataFrame = [name: string, age: bigint] >> >> //Which filtering is more efficient? both use sql expression string. >> scala> df.filter("age < 20") >> res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name: >> string, age: bigint] >> >> scala> ds.filter("age < 20") >> res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint] >> >> >> >> >> >> >> >> >> On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> how would you use only relational transformations on dataset? >>> >>> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan <gshy2...@gmail.com> wrote: >>> >>>> Hi Spark-users, >>>> I came across a few sources which mentioned DataFrame can be more >>>> efficient than Dataset. I can understand this is true because Dataset >>>> allows functional transformation which Catalyst cannot look into and hence >>>> cannot optimize well. But can DataFrame be more efficient than Dataset even >>>> if we only use the relational transformation on dataset? If so, can anyone >>>> give some explanation why it is so? Any benchmark comparing dataset vs. >>>> dataframe? Thank you! >>>> >>>> Shiyuan >>>> >>> >>> >> >