Re: Spark 2.0: Unify DataFrames and Datasets question

Xinh Huynh Tue, 14 Jun 2016 11:21:12 -0700

Hi Arun,

This documentation may be helpful:

The 2.0-preview Scala doc for Dataset class:
http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Dataset
Note that the Dataset API has completely changed from 1.6.

In 2.0, there is no separate DataFrame class. Rather, it is a type alias
defined here:
http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.package@DataFrame=org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
"type DataFrame = Dataset
<http://spark.apache.org/docs/2.0.0-preview/api/scala/org/apache/spark/sql/Dataset.html>
[Row
<http://spark.apache.org/docs/2.0.0-preview/api/scala/org/apache/spark/sql/Row.html>
]"
Unlike in 1.6, a DataFrame is a specific Dataset[T], where T=Row, so
DataFrame shares the same methods as Dataset.

As mentioned earlier, this unification is only available in Scala and Java.

Xinh

On Tue, Jun 14, 2016 at 10:45 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> 1) What does this really mean to an Application developer?
>>
>
> It means there are less concepts to learn.
>
>
>> 2) Why this unification was needed in Spark 2.0?
>>
>
> To simplify the API and reduce the number of concepts that needed to be
> learned.  We only didn't do it in 1.6 because we didn't want to break
> binary compatibility in a minor release.
>
>
>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>>
>
> There is no DataFrame class, all methods are still available, except those
> that returned an RDD (now you can call df.rdd.map if that is still what you
> want)
>
>
>> 4) Compile time safety will be there for DataFrames too?
>>
>
> Slide 7
>
>
>> 5) Python API is supported for Datasets in 2.0?
>>
>
> Slide 10
>

Re: Spark 2.0: Unify DataFrames and Datasets question

Reply via email to