Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Xinh Huynh
Hi Arun,

This documentation may be helpful:

The 2.0-preview Scala doc for Dataset class:
http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Dataset
Note that the Dataset API has completely changed from 1.6.

In 2.0, there is no separate DataFrame class. Rather, it is a type alias
defined here:
http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.package@DataFrame=org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
"type DataFrame = Dataset

[Row

]"
Unlike in 1.6, a DataFrame is a specific Dataset[T], where T=Row, so
DataFrame shares the same methods as Dataset.

As mentioned earlier, this unification is only available in Scala and Java.

Xinh

On Tue, Jun 14, 2016 at 10:45 AM, Michael Armbrust 
wrote:

> 1) What does this really mean to an Application developer?
>>
>
> It means there are less concepts to learn.
>
>
>> 2) Why this unification was needed in Spark 2.0?
>>
>
> To simplify the API and reduce the number of concepts that needed to be
> learned.  We only didn't do it in 1.6 because we didn't want to break
> binary compatibility in a minor release.
>
>
>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>>
>
> There is no DataFrame class, all methods are still available, except those
> that returned an RDD (now you can call df.rdd.map if that is still what you
> want)
>
>
>> 4) Compile time safety will be there for DataFrames too?
>>
>
> Slide 7
>
>
>> 5) Python API is supported for Datasets in 2.0?
>>
>
> Slide 10
>


Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Michael Armbrust
>
> 1) What does this really mean to an Application developer?
>

It means there are less concepts to learn.


> 2) Why this unification was needed in Spark 2.0?
>

To simplify the API and reduce the number of concepts that needed to be
learned.  We only didn't do it in 1.6 because we didn't want to break
binary compatibility in a minor release.


> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>

There is no DataFrame class, all methods are still available, except those
that returned an RDD (now you can call df.rdd.map if that is still what you
want)


> 4) Compile time safety will be there for DataFrames too?
>

Slide 7


> 5) Python API is supported for Datasets in 2.0?
>

Slide 10


Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Arun Patel
Can anyone answer these questions please.



On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel  wrote:

> Thanks Michael.
>
> I went thru these slides already and could not find answers for these
> specific questions.
>
> I created a Dataset and converted it to DataFrame in 1.6 and 2.0.  I don't
> see any difference in 1.6 vs 2.0.  So, I really got confused and asked
> these questions about unification.
>
> Appreciate if you can answer these specific questions.  Thank you very
> much!
>
> On Mon, Jun 13, 2016 at 2:55 PM, Michael Armbrust 
> wrote:
>
>> Here's a talk I gave on the topic:
>>
>> https://www.youtube.com/watch?v=i7l3JQRx7Qw
>>
>> http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust
>>
>> On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel 
>> wrote:
>>
>>> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply
>>> an alias for a Dataset of type row.   I have few questions.
>>>
>>> 1) What does this really mean to an Application developer?
>>> 2) Why this unification was needed in Spark 2.0?
>>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>>> 4) Compile time safety will be there for DataFrames too?
>>> 5) Python API is supported for Datasets in 2.0?
>>>
>>> Thanks
>>> Arun
>>>
>>
>>
>


Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
Thanks Michael.

I went thru these slides already and could not find answers for these
specific questions.

I created a Dataset and converted it to DataFrame in 1.6 and 2.0.  I don't
see any difference in 1.6 vs 2.0.  So, I really got confused and asked
these questions about unification.

Appreciate if you can answer these specific questions.  Thank you very much!

On Mon, Jun 13, 2016 at 2:55 PM, Michael Armbrust 
wrote:

> Here's a talk I gave on the topic:
>
> https://www.youtube.com/watch?v=i7l3JQRx7Qw
>
> http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust
>
> On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel 
> wrote:
>
>> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
>> alias for a Dataset of type row.   I have few questions.
>>
>> 1) What does this really mean to an Application developer?
>> 2) Why this unification was needed in Spark 2.0?
>> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
>> 4) Compile time safety will be there for DataFrames too?
>> 5) Python API is supported for Datasets in 2.0?
>>
>> Thanks
>> Arun
>>
>
>


Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Michael Armbrust
Here's a talk I gave on the topic:

https://www.youtube.com/watch?v=i7l3JQRx7Qw
http://www.slideshare.net/SparkSummit/structuring-spark-dataframes-datasets-and-streaming-by-michael-armbrust

On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel  wrote:

> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an
> alias for a Dataset of type row.   I have few questions.
>
> 1) What does this really mean to an Application developer?
> 2) Why this unification was needed in Spark 2.0?
> 3) What changes can be observed in Spark 2.0 vs Spark 1.6?
> 4) Compile time safety will be there for DataFrames too?
> 5) Python API is supported for Datasets in 2.0?
>
> Thanks
> Arun
>