Hi Arun, This documentation may be helpful:
The 2.0-preview Scala doc for Dataset class: http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Dataset Note that the Dataset API has completely changed from 1.6. In 2.0, there is no separate DataFrame class. Rather, it is a type alias defined here: http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.package@DataFrame=org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] "type DataFrame = Dataset <http://spark.apache.org/docs/2.0.0-preview/api/scala/org/apache/spark/sql/Dataset.html> [Row <http://spark.apache.org/docs/2.0.0-preview/api/scala/org/apache/spark/sql/Row.html> ]" Unlike in 1.6, a DataFrame is a specific Dataset[T], where T=Row, so DataFrame shares the same methods as Dataset. As mentioned earlier, this unification is only available in Scala and Java. Xinh On Tue, Jun 14, 2016 at 10:45 AM, Michael Armbrust <mich...@databricks.com> wrote: > 1) What does this really mean to an Application developer? >> > > It means there are less concepts to learn. > > >> 2) Why this unification was needed in Spark 2.0? >> > > To simplify the API and reduce the number of concepts that needed to be > learned. We only didn't do it in 1.6 because we didn't want to break > binary compatibility in a minor release. > > >> 3) What changes can be observed in Spark 2.0 vs Spark 1.6? >> > > There is no DataFrame class, all methods are still available, except those > that returned an RDD (now you can call df.rdd.map if that is still what you > want) > > >> 4) Compile time safety will be there for DataFrames too? >> > > Slide 7 > > >> 5) Python API is supported for Datasets in 2.0? >> > > Slide 10 >