Re: renaming SchemaRDD -> DataFrame

Michael Malak Tue, 27 Jan 2015 10:02:09 -0800

I personally have no preference DataFrame vs. DataTable, but only wish to lay 
out the history and etymology simply because I'm into that sort of thing.


"Frame" comes from Marvin Minsky's 1970's AI construct: "slots" and the data 
that go in them. The S programming language (precursor to R) adopted this 
terminology in 1991. R of course became popular with the rise of Data Science 
around 2012.
http://www.google.com/trends/explore#q=%22data%20science%22%2C%20%22r%20programming%22&cmpt=q&tz=

"DataFrame" would carry the implication that it comes along with its own 
metadata, whereas "DataTable" might carry the implication that metadata is 
stored in a central metadata repository.

"DataFrame" is thus technically more correct for SchemaRDD, but is a less 
familiar (and thus less accessible) term for those not immersed in data science 
or AI and thus may have narrower appeal.


----- Original Message -----
From: Evan R. Sparks <evan.spa...@gmail.com>
To: Matei Zaharia <matei.zaha...@gmail.com>
Cc: Koert Kuipers <ko...@tresata.com>; Michael Malak <michaelma...@yahoo.com>; 
Patrick Wendell <pwend...@gmail.com>; Reynold Xin <r...@databricks.com>; 
"dev@spark.apache.org" <dev@spark.apache.org>
Sent: Tuesday, January 27, 2015 9:55 AM
Subject: Re: renaming SchemaRDD -> DataFrame

I'm +1 on this, although a little worried about unknowingly introducing
SparkSQL dependencies every time someone wants to use this. It would be
great if the interface can be abstract and the implementation (in this
case, SparkSQL backend) could be swapped out.

One alternative suggestion on the name - why not call it DataTable?
DataFrame seems like a name carried over from pandas (and by extension, R),
and it's never been obvious to me what a "Frame" is.



On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> (Actually when we designed Spark SQL we thought of giving it another name,
> like Spark Schema, but we decided to stick with SQL since that was the most
> obvious use case to many users.)
>
> Matei
>
> > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
> >
> > While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
> >
> > Matei
> >
> >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >>
> >> "The context is that SchemaRDD is becoming a common data format used for
> >> bringing data into Spark from external systems, and used for various
> >> components of Spark, e.g. MLlib's new pipeline API."
> >>
> >> i agree. this to me also implies it belongs in spark core, not sql
> >>
> >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> michaelma...@yahoo.com.invalid> wrote:
> >>
> >>> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >>> Spark Meetup YouTube contained a wealth of background information on
> this
> >>> idea (mostly from Patrick and Reynold :-).
> >>>
> >>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>>
> >>> ________________________________
> >>> From: Patrick Wendell <pwend...@gmail.com>
> >>> To: Reynold Xin <r...@databricks.com>
> >>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> >>> Sent: Monday, January 26, 2015 4:01 PM
> >>> Subject: Re: renaming SchemaRDD -> DataFrame
> >>>
> >>>
> >>> One thing potentially not clear from this e-mail, there will be a 1:1
> >>> correspondence where you can get an RDD to/from a DataFrame.
> >>>
> >>>
> >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>>> Hi,
> >>>>
> >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>>> get the community's opinion.
> >>>>
> >>>> The context is that SchemaRDD is becoming a common data format used
> for
> >>>> bringing data into Spark from external systems, and used for various
> >>>> components of Spark, e.g. MLlib's new pipeline API. We also expect
> more
> >>> and
> >>>> more users to be programming directly against SchemaRDD API rather
> than
> >>> the
> >>>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>>> designed for writing test cases, always has the data-frame like API.
> In
> >>>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>>
> >>>>
> >>>> There are two motivations for the renaming:
> >>>>
> >>>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>>
> >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore
> (even
> >>>> though it would contain some RDD functions like map, flatMap, etc),
> and
> >>>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >>> Instead.
> >>>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>>
> >>>>
> >>>> My understanding is that very few users program directly against the
> >>>> SchemaRDD API at the moment, because they are not well documented.
> >>> However,
> >>>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>>> that is still named SchemaRDD. This will maintain source compatibility
> >>> for
> >>>> Scala. That said, we will have to update all existing materials to use
> >>>> DataFrame rather than SchemaRDD.
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org

> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Reply via email to