Re: renaming SchemaRDD -> DataFrame

Koert Kuipers Tue, 27 Jan 2015 09:10:23 -0800

hey matei,
i think that stuff such as SchemaRDD, columar storage and perhaps also
query planning can be re-used by many systems that do analysis on
structured data. i can imagine panda-like systems, but also datalog or
scalding-like (which we use at tresata and i might rebase on SchemaRDD at
some point). SchemaRDD should become the interface for all these. and
columnar storage abstractions should be re-used between all these.


currently the sql tie in is way beyond just the (perhaps unfortunate)
naming convention. for example a core part of the SchemaRD abstraction is
Row, which is org.apache.spark.sql.catalyst.expressions.Row, forcing anyone
that want to build on top of SchemaRDD to dig into catalyst, a SQL Parser
(if i understand it correctly, i have not used catalyst, but it looks
neat). i should not need to include a SQL parser just to use structured
data in say a panda-like framework.

best, koert


On Mon, Jan 26, 2015 at 8:31 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> While it might be possible to move this concept to Spark Core long-term,
> supporting structured data efficiently does require quite a bit of the
> infrastructure in Spark SQL, such as query planning and columnar storage.
> The intent of Spark SQL though is to be more than a SQL server -- it's
> meant to be a library for manipulating structured data. Since this is
> possible to build over the core API, it's pretty natural to organize it
> that way, same as Spark Streaming is a library.
>
> Matei
>
> > On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
> >
> > "The context is that SchemaRDD is becoming a common data format used for
> > bringing data into Spark from external systems, and used for various
> > components of Spark, e.g. MLlib's new pipeline API."
> >
> > i agree. this to me also implies it belongs in spark core, not sql
> >
> > On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> > michaelma...@yahoo.com.invalid> wrote:
> >
> >> And in the off chance that anyone hasn't seen it yet, the Jan. 13 Bay
> Area
> >> Spark Meetup YouTube contained a wealth of background information on
> this
> >> idea (mostly from Patrick and Reynold :-).
> >>
> >> https://www.youtube.com/watch?v=YWppYPWznSQ
> >>
> >> ________________________________
> >> From: Patrick Wendell <pwend...@gmail.com>
> >> To: Reynold Xin <r...@databricks.com>
> >> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> >> Sent: Monday, January 26, 2015 4:01 PM
> >> Subject: Re: renaming SchemaRDD -> DataFrame
> >>
> >>
> >> One thing potentially not clear from this e-mail, there will be a 1:1
> >> correspondence where you can get an RDD to/from a DataFrame.
> >>
> >>
> >> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >>> Hi,
> >>>
> >>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted
> to
> >>> get the community's opinion.
> >>>
> >>> The context is that SchemaRDD is becoming a common data format used for
> >>> bringing data into Spark from external systems, and used for various
> >>> components of Spark, e.g. MLlib's new pipeline API. We also expect more
> >> and
> >>> more users to be programming directly against SchemaRDD API rather than
> >> the
> >>> core RDD API. SchemaRDD, through its less commonly used DSL originally
> >>> designed for writing test cases, always has the data-frame like API. In
> >>> 1.3, we are redesigning the API to make the API usable for end users.
> >>>
> >>>
> >>> There are two motivations for the renaming:
> >>>
> >>> 1. DataFrame seems to be a more self-evident name than SchemaRDD.
> >>>
> >>> 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
> >>> though it would contain some RDD functions like map, flatMap, etc), and
> >>> calling it Schema*RDD* while it is not an RDD is highly confusing.
> >> Instead.
> >>> DataFrame.rdd will return the underlying RDD for all RDD methods.
> >>>
> >>>
> >>> My understanding is that very few users program directly against the
> >>> SchemaRDD API at the moment, because they are not well documented.
> >> However,
> >>> oo maintain backward compatibility, we can create a type alias
> DataFrame
> >>> that is still named SchemaRDD. This will maintain source compatibility
> >> for
> >>> Scala. That said, we will have to update all existing materials to use
> >>> DataFrame rather than SchemaRDD.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >>
>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to