Re: renaming SchemaRDD -> DataFrame

Dirceu Semighini Filho Tue, 27 Jan 2015 12:11:31 -0800

Reynold,
But with type alias we will have the same problem, right?
If the methods doesn't receive schemardd anymore, we will have to change
our code to migrade from schema to dataframe. Unless we have an implicit
conversion between DataFrame and SchemaRDD




2015-01-27 17:18 GMT-02:00 Reynold Xin <[email protected]>:

> Dirceu,
>
> That is not possible because one cannot overload return types.
>
> SQLContext.parquetFile (and many other methods) needs to return some type,
> and that type cannot be both SchemaRDD and DataFrame.
>
> In 1.3, we will create a type alias for DataFrame called SchemaRDD to not
> break source compatibility for Scala.
>
>
> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> [email protected]> wrote:
>
>> Can't the SchemaRDD remain the same, but deprecated, and be removed in the
>> release 1.5(+/- 1)  for example, and the new code been added to DataFrame?
>> With this, we don't impact in existing code for the next few releases.
>>
>>
>>
>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <[email protected]>:
>>
>> > I want to address the issue that Matei raised about the heavy lifting
>> > required for a full SQL support. It is amazing that even after 30 years
>> of
>> > research there is not a single good open source columnar database like
>> > Vertica. There is a column store option in MySQL, but it is not nearly
>> as
>> > sophisticated as Vertica or MonetDB. But there's a true need for such a
>> > system. I wonder why so and it's high time to change that.
>> > On Jan 26, 2015 5:47 PM, "Sandy Ryza" <[email protected]> wrote:
>> >
>> > > Both SchemaRDD and DataFrame sound fine to me, though I like the
>> former
>> > > slightly better because it's more descriptive.
>> > >
>> > > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>> would
>> > > be more clear from a user-facing perspective to at least choose a
>> package
>> > > name for it that omits "sql".
>> > >
>> > > I would also be in favor of adding a separate Spark Schema module for
>> > Spark
>> > > SQL to rely on, but I imagine that might be too large a change at this
>> > > point?
>> > >
>> > > -Sandy
>> > >
>> > > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> [email protected]>
>> > > wrote:
>> > >
>> > > > (Actually when we designed Spark SQL we thought of giving it another
>> > > name,
>> > > > like Spark Schema, but we decided to stick with SQL since that was
>> the
>> > > most
>> > > > obvious use case to many users.)
>> > > >
>> > > > Matei
>> > > >
>> > > > > On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > While it might be possible to move this concept to Spark Core
>> > > long-term,
>> > > > supporting structured data efficiently does require quite a bit of
>> the
>> > > > infrastructure in Spark SQL, such as query planning and columnar
>> > storage.
>> > > > The intent of Spark SQL though is to be more than a SQL server --
>> it's
>> > > > meant to be a library for manipulating structured data. Since this
>> is
>> > > > possible to build over the core API, it's pretty natural to
>> organize it
>> > > > that way, same as Spark Streaming is a library.
>> > > > >
>> > > > > Matei
>> > > > >
>> > > > >> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <[email protected]>
>> > wrote:
>> > > > >>
>> > > > >> "The context is that SchemaRDD is becoming a common data format
>> used
>> > > for
>> > > > >> bringing data into Spark from external systems, and used for
>> various
>> > > > >> components of Spark, e.g. MLlib's new pipeline API."
>> > > > >>
>> > > > >> i agree. this to me also implies it belongs in spark core, not
>> sql
>> > > > >>
>> > > > >> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > > > >> [email protected]> wrote:
>> > > > >>
>> > > > >>> And in the off chance that anyone hasn't seen it yet, the Jan.
>> 13
>> > Bay
>> > > > Area
>> > > > >>> Spark Meetup YouTube contained a wealth of background
>> information
>> > on
>> > > > this
>> > > > >>> idea (mostly from Patrick and Reynold :-).
>> > > > >>>
>> > > > >>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > > > >>>
>> > > > >>> ________________________________
>> > > > >>> From: Patrick Wendell <[email protected]>
>> > > > >>> To: Reynold Xin <[email protected]>
>> > > > >>> Cc: "[email protected]" <[email protected]>
>> > > > >>> Sent: Monday, January 26, 2015 4:01 PM
>> > > > >>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > > > >>>
>> > > > >>>
>> > > > >>> One thing potentially not clear from this e-mail, there will be
>> a
>> > 1:1
>> > > > >>> correspondence where you can get an RDD to/from a DataFrame.
>> > > > >>>
>> > > > >>>
>> > > > >>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> [email protected]>
>> > > > wrote:
>> > > > >>>> Hi,
>> > > > >>>>
>> > > > >>>> We are considering renaming SchemaRDD -> DataFrame in 1.3, and
>> > > wanted
>> > > > to
>> > > > >>>> get the community's opinion.
>> > > > >>>>
>> > > > >>>> The context is that SchemaRDD is becoming a common data format
>> > used
>> > > > for
>> > > > >>>> bringing data into Spark from external systems, and used for
>> > various
>> > > > >>>> components of Spark, e.g. MLlib's new pipeline API. We also
>> expect
>> > > > more
>> > > > >>> and
>> > > > >>>> more users to be programming directly against SchemaRDD API
>> rather
>> > > > than
>> > > > >>> the
>> > > > >>>> core RDD API. SchemaRDD, through its less commonly used DSL
>> > > originally
>> > > > >>>> designed for writing test cases, always has the data-frame like
>> > API.
>> > > > In
>> > > > >>>> 1.3, we are redesigning the API to make the API usable for end
>> > > users.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> There are two motivations for the renaming:
>> > > > >>>>
>> > > > >>>> 1. DataFrame seems to be a more self-evident name than
>> SchemaRDD.
>> > > > >>>>
>> > > > >>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> anymore
>> > > > (even
>> > > > >>>> though it would contain some RDD functions like map, flatMap,
>> > etc),
>> > > > and
>> > > > >>>> calling it Schema*RDD* while it is not an RDD is highly
>> confusing.
>> > > > >>> Instead.
>> > > > >>>> DataFrame.rdd will return the underlying RDD for all RDD
>> methods.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> My understanding is that very few users program directly
>> against
>> > the
>> > > > >>>> SchemaRDD API at the moment, because they are not well
>> documented.
>> > > > >>> However,
>> > > > >>>> oo maintain backward compatibility, we can create a type alias
>> > > > DataFrame
>> > > > >>>> that is still named SchemaRDD. This will maintain source
>> > > compatibility
>> > > > >>> for
>> > > > >>>> Scala. That said, we will have to update all existing
>> materials to
>> > > use
>> > > > >>>> DataFrame rather than SchemaRDD.
>> > > > >>>
>> > > > >>>
>> > ---------------------------------------------------------------------
>> > > > >>> To unsubscribe, e-mail: [email protected]
>> > > > >>> For additional commands, e-mail: [email protected]
>> > > > >>>
>> > > > >>>
>> > ---------------------------------------------------------------------
>> > > > >>> To unsubscribe, e-mail: [email protected]
>> > > > >>> For additional commands, e-mail: [email protected]
>> > > > >>>
>> > > > >>>
>> > > > >
>> > > >
>> > > >
>> > > >
>> ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: [email protected]
>> > > > For additional commands, e-mail: [email protected]
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to