Re: renaming SchemaRDD -> DataFrame

Evan Chan Wed, 28 Jan 2015 16:42:42 -0800

I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.


Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com> wrote:
> It shouldn't change the data source api at all because data sources create
> RDD[Row], and that gets converted into a DataFrame automatically (previously
> to SchemaRDD).
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>
> One thing that will break the data source API in 1.3 is the location of
> types. Types were previously defined in sql.catalyst.types, and now moved to
> sql.types. After 1.3, sql.catalyst is hidden from users, and all public APIs
> have first class classes/objects defined in sql directly.
>
>
>
> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.git...@gmail.com> wrote:
>>
>> Hey guys,
>>
>> How does this impact the data sources API?  I was planning on using
>> this for a project.
>>
>> +1 that many things from spark-sql / DataFrame is universally
>> desirable and useful.
>>
>> By the way, one thing that prevents the columnar compression stuff in
>> Spark SQL from being more useful is, at least from previous talks with
>> Reynold and Michael et al., that the format was not designed for
>> persistence.
>>
>> I have a new project that aims to change that.  It is a
>> zero-serialisation, high performance binary vector library, designed
>> from the outset to be a persistent storage friendly.  May be one day
>> it can replace the Spark SQL columnar compression.
>>
>> Michael told me this would be a lot of work, and recreates parts of
>> Parquet, but I think it's worth it.  LMK if you'd like more details.
>>
>> -Evan
>>
>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <r...@databricks.com> wrote:
>> > Alright I have merged the patch (
>> > https://github.com/apache/spark/pull/4173
>> > ) since I don't see any strong opinions against it (as a matter of fact
>> > most were for it). We can still change it if somebody lays out a strong
>> > argument.
>> >
>> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > <matei.zaha...@gmail.com>
>> > wrote:
>> >
>> >> The type alias means your methods can specify either type and they will
>> >> work. It's just another name for the same type. But Scaladocs and such
>> >> will
>> >> show DataFrame as the type.
>> >>
>> >> Matei
>> >>
>> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> dirceu.semigh...@gmail.com> wrote:
>> >> >
>> >> > Reynold,
>> >> > But with type alias we will have the same problem, right?
>> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> > change
>> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> > implicit
>> >> > conversion between DataFrame and SchemaRDD
>> >> >
>> >> >
>> >> >
>> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>:
>> >> >
>> >> >> Dirceu,
>> >> >>
>> >> >> That is not possible because one cannot overload return types.
>> >> >>
>> >> >> SQLContext.parquetFile (and many other methods) needs to return some
>> >> type,
>> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> >> >>
>> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
>> >> >> to
>> >> not
>> >> >> break source compatibility for Scala.
>> >> >>
>> >> >>
>> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> >> dirceu.semigh...@gmail.com> wrote:
>> >> >>
>> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
>> >> >>> in
>> >> the
>> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> >> DataFrame?
>> >> >>> With this, we don't impact in existing code for the next few
>> >> >>> releases.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.da...@gmail.com>:
>> >> >>>
>> >> >>>> I want to address the issue that Matei raised about the heavy
>> >> >>>> lifting
>> >> >>>> required for a full SQL support. It is amazing that even after 30
>> >> years
>> >> >>> of
>> >> >>>> research there is not a single good open source columnar database
>> >> >>>> like
>> >> >>>> Vertica. There is a column store option in MySQL, but it is not
>> >> >>>> nearly
>> >> >>> as
>> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need for
>> >> >>>> such
>> >> a
>> >> >>>> system. I wonder why so and it's high time to change that.
>> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.r...@cloudera.com>
>> >> wrote:
>> >> >>>>
>> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like the
>> >> >>> former
>> >> >>>>> slightly better because it's more descriptive.
>> >> >>>>>
>> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the covers,
>> >> >>>>> it
>> >> >>> would
>> >> >>>>> be more clear from a user-facing perspective to at least choose a
>> >> >>> package
>> >> >>>>> name for it that omits "sql".
>> >> >>>>>
>> >> >>>>> I would also be in favor of adding a separate Spark Schema module
>> >> >>>>> for
>> >> >>>> Spark
>> >> >>>>> SQL to rely on, but I imagine that might be too large a change at
>> >> this
>> >> >>>>> point?
>> >> >>>>>
>> >> >>>>> -Sandy
>> >> >>>>>
>> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >> >>> matei.zaha...@gmail.com>
>> >> >>>>> wrote:
>> >> >>>>>
>> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it
>> >> >>>>>> another
>> >> >>>>> name,
>> >> >>>>>> like Spark Schema, but we decided to stick with SQL since that
>> >> >>>>>> was
>> >> >>> the
>> >> >>>>> most
>> >> >>>>>> obvious use case to many users.)
>> >> >>>>>>
>> >> >>>>>> Matei
>> >> >>>>>>
>> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >> >>> matei.zaha...@gmail.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>
>> >> >>>>>>> While it might be possible to move this concept to Spark Core
>> >> >>>>> long-term,
>> >> >>>>>> supporting structured data efficiently does require quite a bit
>> >> >>>>>> of
>> >> >>> the
>> >> >>>>>> infrastructure in Spark SQL, such as query planning and columnar
>> >> >>>> storage.
>> >> >>>>>> The intent of Spark SQL though is to be more than a SQL server
>> >> >>>>>> --
>> >> >>> it's
>> >> >>>>>> meant to be a library for manipulating structured data. Since
>> >> >>>>>> this
>> >> >>> is
>> >> >>>>>> possible to build over the core API, it's pretty natural to
>> >> >>> organize it
>> >> >>>>>> that way, same as Spark Streaming is a library.
>> >> >>>>>>>
>> >> >>>>>>> Matei
>> >> >>>>>>>
>> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <ko...@tresata.com>
>> >> >>>> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
>> >> >>>>>>>> format
>> >> >>> used
>> >> >>>>> for
>> >> >>>>>>>> bringing data into Spark from external systems, and used for
>> >> >>> various
>> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> >> >>>>>>>>
>> >> >>>>>>>> i agree. this to me also implies it belongs in spark core, not
>> >> >>> sql
>> >> >>>>>>>>
>> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> >> >>>>>>>> michaelma...@yahoo.com.invalid> wrote:
>> >> >>>>>>>>
>> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the
>> >> >>>>>>>>> Jan.
>> >> >>> 13
>> >> >>>> Bay
>> >> >>>>>> Area
>> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> >> >>> information
>> >> >>>> on
>> >> >>>>>> this
>> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> >> >>>>>>>>>
>> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> >> >>>>>>>>>
>> >> >>>>>>>>> ________________________________
>> >> >>>>>>>>> From: Patrick Wendell <pwend...@gmail.com>
>> >> >>>>>>>>> To: Reynold Xin <r...@databricks.com>
>> >> >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
>> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> One thing potentially not clear from this e-mail, there will
>> >> >>>>>>>>> be
>> >> >>> a
>> >> >>>> 1:1
>> >> >>>>>>>>> correspondence where you can get an RDD to/from a DataFrame.
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> >> >>> r...@databricks.com>
>> >> >>>>>> wrote:
>> >> >>>>>>>>>> Hi,
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in 1.3,
>> >> >>>>>>>>>> and
>> >> >>>>> wanted
>> >> >>>>>> to
>> >> >>>>>>>>>> get the community's opinion.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
>> >> >>>>>>>>>> format
>> >> >>>> used
>> >> >>>>>> for
>> >> >>>>>>>>>> bringing data into Spark from external systems, and used for
>> >> >>>> various
>> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We also
>> >> >>> expect
>> >> >>>>>> more
>> >> >>>>>>>>> and
>> >> >>>>>>>>>> more users to be programming directly against SchemaRDD API
>> >> >>> rather
>> >> >>>>>> than
>> >> >>>>>>>>> the
>> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used DSL
>> >> >>>>> originally
>> >> >>>>>>>>>> designed for writing test cases, always has the data-frame
>> >> >>>>>>>>>> like
>> >> >>>> API.
>> >> >>>>>> In
>> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable for
>> >> >>>>>>>>>> end
>> >> >>>>> users.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> There are two motivations for the renaming:
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>> >> >>> SchemaRDD.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD
>> >> >>> anymore
>> >> >>>>>> (even
>> >> >>>>>>>>>> though it would contain some RDD functions like map,
>> >> >>>>>>>>>> flatMap,
>> >> >>>> etc),
>> >> >>>>>> and
>> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>> >> >>> confusing.
>> >> >>>>>>>>> Instead.
>> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD
>> >> >>> methods.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> My understanding is that very few users program directly
>> >> >>> against
>> >> >>>> the
>> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
>> >> >>> documented.
>> >> >>>>>>>>> However,
>> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type
>> >> >>>>>>>>>> alias
>> >> >>>>>> DataFrame
>> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source
>> >> >>>>> compatibility
>> >> >>>>>>>>> for
>> >> >>>>>>>>>> Scala. That said, we will have to update all existing
>> >> >>> materials to
>> >> >>>>> use
>> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>
>> >> >>>> ---------------------------------------------------------------------
>> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>
>> >> >>>> ---------------------------------------------------------------------
>> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >>>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>
>> >> >>> ---------------------------------------------------------------------
>> >> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> >>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>
>> >>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: renaming SchemaRDD -> DataFrame

Reply via email to