Re: renaming SchemaRDD -> DataFrame

Michael Armbrust Wed, 28 Jan 2015 17:56:53 -0800

In particular the performance tricks are in SpecificMutableRow.

On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan <velvia.git...@gmail.com> wrote:


> Yeah, it's "null".   I was worried you couldn't represent it in Row
> because of primitive types like Int (unless you box the Int, which
> would be a performance hit).  Anyways, I'll take another look at the
> Row API again  :-p
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com> wrote:
> > Isn't that just "null" in SQL?
> >
> > On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.git...@gmail.com>
> wrote:
> >>
> >> I believe that most DataFrame implementations out there, like Pandas,
> >> supports the idea of missing values / NA, and some support the idea of
> >> Not Meaningful as well.
> >>
> >> Does Row support anything like that?  That is important for certain
> >> applications.  I thought that Row worked by being a mutable object,
> >> but haven't looked into the details in a while.
> >>
> >> -Evan
> >>
> >> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com>
> wrote:
> >> > It shouldn't change the data source api at all because data sources
> >> > create
> >> > RDD[Row], and that gets converted into a DataFrame automatically
> >> > (previously
> >> > to SchemaRDD).
> >> >
> >> >
> >> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> >
> >> > One thing that will break the data source API in 1.3 is the location
> of
> >> > types. Types were previously defined in sql.catalyst.types, and now
> >> > moved to
> >> > sql.types. After 1.3, sql.catalyst is hidden from users, and all
> public
> >> > APIs
> >> > have first class classes/objects defined in sql directly.
> >> >
> >> >
> >> >
> >> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.git...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Hey guys,
> >> >>
> >> >> How does this impact the data sources API?  I was planning on using
> >> >> this for a project.
> >> >>
> >> >> +1 that many things from spark-sql / DataFrame is universally
> >> >> desirable and useful.
> >> >>
> >> >> By the way, one thing that prevents the columnar compression stuff in
> >> >> Spark SQL from being more useful is, at least from previous talks
> with
> >> >> Reynold and Michael et al., that the format was not designed for
> >> >> persistence.
> >> >>
> >> >> I have a new project that aims to change that.  It is a
> >> >> zero-serialisation, high performance binary vector library, designed
> >> >> from the outset to be a persistent storage friendly.  May be one day
> >> >> it can replace the Spark SQL columnar compression.
> >> >>
> >> >> Michael told me this would be a lot of work, and recreates parts of
> >> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> >> >>
> >> >> -Evan
> >> >>
> >> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <r...@databricks.com>
> >> >> wrote:
> >> >> > Alright I have merged the patch (
> >> >> > https://github.com/apache/spark/pull/4173
> >> >> > ) since I don't see any strong opinions against it (as a matter of
> >> >> > fact
> >> >> > most were for it). We can still change it if somebody lays out a
> >> >> > strong
> >> >> > argument.
> >> >> >
> >> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> >> > <matei.zaha...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> The type alias means your methods can specify either type and they
> >> >> >> will
> >> >> >> work. It's just another name for the same type. But Scaladocs and
> >> >> >> such
> >> >> >> will
> >> >> >> show DataFrame as the type.
> >> >> >>
> >> >> >> Matei
> >> >> >>
> >> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >> >
> >> >> >> > Reynold,
> >> >> >> > But with type alias we will have the same problem, right?
> >> >> >> > If the methods doesn't receive schemardd anymore, we will have
> to
> >> >> >> > change
> >> >> >> > our code to migrade from schema to dataframe. Unless we have an
> >> >> >> > implicit
> >> >> >> > conversion between DataFrame and SchemaRDD
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>:
> >> >> >> >
> >> >> >> >> Dirceu,
> >> >> >> >>
> >> >> >> >> That is not possible because one cannot overload return types.
> >> >> >> >>
> >> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> >> >> >> >> some
> >> >> >> type,
> >> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >> >> >>
> >> >> >> >> In 1.3, we will create a type alias for DataFrame called
> >> >> >> >> SchemaRDD
> >> >> >> >> to
> >> >> >> not
> >> >> >> >> break source compatibility for Scala.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> >> >> >> >>> removed
> >> >> >> >>> in
> >> >> >> the
> >> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
> to
> >> >> >> DataFrame?
> >> >> >> >>> With this, we don't impact in existing code for the next few
> >> >> >> >>> releases.
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <
> kushal.da...@gmail.com>:
> >> >> >> >>>
> >> >> >> >>>> I want to address the issue that Matei raised about the heavy
> >> >> >> >>>> lifting
> >> >> >> >>>> required for a full SQL support. It is amazing that even
> after
> >> >> >> >>>> 30
> >> >> >> years
> >> >> >> >>> of
> >> >> >> >>>> research there is not a single good open source columnar
> >> >> >> >>>> database
> >> >> >> >>>> like
> >> >> >> >>>> Vertica. There is a column store option in MySQL, but it is
> not
> >> >> >> >>>> nearly
> >> >> >> >>> as
> >> >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need
> >> >> >> >>>> for
> >> >> >> >>>> such
> >> >> >> a
> >> >> >> >>>> system. I wonder why so and it's high time to change that.
> >> >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <
> sandy.r...@cloudera.com>
> >> >> >> wrote:
> >> >> >> >>>>
> >> >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
> >> >> >> >>>>> the
> >> >> >> >>> former
> >> >> >> >>>>> slightly better because it's more descriptive.
> >> >> >> >>>>>
> >> >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
> >> >> >> >>>>> covers,
> >> >> >> >>>>> it
> >> >> >> >>> would
> >> >> >> >>>>> be more clear from a user-facing perspective to at least
> >> >> >> >>>>> choose a
> >> >> >> >>> package
> >> >> >> >>>>> name for it that omits "sql".
> >> >> >> >>>>>
> >> >> >> >>>>> I would also be in favor of adding a separate Spark Schema
> >> >> >> >>>>> module
> >> >> >> >>>>> for
> >> >> >> >>>> Spark
> >> >> >> >>>>> SQL to rely on, but I imagine that might be too large a
> change
> >> >> >> >>>>> at
> >> >> >> this
> >> >> >> >>>>> point?
> >> >> >> >>>>>
> >> >> >> >>>>> -Sandy
> >> >> >> >>>>>
> >> >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> >> >> >>> matei.zaha...@gmail.com>
> >> >> >> >>>>> wrote:
> >> >> >> >>>>>
> >> >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving
> it
> >> >> >> >>>>>> another
> >> >> >> >>>>> name,
> >> >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
> >> >> >> >>>>>> that
> >> >> >> >>>>>> was
> >> >> >> >>> the
> >> >> >> >>>>> most
> >> >> >> >>>>>> obvious use case to many users.)
> >> >> >> >>>>>>
> >> >> >> >>>>>> Matei
> >> >> >> >>>>>>
> >> >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
> >> >> >> >>> matei.zaha...@gmail.com>
> >> >> >> >>>>>> wrote:
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> While it might be possible to move this concept to Spark
> >> >> >> >>>>>>> Core
> >> >> >> >>>>> long-term,
> >> >> >> >>>>>> supporting structured data efficiently does require quite a
> >> >> >> >>>>>> bit
> >> >> >> >>>>>> of
> >> >> >> >>> the
> >> >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
> >> >> >> >>>>>> columnar
> >> >> >> >>>> storage.
> >> >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
> >> >> >> >>>>>> server
> >> >> >> >>>>>> --
> >> >> >> >>> it's
> >> >> >> >>>>>> meant to be a library for manipulating structured data.
> Since
> >> >> >> >>>>>> this
> >> >> >> >>> is
> >> >> >> >>>>>> possible to build over the core API, it's pretty natural to
> >> >> >> >>> organize it
> >> >> >> >>>>>> that way, same as Spark Streaming is a library.
> >> >> >> >>>>>>>
> >> >> >> >>>>>>> Matei
> >> >> >> >>>>>>>
> >> >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers
> >> >> >> >>>>>>>> <ko...@tresata.com>
> >> >> >> >>>> wrote:
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
> >> >> >> >>>>>>>> format
> >> >> >> >>> used
> >> >> >> >>>>> for
> >> >> >> >>>>>>>> bringing data into Spark from external systems, and used
> >> >> >> >>>>>>>> for
> >> >> >> >>> various
> >> >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> i agree. this to me also implies it belongs in spark
> core,
> >> >> >> >>>>>>>> not
> >> >> >> >>> sql
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
> >> >> >> >>>>>>>> michaelma...@yahoo.com.invalid> wrote:
> >> >> >> >>>>>>>>
> >> >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet,
> the
> >> >> >> >>>>>>>>> Jan.
> >> >> >> >>> 13
> >> >> >> >>>> Bay
> >> >> >> >>>>>> Area
> >> >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
> >> >> >> >>> information
> >> >> >> >>>> on
> >> >> >> >>>>>> this
> >> >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> ________________________________
> >> >> >> >>>>>>>>> From: Patrick Wendell <pwend...@gmail.com>
> >> >> >> >>>>>>>>> To: Reynold Xin <r...@databricks.com>
> >> >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
> >> >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
> >> >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
> >> >> >> >>>>>>>>> will
> >> >> >> >>>>>>>>> be
> >> >> >> >>> a
> >> >> >> >>>> 1:1
> >> >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
> >> >> >> >>>>>>>>> DataFrame.
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
> >> >> >> >>> r...@databricks.com>
> >> >> >> >>>>>> wrote:
> >> >> >> >>>>>>>>>> Hi,
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
> >> >> >> >>>>>>>>>> 1.3,
> >> >> >> >>>>>>>>>> and
> >> >> >> >>>>> wanted
> >> >> >> >>>>>> to
> >> >> >> >>>>>>>>>> get the community's opinion.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
> >> >> >> >>>>>>>>>> format
> >> >> >> >>>> used
> >> >> >> >>>>>> for
> >> >> >> >>>>>>>>>> bringing data into Spark from external systems, and
> used
> >> >> >> >>>>>>>>>> for
> >> >> >> >>>> various
> >> >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
> >> >> >> >>>>>>>>>> also
> >> >> >> >>> expect
> >> >> >> >>>>>> more
> >> >> >> >>>>>>>>> and
> >> >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
> >> >> >> >>>>>>>>>> API
> >> >> >> >>> rather
> >> >> >> >>>>>> than
> >> >> >> >>>>>>>>> the
> >> >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
> >> >> >> >>>>>>>>>> DSL
> >> >> >> >>>>> originally
> >> >> >> >>>>>>>>>> designed for writing test cases, always has the
> >> >> >> >>>>>>>>>> data-frame
> >> >> >> >>>>>>>>>> like
> >> >> >> >>>> API.
> >> >> >> >>>>>> In
> >> >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable
> >> >> >> >>>>>>>>>> for
> >> >> >> >>>>>>>>>> end
> >> >> >> >>>>> users.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> There are two motivations for the renaming:
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
> >> >> >> >>> SchemaRDD.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
> RDD
> >> >> >> >>> anymore
> >> >> >> >>>>>> (even
> >> >> >> >>>>>>>>>> though it would contain some RDD functions like map,
> >> >> >> >>>>>>>>>> flatMap,
> >> >> >> >>>> etc),
> >> >> >> >>>>>> and
> >> >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
> >> >> >> >>> confusing.
> >> >> >> >>>>>>>>> Instead.
> >> >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
> RDD
> >> >> >> >>> methods.
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>>
> >> >> >> >>>>>>>>>> My understanding is that very few users program
> directly
> >> >> >> >>> against
> >> >> >> >>>> the
> >> >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
> >> >> >> >>> documented.
> >> >> >> >>>>>>>>> However,
> >> >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a
> type
> >> >> >> >>>>>>>>>> alias
> >> >> >> >>>>>> DataFrame
> >> >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain
> source
> >> >> >> >>>>> compatibility
> >> >> >> >>>>>>>>> for
> >> >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
> >> >> >> >>> materials to
> >> >> >> >>>>> use
> >> >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>>
> ---------------------------------------------------------------------
> >> >> >> >>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscr...@spark.apache.org
> >> >> >> >>>>>>>>> For additional commands, e-mail:
> dev-h...@spark.apache.org
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>
> >> >> >> >>>>
> >> >> >> >>>>
> ---------------------------------------------------------------------
> >> >> >> >>>>>>>>> To unsubscribe, e-mail:
> dev-unsubscr...@spark.apache.org
> >> >> >> >>>>>>>>> For additional commands, e-mail:
> dev-h...@spark.apache.org
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>>>
> >> >> >> >>>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> ---------------------------------------------------------------------
> >> >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >> >> >>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >> >> >>>>>>
> >> >> >> >>>>>>
> >> >> >> >>>>>
> >> >> >> >>>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> >> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >> >> >>
> >> >> >>
> >> >
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to