Re: renaming SchemaRDD -> DataFrame

Evan Chan Thu, 29 Jan 2015 11:34:38 -0800

+1.... having proper NA support is much cleaner than using null, at
least the Java null.


On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <[email protected]> wrote:
> You've got to be a little bit careful here. "NA" in systems like R or pandas
> may have special meaning that is distinct from "null".
>
> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>
>
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <[email protected]> wrote:
>>
>> Isn't that just "null" in SQL?
>>
>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <[email protected]>
>> wrote:
>>
>> > I believe that most DataFrame implementations out there, like Pandas,
>> > supports the idea of missing values / NA, and some support the idea of
>> > Not Meaningful as well.
>> >
>> > Does Row support anything like that?  That is important for certain
>> > applications.  I thought that Row worked by being a mutable object,
>> > but haven't looked into the details in a while.
>> >
>> > -Evan
>> >
>> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <[email protected]>
>> > wrote:
>> > > It shouldn't change the data source api at all because data sources
>> > create
>> > > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > > to SchemaRDD).
>> > >
>> > >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> > >
>> > > One thing that will break the data source API in 1.3 is the location
>> > > of
>> > > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all
>> > > public
>> > APIs
>> > > have first class classes/objects defined in sql directly.
>> > >
>> > >
>> > >
>> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <[email protected]>
>> > wrote:
>> > >>
>> > >> Hey guys,
>> > >>
>> > >> How does this impact the data sources API?  I was planning on using
>> > >> this for a project.
>> > >>
>> > >> +1 that many things from spark-sql / DataFrame is universally
>> > >> desirable and useful.
>> > >>
>> > >> By the way, one thing that prevents the columnar compression stuff in
>> > >> Spark SQL from being more useful is, at least from previous talks
>> > >> with
>> > >> Reynold and Michael et al., that the format was not designed for
>> > >> persistence.
>> > >>
>> > >> I have a new project that aims to change that.  It is a
>> > >> zero-serialisation, high performance binary vector library, designed
>> > >> from the outset to be a persistent storage friendly.  May be one day
>> > >> it can replace the Spark SQL columnar compression.
>> > >>
>> > >> Michael told me this would be a lot of work, and recreates parts of
>> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> > >>
>> > >> -Evan
>> > >>
>> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <[email protected]>
>> > wrote:
>> > >> > Alright I have merged the patch (
>> > >> > https://github.com/apache/spark/pull/4173
>> > >> > ) since I don't see any strong opinions against it (as a matter of
>> > fact
>> > >> > most were for it). We can still change it if somebody lays out a
>> > strong
>> > >> > argument.
>> > >> >
>> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > >> > <[email protected]>
>> > >> > wrote:
>> > >> >
>> > >> >> The type alias means your methods can specify either type and they
>> > will
>> > >> >> work. It's just another name for the same type. But Scaladocs and
>> > such
>> > >> >> will
>> > >> >> show DataFrame as the type.
>> > >> >>
>> > >> >> Matei
>> > >> >>
>> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> > >> >> [email protected]> wrote:
>> > >> >> >
>> > >> >> > Reynold,
>> > >> >> > But with type alias we will have the same problem, right?
>> > >> >> > If the methods doesn't receive schemardd anymore, we will have
>> > >> >> > to
>> > >> >> > change
>> > >> >> > our code to migrade from schema to dataframe. Unless we have an
>> > >> >> > implicit
>> > >> >> > conversion between DataFrame and SchemaRDD
>> > >> >> >
>> > >> >> >
>> > >> >> >
>> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <[email protected]>:
>> > >> >> >
>> > >> >> >> Dirceu,
>> > >> >> >>
>> > >> >> >> That is not possible because one cannot overload return types.
>> > >> >> >>
>> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return
>> > some
>> > >> >> type,
>> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> > >> >> >>
>> > >> >> >> In 1.3, we will create a type alias for DataFrame called
>> > >> >> >> SchemaRDD
>> > >> >> >> to
>> > >> >> not
>> > >> >> >> break source compatibility for Scala.
>> > >> >> >>
>> > >> >> >>
>> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> > >> >> >> [email protected]> wrote:
>> > >> >> >>
>> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
>> > removed
>> > >> >> >>> in
>> > >> >> the
>> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
>> > >> >> >>> to
>> > >> >> DataFrame?
>> > >> >> >>> With this, we don't impact in existing code for the next few
>> > >> >> >>> releases.
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>>
>> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>> > >> >> >>> <[email protected]>:
>> > >> >> >>>
>> > >> >> >>>> I want to address the issue that Matei raised about the heavy
>> > >> >> >>>> lifting
>> > >> >> >>>> required for a full SQL support. It is amazing that even
>> > >> >> >>>> after
>> > 30
>> > >> >> years
>> > >> >> >>> of
>> > >> >> >>>> research there is not a single good open source columnar
>> > database
>> > >> >> >>>> like
>> > >> >> >>>> Vertica. There is a column store option in MySQL, but it is
>> > >> >> >>>> not
>> > >> >> >>>> nearly
>> > >> >> >>> as
>> > >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need
>> > >> >> >>>> for
>> > >> >> >>>> such
>> > >> >> a
>> > >> >> >>>> system. I wonder why so and it's high time to change that.
>> > >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>> > >> >> >>>> <[email protected]>
>> > >> >> wrote:
>> > >> >> >>>>
>> > >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like
>> > the
>> > >> >> >>> former
>> > >> >> >>>>> slightly better because it's more descriptive.
>> > >> >> >>>>>
>> > >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>> > covers,
>> > >> >> >>>>> it
>> > >> >> >>> would
>> > >> >> >>>>> be more clear from a user-facing perspective to at least
>> > choose a
>> > >> >> >>> package
>> > >> >> >>>>> name for it that omits "sql".
>> > >> >> >>>>>
>> > >> >> >>>>> I would also be in favor of adding a separate Spark Schema
>> > module
>> > >> >> >>>>> for
>> > >> >> >>>> Spark
>> > >> >> >>>>> SQL to rely on, but I imagine that might be too large a
>> > >> >> >>>>> change
>> > at
>> > >> >> this
>> > >> >> >>>>> point?
>> > >> >> >>>>>
>> > >> >> >>>>> -Sandy
>> > >> >> >>>>>
>> > >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> > >> >> >>> [email protected]>
>> > >> >> >>>>> wrote:
>> > >> >> >>>>>
>> > >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving
>> > >> >> >>>>>> it
>> > >> >> >>>>>> another
>> > >> >> >>>>> name,
>> > >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since
>> > >> >> >>>>>> that
>> > >> >> >>>>>> was
>> > >> >> >>> the
>> > >> >> >>>>> most
>> > >> >> >>>>>> obvious use case to many users.)
>> > >> >> >>>>>>
>> > >> >> >>>>>> Matei
>> > >> >> >>>>>>
>> > >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> > >> >> >>> [email protected]>
>> > >> >> >>>>>> wrote:
>> > >> >> >>>>>>>
>> > >> >> >>>>>>> While it might be possible to move this concept to Spark
>> > >> >> >>>>>>> Core
>> > >> >> >>>>> long-term,
>> > >> >> >>>>>> supporting structured data efficiently does require quite a
>> > bit
>> > >> >> >>>>>> of
>> > >> >> >>> the
>> > >> >> >>>>>> infrastructure in Spark SQL, such as query planning and
>> > columnar
>> > >> >> >>>> storage.
>> > >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL
>> > >> >> >>>>>> server
>> > >> >> >>>>>> --
>> > >> >> >>> it's
>> > >> >> >>>>>> meant to be a library for manipulating structured data.
>> > >> >> >>>>>> Since
>> > >> >> >>>>>> this
>> > >> >> >>> is
>> > >> >> >>>>>> possible to build over the core API, it's pretty natural to
>> > >> >> >>> organize it
>> > >> >> >>>>>> that way, same as Spark Streaming is a library.
>> > >> >> >>>>>>>
>> > >> >> >>>>>>> Matei
>> > >> >> >>>>>>>
>> > >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>> > [email protected]>
>> > >> >> >>>> wrote:
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data
>> > >> >> >>>>>>>> format
>> > >> >> >>> used
>> > >> >> >>>>> for
>> > >> >> >>>>>>>> bringing data into Spark from external systems, and used
>> > >> >> >>>>>>>> for
>> > >> >> >>> various
>> > >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>> i agree. this to me also implies it belongs in spark
>> > >> >> >>>>>>>> core,
>> > not
>> > >> >> >>> sql
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > >> >> >>>>>>>> [email protected]> wrote:
>> > >> >> >>>>>>>>
>> > >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>> > >> >> >>>>>>>>> the
>> > >> >> >>>>>>>>> Jan.
>> > >> >> >>> 13
>> > >> >> >>>> Bay
>> > >> >> >>>>>> Area
>> > >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background
>> > >> >> >>> information
>> > >> >> >>>> on
>> > >> >> >>>>>> this
>> > >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> ________________________________
>> > >> >> >>>>>>>>> From: Patrick Wendell <[email protected]>
>> > >> >> >>>>>>>>> To: Reynold Xin <[email protected]>
>> > >> >> >>>>>>>>> Cc: "[email protected]" <[email protected]>
>> > >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> > >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there
>> > will
>> > >> >> >>>>>>>>> be
>> > >> >> >>> a
>> > >> >> >>>> 1:1
>> > >> >> >>>>>>>>> correspondence where you can get an RDD to/from a
>> > DataFrame.
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> > >> >> >>> [email protected]>
>> > >> >> >>>>>> wrote:
>> > >> >> >>>>>>>>>> Hi,
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>> > >> >> >>>>>>>>>> 1.3,
>> > >> >> >>>>>>>>>> and
>> > >> >> >>>>> wanted
>> > >> >> >>>>>> to
>> > >> >> >>>>>>>>>> get the community's opinion.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data
>> > >> >> >>>>>>>>>> format
>> > >> >> >>>> used
>> > >> >> >>>>>> for
>> > >> >> >>>>>>>>>> bringing data into Spark from external systems, and
>> > >> >> >>>>>>>>>> used
>> > for
>> > >> >> >>>> various
>> > >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We
>> > also
>> > >> >> >>> expect
>> > >> >> >>>>>> more
>> > >> >> >>>>>>>>> and
>> > >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD
>> > API
>> > >> >> >>> rather
>> > >> >> >>>>>> than
>> > >> >> >>>>>>>>> the
>> > >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used
>> > DSL
>> > >> >> >>>>> originally
>> > >> >> >>>>>>>>>> designed for writing test cases, always has the
>> > >> >> >>>>>>>>>> data-frame
>> > >> >> >>>>>>>>>> like
>> > >> >> >>>> API.
>> > >> >> >>>>>> In
>> > >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable
>> > >> >> >>>>>>>>>> for
>> > >> >> >>>>>>>>>> end
>> > >> >> >>>>> users.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> There are two motivations for the renaming:
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than
>> > >> >> >>> SchemaRDD.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>> > >> >> >>>>>>>>>> RDD
>> > >> >> >>> anymore
>> > >> >> >>>>>> (even
>> > >> >> >>>>>>>>>> though it would contain some RDD functions like map,
>> > >> >> >>>>>>>>>> flatMap,
>> > >> >> >>>> etc),
>> > >> >> >>>>>> and
>> > >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly
>> > >> >> >>> confusing.
>> > >> >> >>>>>>>>> Instead.
>> > >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>> > >> >> >>>>>>>>>> RDD
>> > >> >> >>> methods.
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>>
>> > >> >> >>>>>>>>>> My understanding is that very few users program
>> > >> >> >>>>>>>>>> directly
>> > >> >> >>> against
>> > >> >> >>>> the
>> > >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well
>> > >> >> >>> documented.
>> > >> >> >>>>>>>>> However,
>> > >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a
>> > >> >> >>>>>>>>>> type
>> > >> >> >>>>>>>>>> alias
>> > >> >> >>>>>> DataFrame
>> > >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain
>> > >> >> >>>>>>>>>> source
>> > >> >> >>>>> compatibility
>> > >> >> >>>>>>>>> for
>> > >> >> >>>>>>>>>> Scala. That said, we will have to update all existing
>> > >> >> >>> materials to
>> > >> >> >>>>> use
>> > >> >> >>>>>>>>>> DataFrame rather than SchemaRDD.
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>
>> > >> >> >>>>
>> > ---------------------------------------------------------------------
>> > >> >> >>>>>>>>> To unsubscribe, e-mail: [email protected]
>> > >> >> >>>>>>>>> For additional commands, e-mail:
>> > >> >> >>>>>>>>> [email protected]
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>
>> > >> >> >>>>
>> > ---------------------------------------------------------------------
>> > >> >> >>>>>>>>> To unsubscribe, e-mail: [email protected]
>> > >> >> >>>>>>>>> For additional commands, e-mail:
>> > >> >> >>>>>>>>> [email protected]
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>>>
>> > >> >> >>>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>
>> > >> >> >>>
>> > ---------------------------------------------------------------------
>> > >> >> >>>>>> To unsubscribe, e-mail: [email protected]
>> > >> >> >>>>>> For additional commands, e-mail: [email protected]
>> > >> >> >>>>>>
>> > >> >> >>>>>>
>> > >> >> >>>>>
>> > >> >> >>>>
>> > >> >> >>>
>> > >> >> >>
>> > >> >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >> ---------------------------------------------------------------------
>> > >> >> To unsubscribe, e-mail: [email protected]
>> > >> >> For additional commands, e-mail: [email protected]
>> > >> >>
>> > >> >>
>> > >
>> > >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: renaming SchemaRDD -> DataFrame

Reply via email to