Re: renaming SchemaRDD -> DataFrame

Koert Kuipers Tue, 10 Feb 2015 12:15:21 -0800

thanks matei its good to know i can create them like that

reynold, yeah somehow the words sql gets me going :) sorry...
yeah agreed that you need new transformations to preserve the schema info.
i misunderstood and thought i had to implement the bunch but that is
clearly not necessary as matei indicated.


allright i am clearly being slow/dense here, but now it makes sense to
me....






On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin <r...@databricks.com> wrote:

> Koert,
>
> Don't get too hang up on the name SQL. This is exactly what you want: a
> collection with record-like objects with field names and runtime types.
>
> Almost all of the 40 methods are transformations for structured data, such
> as aggregation on a field, or filtering on a field. If all you have is the
> old RDD style map/flatMap, then any transformation would lose the schema
> information, making the extra schema information useless.
>
>
>
>
> On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> so i understand the success or spark.sql. besides the fact that anything
>> with the words SQL in its name will have thousands of developers running
>> towards it because of the familiarity, there is also a genuine need for a
>> generic RDD that holds record-like objects, with field names and runtime
>> types. after all that is a successfull generic abstraction used in many
>> structured data tools.
>>
>> but to me that abstraction is as simple as:
>>
>> trait SchemaRDD extends RDD[Row] {
>>   def schema: StructType
>> }
>>
>> and perhaps another abstraction to indicate it intends to be column
>> oriented (with a few methods to efficiently extract a subset of columns).
>> so that could be DataFrame.
>>
>> such simple contracts would allow many people to write loaders for this
>> (say from csv) and whatnot.
>>
>> what i do not understand why it has to be much more complex than this. but
>> if i look at DataFrame it has so much additional stuff, that has (in my
>> eyes) nothing to do with generic structured data analysis.
>>
>> for example to implement DataFrame i need to implement about 40 additional
>> methods!? and for some the SQLness is obviously leaking into the
>> abstraction. for example why would i care about:
>>   def registerTempTable(tableName: String): Unit
>>
>>
>> best, koert
>>
>> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.git...@gmail.com>
>> wrote:
>>
>> > It is true that you can persist SchemaRdds / DataFrames to disk via
>> > Parquet, but a lot of time and inefficiencies is lost.   The in-memory
>> > columnar cached representation is completely different from the
>> > Parquet file format, and I believe there has to be a translation into
>> > a Row (because ultimately Spark SQL traverses Row's -- even the
>> > InMemoryColumnarTableScan has to then convert the columns into Rows
>> > for row-based processing).   On the other hand, traditional data
>> > frames process in a columnar fashion.   Columnar storage is good, but
>> > nowhere near as good as columnar processing.
>> >
>> > Another issue, which I don't know if it is solved yet, but it is
>> > difficult for Tachyon to efficiently cache Parquet files without
>> > understanding the file format itself.
>> >
>> > I gave a talk at last year's Spark Summit on this topic.
>> >
>> > I'm working on efforts to change this, however.  Shoot me an email at
>> > velvia at gmail if you're interested in joining forces.
>> >
>> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs....@gmail.com>
>> wrote:
>> > > Yes, when a DataFrame is cached in memory, it's stored in an efficient
>> > > columnar format. And you can also easily persist it on disk using
>> > Parquet,
>> > > which is also columnar.
>> > >
>> > > Cheng
>> > >
>> > >
>> > > On 1/29/15 1:24 PM, Koert Kuipers wrote:
>> > >>
>> > >> to me the word DataFrame does come with certain expectations. one of
>> > them
>> > >> is that the data is stored columnar. in R data.frame internally uses
>> a
>> > >> list
>> > >> of sequences i think, but since lists can have labels its more like a
>> > >> SortedMap[String, Array[_]]. this makes certain operations very cheap
>> > >> (such
>> > >> as adding a column).
>> > >>
>> > >> in Spark the closest thing would be a data structure where per
>> Partition
>> > >> the data is also stored columnar. does spark SQL already use
>> something
>> > >> like
>> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds
>> like
>> > >> it. where can i find that?
>> > >>
>> > >> thanks
>> > >>
>> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.git...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> +1.... having proper NA support is much cleaner than using null, at
>> > >>> least the Java null.
>> > >>>
>> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <
>> evan.spa...@gmail.com
>> > >
>> > >>> wrote:
>> > >>>>
>> > >>>> You've got to be a little bit careful here. "NA" in systems like R
>> or
>> > >>>
>> > >>> pandas
>> > >>>>
>> > >>>> may have special meaning that is distinct from "null".
>> > >>>>
>> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com>
>> > >>>
>> > >>> wrote:
>> > >>>>>
>> > >>>>> Isn't that just "null" in SQL?
>> > >>>>>
>> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <
>> velvia.git...@gmail.com>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> I believe that most DataFrame implementations out there, like
>> > Pandas,
>> > >>>>>> supports the idea of missing values / NA, and some support the
>> idea
>> > of
>> > >>>>>> Not Meaningful as well.
>> > >>>>>>
>> > >>>>>> Does Row support anything like that?  That is important for
>> certain
>> > >>>>>> applications.  I thought that Row worked by being a mutable
>> object,
>> > >>>>>> but haven't looked into the details in a while.
>> > >>>>>>
>> > >>>>>> -Evan
>> > >>>>>>
>> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <
>> r...@databricks.com>
>> > >>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>> It shouldn't change the data source api at all because data
>> sources
>> > >>>>>>
>> > >>>>>> create
>> > >>>>>>>
>> > >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>> > >>>>>>
>> > >>>>>> (previously
>> > >>>>>>>
>> > >>>>>>> to SchemaRDD).
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>
>> > >>>
>> >
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> > >>>>>>>
>> > >>>>>>> One thing that will break the data source API in 1.3 is the
>> > location
>> > >>>>>>> of
>> > >>>>>>> types. Types were previously defined in sql.catalyst.types, and
>> now
>> > >>>>>>
>> > >>>>>> moved to
>> > >>>>>>>
>> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>> > >>>>>>> public
>> > >>>>>>
>> > >>>>>> APIs
>> > >>>>>>>
>> > >>>>>>> have first class classes/objects defined in sql directly.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <
>> > velvia.git...@gmail.com
>> > >>>>>>
>> > >>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>> Hey guys,
>> > >>>>>>>>
>> > >>>>>>>> How does this impact the data sources API?  I was planning on
>> > using
>> > >>>>>>>> this for a project.
>> > >>>>>>>>
>> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>> > >>>>>>>> desirable and useful.
>> > >>>>>>>>
>> > >>>>>>>> By the way, one thing that prevents the columnar compression
>> stuff
>> > >>>
>> > >>> in
>> > >>>>>>>>
>> > >>>>>>>> Spark SQL from being more useful is, at least from previous
>> talks
>> > >>>>>>>> with
>> > >>>>>>>> Reynold and Michael et al., that the format was not designed
>> for
>> > >>>>>>>> persistence.
>> > >>>>>>>>
>> > >>>>>>>> I have a new project that aims to change that.  It is a
>> > >>>>>>>> zero-serialisation, high performance binary vector library,
>> > >>>
>> > >>> designed
>> > >>>>>>>>
>> > >>>>>>>> from the outset to be a persistent storage friendly.  May be
>> one
>> > >>>
>> > >>> day
>> > >>>>>>>>
>> > >>>>>>>> it can replace the Spark SQL columnar compression.
>> > >>>>>>>>
>> > >>>>>>>> Michael told me this would be a lot of work, and recreates
>> parts
>> > of
>> > >>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>> > >>>
>> > >>> details.
>> > >>>>>>>>
>> > >>>>>>>> -Evan
>> > >>>>>>>>
>> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <
>> r...@databricks.com
>> > >
>> > >>>>>>
>> > >>>>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>> Alright I have merged the patch (
>> > >>>>>>>>> https://github.com/apache/spark/pull/4173
>> > >>>>>>>>> ) since I don't see any strong opinions against it (as a
>> matter
>> > >>>
>> > >>> of
>> > >>>>>>
>> > >>>>>> fact
>> > >>>>>>>>>
>> > >>>>>>>>> most were for it). We can still change it if somebody lays
>> out a
>> > >>>>>>
>> > >>>>>> strong
>> > >>>>>>>>>
>> > >>>>>>>>> argument.
>> > >>>>>>>>>
>> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > >>>>>>>>> <matei.zaha...@gmail.com>
>> > >>>>>>>>> wrote:
>> > >>>>>>>>>
>> > >>>>>>>>>> The type alias means your methods can specify either type and
>> > >>>
>> > >>> they
>> > >>>>>>
>> > >>>>>> will
>> > >>>>>>>>>>
>> > >>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>> > >>>
>> > >>> and
>> > >>>>>>
>> > >>>>>> such
>> > >>>>>>>>>>
>> > >>>>>>>>>> will
>> > >>>>>>>>>> show DataFrame as the type.
>> > >>>>>>>>>>
>> > >>>>>>>>>> Matei
>> > >>>>>>>>>>
>> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> > >>>>>>>>>>
>> > >>>>>>>>>> dirceu.semigh...@gmail.com> wrote:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> Reynold,
>> > >>>>>>>>>>> But with type alias we will have the same problem, right?
>> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will
>> have
>> > >>>>>>>>>>> to
>> > >>>>>>>>>>> change
>> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>> > >>>
>> > >>> an
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> implicit
>> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>
>> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com
>> >:
>> > >>>>>>>>>>>
>> > >>>>>>>>>>>> Dirceu,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> That is not possible because one cannot overload return
>> > >>>
>> > >>> types.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>> > >>>
>> > >>> return
>> > >>>>>>
>> > >>>>>> some
>> > >>>>>>>>>>
>> > >>>>>>>>>> type,
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>> > >>>>>>>>>>>> SchemaRDD
>> > >>>>>>>>>>>> to
>> > >>>>>>>>>>
>> > >>>>>>>>>> not
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> break source compatibility for Scala.
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> > >>>>>>>>>>>> dirceu.semigh...@gmail.com> wrote:
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and
>> be
>> > >>>>>>
>> > >>>>>> removed
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> in
>> > >>>>>>>>>>
>> > >>>>>>>>>> the
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been
>> added
>> > >>>>>>>>>>>>> to
>> > >>>>>>>>>>
>> > >>>>>>>>>> DataFrame?
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next
>> few
>> > >>>>>>>>>>>>> releases.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>> > >>>>>>>>>>>>> <kushal.da...@gmail.com>:
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>> > >>>
>> > >>> heavy
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> lifting
>> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>> > >>>>>>>>>>>>>> after
>> > >>>>>>
>> > >>>>>> 30
>> > >>>>>>>>>>
>> > >>>>>>>>>> years
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> research there is not a single good open source columnar
>> > >>>>>>
>> > >>>>>> database
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> like
>> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it
>> is
>> > >>>>>>>>>>>>>> not
>> > >>>>>>>>>>>>>> nearly
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> as
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>> > >>>
>> > >>> need
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>> such
>> > >>>>>>>>>>
>> > >>>>>>>>>> a
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change
>> that.
>> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>> > >>>>>>>>>>>>>> <sandy.r...@cloudera.com>
>> > >>>>>>>>>>
>> > >>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>> > >>>
>> > >>> like
>> > >>>>>>
>> > >>>>>> the
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> former
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> slightly better because it's more descriptive.
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>> > >>>>>>
>> > >>>>>> covers,
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> it
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> would
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>> > >>>>>>
>> > >>>>>> choose a
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> package
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> name for it that omits "sql".
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark
>> Schema
>> > >>>>>>
>> > >>>>>> module
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Spark
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>> > >>>>>>>>>>>>>>> change
>> > >>>>>>
>> > >>>>>> at
>> > >>>>>>>>>>
>> > >>>>>>>>>> this
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> point?
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> -Sandy
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> matei.zaha...@gmail.com>
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of
>> giving
>> > >>>>>>>>>>>>>>>> it
>> > >>>>>>>>>>>>>>>> another
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> name,
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL
>> since
>> > >>>>>>>>>>>>>>>> that
>> > >>>>>>>>>>>>>>>> was
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> most
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> obvious use case to many users.)
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Matei
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> matei.zaha...@gmail.com>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to
>> Spark
>> > >>>>>>>>>>>>>>>>> Core
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> long-term,
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require
>> > >>>
>> > >>> quite a
>> > >>>>>>
>> > >>>>>> bit
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> of
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>> > >>>>>>
>> > >>>>>> columnar
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> storage.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>> > >>>>>>>>>>>>>>>> server
>> > >>>>>>>>>>>>>>>> --
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> it's
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>> > >>>>>>>>>>>>>>>> Since
>> > >>>>>>>>>>>>>>>> this
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> is
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty
>> natural
>> > >>>
>> > >>> to
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> organize it
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>> Matei
>> > >>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>> > >>>>>>
>> > >>>>>> ko...@tresata.com>
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>> > >>>
>> > >>> data
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> format
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> used
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>> > >>>
>> > >>> used
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> various
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>> > >>>>>>>>>>>>>>>>>> core,
>> > >>>>>>
>> > >>>>>> not
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> sql
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>> > >>>>>>>>>>>>>>>>>> michaelma...@yahoo.com.invalid> wrote:
>> > >>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it
>> yet,
>> > >>>>>>>>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>>>> Jan.
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> 13
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> Bay
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> Area
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of
>> background
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> information
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> on
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> this
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> ________________________________
>> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pwend...@gmail.com>
>> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <r...@databricks.com>
>> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org>
>> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>> > >>>
>> > >>> there
>> > >>>>>>
>> > >>>>>> will
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> be
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> a
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> 1:1
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>> > >>>>>>
>> > >>>>>> DataFrame.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> r...@databricks.com>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> wrote:
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> Hi,
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame
>> in
>> > >>>>>>>>>>>>>>>>>>>> 1.3,
>> > >>>>>>>>>>>>>>>>>>>> and
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> wanted
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> to
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> get the community's opinion.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>> > >>>
>> > >>> data
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> format
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> used
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>> > >>>>>>>>>>>>>>>>>>>> used
>> > >>>>>>
>> > >>>>>> for
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> various
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>> > >>>
>> > >>> We
>> > >>>>>>
>> > >>>>>> also
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> expect
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> more
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> and
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>> > >>>
>> > >>> SchemaRDD
>> > >>>>>>
>> > >>>>>> API
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> rather
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> than
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>> > >>>
>> > >>> used
>> > >>>>>>
>> > >>>>>> DSL
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> originally
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>> > >>>>>>>>>>>>>>>>>>>> data-frame
>> > >>>>>>>>>>>>>>>>>>>> like
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> API.
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> In
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>> > >>>
>> > >>> usable
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>>> end
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> users.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>> > >>>
>> > >>> than
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> SchemaRDD.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be
>> an
>> > >>>>>>>>>>>>>>>>>>>> RDD
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> anymore
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> (even
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like
>> map,
>> > >>>>>>>>>>>>>>>>>>>> flatMap,
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> etc),
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> and
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>> > >>>
>> > >>> highly
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> confusing.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> Instead.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for
>> all
>> > >>>>>>>>>>>>>>>>>>>> RDD
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> methods.
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>> > >>>>>>>>>>>>>>>>>>>> directly
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> against
>> > >>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>> the
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>> > >>>
>> > >>> well
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> documented.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> However,
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>> > >>>>>>>>>>>>>>>>>>>> type
>> > >>>>>>>>>>>>>>>>>>>> alias
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> DataFrame
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>> > >>>>>>>>>>>>>>>>>>>> source
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> compatibility
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> for
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all
>> existing
>> > >>>>>>>>>>>>>
>> > >>>>>>>>>>>>> materials to
>> > >>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>> use
>> > >>>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>
>> > ---------------------------------------------------------------------
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> > >>>
>> > >>> dev-unsubscr...@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>
>> > >>>>>>
>> > ---------------------------------------------------------------------
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> > >>>
>> > >>> dev-unsubscr...@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>> > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>
>> > >>>>>>
>> > ---------------------------------------------------------------------
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>> dev-unsubscr...@spark.apache.org
>> > >>>>>>>>>>>>>>>> For additional commands, e-mail:
>> > >>>
>> > >>> dev-h...@spark.apache.org
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>>>>>
>> > >>>>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>
>> ---------------------------------------------------------------------
>> > >>>>>>>>>>
>> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>> > >>>>>>>>>>
>> > >>>>>>>>>>
>> > >>>>>>>
>> > >>>>
>> > >>>
>> ---------------------------------------------------------------------
>> > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > >>> For additional commands, e-mail: dev-h...@spark.apache.org
>> > >>>
>> > >>>
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > > For additional commands, e-mail: dev-h...@spark.apache.org
>> > >
>> >
>>
>
>

Re: renaming SchemaRDD -> DataFrame

Reply via email to