Re: renaming SchemaRDD -> DataFrame

Evan Chan Sun, 01 Feb 2015 00:32:46 -0800

It is true that you can persist SchemaRdds / DataFrames to disk via
Parquet, but a lot of time and inefficiencies is lost.   The in-memory
columnar cached representation is completely different from the
Parquet file format, and I believe there has to be a translation into
a Row (because ultimately Spark SQL traverses Row's -- even the
InMemoryColumnarTableScan has to then convert the columns into Rows
for row-based processing).   On the other hand, traditional data
frames process in a columnar fashion.   Columnar storage is good, but
nowhere near as good as columnar processing.


Another issue, which I don't know if it is solved yet, but it is
difficult for Tachyon to efficiently cache Parquet files without
understanding the file format itself.

I gave a talk at last year's Spark Summit on this topic.

I'm working on efforts to change this, however.  Shoot me an email at
velvia at gmail if you're interested in joining forces.

On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <[email protected]> wrote:
> Yes, when a DataFrame is cached in memory, it's stored in an efficient
> columnar format. And you can also easily persist it on disk using Parquet,
> which is also columnar.
>
> Cheng
>
>
> On 1/29/15 1:24 PM, Koert Kuipers wrote:
>>
>> to me the word DataFrame does come with certain expectations. one of them
>> is that the data is stored columnar. in R data.frame internally uses a
>> list
>> of sequences i think, but since lists can have labels its more like a
>> SortedMap[String, Array[_]]. this makes certain operations very cheap
>> (such
>> as adding a column).
>>
>> in Spark the closest thing would be a data structure where per Partition
>> the data is also stored columnar. does spark SQL already use something
>> like
>> that? Evan mentioned "Spark SQL columnar compression", which sounds like
>> it. where can i find that?
>>
>> thanks
>>
>> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <[email protected]>
>> wrote:
>>
>>> +1.... having proper NA support is much cleaner than using null, at
>>> least the Java null.
>>>
>>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <[email protected]>
>>> wrote:
>>>>
>>>> You've got to be a little bit careful here. "NA" in systems like R or
>>>
>>> pandas
>>>>
>>>> may have special meaning that is distinct from "null".
>>>>
>>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/
>>>>
>>>>
>>>>
>>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <[email protected]>
>>>
>>> wrote:
>>>>>
>>>>> Isn't that just "null" in SQL?
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I believe that most DataFrame implementations out there, like Pandas,
>>>>>> supports the idea of missing values / NA, and some support the idea of
>>>>>> Not Meaningful as well.
>>>>>>
>>>>>> Does Row support anything like that?  That is important for certain
>>>>>> applications.  I thought that Row worked by being a mutable object,
>>>>>> but haven't looked into the details in a while.
>>>>>>
>>>>>> -Evan
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <[email protected]>
>>>>>> wrote:
>>>>>>>
>>>>>>> It shouldn't change the data source api at all because data sources
>>>>>>
>>>>>> create
>>>>>>>
>>>>>>> RDD[Row], and that gets converted into a DataFrame automatically
>>>>>>
>>>>>> (previously
>>>>>>>
>>>>>>> to SchemaRDD).
>>>>>>>
>>>>>>>
>>>>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>>>>>>>
>>>>>>> One thing that will break the data source API in 1.3 is the location
>>>>>>> of
>>>>>>> types. Types were previously defined in sql.catalyst.types, and now
>>>>>>
>>>>>> moved to
>>>>>>>
>>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all
>>>>>>> public
>>>>>>
>>>>>> APIs
>>>>>>>
>>>>>>> have first class classes/objects defined in sql directly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <[email protected]
>>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hey guys,
>>>>>>>>
>>>>>>>> How does this impact the data sources API?  I was planning on using
>>>>>>>> this for a project.
>>>>>>>>
>>>>>>>> +1 that many things from spark-sql / DataFrame is universally
>>>>>>>> desirable and useful.
>>>>>>>>
>>>>>>>> By the way, one thing that prevents the columnar compression stuff
>>>
>>> in
>>>>>>>>
>>>>>>>> Spark SQL from being more useful is, at least from previous talks
>>>>>>>> with
>>>>>>>> Reynold and Michael et al., that the format was not designed for
>>>>>>>> persistence.
>>>>>>>>
>>>>>>>> I have a new project that aims to change that.  It is a
>>>>>>>> zero-serialisation, high performance binary vector library,
>>>
>>> designed
>>>>>>>>
>>>>>>>> from the outset to be a persistent storage friendly.  May be one
>>>
>>> day
>>>>>>>>
>>>>>>>> it can replace the Spark SQL columnar compression.
>>>>>>>>
>>>>>>>> Michael told me this would be a lot of work, and recreates parts of
>>>>>>>> Parquet, but I think it's worth it.  LMK if you'd like more
>>>
>>> details.
>>>>>>>>
>>>>>>>> -Evan
>>>>>>>>
>>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <[email protected]>
>>>>>>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Alright I have merged the patch (
>>>>>>>>> https://github.com/apache/spark/pull/4173
>>>>>>>>> ) since I don't see any strong opinions against it (as a matter
>>>
>>> of
>>>>>>
>>>>>> fact
>>>>>>>>>
>>>>>>>>> most were for it). We can still change it if somebody lays out a
>>>>>>
>>>>>> strong
>>>>>>>>>
>>>>>>>>> argument.
>>>>>>>>>
>>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>>>>>>>>> <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The type alias means your methods can specify either type and
>>>
>>> they
>>>>>>
>>>>>> will
>>>>>>>>>>
>>>>>>>>>> work. It's just another name for the same type. But Scaladocs
>>>
>>> and
>>>>>>
>>>>>> such
>>>>>>>>>>
>>>>>>>>>> will
>>>>>>>>>> show DataFrame as the type.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>>>>>>>>>>
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Reynold,
>>>>>>>>>>> But with type alias we will have the same problem, right?
>>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will have
>>>>>>>>>>> to
>>>>>>>>>>> change
>>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have
>>>
>>> an
>>>>>>>>>>>
>>>>>>>>>>> implicit
>>>>>>>>>>> conversion between DataFrame and SchemaRDD
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <[email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>> Dirceu,
>>>>>>>>>>>>
>>>>>>>>>>>> That is not possible because one cannot overload return
>>>
>>> types.
>>>>>>>>>>>>
>>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to
>>>
>>> return
>>>>>>
>>>>>> some
>>>>>>>>>>
>>>>>>>>>> type,
>>>>>>>>>>>>
>>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame.
>>>>>>>>>>>>
>>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called
>>>>>>>>>>>> SchemaRDD
>>>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> not
>>>>>>>>>>>>
>>>>>>>>>>>> break source compatibility for Scala.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be
>>>>>>
>>>>>> removed
>>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>> release 1.5(+/- 1)  for example, and the new code been added
>>>>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>> DataFrame?
>>>>>>>>>>>>>
>>>>>>>>>>>>> With this, we don't impact in existing code for the next few
>>>>>>>>>>>>> releases.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta
>>>>>>>>>>>>> <[email protected]>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I want to address the issue that Matei raised about the
>>>
>>> heavy
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> lifting
>>>>>>>>>>>>>> required for a full SQL support. It is amazing that even
>>>>>>>>>>>>>> after
>>>>>>
>>>>>> 30
>>>>>>>>>>
>>>>>>>>>> years
>>>>>>>>>>>>>
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> research there is not a single good open source columnar
>>>>>>
>>>>>> database
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> like
>>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> nearly
>>>>>>>>>>>>>
>>>>>>>>>>>>> as
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true
>>>
>>> need
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> such
>>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> system. I wonder why so and it's high time to change that.
>>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza"
>>>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I
>>>
>>> like
>>>>>>
>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>> former
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> slightly better because it's more descriptive.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the
>>>>>>
>>>>>> covers,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>
>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least
>>>>>>
>>>>>> choose a
>>>>>>>>>>>>>
>>>>>>>>>>>>> package
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> name for it that omits "sql".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark Schema
>>>>>>
>>>>>> module
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Spark
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a
>>>>>>>>>>>>>>> change
>>>>>>
>>>>>> at
>>>>>>>>>>
>>>>>>>>>> this
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> point?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Sandy
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>>>>>>>>>>>>>
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of giving
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>> another
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> name,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL since
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> was
>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> obvious use case to many users.)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>>>>>>>>>>>>>
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While it might be possible to move this concept to Spark
>>>>>>>>>>>>>>>>> Core
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> long-term,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> supporting structured data efficiently does require
>>>
>>> quite a
>>>>>>
>>>>>> bit
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and
>>>>>>
>>>>>> columnar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> storage.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL
>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> meant to be a library for manipulating structured data.
>>>>>>>>>>>>>>>> Since
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural
>>>
>>> to
>>>>>>>>>>>>>
>>>>>>>>>>>>> organize it
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Matei
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers <
>>>>>>
>>>>>> [email protected]>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common
>>>
>>> data
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>
>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>
>>> used
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>
>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API."
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark
>>>>>>>>>>>>>>>>>> core,
>>>>>>
>>>>>> not
>>>>>>>>>>>>>
>>>>>>>>>>>>> sql
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet,
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Jan.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 13
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bay
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Area
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background
>>>>>>>>>>>>>
>>>>>>>>>>>>> information
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>>> From: Patrick Wendell <[email protected]>
>>>>>>>>>>>>>>>>>>> To: Reynold Xin <[email protected]>
>>>>>>>>>>>>>>>>>>> Cc: "[email protected]" <[email protected]>
>>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM
>>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail,
>>>
>>> there
>>>>>>
>>>>>> will
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1:1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a
>>>>>>
>>>>>> DataFrame.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin <
>>>>>>>>>>>>>
>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in
>>>>>>>>>>>>>>>>>>>> 1.3,
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wanted
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> get the community's opinion.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common
>>>
>>> data
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> format
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and
>>>>>>>>>>>>>>>>>>>> used
>>>>>>
>>>>>> for
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> various
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API.
>>>
>>> We
>>>>>>
>>>>>> also
>>>>>>>>>>>>>
>>>>>>>>>>>>> expect
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> more users to be programming directly against
>>>
>>> SchemaRDD
>>>>>>
>>>>>> API
>>>>>>>>>>>>>
>>>>>>>>>>>>> rather
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly
>>>
>>> used
>>>>>>
>>>>>> DSL
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> originally
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the
>>>>>>>>>>>>>>>>>>>> data-frame
>>>>>>>>>>>>>>>>>>>> like
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> API.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API
>>>
>>> usable
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> end
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name
>>>
>>> than
>>>>>>>>>>>>>
>>>>>>>>>>>>> SchemaRDD.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an
>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>>
>>>>>>>>>>>>> anymore
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (even
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map,
>>>>>>>>>>>>>>>>>>>> flatMap,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> etc),
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is
>>>
>>> highly
>>>>>>>>>>>>>
>>>>>>>>>>>>> confusing.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Instead.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all
>>>>>>>>>>>>>>>>>>>> RDD
>>>>>>>>>>>>>
>>>>>>>>>>>>> methods.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> My understanding is that very few users program
>>>>>>>>>>>>>>>>>>>> directly
>>>>>>>>>>>>>
>>>>>>>>>>>>> against
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not
>>>
>>> well
>>>>>>>>>>>>>
>>>>>>>>>>>>> documented.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> However,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a
>>>>>>>>>>>>>>>>>>>> type
>>>>>>>>>>>>>>>>>>>> alias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> DataFrame
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain
>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> compatibility
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all existing
>>>>>>>>>>>>>
>>>>>>>>>>>>> materials to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>>
>>> [email protected]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail:
>>>
>>> [email protected]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>>>>>>>> For additional commands, e-mail:
>>>
>>> [email protected]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>> ---------------------------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: renaming SchemaRDD -> DataFrame

Reply via email to