thanks matei its good to know i can create them like that reynold, yeah somehow the words sql gets me going :) sorry... yeah agreed that you need new transformations to preserve the schema info. i misunderstood and thought i had to implement the bunch but that is clearly not necessary as matei indicated.
allright i am clearly being slow/dense here, but now it makes sense to me.... On Tue, Feb 10, 2015 at 2:58 PM, Reynold Xin <r...@databricks.com> wrote: > Koert, > > Don't get too hang up on the name SQL. This is exactly what you want: a > collection with record-like objects with field names and runtime types. > > Almost all of the 40 methods are transformations for structured data, such > as aggregation on a field, or filtering on a field. If all you have is the > old RDD style map/flatMap, then any transformation would lose the schema > information, making the extra schema information useless. > > > > > On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> so i understand the success or spark.sql. besides the fact that anything >> with the words SQL in its name will have thousands of developers running >> towards it because of the familiarity, there is also a genuine need for a >> generic RDD that holds record-like objects, with field names and runtime >> types. after all that is a successfull generic abstraction used in many >> structured data tools. >> >> but to me that abstraction is as simple as: >> >> trait SchemaRDD extends RDD[Row] { >> def schema: StructType >> } >> >> and perhaps another abstraction to indicate it intends to be column >> oriented (with a few methods to efficiently extract a subset of columns). >> so that could be DataFrame. >> >> such simple contracts would allow many people to write loaders for this >> (say from csv) and whatnot. >> >> what i do not understand why it has to be much more complex than this. but >> if i look at DataFrame it has so much additional stuff, that has (in my >> eyes) nothing to do with generic structured data analysis. >> >> for example to implement DataFrame i need to implement about 40 additional >> methods!? and for some the SQLness is obviously leaking into the >> abstraction. for example why would i care about: >> def registerTempTable(tableName: String): Unit >> >> >> best, koert >> >> On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.git...@gmail.com> >> wrote: >> >> > It is true that you can persist SchemaRdds / DataFrames to disk via >> > Parquet, but a lot of time and inefficiencies is lost. The in-memory >> > columnar cached representation is completely different from the >> > Parquet file format, and I believe there has to be a translation into >> > a Row (because ultimately Spark SQL traverses Row's -- even the >> > InMemoryColumnarTableScan has to then convert the columns into Rows >> > for row-based processing). On the other hand, traditional data >> > frames process in a columnar fashion. Columnar storage is good, but >> > nowhere near as good as columnar processing. >> > >> > Another issue, which I don't know if it is solved yet, but it is >> > difficult for Tachyon to efficiently cache Parquet files without >> > understanding the file format itself. >> > >> > I gave a talk at last year's Spark Summit on this topic. >> > >> > I'm working on efforts to change this, however. Shoot me an email at >> > velvia at gmail if you're interested in joining forces. >> > >> > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs....@gmail.com> >> wrote: >> > > Yes, when a DataFrame is cached in memory, it's stored in an efficient >> > > columnar format. And you can also easily persist it on disk using >> > Parquet, >> > > which is also columnar. >> > > >> > > Cheng >> > > >> > > >> > > On 1/29/15 1:24 PM, Koert Kuipers wrote: >> > >> >> > >> to me the word DataFrame does come with certain expectations. one of >> > them >> > >> is that the data is stored columnar. in R data.frame internally uses >> a >> > >> list >> > >> of sequences i think, but since lists can have labels its more like a >> > >> SortedMap[String, Array[_]]. this makes certain operations very cheap >> > >> (such >> > >> as adding a column). >> > >> >> > >> in Spark the closest thing would be a data structure where per >> Partition >> > >> the data is also stored columnar. does spark SQL already use >> something >> > >> like >> > >> that? Evan mentioned "Spark SQL columnar compression", which sounds >> like >> > >> it. where can i find that? >> > >> >> > >> thanks >> > >> >> > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.git...@gmail.com> >> > >> wrote: >> > >> >> > >>> +1.... having proper NA support is much cleaner than using null, at >> > >>> least the Java null. >> > >>> >> > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks < >> evan.spa...@gmail.com >> > > >> > >>> wrote: >> > >>>> >> > >>>> You've got to be a little bit careful here. "NA" in systems like R >> or >> > >>> >> > >>> pandas >> > >>>> >> > >>>> may have special meaning that is distinct from "null". >> > >>>> >> > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/ >> > >>>> >> > >>>> >> > >>>> >> > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com> >> > >>> >> > >>> wrote: >> > >>>>> >> > >>>>> Isn't that just "null" in SQL? >> > >>>>> >> > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan < >> velvia.git...@gmail.com> >> > >>>>> wrote: >> > >>>>> >> > >>>>>> I believe that most DataFrame implementations out there, like >> > Pandas, >> > >>>>>> supports the idea of missing values / NA, and some support the >> idea >> > of >> > >>>>>> Not Meaningful as well. >> > >>>>>> >> > >>>>>> Does Row support anything like that? That is important for >> certain >> > >>>>>> applications. I thought that Row worked by being a mutable >> object, >> > >>>>>> but haven't looked into the details in a while. >> > >>>>>> >> > >>>>>> -Evan >> > >>>>>> >> > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin < >> r...@databricks.com> >> > >>>>>> wrote: >> > >>>>>>> >> > >>>>>>> It shouldn't change the data source api at all because data >> sources >> > >>>>>> >> > >>>>>> create >> > >>>>>>> >> > >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically >> > >>>>>> >> > >>>>>> (previously >> > >>>>>>> >> > >>>>>>> to SchemaRDD). >> > >>>>>>> >> > >>>>>>> >> > >>>>>> >> > >>> >> > >>> >> > >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala >> > >>>>>>> >> > >>>>>>> One thing that will break the data source API in 1.3 is the >> > location >> > >>>>>>> of >> > >>>>>>> types. Types were previously defined in sql.catalyst.types, and >> now >> > >>>>>> >> > >>>>>> moved to >> > >>>>>>> >> > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all >> > >>>>>>> public >> > >>>>>> >> > >>>>>> APIs >> > >>>>>>> >> > >>>>>>> have first class classes/objects defined in sql directly. >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan < >> > velvia.git...@gmail.com >> > >>>>>> >> > >>>>>> wrote: >> > >>>>>>>> >> > >>>>>>>> Hey guys, >> > >>>>>>>> >> > >>>>>>>> How does this impact the data sources API? I was planning on >> > using >> > >>>>>>>> this for a project. >> > >>>>>>>> >> > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally >> > >>>>>>>> desirable and useful. >> > >>>>>>>> >> > >>>>>>>> By the way, one thing that prevents the columnar compression >> stuff >> > >>> >> > >>> in >> > >>>>>>>> >> > >>>>>>>> Spark SQL from being more useful is, at least from previous >> talks >> > >>>>>>>> with >> > >>>>>>>> Reynold and Michael et al., that the format was not designed >> for >> > >>>>>>>> persistence. >> > >>>>>>>> >> > >>>>>>>> I have a new project that aims to change that. It is a >> > >>>>>>>> zero-serialisation, high performance binary vector library, >> > >>> >> > >>> designed >> > >>>>>>>> >> > >>>>>>>> from the outset to be a persistent storage friendly. May be >> one >> > >>> >> > >>> day >> > >>>>>>>> >> > >>>>>>>> it can replace the Spark SQL columnar compression. >> > >>>>>>>> >> > >>>>>>>> Michael told me this would be a lot of work, and recreates >> parts >> > of >> > >>>>>>>> Parquet, but I think it's worth it. LMK if you'd like more >> > >>> >> > >>> details. >> > >>>>>>>> >> > >>>>>>>> -Evan >> > >>>>>>>> >> > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin < >> r...@databricks.com >> > > >> > >>>>>> >> > >>>>>> wrote: >> > >>>>>>>>> >> > >>>>>>>>> Alright I have merged the patch ( >> > >>>>>>>>> https://github.com/apache/spark/pull/4173 >> > >>>>>>>>> ) since I don't see any strong opinions against it (as a >> matter >> > >>> >> > >>> of >> > >>>>>> >> > >>>>>> fact >> > >>>>>>>>> >> > >>>>>>>>> most were for it). We can still change it if somebody lays >> out a >> > >>>>>> >> > >>>>>> strong >> > >>>>>>>>> >> > >>>>>>>>> argument. >> > >>>>>>>>> >> > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia >> > >>>>>>>>> <matei.zaha...@gmail.com> >> > >>>>>>>>> wrote: >> > >>>>>>>>> >> > >>>>>>>>>> The type alias means your methods can specify either type and >> > >>> >> > >>> they >> > >>>>>> >> > >>>>>> will >> > >>>>>>>>>> >> > >>>>>>>>>> work. It's just another name for the same type. But Scaladocs >> > >>> >> > >>> and >> > >>>>>> >> > >>>>>> such >> > >>>>>>>>>> >> > >>>>>>>>>> will >> > >>>>>>>>>> show DataFrame as the type. >> > >>>>>>>>>> >> > >>>>>>>>>> Matei >> > >>>>>>>>>> >> > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho < >> > >>>>>>>>>> >> > >>>>>>>>>> dirceu.semigh...@gmail.com> wrote: >> > >>>>>>>>>>> >> > >>>>>>>>>>> Reynold, >> > >>>>>>>>>>> But with type alias we will have the same problem, right? >> > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will >> have >> > >>>>>>>>>>> to >> > >>>>>>>>>>> change >> > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have >> > >>> >> > >>> an >> > >>>>>>>>>>> >> > >>>>>>>>>>> implicit >> > >>>>>>>>>>> conversion between DataFrame and SchemaRDD >> > >>>>>>>>>>> >> > >>>>>>>>>>> >> > >>>>>>>>>>> >> > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com >> >: >> > >>>>>>>>>>> >> > >>>>>>>>>>>> Dirceu, >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> That is not possible because one cannot overload return >> > >>> >> > >>> types. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to >> > >>> >> > >>> return >> > >>>>>> >> > >>>>>> some >> > >>>>>>>>>> >> > >>>>>>>>>> type, >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called >> > >>>>>>>>>>>> SchemaRDD >> > >>>>>>>>>>>> to >> > >>>>>>>>>> >> > >>>>>>>>>> not >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> break source compatibility for Scala. >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho < >> > >>>>>>>>>>>> dirceu.semigh...@gmail.com> wrote: >> > >>>>>>>>>>>> >> > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and >> be >> > >>>>>> >> > >>>>>> removed >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> in >> > >>>>>>>>>> >> > >>>>>>>>>> the >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> release 1.5(+/- 1) for example, and the new code been >> added >> > >>>>>>>>>>>>> to >> > >>>>>>>>>> >> > >>>>>>>>>> DataFrame? >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> With this, we don't impact in existing code for the next >> few >> > >>>>>>>>>>>>> releases. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta >> > >>>>>>>>>>>>> <kushal.da...@gmail.com>: >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the >> > >>> >> > >>> heavy >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> lifting >> > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even >> > >>>>>>>>>>>>>> after >> > >>>>>> >> > >>>>>> 30 >> > >>>>>>>>>> >> > >>>>>>>>>> years >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> of >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> research there is not a single good open source columnar >> > >>>>>> >> > >>>>>> database >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> like >> > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it >> is >> > >>>>>>>>>>>>>> not >> > >>>>>>>>>>>>>> nearly >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> as >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true >> > >>> >> > >>> need >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> for >> > >>>>>>>>>>>>>> such >> > >>>>>>>>>> >> > >>>>>>>>>> a >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change >> that. >> > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" >> > >>>>>>>>>>>>>> <sandy.r...@cloudera.com> >> > >>>>>>>>>> >> > >>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I >> > >>> >> > >>> like >> > >>>>>> >> > >>>>>> the >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> former >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> slightly better because it's more descriptive. >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the >> > >>>>>> >> > >>>>>> covers, >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> it >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> would >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least >> > >>>>>> >> > >>>>>> choose a >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> package >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> name for it that omits "sql". >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark >> Schema >> > >>>>>> >> > >>>>>> module >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> for >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Spark >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a >> > >>>>>>>>>>>>>>> change >> > >>>>>> >> > >>>>>> at >> > >>>>>>>>>> >> > >>>>>>>>>> this >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> point? >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> -Sandy >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia < >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> matei.zaha...@gmail.com> >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of >> giving >> > >>>>>>>>>>>>>>>> it >> > >>>>>>>>>>>>>>>> another >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> name, >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL >> since >> > >>>>>>>>>>>>>>>> that >> > >>>>>>>>>>>>>>>> was >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> the >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> most >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> obvious use case to many users.) >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> Matei >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia < >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> matei.zaha...@gmail.com> >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to >> Spark >> > >>>>>>>>>>>>>>>>> Core >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> long-term, >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> supporting structured data efficiently does require >> > >>> >> > >>> quite a >> > >>>>>> >> > >>>>>> bit >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> of >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> the >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and >> > >>>>>> >> > >>>>>> columnar >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> storage. >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL >> > >>>>>>>>>>>>>>>> server >> > >>>>>>>>>>>>>>>> -- >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> it's >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data. >> > >>>>>>>>>>>>>>>> Since >> > >>>>>>>>>>>>>>>> this >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> is >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty >> natural >> > >>> >> > >>> to >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> organize it >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library. >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>> Matei >> > >>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers < >> > >>>>>> >> > >>>>>> ko...@tresata.com> >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common >> > >>> >> > >>> data >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> format >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> used >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> for >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and >> > >>> >> > >>> used >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> for >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> various >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API." >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark >> > >>>>>>>>>>>>>>>>>> core, >> > >>>>>> >> > >>>>>> not >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> sql >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < >> > >>>>>>>>>>>>>>>>>> michaelma...@yahoo.com.invalid> wrote: >> > >>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it >> yet, >> > >>>>>>>>>>>>>>>>>>> the >> > >>>>>>>>>>>>>>>>>>> Jan. >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> 13 >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> Bay >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> Area >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of >> background >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> information >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> on >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> this >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-). >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> ________________________________ >> > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pwend...@gmail.com> >> > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <r...@databricks.com> >> > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> >> > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM >> > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail, >> > >>> >> > >>> there >> > >>>>>> >> > >>>>>> will >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> be >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> a >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> 1:1 >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a >> > >>>>>> >> > >>>>>> DataFrame. >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin < >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> r...@databricks.com> >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> wrote: >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> Hi, >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame >> in >> > >>>>>>>>>>>>>>>>>>>> 1.3, >> > >>>>>>>>>>>>>>>>>>>> and >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> wanted >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> to >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> get the community's opinion. >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common >> > >>> >> > >>> data >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> format >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> used >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> for >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and >> > >>>>>>>>>>>>>>>>>>>> used >> > >>>>>> >> > >>>>>> for >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> various >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. >> > >>> >> > >>> We >> > >>>>>> >> > >>>>>> also >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> expect >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> more >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> and >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against >> > >>> >> > >>> SchemaRDD >> > >>>>>> >> > >>>>>> API >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> rather >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> than >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> the >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly >> > >>> >> > >>> used >> > >>>>>> >> > >>>>>> DSL >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> originally >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the >> > >>>>>>>>>>>>>>>>>>>> data-frame >> > >>>>>>>>>>>>>>>>>>>> like >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> API. >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> In >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API >> > >>> >> > >>> usable >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> for >> > >>>>>>>>>>>>>>>>>>>> end >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> users. >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming: >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name >> > >>> >> > >>> than >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> SchemaRDD. >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be >> an >> > >>>>>>>>>>>>>>>>>>>> RDD >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> anymore >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> (even >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like >> map, >> > >>>>>>>>>>>>>>>>>>>> flatMap, >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> etc), >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> and >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is >> > >>> >> > >>> highly >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> confusing. >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> Instead. >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for >> all >> > >>>>>>>>>>>>>>>>>>>> RDD >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> methods. >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program >> > >>>>>>>>>>>>>>>>>>>> directly >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> against >> > >>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> the >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not >> > >>> >> > >>> well >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> documented. >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> However, >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a >> > >>>>>>>>>>>>>>>>>>>> type >> > >>>>>>>>>>>>>>>>>>>> alias >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> DataFrame >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain >> > >>>>>>>>>>>>>>>>>>>> source >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> compatibility >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> for >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all >> existing >> > >>>>>>>>>>>>> >> > >>>>>>>>>>>>> materials to >> > >>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>> use >> > >>>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD. >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>> >> > --------------------------------------------------------------------- >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: >> > >>> >> > >>> dev-unsubscr...@spark.apache.org >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail: >> > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>> >> > >>>>>> >> > --------------------------------------------------------------------- >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: >> > >>> >> > >>> dev-unsubscr...@spark.apache.org >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail: >> > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>> >> > >>>>>> >> > --------------------------------------------------------------------- >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> To unsubscribe, e-mail: >> dev-unsubscr...@spark.apache.org >> > >>>>>>>>>>>>>>>> For additional commands, e-mail: >> > >>> >> > >>> dev-h...@spark.apache.org >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>>>>>> >> > >>>>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>> >> --------------------------------------------------------------------- >> > >>>>>>>>>> >> > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >> > >>>>>>>>>> >> > >>>>>>>>>> >> > >>>>>>> >> > >>>> >> > >>> >> --------------------------------------------------------------------- >> > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > >>> For additional commands, e-mail: dev-h...@spark.apache.org >> > >>> >> > >>> >> > > >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > > For additional commands, e-mail: dev-h...@spark.apache.org >> > > >> > >> > >