Koert, Don't get too hang up on the name SQL. This is exactly what you want: a collection with record-like objects with field names and runtime types.
Almost all of the 40 methods are transformations for structured data, such as aggregation on a field, or filtering on a field. If all you have is the old RDD style map/flatMap, then any transformation would lose the schema information, making the extra schema information useless. On Tue, Feb 10, 2015 at 11:47 AM, Koert Kuipers <ko...@tresata.com> wrote: > so i understand the success or spark.sql. besides the fact that anything > with the words SQL in its name will have thousands of developers running > towards it because of the familiarity, there is also a genuine need for a > generic RDD that holds record-like objects, with field names and runtime > types. after all that is a successfull generic abstraction used in many > structured data tools. > > but to me that abstraction is as simple as: > > trait SchemaRDD extends RDD[Row] { > def schema: StructType > } > > and perhaps another abstraction to indicate it intends to be column > oriented (with a few methods to efficiently extract a subset of columns). > so that could be DataFrame. > > such simple contracts would allow many people to write loaders for this > (say from csv) and whatnot. > > what i do not understand why it has to be much more complex than this. but > if i look at DataFrame it has so much additional stuff, that has (in my > eyes) nothing to do with generic structured data analysis. > > for example to implement DataFrame i need to implement about 40 additional > methods!? and for some the SQLness is obviously leaking into the > abstraction. for example why would i care about: > def registerTempTable(tableName: String): Unit > > > best, koert > > On Sun, Feb 1, 2015 at 3:31 AM, Evan Chan <velvia.git...@gmail.com> wrote: > > > It is true that you can persist SchemaRdds / DataFrames to disk via > > Parquet, but a lot of time and inefficiencies is lost. The in-memory > > columnar cached representation is completely different from the > > Parquet file format, and I believe there has to be a translation into > > a Row (because ultimately Spark SQL traverses Row's -- even the > > InMemoryColumnarTableScan has to then convert the columns into Rows > > for row-based processing). On the other hand, traditional data > > frames process in a columnar fashion. Columnar storage is good, but > > nowhere near as good as columnar processing. > > > > Another issue, which I don't know if it is solved yet, but it is > > difficult for Tachyon to efficiently cache Parquet files without > > understanding the file format itself. > > > > I gave a talk at last year's Spark Summit on this topic. > > > > I'm working on efforts to change this, however. Shoot me an email at > > velvia at gmail if you're interested in joining forces. > > > > On Thu, Jan 29, 2015 at 1:59 PM, Cheng Lian <lian.cs....@gmail.com> > wrote: > > > Yes, when a DataFrame is cached in memory, it's stored in an efficient > > > columnar format. And you can also easily persist it on disk using > > Parquet, > > > which is also columnar. > > > > > > Cheng > > > > > > > > > On 1/29/15 1:24 PM, Koert Kuipers wrote: > > >> > > >> to me the word DataFrame does come with certain expectations. one of > > them > > >> is that the data is stored columnar. in R data.frame internally uses a > > >> list > > >> of sequences i think, but since lists can have labels its more like a > > >> SortedMap[String, Array[_]]. this makes certain operations very cheap > > >> (such > > >> as adding a column). > > >> > > >> in Spark the closest thing would be a data structure where per > Partition > > >> the data is also stored columnar. does spark SQL already use something > > >> like > > >> that? Evan mentioned "Spark SQL columnar compression", which sounds > like > > >> it. where can i find that? > > >> > > >> thanks > > >> > > >> On Thu, Jan 29, 2015 at 2:32 PM, Evan Chan <velvia.git...@gmail.com> > > >> wrote: > > >> > > >>> +1.... having proper NA support is much cleaner than using null, at > > >>> least the Java null. > > >>> > > >>> On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks < > evan.spa...@gmail.com > > > > > >>> wrote: > > >>>> > > >>>> You've got to be a little bit careful here. "NA" in systems like R > or > > >>> > > >>> pandas > > >>>> > > >>>> may have special meaning that is distinct from "null". > > >>>> > > >>>> See, e.g. http://www.r-bloggers.com/r-na-vs-null/ > > >>>> > > >>>> > > >>>> > > >>>> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com> > > >>> > > >>> wrote: > > >>>>> > > >>>>> Isn't that just "null" in SQL? > > >>>>> > > >>>>> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan < > velvia.git...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>>> I believe that most DataFrame implementations out there, like > > Pandas, > > >>>>>> supports the idea of missing values / NA, and some support the > idea > > of > > >>>>>> Not Meaningful as well. > > >>>>>> > > >>>>>> Does Row support anything like that? That is important for > certain > > >>>>>> applications. I thought that Row worked by being a mutable > object, > > >>>>>> but haven't looked into the details in a while. > > >>>>>> > > >>>>>> -Evan > > >>>>>> > > >>>>>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com > > > > >>>>>> wrote: > > >>>>>>> > > >>>>>>> It shouldn't change the data source api at all because data > sources > > >>>>>> > > >>>>>> create > > >>>>>>> > > >>>>>>> RDD[Row], and that gets converted into a DataFrame automatically > > >>>>>> > > >>>>>> (previously > > >>>>>>> > > >>>>>>> to SchemaRDD). > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>> > > >>> > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala > > >>>>>>> > > >>>>>>> One thing that will break the data source API in 1.3 is the > > location > > >>>>>>> of > > >>>>>>> types. Types were previously defined in sql.catalyst.types, and > now > > >>>>>> > > >>>>>> moved to > > >>>>>>> > > >>>>>>> sql.types. After 1.3, sql.catalyst is hidden from users, and all > > >>>>>>> public > > >>>>>> > > >>>>>> APIs > > >>>>>>> > > >>>>>>> have first class classes/objects defined in sql directly. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan < > > velvia.git...@gmail.com > > >>>>>> > > >>>>>> wrote: > > >>>>>>>> > > >>>>>>>> Hey guys, > > >>>>>>>> > > >>>>>>>> How does this impact the data sources API? I was planning on > > using > > >>>>>>>> this for a project. > > >>>>>>>> > > >>>>>>>> +1 that many things from spark-sql / DataFrame is universally > > >>>>>>>> desirable and useful. > > >>>>>>>> > > >>>>>>>> By the way, one thing that prevents the columnar compression > stuff > > >>> > > >>> in > > >>>>>>>> > > >>>>>>>> Spark SQL from being more useful is, at least from previous > talks > > >>>>>>>> with > > >>>>>>>> Reynold and Michael et al., that the format was not designed for > > >>>>>>>> persistence. > > >>>>>>>> > > >>>>>>>> I have a new project that aims to change that. It is a > > >>>>>>>> zero-serialisation, high performance binary vector library, > > >>> > > >>> designed > > >>>>>>>> > > >>>>>>>> from the outset to be a persistent storage friendly. May be one > > >>> > > >>> day > > >>>>>>>> > > >>>>>>>> it can replace the Spark SQL columnar compression. > > >>>>>>>> > > >>>>>>>> Michael told me this would be a lot of work, and recreates parts > > of > > >>>>>>>> Parquet, but I think it's worth it. LMK if you'd like more > > >>> > > >>> details. > > >>>>>>>> > > >>>>>>>> -Evan > > >>>>>>>> > > >>>>>>>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin < > r...@databricks.com > > > > > >>>>>> > > >>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>> Alright I have merged the patch ( > > >>>>>>>>> https://github.com/apache/spark/pull/4173 > > >>>>>>>>> ) since I don't see any strong opinions against it (as a matter > > >>> > > >>> of > > >>>>>> > > >>>>>> fact > > >>>>>>>>> > > >>>>>>>>> most were for it). We can still change it if somebody lays out > a > > >>>>>> > > >>>>>> strong > > >>>>>>>>> > > >>>>>>>>> argument. > > >>>>>>>>> > > >>>>>>>>> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia > > >>>>>>>>> <matei.zaha...@gmail.com> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> The type alias means your methods can specify either type and > > >>> > > >>> they > > >>>>>> > > >>>>>> will > > >>>>>>>>>> > > >>>>>>>>>> work. It's just another name for the same type. But Scaladocs > > >>> > > >>> and > > >>>>>> > > >>>>>> such > > >>>>>>>>>> > > >>>>>>>>>> will > > >>>>>>>>>> show DataFrame as the type. > > >>>>>>>>>> > > >>>>>>>>>> Matei > > >>>>>>>>>> > > >>>>>>>>>>> On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho < > > >>>>>>>>>> > > >>>>>>>>>> dirceu.semigh...@gmail.com> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Reynold, > > >>>>>>>>>>> But with type alias we will have the same problem, right? > > >>>>>>>>>>> If the methods doesn't receive schemardd anymore, we will > have > > >>>>>>>>>>> to > > >>>>>>>>>>> change > > >>>>>>>>>>> our code to migrade from schema to dataframe. Unless we have > > >>> > > >>> an > > >>>>>>>>>>> > > >>>>>>>>>>> implicit > > >>>>>>>>>>> conversion between DataFrame and SchemaRDD > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com > >: > > >>>>>>>>>>> > > >>>>>>>>>>>> Dirceu, > > >>>>>>>>>>>> > > >>>>>>>>>>>> That is not possible because one cannot overload return > > >>> > > >>> types. > > >>>>>>>>>>>> > > >>>>>>>>>>>> SQLContext.parquetFile (and many other methods) needs to > > >>> > > >>> return > > >>>>>> > > >>>>>> some > > >>>>>>>>>> > > >>>>>>>>>> type, > > >>>>>>>>>>>> > > >>>>>>>>>>>> and that type cannot be both SchemaRDD and DataFrame. > > >>>>>>>>>>>> > > >>>>>>>>>>>> In 1.3, we will create a type alias for DataFrame called > > >>>>>>>>>>>> SchemaRDD > > >>>>>>>>>>>> to > > >>>>>>>>>> > > >>>>>>>>>> not > > >>>>>>>>>>>> > > >>>>>>>>>>>> break source compatibility for Scala. > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho < > > >>>>>>>>>>>> dirceu.semigh...@gmail.com> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Can't the SchemaRDD remain the same, but deprecated, and be > > >>>>>> > > >>>>>> removed > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> in > > >>>>>>>>>> > > >>>>>>>>>> the > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> release 1.5(+/- 1) for example, and the new code been > added > > >>>>>>>>>>>>> to > > >>>>>>>>>> > > >>>>>>>>>> DataFrame? > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> With this, we don't impact in existing code for the next > few > > >>>>>>>>>>>>> releases. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> 2015-01-27 0:02 GMT-02:00 Kushal Datta > > >>>>>>>>>>>>> <kushal.da...@gmail.com>: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> I want to address the issue that Matei raised about the > > >>> > > >>> heavy > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> lifting > > >>>>>>>>>>>>>> required for a full SQL support. It is amazing that even > > >>>>>>>>>>>>>> after > > >>>>>> > > >>>>>> 30 > > >>>>>>>>>> > > >>>>>>>>>> years > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> of > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> research there is not a single good open source columnar > > >>>>>> > > >>>>>> database > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> like > > >>>>>>>>>>>>>> Vertica. There is a column store option in MySQL, but it > is > > >>>>>>>>>>>>>> not > > >>>>>>>>>>>>>> nearly > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> as > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> sophisticated as Vertica or MonetDB. But there's a true > > >>> > > >>> need > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> for > > >>>>>>>>>>>>>> such > > >>>>>>>>>> > > >>>>>>>>>> a > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> system. I wonder why so and it's high time to change that. > > >>>>>>>>>>>>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" > > >>>>>>>>>>>>>> <sandy.r...@cloudera.com> > > >>>>>>>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Both SchemaRDD and DataFrame sound fine to me, though I > > >>> > > >>> like > > >>>>>> > > >>>>>> the > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> former > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> slightly better because it's more descriptive. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Even if SchemaRDD's needs to rely on Spark SQL under the > > >>>>>> > > >>>>>> covers, > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> it > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> would > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> be more clear from a user-facing perspective to at least > > >>>>>> > > >>>>>> choose a > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> package > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> name for it that omits "sql". > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> I would also be in favor of adding a separate Spark > Schema > > >>>>>> > > >>>>>> module > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Spark > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> SQL to rely on, but I imagine that might be too large a > > >>>>>>>>>>>>>>> change > > >>>>>> > > >>>>>> at > > >>>>>>>>>> > > >>>>>>>>>> this > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> point? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> -Sandy > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia < > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> matei.zaha...@gmail.com> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> (Actually when we designed Spark SQL we thought of > giving > > >>>>>>>>>>>>>>>> it > > >>>>>>>>>>>>>>>> another > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> name, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> like Spark Schema, but we decided to stick with SQL > since > > >>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>> was > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> most > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> obvious use case to many users.) > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Matei > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia < > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> matei.zaha...@gmail.com> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> While it might be possible to move this concept to > Spark > > >>>>>>>>>>>>>>>>> Core > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> long-term, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> supporting structured data efficiently does require > > >>> > > >>> quite a > > >>>>>> > > >>>>>> bit > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> infrastructure in Spark SQL, such as query planning and > > >>>>>> > > >>>>>> columnar > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> storage. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> The intent of Spark SQL though is to be more than a SQL > > >>>>>>>>>>>>>>>> server > > >>>>>>>>>>>>>>>> -- > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> it's > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> meant to be a library for manipulating structured data. > > >>>>>>>>>>>>>>>> Since > > >>>>>>>>>>>>>>>> this > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> possible to build over the core API, it's pretty natural > > >>> > > >>> to > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> organize it > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> that way, same as Spark Streaming is a library. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Matei > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers < > > >>>>>> > > >>>>>> ko...@tresata.com> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> "The context is that SchemaRDD is becoming a common > > >>> > > >>> data > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> format > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> used > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and > > >>> > > >>> used > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> various > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API." > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> i agree. this to me also implies it belongs in spark > > >>>>>>>>>>>>>>>>>> core, > > >>>>>> > > >>>>>> not > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> sql > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < > > >>>>>>>>>>>>>>>>>> michaelma...@yahoo.com.invalid> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> And in the off chance that anyone hasn't seen it yet, > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> Jan. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> 13 > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Bay > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Area > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Spark Meetup YouTube contained a wealth of background > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> information > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> on > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> idea (mostly from Patrick and Reynold :-). > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> ________________________________ > > >>>>>>>>>>>>>>>>>>> From: Patrick Wendell <pwend...@gmail.com> > > >>>>>>>>>>>>>>>>>>> To: Reynold Xin <r...@databricks.com> > > >>>>>>>>>>>>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> > > >>>>>>>>>>>>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM > > >>>>>>>>>>>>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> One thing potentially not clear from this e-mail, > > >>> > > >>> there > > >>>>>> > > >>>>>> will > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> be > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> a > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> 1:1 > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> correspondence where you can get an RDD to/from a > > >>>>>> > > >>>>>> DataFrame. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin < > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> r...@databricks.com> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame > in > > >>>>>>>>>>>>>>>>>>>> 1.3, > > >>>>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> wanted > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> get the community's opinion. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> The context is that SchemaRDD is becoming a common > > >>> > > >>> data > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> format > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> used > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> bringing data into Spark from external systems, and > > >>>>>>>>>>>>>>>>>>>> used > > >>>>>> > > >>>>>> for > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> various > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. > > >>> > > >>> We > > >>>>>> > > >>>>>> also > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> expect > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> more > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> more users to be programming directly against > > >>> > > >>> SchemaRDD > > >>>>>> > > >>>>>> API > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> rather > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> than > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> core RDD API. SchemaRDD, through its less commonly > > >>> > > >>> used > > >>>>>> > > >>>>>> DSL > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> originally > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> designed for writing test cases, always has the > > >>>>>>>>>>>>>>>>>>>> data-frame > > >>>>>>>>>>>>>>>>>>>> like > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> API. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> In > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> 1.3, we are redesigning the API to make the API > > >>> > > >>> usable > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>>>> end > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> users. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> There are two motivations for the renaming: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> 1. DataFrame seems to be a more self-evident name > > >>> > > >>> than > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> SchemaRDD. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be > an > > >>>>>>>>>>>>>>>>>>>> RDD > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> anymore > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> (even > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> though it would contain some RDD functions like map, > > >>>>>>>>>>>>>>>>>>>> flatMap, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> etc), > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> calling it Schema*RDD* while it is not an RDD is > > >>> > > >>> highly > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> confusing. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Instead. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> DataFrame.rdd will return the underlying RDD for all > > >>>>>>>>>>>>>>>>>>>> RDD > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> methods. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> My understanding is that very few users program > > >>>>>>>>>>>>>>>>>>>> directly > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> against > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> SchemaRDD API at the moment, because they are not > > >>> > > >>> well > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> documented. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> However, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> oo maintain backward compatibility, we can create a > > >>>>>>>>>>>>>>>>>>>> type > > >>>>>>>>>>>>>>>>>>>> alias > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> DataFrame > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> that is still named SchemaRDD. This will maintain > > >>>>>>>>>>>>>>>>>>>> source > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> compatibility > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Scala. That said, we will have to update all > existing > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> materials to > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> use > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> DataFrame rather than SchemaRDD. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>> > > --------------------------------------------------------------------- > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: > > >>> > > >>> dev-unsubscr...@spark.apache.org > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail: > > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>> > > --------------------------------------------------------------------- > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> To unsubscribe, e-mail: > > >>> > > >>> dev-unsubscr...@spark.apache.org > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> For additional commands, e-mail: > > >>>>>>>>>>>>>>>>>>> dev-h...@spark.apache.org > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>> > > --------------------------------------------------------------------- > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> To unsubscribe, e-mail: > dev-unsubscr...@spark.apache.org > > >>>>>>>>>>>>>>>> For additional commands, e-mail: > > >>> > > >>> dev-h...@spark.apache.org > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>> --------------------------------------------------------------------- > > >>>>>>>>>> > > >>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > >>>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>> > > >>>> > > >>> --------------------------------------------------------------------- > > >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > >>> For additional commands, e-mail: dev-h...@spark.apache.org > > >>> > > >>> > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > >