+1.... having proper NA support is much cleaner than using null, at least the Java null.
On Wed, Jan 28, 2015 at 6:10 PM, Evan R. Sparks <evan.spa...@gmail.com> wrote: > You've got to be a little bit careful here. "NA" in systems like R or pandas > may have special meaning that is distinct from "null". > > See, e.g. http://www.r-bloggers.com/r-na-vs-null/ > > > > On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com> wrote: >> >> Isn't that just "null" in SQL? >> >> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.git...@gmail.com> >> wrote: >> >> > I believe that most DataFrame implementations out there, like Pandas, >> > supports the idea of missing values / NA, and some support the idea of >> > Not Meaningful as well. >> > >> > Does Row support anything like that? That is important for certain >> > applications. I thought that Row worked by being a mutable object, >> > but haven't looked into the details in a while. >> > >> > -Evan >> > >> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com> >> > wrote: >> > > It shouldn't change the data source api at all because data sources >> > create >> > > RDD[Row], and that gets converted into a DataFrame automatically >> > (previously >> > > to SchemaRDD). >> > > >> > > >> > >> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala >> > > >> > > One thing that will break the data source API in 1.3 is the location >> > > of >> > > types. Types were previously defined in sql.catalyst.types, and now >> > moved to >> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all >> > > public >> > APIs >> > > have first class classes/objects defined in sql directly. >> > > >> > > >> > > >> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.git...@gmail.com> >> > wrote: >> > >> >> > >> Hey guys, >> > >> >> > >> How does this impact the data sources API? I was planning on using >> > >> this for a project. >> > >> >> > >> +1 that many things from spark-sql / DataFrame is universally >> > >> desirable and useful. >> > >> >> > >> By the way, one thing that prevents the columnar compression stuff in >> > >> Spark SQL from being more useful is, at least from previous talks >> > >> with >> > >> Reynold and Michael et al., that the format was not designed for >> > >> persistence. >> > >> >> > >> I have a new project that aims to change that. It is a >> > >> zero-serialisation, high performance binary vector library, designed >> > >> from the outset to be a persistent storage friendly. May be one day >> > >> it can replace the Spark SQL columnar compression. >> > >> >> > >> Michael told me this would be a lot of work, and recreates parts of >> > >> Parquet, but I think it's worth it. LMK if you'd like more details. >> > >> >> > >> -Evan >> > >> >> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <r...@databricks.com> >> > wrote: >> > >> > Alright I have merged the patch ( >> > >> > https://github.com/apache/spark/pull/4173 >> > >> > ) since I don't see any strong opinions against it (as a matter of >> > fact >> > >> > most were for it). We can still change it if somebody lays out a >> > strong >> > >> > argument. >> > >> > >> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia >> > >> > <matei.zaha...@gmail.com> >> > >> > wrote: >> > >> > >> > >> >> The type alias means your methods can specify either type and they >> > will >> > >> >> work. It's just another name for the same type. But Scaladocs and >> > such >> > >> >> will >> > >> >> show DataFrame as the type. >> > >> >> >> > >> >> Matei >> > >> >> >> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho < >> > >> >> dirceu.semigh...@gmail.com> wrote: >> > >> >> > >> > >> >> > Reynold, >> > >> >> > But with type alias we will have the same problem, right? >> > >> >> > If the methods doesn't receive schemardd anymore, we will have >> > >> >> > to >> > >> >> > change >> > >> >> > our code to migrade from schema to dataframe. Unless we have an >> > >> >> > implicit >> > >> >> > conversion between DataFrame and SchemaRDD >> > >> >> > >> > >> >> > >> > >> >> > >> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>: >> > >> >> > >> > >> >> >> Dirceu, >> > >> >> >> >> > >> >> >> That is not possible because one cannot overload return types. >> > >> >> >> >> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return >> > some >> > >> >> type, >> > >> >> >> and that type cannot be both SchemaRDD and DataFrame. >> > >> >> >> >> > >> >> >> In 1.3, we will create a type alias for DataFrame called >> > >> >> >> SchemaRDD >> > >> >> >> to >> > >> >> not >> > >> >> >> break source compatibility for Scala. >> > >> >> >> >> > >> >> >> >> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho < >> > >> >> >> dirceu.semigh...@gmail.com> wrote: >> > >> >> >> >> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be >> > removed >> > >> >> >>> in >> > >> >> the >> > >> >> >>> release 1.5(+/- 1) for example, and the new code been added >> > >> >> >>> to >> > >> >> DataFrame? >> > >> >> >>> With this, we don't impact in existing code for the next few >> > >> >> >>> releases. >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta >> > >> >> >>> <kushal.da...@gmail.com>: >> > >> >> >>> >> > >> >> >>>> I want to address the issue that Matei raised about the heavy >> > >> >> >>>> lifting >> > >> >> >>>> required for a full SQL support. It is amazing that even >> > >> >> >>>> after >> > 30 >> > >> >> years >> > >> >> >>> of >> > >> >> >>>> research there is not a single good open source columnar >> > database >> > >> >> >>>> like >> > >> >> >>>> Vertica. There is a column store option in MySQL, but it is >> > >> >> >>>> not >> > >> >> >>>> nearly >> > >> >> >>> as >> > >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need >> > >> >> >>>> for >> > >> >> >>>> such >> > >> >> a >> > >> >> >>>> system. I wonder why so and it's high time to change that. >> > >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" >> > >> >> >>>> <sandy.r...@cloudera.com> >> > >> >> wrote: >> > >> >> >>>> >> > >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like >> > the >> > >> >> >>> former >> > >> >> >>>>> slightly better because it's more descriptive. >> > >> >> >>>>> >> > >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the >> > covers, >> > >> >> >>>>> it >> > >> >> >>> would >> > >> >> >>>>> be more clear from a user-facing perspective to at least >> > choose a >> > >> >> >>> package >> > >> >> >>>>> name for it that omits "sql". >> > >> >> >>>>> >> > >> >> >>>>> I would also be in favor of adding a separate Spark Schema >> > module >> > >> >> >>>>> for >> > >> >> >>>> Spark >> > >> >> >>>>> SQL to rely on, but I imagine that might be too large a >> > >> >> >>>>> change >> > at >> > >> >> this >> > >> >> >>>>> point? >> > >> >> >>>>> >> > >> >> >>>>> -Sandy >> > >> >> >>>>> >> > >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia < >> > >> >> >>> matei.zaha...@gmail.com> >> > >> >> >>>>> wrote: >> > >> >> >>>>> >> > >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving >> > >> >> >>>>>> it >> > >> >> >>>>>> another >> > >> >> >>>>> name, >> > >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since >> > >> >> >>>>>> that >> > >> >> >>>>>> was >> > >> >> >>> the >> > >> >> >>>>> most >> > >> >> >>>>>> obvious use case to many users.) >> > >> >> >>>>>> >> > >> >> >>>>>> Matei >> > >> >> >>>>>> >> > >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia < >> > >> >> >>> matei.zaha...@gmail.com> >> > >> >> >>>>>> wrote: >> > >> >> >>>>>>> >> > >> >> >>>>>>> While it might be possible to move this concept to Spark >> > >> >> >>>>>>> Core >> > >> >> >>>>> long-term, >> > >> >> >>>>>> supporting structured data efficiently does require quite a >> > bit >> > >> >> >>>>>> of >> > >> >> >>> the >> > >> >> >>>>>> infrastructure in Spark SQL, such as query planning and >> > columnar >> > >> >> >>>> storage. >> > >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL >> > >> >> >>>>>> server >> > >> >> >>>>>> -- >> > >> >> >>> it's >> > >> >> >>>>>> meant to be a library for manipulating structured data. >> > >> >> >>>>>> Since >> > >> >> >>>>>> this >> > >> >> >>> is >> > >> >> >>>>>> possible to build over the core API, it's pretty natural to >> > >> >> >>> organize it >> > >> >> >>>>>> that way, same as Spark Streaming is a library. >> > >> >> >>>>>>> >> > >> >> >>>>>>> Matei >> > >> >> >>>>>>> >> > >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers < >> > ko...@tresata.com> >> > >> >> >>>> wrote: >> > >> >> >>>>>>>> >> > >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data >> > >> >> >>>>>>>> format >> > >> >> >>> used >> > >> >> >>>>> for >> > >> >> >>>>>>>> bringing data into Spark from external systems, and used >> > >> >> >>>>>>>> for >> > >> >> >>> various >> > >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API." >> > >> >> >>>>>>>> >> > >> >> >>>>>>>> i agree. this to me also implies it belongs in spark >> > >> >> >>>>>>>> core, >> > not >> > >> >> >>> sql >> > >> >> >>>>>>>> >> > >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < >> > >> >> >>>>>>>> michaelma...@yahoo.com.invalid> wrote: >> > >> >> >>>>>>>> >> > >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, >> > >> >> >>>>>>>>> the >> > >> >> >>>>>>>>> Jan. >> > >> >> >>> 13 >> > >> >> >>>> Bay >> > >> >> >>>>>> Area >> > >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background >> > >> >> >>> information >> > >> >> >>>> on >> > >> >> >>>>>> this >> > >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-). >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> ________________________________ >> > >> >> >>>>>>>>> From: Patrick Wendell <pwend...@gmail.com> >> > >> >> >>>>>>>>> To: Reynold Xin <r...@databricks.com> >> > >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> >> > >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM >> > >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there >> > will >> > >> >> >>>>>>>>> be >> > >> >> >>> a >> > >> >> >>>> 1:1 >> > >> >> >>>>>>>>> correspondence where you can get an RDD to/from a >> > DataFrame. >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin < >> > >> >> >>> r...@databricks.com> >> > >> >> >>>>>> wrote: >> > >> >> >>>>>>>>>> Hi, >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in >> > >> >> >>>>>>>>>> 1.3, >> > >> >> >>>>>>>>>> and >> > >> >> >>>>> wanted >> > >> >> >>>>>> to >> > >> >> >>>>>>>>>> get the community's opinion. >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data >> > >> >> >>>>>>>>>> format >> > >> >> >>>> used >> > >> >> >>>>>> for >> > >> >> >>>>>>>>>> bringing data into Spark from external systems, and >> > >> >> >>>>>>>>>> used >> > for >> > >> >> >>>> various >> > >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We >> > also >> > >> >> >>> expect >> > >> >> >>>>>> more >> > >> >> >>>>>>>>> and >> > >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD >> > API >> > >> >> >>> rather >> > >> >> >>>>>> than >> > >> >> >>>>>>>>> the >> > >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used >> > DSL >> > >> >> >>>>> originally >> > >> >> >>>>>>>>>> designed for writing test cases, always has the >> > >> >> >>>>>>>>>> data-frame >> > >> >> >>>>>>>>>> like >> > >> >> >>>> API. >> > >> >> >>>>>> In >> > >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable >> > >> >> >>>>>>>>>> for >> > >> >> >>>>>>>>>> end >> > >> >> >>>>> users. >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> There are two motivations for the renaming: >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than >> > >> >> >>> SchemaRDD. >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an >> > >> >> >>>>>>>>>> RDD >> > >> >> >>> anymore >> > >> >> >>>>>> (even >> > >> >> >>>>>>>>>> though it would contain some RDD functions like map, >> > >> >> >>>>>>>>>> flatMap, >> > >> >> >>>> etc), >> > >> >> >>>>>> and >> > >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly >> > >> >> >>> confusing. >> > >> >> >>>>>>>>> Instead. >> > >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all >> > >> >> >>>>>>>>>> RDD >> > >> >> >>> methods. >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> >> > >> >> >>>>>>>>>> My understanding is that very few users program >> > >> >> >>>>>>>>>> directly >> > >> >> >>> against >> > >> >> >>>> the >> > >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well >> > >> >> >>> documented. >> > >> >> >>>>>>>>> However, >> > >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a >> > >> >> >>>>>>>>>> type >> > >> >> >>>>>>>>>> alias >> > >> >> >>>>>> DataFrame >> > >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain >> > >> >> >>>>>>>>>> source >> > >> >> >>>>> compatibility >> > >> >> >>>>>>>>> for >> > >> >> >>>>>>>>>> Scala. That said, we will have to update all existing >> > >> >> >>> materials to >> > >> >> >>>>> use >> > >> >> >>>>>>>>>> DataFrame rather than SchemaRDD. >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> >> > >> >> >>>> >> > >> >> >>>> >> > --------------------------------------------------------------------- >> > >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > >> >> >>>>>>>>> For additional commands, e-mail: >> > >> >> >>>>>>>>> dev-h...@spark.apache.org >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> >> > >> >> >>>> >> > >> >> >>>> >> > --------------------------------------------------------------------- >> > >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > >> >> >>>>>>>>> For additional commands, e-mail: >> > >> >> >>>>>>>>> dev-h...@spark.apache.org >> > >> >> >>>>>>>>> >> > >> >> >>>>>>>>> >> > >> >> >>>>>>> >> > >> >> >>>>>> >> > >> >> >>>>>> >> > >> >> >>>>>> >> > >> >> >>> >> > >> >> >>> >> > --------------------------------------------------------------------- >> > >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > >> >> >>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >> > >> >> >>>>>> >> > >> >> >>>>>> >> > >> >> >>>>> >> > >> >> >>>> >> > >> >> >>> >> > >> >> >> >> > >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> --------------------------------------------------------------------- >> > >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> > >> >> >> > >> >> >> > > >> > > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org