Yeah, it's "null". I was worried you couldn't represent it in Row because of primitive types like Int (unless you box the Int, which would be a performance hit). Anyways, I'll take another look at the Row API again :-p
On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin <r...@databricks.com> wrote: > Isn't that just "null" in SQL? > > On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan <velvia.git...@gmail.com> wrote: >> >> I believe that most DataFrame implementations out there, like Pandas, >> supports the idea of missing values / NA, and some support the idea of >> Not Meaningful as well. >> >> Does Row support anything like that? That is important for certain >> applications. I thought that Row worked by being a mutable object, >> but haven't looked into the details in a while. >> >> -Evan >> >> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin <r...@databricks.com> wrote: >> > It shouldn't change the data source api at all because data sources >> > create >> > RDD[Row], and that gets converted into a DataFrame automatically >> > (previously >> > to SchemaRDD). >> > >> > >> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala >> > >> > One thing that will break the data source API in 1.3 is the location of >> > types. Types were previously defined in sql.catalyst.types, and now >> > moved to >> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public >> > APIs >> > have first class classes/objects defined in sql directly. >> > >> > >> > >> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan <velvia.git...@gmail.com> >> > wrote: >> >> >> >> Hey guys, >> >> >> >> How does this impact the data sources API? I was planning on using >> >> this for a project. >> >> >> >> +1 that many things from spark-sql / DataFrame is universally >> >> desirable and useful. >> >> >> >> By the way, one thing that prevents the columnar compression stuff in >> >> Spark SQL from being more useful is, at least from previous talks with >> >> Reynold and Michael et al., that the format was not designed for >> >> persistence. >> >> >> >> I have a new project that aims to change that. It is a >> >> zero-serialisation, high performance binary vector library, designed >> >> from the outset to be a persistent storage friendly. May be one day >> >> it can replace the Spark SQL columnar compression. >> >> >> >> Michael told me this would be a lot of work, and recreates parts of >> >> Parquet, but I think it's worth it. LMK if you'd like more details. >> >> >> >> -Evan >> >> >> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin <r...@databricks.com> >> >> wrote: >> >> > Alright I have merged the patch ( >> >> > https://github.com/apache/spark/pull/4173 >> >> > ) since I don't see any strong opinions against it (as a matter of >> >> > fact >> >> > most were for it). We can still change it if somebody lays out a >> >> > strong >> >> > argument. >> >> > >> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia >> >> > <matei.zaha...@gmail.com> >> >> > wrote: >> >> > >> >> >> The type alias means your methods can specify either type and they >> >> >> will >> >> >> work. It's just another name for the same type. But Scaladocs and >> >> >> such >> >> >> will >> >> >> show DataFrame as the type. >> >> >> >> >> >> Matei >> >> >> >> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho < >> >> >> dirceu.semigh...@gmail.com> wrote: >> >> >> > >> >> >> > Reynold, >> >> >> > But with type alias we will have the same problem, right? >> >> >> > If the methods doesn't receive schemardd anymore, we will have to >> >> >> > change >> >> >> > our code to migrade from schema to dataframe. Unless we have an >> >> >> > implicit >> >> >> > conversion between DataFrame and SchemaRDD >> >> >> > >> >> >> > >> >> >> > >> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin <r...@databricks.com>: >> >> >> > >> >> >> >> Dirceu, >> >> >> >> >> >> >> >> That is not possible because one cannot overload return types. >> >> >> >> >> >> >> >> SQLContext.parquetFile (and many other methods) needs to return >> >> >> >> some >> >> >> type, >> >> >> >> and that type cannot be both SchemaRDD and DataFrame. >> >> >> >> >> >> >> >> In 1.3, we will create a type alias for DataFrame called >> >> >> >> SchemaRDD >> >> >> >> to >> >> >> not >> >> >> >> break source compatibility for Scala. >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho < >> >> >> >> dirceu.semigh...@gmail.com> wrote: >> >> >> >> >> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be >> >> >> >>> removed >> >> >> >>> in >> >> >> the >> >> >> >>> release 1.5(+/- 1) for example, and the new code been added to >> >> >> DataFrame? >> >> >> >>> With this, we don't impact in existing code for the next few >> >> >> >>> releases. >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta <kushal.da...@gmail.com>: >> >> >> >>> >> >> >> >>>> I want to address the issue that Matei raised about the heavy >> >> >> >>>> lifting >> >> >> >>>> required for a full SQL support. It is amazing that even after >> >> >> >>>> 30 >> >> >> years >> >> >> >>> of >> >> >> >>>> research there is not a single good open source columnar >> >> >> >>>> database >> >> >> >>>> like >> >> >> >>>> Vertica. There is a column store option in MySQL, but it is not >> >> >> >>>> nearly >> >> >> >>> as >> >> >> >>>> sophisticated as Vertica or MonetDB. But there's a true need >> >> >> >>>> for >> >> >> >>>> such >> >> >> a >> >> >> >>>> system. I wonder why so and it's high time to change that. >> >> >> >>>> On Jan 26, 2015 5:47 PM, "Sandy Ryza" <sandy.r...@cloudera.com> >> >> >> wrote: >> >> >> >>>> >> >> >> >>>>> Both SchemaRDD and DataFrame sound fine to me, though I like >> >> >> >>>>> the >> >> >> >>> former >> >> >> >>>>> slightly better because it's more descriptive. >> >> >> >>>>> >> >> >> >>>>> Even if SchemaRDD's needs to rely on Spark SQL under the >> >> >> >>>>> covers, >> >> >> >>>>> it >> >> >> >>> would >> >> >> >>>>> be more clear from a user-facing perspective to at least >> >> >> >>>>> choose a >> >> >> >>> package >> >> >> >>>>> name for it that omits "sql". >> >> >> >>>>> >> >> >> >>>>> I would also be in favor of adding a separate Spark Schema >> >> >> >>>>> module >> >> >> >>>>> for >> >> >> >>>> Spark >> >> >> >>>>> SQL to rely on, but I imagine that might be too large a change >> >> >> >>>>> at >> >> >> this >> >> >> >>>>> point? >> >> >> >>>>> >> >> >> >>>>> -Sandy >> >> >> >>>>> >> >> >> >>>>> On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia < >> >> >> >>> matei.zaha...@gmail.com> >> >> >> >>>>> wrote: >> >> >> >>>>> >> >> >> >>>>>> (Actually when we designed Spark SQL we thought of giving it >> >> >> >>>>>> another >> >> >> >>>>> name, >> >> >> >>>>>> like Spark Schema, but we decided to stick with SQL since >> >> >> >>>>>> that >> >> >> >>>>>> was >> >> >> >>> the >> >> >> >>>>> most >> >> >> >>>>>> obvious use case to many users.) >> >> >> >>>>>> >> >> >> >>>>>> Matei >> >> >> >>>>>> >> >> >> >>>>>>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia < >> >> >> >>> matei.zaha...@gmail.com> >> >> >> >>>>>> wrote: >> >> >> >>>>>>> >> >> >> >>>>>>> While it might be possible to move this concept to Spark >> >> >> >>>>>>> Core >> >> >> >>>>> long-term, >> >> >> >>>>>> supporting structured data efficiently does require quite a >> >> >> >>>>>> bit >> >> >> >>>>>> of >> >> >> >>> the >> >> >> >>>>>> infrastructure in Spark SQL, such as query planning and >> >> >> >>>>>> columnar >> >> >> >>>> storage. >> >> >> >>>>>> The intent of Spark SQL though is to be more than a SQL >> >> >> >>>>>> server >> >> >> >>>>>> -- >> >> >> >>> it's >> >> >> >>>>>> meant to be a library for manipulating structured data. Since >> >> >> >>>>>> this >> >> >> >>> is >> >> >> >>>>>> possible to build over the core API, it's pretty natural to >> >> >> >>> organize it >> >> >> >>>>>> that way, same as Spark Streaming is a library. >> >> >> >>>>>>> >> >> >> >>>>>>> Matei >> >> >> >>>>>>> >> >> >> >>>>>>>> On Jan 26, 2015, at 4:26 PM, Koert Kuipers >> >> >> >>>>>>>> <ko...@tresata.com> >> >> >> >>>> wrote: >> >> >> >>>>>>>> >> >> >> >>>>>>>> "The context is that SchemaRDD is becoming a common data >> >> >> >>>>>>>> format >> >> >> >>> used >> >> >> >>>>> for >> >> >> >>>>>>>> bringing data into Spark from external systems, and used >> >> >> >>>>>>>> for >> >> >> >>> various >> >> >> >>>>>>>> components of Spark, e.g. MLlib's new pipeline API." >> >> >> >>>>>>>> >> >> >> >>>>>>>> i agree. this to me also implies it belongs in spark core, >> >> >> >>>>>>>> not >> >> >> >>> sql >> >> >> >>>>>>>> >> >> >> >>>>>>>> On Mon, Jan 26, 2015 at 6:11 PM, Michael Malak < >> >> >> >>>>>>>> michaelma...@yahoo.com.invalid> wrote: >> >> >> >>>>>>>> >> >> >> >>>>>>>>> And in the off chance that anyone hasn't seen it yet, the >> >> >> >>>>>>>>> Jan. >> >> >> >>> 13 >> >> >> >>>> Bay >> >> >> >>>>>> Area >> >> >> >>>>>>>>> Spark Meetup YouTube contained a wealth of background >> >> >> >>> information >> >> >> >>>> on >> >> >> >>>>>> this >> >> >> >>>>>>>>> idea (mostly from Patrick and Reynold :-). >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> https://www.youtube.com/watch?v=YWppYPWznSQ >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> ________________________________ >> >> >> >>>>>>>>> From: Patrick Wendell <pwend...@gmail.com> >> >> >> >>>>>>>>> To: Reynold Xin <r...@databricks.com> >> >> >> >>>>>>>>> Cc: "dev@spark.apache.org" <dev@spark.apache.org> >> >> >> >>>>>>>>> Sent: Monday, January 26, 2015 4:01 PM >> >> >> >>>>>>>>> Subject: Re: renaming SchemaRDD -> DataFrame >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> One thing potentially not clear from this e-mail, there >> >> >> >>>>>>>>> will >> >> >> >>>>>>>>> be >> >> >> >>> a >> >> >> >>>> 1:1 >> >> >> >>>>>>>>> correspondence where you can get an RDD to/from a >> >> >> >>>>>>>>> DataFrame. >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> On Mon, Jan 26, 2015 at 2:18 PM, Reynold Xin < >> >> >> >>> r...@databricks.com> >> >> >> >>>>>> wrote: >> >> >> >>>>>>>>>> Hi, >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> We are considering renaming SchemaRDD -> DataFrame in >> >> >> >>>>>>>>>> 1.3, >> >> >> >>>>>>>>>> and >> >> >> >>>>> wanted >> >> >> >>>>>> to >> >> >> >>>>>>>>>> get the community's opinion. >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> The context is that SchemaRDD is becoming a common data >> >> >> >>>>>>>>>> format >> >> >> >>>> used >> >> >> >>>>>> for >> >> >> >>>>>>>>>> bringing data into Spark from external systems, and used >> >> >> >>>>>>>>>> for >> >> >> >>>> various >> >> >> >>>>>>>>>> components of Spark, e.g. MLlib's new pipeline API. We >> >> >> >>>>>>>>>> also >> >> >> >>> expect >> >> >> >>>>>> more >> >> >> >>>>>>>>> and >> >> >> >>>>>>>>>> more users to be programming directly against SchemaRDD >> >> >> >>>>>>>>>> API >> >> >> >>> rather >> >> >> >>>>>> than >> >> >> >>>>>>>>> the >> >> >> >>>>>>>>>> core RDD API. SchemaRDD, through its less commonly used >> >> >> >>>>>>>>>> DSL >> >> >> >>>>> originally >> >> >> >>>>>>>>>> designed for writing test cases, always has the >> >> >> >>>>>>>>>> data-frame >> >> >> >>>>>>>>>> like >> >> >> >>>> API. >> >> >> >>>>>> In >> >> >> >>>>>>>>>> 1.3, we are redesigning the API to make the API usable >> >> >> >>>>>>>>>> for >> >> >> >>>>>>>>>> end >> >> >> >>>>> users. >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> There are two motivations for the renaming: >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> 1. DataFrame seems to be a more self-evident name than >> >> >> >>> SchemaRDD. >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> 2. SchemaRDD/DataFrame is actually not going to be an RDD >> >> >> >>> anymore >> >> >> >>>>>> (even >> >> >> >>>>>>>>>> though it would contain some RDD functions like map, >> >> >> >>>>>>>>>> flatMap, >> >> >> >>>> etc), >> >> >> >>>>>> and >> >> >> >>>>>>>>>> calling it Schema*RDD* while it is not an RDD is highly >> >> >> >>> confusing. >> >> >> >>>>>>>>> Instead. >> >> >> >>>>>>>>>> DataFrame.rdd will return the underlying RDD for all RDD >> >> >> >>> methods. >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> >> >> >> >>>>>>>>>> My understanding is that very few users program directly >> >> >> >>> against >> >> >> >>>> the >> >> >> >>>>>>>>>> SchemaRDD API at the moment, because they are not well >> >> >> >>> documented. >> >> >> >>>>>>>>> However, >> >> >> >>>>>>>>>> oo maintain backward compatibility, we can create a type >> >> >> >>>>>>>>>> alias >> >> >> >>>>>> DataFrame >> >> >> >>>>>>>>>> that is still named SchemaRDD. This will maintain source >> >> >> >>>>> compatibility >> >> >> >>>>>>>>> for >> >> >> >>>>>>>>>> Scala. That said, we will have to update all existing >> >> >> >>> materials to >> >> >> >>>>> use >> >> >> >>>>>>>>>> DataFrame rather than SchemaRDD. >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> --------------------------------------------------------------------- >> >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> >> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> >>>> >> >> >> >>>> >> >> >> >>>> --------------------------------------------------------------------- >> >> >> >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> >> >>>>>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >>>>>>>>> >> >> >> >>>>>>>>> >> >> >> >>>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>> >> >> >> >>> >> >> >> >>> --------------------------------------------------------------------- >> >> >> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> >> >>>>>> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >>>>>> >> >> >> >>>>>> >> >> >> >>>>> >> >> >> >>>> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> >> >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >> >> >> >> >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org