union of SchemaRDDs
I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel
Re: union of SchemaRDDs
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: union of SchemaRDDs
Thanks Matei. What does unionAll do if the input RDD schemas are not 100% compatible. Does it take the union of the columns and generalize the types? thanks Daniel On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel
Re: union of SchemaRDDs
It does generalize types, but only on the intersection of the columns it seems. There might be a way to get the union of the columns too using HiveQL. Types generalize up with string being the most general. Matei On Nov 1, 2014, at 6:22 PM, Daniel Mahler dmah...@gmail.com wrote: Thanks Matei. What does unionAll do if the input RDD schemas are not 100% compatible. Does it take the union of the columns and generalize the types? thanks Daniel On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com wrote: Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com mailto:dmah...@gmail.com wrote: I would like to combine 2 parquet tables I have create. I tried: sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB)) but that just returns RDD[Row]. How do I combine them to get a SchemaRDD[Row]? thanks Daniel
Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?
* unionAll preserve duplicate v/s union that does not This is true, if you want to eliminate duplicate items you should follow the union with a distinct() * SQL union and unionAll result in same output format i.e. another SQL v/s different RDD types here. * Understand the existing union contract issue. This may be a class hierarchy discussion for SchemaRDD, UnionRDD etc. ? This is unfortunately going to be a limitation of the query DSL since it extends standard RDDs. It is not possible for us to return specialized types from functions that are already defined in RDD (such as union) as the base RDD class has a very opaque notion of schema, and at this point the API for RDDs is very fixed. If you use SQL however, you will always get back SchemaRDDs.
Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?
Hi, I am trying SparkSQL based on the example on doc ... val people = sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p = Person(p(0), p(1).trim.toInt)) val olderThanTeans = people.where('age 19) val youngerThanTeans = people.where('age 13) val nonTeans = youngerThanTeans.union(olderThanTeans) I can do a orderBy('age) on first two (which are SchemaRDD) but not on third. The nonTeans is a UnionRDD that does not supports orderBy. This seems different than the SQL behavior where results of 2 SQL unions is a SQL itself with same functionality ... Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD Thanks,