Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-31 Thread Michael Armbrust

 * unionAll preserve duplicate v/s union that does not


This is true, if you want to eliminate duplicate items you should follow
the union with a distinct()


 * SQL union and unionAll result in same output format i.e. another SQL v/s
 different RDD types here.

* Understand the existing union contract issue. This may be a class
 hierarchy discussion for SchemaRDD, UnionRDD etc. ?


This is unfortunately going to be a limitation of the query DSL since it
extends standard RDDs.  It is not possible for us to return specialized
types from functions that are already defined in RDD (such as union) as the
base RDD class has a very opaque notion of schema, and at this point the
API for RDDs is very fixed.  If you use SQL however, you will always get
back SchemaRDDs.


Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi,

I am trying SparkSQL based on the example on doc ...



val people =
sc.textFile(/data/spark/examples/src/main/resources/people.txt).map(_.split(,)).map(p
= Person(p(0), p(1).trim.toInt))


val olderThanTeans = people.where('age  19)
val youngerThanTeans = people.where('age  13)
val nonTeans = youngerThanTeans.union(olderThanTeans)

I can do a orderBy('age) on first two (which are SchemaRDD) but not on
third. The nonTeans is a UnionRDD that does not supports orderBy. This
seems different than the SQL behavior where results of 2 SQL unions is a
SQL itself with same functionality ...

Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD 


Thanks,