Hi Ian, If I understand what you're after, you might find "zip" useful. From the docs:
Zips this RDD with another one, returning key-value pairs with the first element in each RDD, second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* and the *same number of elements in each partition* (e.g. one was made through a map on the other). Here's a toy example: >> val rdd1 = sc.parallelize(Array("name1", "name2", "name3"), 3) >> val rdd2 = sc.parallelize(Array("sign1", "sign2", "sign3"), 3) >> rdd1.collect() Array[String] = Array(name1, name2, name3) >> rdd2.collect() Array[String] = Array(sign1, sign2, sign3) >> rdd1.zip(rdd2).collect() Array[(String, String)] = Array((name1,sign1), (name2,sign2), (name3,sign3)) In your case, you might have the first two RDDs calculated from some common raw data through a map. -- Jeremy --------------------- Jeremy Freeman, PhD Neuroscientist @thefreemanlab On Apr 19, 2014, at 12:59 AM, Ian Ferreira <ianferre...@hotmail.com> wrote: > > This may seem contrived but, suppose I wanted to create a collection of > "single column" RDD's that contain calculated values, so I want to cache > these to avoid re-calc. > > i.e. > > rdd1 = {Names] > rdd2 = {Star Sign} > rdd3 = {Age} > > Then I want to create a new virtual RDD that is a collection of these RDD's > to create a "multi-column" RDD > > rddA = {Names, Age} > rddB = {Names, Star Sign} > > I saw that rdd.union() merges rows, but anything that can combine columns? > > Cheers > - Ian