Thanks Mayur - based on the doc-comments in source looks like this will work for the case. I will confirm.
---- the dreamers of the day are dangerous men, for they may act their dream with open eyes, and make it possible On Fri, Mar 7, 2014 at 2:21 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > How about PartitionerAwareUnionRDD? > > Regards > Mayur > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoidanalytics.com > @mayur_rustagi <https://twitter.com/mayur_rustagi> > > > > On Thu, Mar 6, 2014 at 9:42 AM, Evan Chan <e...@ooyala.com> wrote: > > > I would love to hear the answer to this as well. > > > > On Thu, Mar 6, 2014 at 4:09 AM, Manoj Awasthi <awasthi.ma...@gmail.com> > > wrote: > > > Hi All, > > > > > > > > > I have a three machine cluster. I have two RDDs each consisting of > (K,V) > > > pairs. RDDs have just three keys 'a', 'b' and 'c'. > > > > > > // list1 - List(('a',1), ('b',2), .... > > > val rdd1 = sc.parallelize(list1).groupByKey(new HashPartitioner(3)) > > > > > > // list2 - List(('a',2), ('b',7), .... > > > val rdd2 = sc.parallelize(list2).groupByKey(new HashPartitioner(3)) > > > > > > By using a HashPartitioner with 3 partitions I can achieve that each of > > the > > > keys ('a', 'b' and 'c') in each RDD gets partitioned on different > > machines > > > on cluster (based on the hashCode). > > > > > > Problem is that I cannot deterministically do the same allocation for > > > second RDD? (all 'a's from rdd2 going to the same machine where 'a's > from > > > first RDD went to). > > > > > > Is there a way to achieve this? > > > > > > Manoj > > > > > > > > -- > > -- > > Evan Chan > > Staff Engineer > > e...@ooyala.com | > > >