I believe reduceByKeyLocally was introduced for this purpose. On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi Rajat, > > My quick test has showed that groupBy will preserve the partitions: > > scala> > sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex > { case (idx, iter) => val s = iter.toSeq; println(idx + " with " + > s.size + " elements: " + s); s.toIterator > }.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s = > iter.toSeq; println(idx + " with " + s.size + " elements: " + s); > s.toIterator }.collect > > 1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1)) > 0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1)) > > 0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1)))) > 1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1)))) > > Do I miss anything? > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | https://medium.com/@jaceklaskowski/ | > http://blog.jaceklaskowski.pl > Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar <rajatkumar10...@gmail.com> > wrote: > > Hi > > > > i have a javaPairRdd<K,V> rdd1. i want to group by rdd1 by keys but > preserve > > the partitions of original rdd only to avoid shuffle since I know all > same > > keys are already in same partition. > > > > PairRdd is basically constrcuted using kafka streaming low level consumer > > which have all records with same key already in same partition. Can i > group > > them together with avoid shuffle. > > > > Thanks > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Best Regards, Ayan Guha