I believe reduceByKeyLocally was introduced for this purpose.

On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi Rajat,
>
> My quick test has showed that groupBy will preserve the partitions:
>
> scala>
> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex
> { case (idx, iter) => val s = iter.toSeq; println(idx + " with " +
> s.size + " elements: " + s); s.toIterator
> }.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s =
> iter.toSeq; println(idx + " with " + s.size + " elements: " + s);
> s.toIterator }.collect
>
> 1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1))
> 0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1))
>
> 0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1))))
> 1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1))))
>
> Do I miss anything?
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar <rajatkumar10...@gmail.com>
> wrote:
> > Hi
> >
> > i have a javaPairRdd<K,V> rdd1. i want to group by rdd1 by keys but
> preserve
> > the partitions of original rdd only to avoid shuffle since I know all
> same
> > keys are already in same partition.
> >
> > PairRdd is basically constrcuted using kafka streaming low level consumer
> > which have all records with same key already in same partition. Can i
> group
> > them together with avoid shuffle.
> >
> > Thanks
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards,
Ayan Guha

Reply via email to