Hi I have created a jira for this feature https://issues.apache.org/jira/browse/SPARK-12524 Please vote this feature if its necessary. I would like to implement this feature.
Thanks Shushant On Wed, Dec 2, 2015 at 1:14 PM, Rajat Kumar <rajatkumar10...@gmail.com> wrote: > What if I don't have to use aggregate function only groupbykeylocally() > and then a map transformation? > > Will reduceByKeyLocally help here? Or is there any workaround if > groupbykey is not locally and is global across all partitions. > > Thanks > > On Tue, Dec 1, 2015 at 5:20 PM, ayan guha <guha.a...@gmail.com> wrote: > >> I believe reduceByKeyLocally was introduced for this purpose. >> >> On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski <ja...@japila.pl> wrote: >> >>> Hi Rajat, >>> >>> My quick test has showed that groupBy will preserve the partitions: >>> >>> scala> >>> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex >>> { case (idx, iter) => val s = iter.toSeq; println(idx + " with " + >>> s.size + " elements: " + s); s.toIterator >>> }.groupBy(_._1).mapPartitionsWithIndex { case (idx, iter) => val s = >>> iter.toSeq; println(idx + " with " + s.size + " elements: " + s); >>> s.toIterator }.collect >>> >>> 1 with 4 elements: Stream((1,1), (1,1), (1,1), (1,1)) >>> 0 with 4 elements: Stream((0,1), (0,1), (0,1), (0,1)) >>> >>> 0 with 1 elements: Stream((0,CompactBuffer((0,1), (0,1), (0,1), (0,1)))) >>> 1 with 1 elements: Stream((1,CompactBuffer((1,1), (1,1), (1,1), (1,1)))) >>> >>> Do I miss anything? >>> >>> Pozdrawiam, >>> Jacek >>> >>> -- >>> Jacek Laskowski | https://medium.com/@jaceklaskowski/ | >>> http://blog.jaceklaskowski.pl >>> Mastering Spark >>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ >>> Follow me at https://twitter.com/jaceklaskowski >>> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski >>> >>> >>> On Tue, Dec 1, 2015 at 2:46 AM, Rajat Kumar <rajatkumar10...@gmail.com> >>> wrote: >>> > Hi >>> > >>> > i have a javaPairRdd<K,V> rdd1. i want to group by rdd1 by keys but >>> preserve >>> > the partitions of original rdd only to avoid shuffle since I know all >>> same >>> > keys are already in same partition. >>> > >>> > PairRdd is basically constrcuted using kafka streaming low level >>> consumer >>> > which have all records with same key already in same partition. Can i >>> group >>> > them together with avoid shuffle. >>> > >>> > Thanks >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> >> -- >> Best Regards, >> Ayan Guha >> > >