we use PartitionBy a lot to keep multiple datasets co-partitioned before caching. it works well.
On Sat, Nov 16, 2013 at 5:10 AM, guojc <guoj...@gmail.com> wrote: > After looking at the api more carefully, I just found I overlooked the > partitionBy function on PairRDDFunction. It's the function I need. Sorry > for the confusion. > > Best Regards, > Jiacheng Guo > > > On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <c...@adatao.com>wrote: > >> Jiacheng, if you're OK with using the Shark layer above Spark (and I >> think for many use cases the answer would be "yes"), then you can take >> advantage of Shark's co-partitioning. Or do something like >> https://github.com/amplab/shark/pull/100/commits >> >> Sent while mobile. Pls excuse typos etc. >> On Nov 16, 2013 2:48 AM, "guojc" <guoj...@gmail.com> wrote: >> >>> Hi Meisam, >>> What I want to achieve here is a bit tricky. Basically, I'm try to >>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a >>> very efficient join strategy for high in-balanced data set and provide huge >>> gain against normal join in that situation., >>> >>> Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both >>> of them load directly from hdfs. So both of them will has a partitioner of >>> Nothing. And X is a large complicate struct contain a set of join key Y. >>> First for each partition of a , I extract join key Y from every ins of X >>> in that parition and construct a hash set of join key Y and paritionID. Now >>> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then >>> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and >>> constructing map of Y and Z. As for each partition of a, I want to >>> repartiion it according to its partition id, and it becomes a rdd >>> e:RDD[PartitionID,X]. As both d and e will same partitioner and same key, >>> they will be joined very efficiently. >>> >>> The key ability I want to have here is the ability to cache rdd c >>> with same partitioner of rdd b and cache e. So later join with b and d will >>> be efficient, because the value of b will be updated from time to time and >>> d's content will change accordingly. And It will be nice to have the >>> ability to repartition a with its original paritionid without actually >>> shuffle across network. >>> >>> You can refer to >>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf >>> for >>> PerSplit SemiJoin's details. >>> >>> Best Regards, >>> Jiacheng Guo >>> >>> >>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <meisam.fa...@gmail.com>wrote: >>> >>>> Hi guojc, >>>> >>>> It is not cleat for me what problem you are trying to solve. What do >>>> you want to do with the result of your >>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it >>>> in a join? Do you want to save it to your file system? Or do you want >>>> to do something else with it? >>>> >>>> Thanks, >>>> Meisam >>>> >>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <guoj...@gmail.com> wrote: >>>> > Hi Meisam, >>>> > Thank you for response. I know each rdd has a partitioner. What I >>>> want >>>> > to achieved here is re-partition a piece of data according to my >>>> custom >>>> > partitioner. Currently I do that by >>>> groupByKey(myPartitioner).flatMapValues( >>>> > x=>x). But I'm a bit worried whether this will create additional temp >>>> object >>>> > collection, as result is first made into Seq the an collection of >>>> tupples. >>>> > Any suggestion? >>>> > >>>> > Best Regards, >>>> > Jiahcheng Guo >>>> > >>>> > >>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi < >>>> meisam.fa...@gmail.com> >>>> > wrote: >>>> >> >>>> >> Hi Jiacheng, >>>> >> >>>> >> Each RDD has a partitioner. You can define your own partitioner if >>>> the >>>> >> default partitioner does not suit your purpose. >>>> >> You can take a look at this >>>> >> >>>> >> >>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf >>>> . >>>> >> >>>> >> Thanks, >>>> >> Meisam >>>> >> >>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <guoj...@gmail.com> wrote: >>>> >> > Hi, >>>> >> > I'm wondering whether spark rdd can has a partitionedByKey >>>> function? >>>> >> > The >>>> >> > use of this function is to have a rdd distributed by according to a >>>> >> > cerntain >>>> >> > paritioner and cache it. And then further join performance by rdd >>>> with >>>> >> > same >>>> >> > partitoner will a great speed up. Currently, we only have a >>>> >> > groupByKeyFunction and generate a Seq of desired type , which is >>>> not >>>> >> > very >>>> >> > convenient. >>>> >> > >>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send >>>> >> > shortcut. >>>> >> > >>>> >> > >>>> >> > Best Regards, >>>> >> > Jiacheng Guo >>>> > >>>> > >>>> >>> >>> >