in fact co-partitioning was one of the main reason we started using spark. in map-reduce its a giant pain to implement
On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers <ko...@tresata.com> wrote: > we use PartitionBy a lot to keep multiple datasets co-partitioned before > caching. > it works well. > > > On Sat, Nov 16, 2013 at 5:10 AM, guojc <guoj...@gmail.com> wrote: > >> After looking at the api more carefully, I just found I overlooked the >> partitionBy function on PairRDDFunction. It's the function I need. Sorry >> for the confusion. >> >> Best Regards, >> Jiacheng Guo >> >> >> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <c...@adatao.com>wrote: >> >>> Jiacheng, if you're OK with using the Shark layer above Spark (and I >>> think for many use cases the answer would be "yes"), then you can take >>> advantage of Shark's co-partitioning. Or do something like >>> https://github.com/amplab/shark/pull/100/commits >>> >>> Sent while mobile. Pls excuse typos etc. >>> On Nov 16, 2013 2:48 AM, "guojc" <guoj...@gmail.com> wrote: >>> >>>> Hi Meisam, >>>> What I want to achieve here is a bit tricky. Basically, I'm try to >>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a >>>> very efficient join strategy for high in-balanced data set and provide huge >>>> gain against normal join in that situation., >>>> >>>> Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and >>>> both of them load directly from hdfs. So both of them will has a >>>> partitioner of Nothing. And X is a large complicate struct contain a set of >>>> join key Y. First for each partition of a , I extract join key Y from >>>> every ins of X in that parition and construct a hash set of join key Y and >>>> paritionID. Now I have a new rdd c :RDD[Y,PartionID ] and join it with b on >>>> Y and then construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on >>>> PartitionID and constructing map of Y and Z. As for each partition of a, I >>>> want to repartiion it according to its partition id, and it becomes a rdd >>>> e:RDD[PartitionID,X]. As both d and e will same partitioner and same key, >>>> they will be joined very efficiently. >>>> >>>> The key ability I want to have here is the ability to cache rdd c >>>> with same partitioner of rdd b and cache e. So later join with b and d will >>>> be efficient, because the value of b will be updated from time to time and >>>> d's content will change accordingly. And It will be nice to have the >>>> ability to repartition a with its original paritionid without actually >>>> shuffle across network. >>>> >>>> You can refer to >>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf >>>> for >>>> PerSplit SemiJoin's details. >>>> >>>> Best Regards, >>>> Jiacheng Guo >>>> >>>> >>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi >>>> <meisam.fa...@gmail.com>wrote: >>>> >>>>> Hi guojc, >>>>> >>>>> It is not cleat for me what problem you are trying to solve. What do >>>>> you want to do with the result of your >>>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it >>>>> in a join? Do you want to save it to your file system? Or do you want >>>>> to do something else with it? >>>>> >>>>> Thanks, >>>>> Meisam >>>>> >>>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <guoj...@gmail.com> wrote: >>>>> > Hi Meisam, >>>>> > Thank you for response. I know each rdd has a partitioner. What >>>>> I want >>>>> > to achieved here is re-partition a piece of data according to my >>>>> custom >>>>> > partitioner. Currently I do that by >>>>> groupByKey(myPartitioner).flatMapValues( >>>>> > x=>x). But I'm a bit worried whether this will create additional >>>>> temp object >>>>> > collection, as result is first made into Seq the an collection of >>>>> tupples. >>>>> > Any suggestion? >>>>> > >>>>> > Best Regards, >>>>> > Jiahcheng Guo >>>>> > >>>>> > >>>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi < >>>>> meisam.fa...@gmail.com> >>>>> > wrote: >>>>> >> >>>>> >> Hi Jiacheng, >>>>> >> >>>>> >> Each RDD has a partitioner. You can define your own partitioner if >>>>> the >>>>> >> default partitioner does not suit your purpose. >>>>> >> You can take a look at this >>>>> >> >>>>> >> >>>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf >>>>> . >>>>> >> >>>>> >> Thanks, >>>>> >> Meisam >>>>> >> >>>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <guoj...@gmail.com> wrote: >>>>> >> > Hi, >>>>> >> > I'm wondering whether spark rdd can has a partitionedByKey >>>>> function? >>>>> >> > The >>>>> >> > use of this function is to have a rdd distributed by according to >>>>> a >>>>> >> > cerntain >>>>> >> > paritioner and cache it. And then further join performance by rdd >>>>> with >>>>> >> > same >>>>> >> > partitoner will a great speed up. Currently, we only have a >>>>> >> > groupByKeyFunction and generate a Seq of desired type , which is >>>>> not >>>>> >> > very >>>>> >> > convenient. >>>>> >> > >>>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send >>>>> >> > shortcut. >>>>> >> > >>>>> >> > >>>>> >> > Best Regards, >>>>> >> > Jiacheng Guo >>>>> > >>>>> > >>>>> >>>> >>>> >> >