in fact co-partitioning was one of the main reason we started using spark.
in map-reduce its a giant pain to implement


On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers <ko...@tresata.com> wrote:

> we use PartitionBy a lot to keep multiple datasets co-partitioned before
> caching.
> it works well.
>
>
> On Sat, Nov 16, 2013 at 5:10 AM, guojc <guoj...@gmail.com> wrote:
>
>> After looking at the api more carefully, I just found  I overlooked the
>> partitionBy function on PairRDDFunction.  It's the function I need. Sorry
>> for the confusion.
>>
>> Best Regards,
>> Jiacheng Guo
>>
>>
>> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <c...@adatao.com>wrote:
>>
>>> Jiacheng, if you're OK with using the Shark layer above Spark (and I
>>> think for many use cases the answer would be "yes"), then you can take
>>> advantage of Shark's co-partitioning. Or do something like
>>> https://github.com/amplab/shark/pull/100/commits
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 2:48 AM, "guojc" <guoj...@gmail.com> wrote:
>>>
>>>> Hi Meisam,
>>>>      What I want to achieve here is a bit tricky. Basically, I'm try to
>>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>>>> very efficient join strategy for high in-balanced data set and provide huge
>>>> gain against normal join in that situation.,
>>>>
>>>>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and
>>>> both of them load directly from hdfs. So both of them will has a
>>>> partitioner of Nothing. And X is a large complicate struct contain a set of
>>>> join key Y.  First for each partition of a , I extract join key Y from
>>>> every ins of X in that parition and construct a hash set of join key Y and
>>>> paritionID. Now I have a new rdd c :RDD[Y,PartionID ] and join it with b on
>>>> Y and then construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on
>>>> PartitionID and constructing map of Y and Z.  As for each partition of a, I
>>>> want to repartiion it according to its partition id, and it becomes a rdd
>>>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>>>> they will be joined very efficiently.
>>>>
>>>>     The key ability I want to have here is the ability to cache rdd c
>>>> with same partitioner of rdd b and cache e. So later join with b and d will
>>>> be efficient, because the value of b will be updated from time to time and
>>>> d's content will change accordingly. And It will be nice to have the
>>>> ability to repartition a with its original paritionid without actually
>>>> shuffle across network.
>>>>
>>>> You can refer to
>>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf 
>>>> for
>>>> PerSplit SemiJoin's details.
>>>>
>>>> Best Regards,
>>>> Jiacheng Guo
>>>>
>>>>
>>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi 
>>>> <meisam.fa...@gmail.com>wrote:
>>>>
>>>>> Hi guojc,
>>>>>
>>>>> It is not cleat for me what problem you are trying to solve. What do
>>>>> you want to do with the result of your
>>>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>>>> in a join? Do you want to save it to your file system? Or do you want
>>>>> to do something else with it?
>>>>>
>>>>> Thanks,
>>>>> Meisam
>>>>>
>>>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <guoj...@gmail.com> wrote:
>>>>> > Hi Meisam,
>>>>> >     Thank you for response. I know each rdd has a partitioner. What
>>>>> I want
>>>>> > to achieved here is re-partition a piece of data according to my
>>>>> custom
>>>>> > partitioner. Currently I do that by
>>>>> groupByKey(myPartitioner).flatMapValues(
>>>>> > x=>x). But I'm a bit worried whether this will create additional
>>>>> temp object
>>>>> > collection, as result is first made into Seq the an collection of
>>>>> tupples.
>>>>> > Any suggestion?
>>>>> >
>>>>> > Best Regards,
>>>>> > Jiahcheng Guo
>>>>> >
>>>>> >
>>>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <
>>>>> meisam.fa...@gmail.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hi Jiacheng,
>>>>> >>
>>>>> >> Each RDD has a partitioner. You can define your own partitioner if
>>>>> the
>>>>> >> default partitioner does not suit your purpose.
>>>>> >> You can take a look at this
>>>>> >>
>>>>> >>
>>>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>>>>> .
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Meisam
>>>>> >>
>>>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <guoj...@gmail.com> wrote:
>>>>> >> > Hi,
>>>>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>>>>> function?
>>>>> >> > The
>>>>> >> > use of this function is to have a rdd distributed by according to
>>>>> a
>>>>> >> > cerntain
>>>>> >> > paritioner and cache it. And then further join performance by rdd
>>>>> with
>>>>> >> > same
>>>>> >> > partitoner will a great speed up. Currently, we only have a
>>>>> >> > groupByKeyFunction and generate a Seq of desired type , which is
>>>>> not
>>>>> >> > very
>>>>> >> > convenient.
>>>>> >> >
>>>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>>>>> >> > shortcut.
>>>>> >> >
>>>>> >> >
>>>>> >> > Best Regards,
>>>>> >> > Jiacheng Guo
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>
>

Reply via email to