we use PartitionBy a lot to keep multiple datasets co-partitioned before
caching.
it works well.


On Sat, Nov 16, 2013 at 5:10 AM, guojc <guoj...@gmail.com> wrote:

> After looking at the api more carefully, I just found  I overlooked the
> partitionBy function on PairRDDFunction.  It's the function I need. Sorry
> for the confusion.
>
> Best Regards,
> Jiacheng Guo
>
>
> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <c...@adatao.com>wrote:
>
>> Jiacheng, if you're OK with using the Shark layer above Spark (and I
>> think for many use cases the answer would be "yes"), then you can take
>> advantage of Shark's co-partitioning. Or do something like
>> https://github.com/amplab/shark/pull/100/commits
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:48 AM, "guojc" <guoj...@gmail.com> wrote:
>>
>>> Hi Meisam,
>>>      What I want to achieve here is a bit tricky. Basically, I'm try to
>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>>> very efficient join strategy for high in-balanced data set and provide huge
>>> gain against normal join in that situation.,
>>>
>>>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
>>> of them load directly from hdfs. So both of them will has a partitioner of
>>> Nothing. And X is a large complicate struct contain a set of join key Y.
>>>  First for each partition of a , I extract join key Y from every ins of X
>>> in that parition and construct a hash set of join key Y and paritionID. Now
>>> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
>>> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
>>> constructing map of Y and Z.  As for each partition of a, I want to
>>> repartiion it according to its partition id, and it becomes a rdd
>>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>>> they will be joined very efficiently.
>>>
>>>     The key ability I want to have here is the ability to cache rdd c
>>> with same partitioner of rdd b and cache e. So later join with b and d will
>>> be efficient, because the value of b will be updated from time to time and
>>> d's content will change accordingly. And It will be nice to have the
>>> ability to repartition a with its original paritionid without actually
>>> shuffle across network.
>>>
>>> You can refer to
>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf 
>>> for
>>> PerSplit SemiJoin's details.
>>>
>>> Best Regards,
>>> Jiacheng Guo
>>>
>>>
>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <meisam.fa...@gmail.com>wrote:
>>>
>>>> Hi guojc,
>>>>
>>>> It is not cleat for me what problem you are trying to solve. What do
>>>> you want to do with the result of your
>>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>>> in a join? Do you want to save it to your file system? Or do you want
>>>> to do something else with it?
>>>>
>>>> Thanks,
>>>> Meisam
>>>>
>>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <guoj...@gmail.com> wrote:
>>>> > Hi Meisam,
>>>> >     Thank you for response. I know each rdd has a partitioner. What I
>>>> want
>>>> > to achieved here is re-partition a piece of data according to my
>>>> custom
>>>> > partitioner. Currently I do that by
>>>> groupByKey(myPartitioner).flatMapValues(
>>>> > x=>x). But I'm a bit worried whether this will create additional temp
>>>> object
>>>> > collection, as result is first made into Seq the an collection of
>>>> tupples.
>>>> > Any suggestion?
>>>> >
>>>> > Best Regards,
>>>> > Jiahcheng Guo
>>>> >
>>>> >
>>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <
>>>> meisam.fa...@gmail.com>
>>>> > wrote:
>>>> >>
>>>> >> Hi Jiacheng,
>>>> >>
>>>> >> Each RDD has a partitioner. You can define your own partitioner if
>>>> the
>>>> >> default partitioner does not suit your purpose.
>>>> >> You can take a look at this
>>>> >>
>>>> >>
>>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>>>> .
>>>> >>
>>>> >> Thanks,
>>>> >> Meisam
>>>> >>
>>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <guoj...@gmail.com> wrote:
>>>> >> > Hi,
>>>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>>>> function?
>>>> >> > The
>>>> >> > use of this function is to have a rdd distributed by according to a
>>>> >> > cerntain
>>>> >> > paritioner and cache it. And then further join performance by rdd
>>>> with
>>>> >> > same
>>>> >> > partitoner will a great speed up. Currently, we only have a
>>>> >> > groupByKeyFunction and generate a Seq of desired type , which is
>>>> not
>>>> >> > very
>>>> >> > convenient.
>>>> >> >
>>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>>>> >> > shortcut.
>>>> >> >
>>>> >> >
>>>> >> > Best Regards,
>>>> >> > Jiacheng Guo
>>>> >
>>>> >
>>>>
>>>
>>>
>

Reply via email to