there is no way to avoid shuffle if you use combine by key, no matter if
your data is cached in memory, because the shuffle write must write the
data into disk. And It seem that spark can not guarantee the similar
key(K1) goes to the Container_X.

you can use the tmpfs for your shuffle dir, this can improve your shuffle
write speed.

If the number of worker nodes is enough, then hundreds of GB is not quite
big to deal with.


On Wed, Jan 14, 2015 at 5:30 AM, Puneet Kapoor <puneet.cse.i...@gmail.com>
wrote:

> Hi,
>
> I have a usecase where in I have hourly spark job which creates hourly
> RDDs, which are partitioned by keys.
>
> At the end of the day I need to access all of these RDDs and combine the
> Key/Value pairs over the day.
>
> If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour
> of the day); we need to combine all the values of this K1 using some logic.
>
> What I want to do is to avoid the shuffling at the end of the day since
> the data in huge ~ hundreds of GB.
>
> Questions
> ---------------
> 1.) Is there a way that i can persist hourly RDDs with partition
> information and then while reading back the RDDs the partition information
> is restored.
> 2.) Can i ensure that partitioning is similar for different hours. Like if
> K1 goes to container_X, it would go to the same container in the next hour
> and so on.
>
> Regards
> Puneet
>
>

Reply via email to