I believe the default hash partitioner logic in spark will send all the
same keys to same machine.

On Wed, Jan 14, 2015, 03:03 Puneet Kapoor <puneet.cse.i...@gmail.com> wrote:

> Hi,
>
> I have a usecase where in I have hourly spark job which creates hourly
> RDDs, which are partitioned by keys.
>
> At the end of the day I need to access all of these RDDs and combine the
> Key/Value pairs over the day.
>
> If there is a key K1 in RDD0 (1st hour of day), RDD1 ... RDD23(last hour
> of the day); we need to combine all the values of this K1 using some logic.
>
> What I want to do is to avoid the shuffling at the end of the day since
> the data in huge ~ hundreds of GB.
>
> Questions
> ---------------
> 1.) Is there a way that i can persist hourly RDDs with partition
> information and then while reading back the RDDs the partition information
> is restored.
> 2.) Can i ensure that partitioning is similar for different hours. Like if
> K1 goes to container_X, it would go to the same container in the next hour
> and so on.
>
> Regards
> Puneet
>
>

Reply via email to