Thank you very much for your reply.

I guess this approach balances the load across the cluster of machines.

However, I am looking for something for heterogeneous cluster for which the
distribution is not known in prior.

Cheers,
Anis


On Tue, 14 Feb 2017 at 20:19, Galen Marchetti <galenmarche...@gmail.com>
wrote:

> Anis,
>
> I've typically seen people handle skew by seeding the keys corresponding
> to high volumes with random values, then partitioning the dataset based on
> the original key *and* the random value, then reducing.
>
> Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>,
> <random_digit>, <name> )
>
> This transformation reduces the size of the huge partition, making it
> tenable for spark, as long as you can figure out logic for aggregating the
> results of the seeded partitions together again.
>
> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aadi.a...@gmail.com> wrote:
>
> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>
>
>

Reply via email to