Anis,

I've typically seen people handle skew by seeding the keys corresponding to
high volumes with random values, then partitioning the dataset based on the
original key *and* the random value, then reducing.

Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>, <random_digit>,
<name> )

This transformation reduces the size of the huge partition, making it
tenable for spark, as long as you can figure out logic for aggregating the
results of the seeded partitions together again.

On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aadi.a...@gmail.com> wrote:

> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>

Reply via email to