Re: skewed data in join

2017-02-16 Thread Anis Nasir
You can also so something similar to what is mentioned in [1].

The basic idea is to use two hash functions for each key and assigning it
to the least loaded of the two hashed worker.

Cheers,
Anis


[1].
https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf


On Fri, 17 Feb 2017 at 01:34, Yong Zhang  wrote:

> Yes. You have to change your key, or as BigData term, "adding salt".
>
>
> Yong
>
> --
> *From:* Gourav Sengupta 
> *Sent:* Thursday, February 16, 2017 11:11 AM
> *To:* user
> *Subject:* skewed data in join
>
> Hi,
>
> Is there a way to do multiple reducers for joining on skewed data?
>
> Regards,
> Gourav
>


Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Thank you very much for your reply.

I guess this approach balances the load across the cluster of machines.

However, I am looking for something for heterogeneous cluster for which the
distribution is not known in prior.

Cheers,
Anis


On Tue, 14 Feb 2017 at 20:19, Galen Marchetti 
wrote:

> Anis,
>
> I've typically seen people handle skew by seeding the keys corresponding
> to high volumes with random values, then partitioning the dataset based on
> the original key *and* the random value, then reducing.
>
> Ex: ( ,  ) -> ( ,
> ,  )
>
> This transformation reduces the size of the huge partition, making it
> tenable for spark, as long as you can figure out logic for aggregating the
> results of the seeded partitions together again.
>
> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir  wrote:
>
> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>
>
>


Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Dear All,

I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.

Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes from a zipf distribution).

I want to know who spark will handle such use cases.

Any help will be highly appreciated!


Regards,
Anis