Re: skewed data in join

2017-02-16 Thread Anis Nasir
You can also so something similar to what is mentioned in [1]. The basic idea is to use two hash functions for each key and assigning it to the least loaded of the two hashed worker. Cheers, Anis [1].

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
> , ) > > This transformation reduces the size of the huge partition, making it > tenable for spark, as long as you can figure out logic for aggregating the > results of the seeded partitions together again. > > On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aadi.a...@gmail.com>

Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Dear All, I have few use cases for spark streaming where spark cluster consist of heterogenous machines. Additionally, there is skew present in both the input distribution (e.g., each tuple is drawn from a zipf distribution) and the service time (e.g., service time required for each tuple comes