Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti
Anis, If your random partitions are smaller than your smallest machine, and you request executors for your spark jobs no larger than your smallest machine, then spark/cluster manager will automatically assign many executors to your larger machines. As long as you request small executors, you

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Thank you very much for your reply. I guess this approach balances the load across the cluster of machines. However, I am looking for something for heterogeneous cluster for which the distribution is not known in prior. Cheers, Anis On Tue, 14 Feb 2017 at 20:19, Galen Marchetti

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti
Anis, I've typically seen people handle skew by seeding the keys corresponding to high volumes with random values, then partitioning the dataset based on the original key *and* the random value, then reducing. Ex: ( , ) -> ( , , ) This transformation reduces the size of the huge partition,

Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Dear All, I have few use cases for spark streaming where spark cluster consist of heterogenous machines. Additionally, there is skew present in both the input distribution (e.g., each tuple is drawn from a zipf distribution) and the service time (e.g., service time required for each tuple comes