from:"Galen Marchetti"

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti

oss the cluster of machines. > > However, I am looking for something for heterogeneous cluster for which > the distribution is not known in prior. > > Cheers, > Anis > > > On Tue, 14 Feb 2017 at 20:19, Galen Marchetti <galenmarche...@gmail.com> > wrote: >

Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti

Anis, I've typically seen people handle skew by seeding the keys corresponding to high volumes with random values, then partitioning the dataset based on the original key *and* the random value, then reducing. Ex: ( , ) -> ( , , ) This transformation reduces the size of the huge partition,