Thank you very much for your reply. I guess this approach balances the load across the cluster of machines.
However, I am looking for something for heterogeneous cluster for which the distribution is not known in prior. Cheers, Anis On Tue, 14 Feb 2017 at 20:19, Galen Marchetti <galenmarche...@gmail.com> wrote: > Anis, > > I've typically seen people handle skew by seeding the keys corresponding > to high volumes with random values, then partitioning the dataset based on > the original key *and* the random value, then reducing. > > Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>, > <random_digit>, <name> ) > > This transformation reduces the size of the huge partition, making it > tenable for spark, as long as you can figure out logic for aggregating the > results of the seeded partitions together again. > > On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aadi.a...@gmail.com> wrote: > > Dear All, > > I have few use cases for spark streaming where spark cluster consist of > heterogenous machines. > > Additionally, there is skew present in both the input distribution (e.g., > each tuple is drawn from a zipf distribution) and the service time (e.g., > service time required for each tuple comes from a zipf distribution). > > I want to know who spark will handle such use cases. > > Any help will be highly appreciated! > > > Regards, > Anis > > > > >