Anis,

If your random partitions are smaller than your smallest machine, and you
request executors for your spark jobs no larger than your smallest machine,
then spark/cluster manager will automatically assign many executors to your
larger machines.

As long as you request small executors, you will utilize your large boxes
effectively because they will run many more executors simultaneously than
the small boxes do.

On Tue, Feb 14, 2017 at 5:09 PM, Anis Nasir <aadi.a...@gmail.com> wrote:

> Thank you very much for your reply.
>
> I guess this approach balances the load across the cluster of machines.
>
> However, I am looking for something for heterogeneous cluster for which
> the distribution is not known in prior.
>
> Cheers,
> Anis
>
>
> On Tue, 14 Feb 2017 at 20:19, Galen Marchetti <galenmarche...@gmail.com>
> wrote:
>
>> Anis,
>>
>> I've typically seen people handle skew by seeding the keys corresponding
>> to high volumes with random values, then partitioning the dataset based on
>> the original key *and* the random value, then reducing.
>>
>> Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>,
>> <random_digit>, <name> )
>>
>> This transformation reduces the size of the huge partition, making it
>> tenable for spark, as long as you can figure out logic for aggregating the
>> results of the seeded partitions together again.
>>
>> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aadi.a...@gmail.com> wrote:
>>
>> Dear All,
>>
>> I have few use cases for spark streaming where spark cluster consist of
>> heterogenous machines.
>>
>> Additionally, there is skew present in both the input distribution (e.g.,
>> each tuple is drawn from a zipf distribution) and the service time (e.g.,
>> service time required for each tuple comes from a zipf distribution).
>>
>> I want to know who spark will handle such use cases.
>>
>> Any help will be highly appreciated!
>>
>>
>> Regards,
>> Anis
>>
>>
>>
>>
>>

Reply via email to