Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti
Anis,

If your random partitions are smaller than your smallest machine, and you
request executors for your spark jobs no larger than your smallest machine,
then spark/cluster manager will automatically assign many executors to your
larger machines.

As long as you request small executors, you will utilize your large boxes
effectively because they will run many more executors simultaneously than
the small boxes do.

On Tue, Feb 14, 2017 at 5:09 PM, Anis Nasir  wrote:

> Thank you very much for your reply.
>
> I guess this approach balances the load across the cluster of machines.
>
> However, I am looking for something for heterogeneous cluster for which
> the distribution is not known in prior.
>
> Cheers,
> Anis
>
>
> On Tue, 14 Feb 2017 at 20:19, Galen Marchetti 
> wrote:
>
>> Anis,
>>
>> I've typically seen people handle skew by seeding the keys corresponding
>> to high volumes with random values, then partitioning the dataset based on
>> the original key *and* the random value, then reducing.
>>
>> Ex: ( ,  ) -> ( ,
>> ,  )
>>
>> This transformation reduces the size of the huge partition, making it
>> tenable for spark, as long as you can figure out logic for aggregating the
>> results of the seeded partitions together again.
>>
>> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir  wrote:
>>
>> Dear All,
>>
>> I have few use cases for spark streaming where spark cluster consist of
>> heterogenous machines.
>>
>> Additionally, there is skew present in both the input distribution (e.g.,
>> each tuple is drawn from a zipf distribution) and the service time (e.g.,
>> service time required for each tuple comes from a zipf distribution).
>>
>> I want to know who spark will handle such use cases.
>>
>> Any help will be highly appreciated!
>>
>>
>> Regards,
>> Anis
>>
>>
>>
>>
>>


Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Thank you very much for your reply.

I guess this approach balances the load across the cluster of machines.

However, I am looking for something for heterogeneous cluster for which the
distribution is not known in prior.

Cheers,
Anis


On Tue, 14 Feb 2017 at 20:19, Galen Marchetti 
wrote:

> Anis,
>
> I've typically seen people handle skew by seeding the keys corresponding
> to high volumes with random values, then partitioning the dataset based on
> the original key *and* the random value, then reducing.
>
> Ex: ( ,  ) -> ( ,
> ,  )
>
> This transformation reduces the size of the huge partition, making it
> tenable for spark, as long as you can figure out logic for aggregating the
> results of the seeded partitions together again.
>
> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir  wrote:
>
> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>
>
>


Re: Handling Skewness and Heterogeneity

2017-02-14 Thread Galen Marchetti
Anis,

I've typically seen people handle skew by seeding the keys corresponding to
high volumes with random values, then partitioning the dataset based on the
original key *and* the random value, then reducing.

Ex: ( ,  ) -> ( , ,
 )

This transformation reduces the size of the huge partition, making it
tenable for spark, as long as you can figure out logic for aggregating the
results of the seeded partitions together again.

On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir  wrote:

> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>


Handling Skewness and Heterogeneity

2017-02-14 Thread Anis Nasir
Dear All,

I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.

Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes from a zipf distribution).

I want to know who spark will handle such use cases.

Any help will be highly appreciated!


Regards,
Anis