It depends how you salt it.  See slide 40 and onwards from a spark summit
talk here: http://www.slideshare.net/cloudera/top-5-mistakes-
to-avoid-when-writing-apache-spark-applications  The speakers use a mod8
integer salt appended to the end of the key, the salt that works best for
you might be different.

On Thu, Feb 16, 2017 at 12:40 PM, Gourav Sengupta <gourav.sengu...@gmail.com
> wrote:

> Hi,
>
> Thanks for your kind response. The hash key using random numbers increases
> the time for processing the data. My entire join for the entire month
> finishes within 150 seconds for 471 million records and then stays for
> another 6 mins for 55 million records.
>
> Using hash keys increases the processing time to 11 mins. Therefore I am
> not quite clear why should I do that. The overall idea was to ensure that
> the entire processing of around 520 million records in may be another 10
> seconds more.
>
>
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir <aadi.a...@gmail.com> wrote:
>
>> You can also so something similar to what is mentioned in [1].
>>
>> The basic idea is to use two hash functions for each key and assigning it
>> to the least loaded of the two hashed worker.
>>
>> Cheers,
>> Anis
>>
>>
>> [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of-
>> both-choices-practical-load-balancing-for-distributed-
>> stream-processing-engines.pdf
>>
>>
>> On Fri, 17 Feb 2017 at 01:34, Yong Zhang <java8...@hotmail.com> wrote:
>>
>>> Yes. You have to change your key, or as BigData term, "adding salt".
>>>
>>>
>>> Yong
>>>
>>> ------------------------------
>>> *From:* Gourav Sengupta <gourav.sengu...@gmail.com>
>>> *Sent:* Thursday, February 16, 2017 11:11 AM
>>> *To:* user
>>> *Subject:* skewed data in join
>>>
>>> Hi,
>>>
>>> Is there a way to do multiple reducers for joining on skewed data?
>>>
>>> Regards,
>>> Gourav
>>>
>>
>

Reply via email to