Hi,

Thanks for your kind response. The hash key using random numbers increases
the time for processing the data. My entire join for the entire month
finishes within 150 seconds for 471 million records and then stays for
another 6 mins for 55 million records.

Using hash keys increases the processing time to 11 mins. Therefore I am
not quite clear why should I do that. The overall idea was to ensure that
the entire processing of around 520 million records in may be another 10
seconds more.



Regards,
Gourav Sengupta

On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir <aadi.a...@gmail.com> wrote:

> You can also so something similar to what is mentioned in [1].
>
> The basic idea is to use two hash functions for each key and assigning it
> to the least loaded of the two hashed worker.
>
> Cheers,
> Anis
>
>
> [1]. https://melmeric.files.wordpress.com/2014/11/the-
> power-of-both-choices-practical-load-balancing-for-
> distributed-stream-processing-engines.pdf
>
>
> On Fri, 17 Feb 2017 at 01:34, Yong Zhang <java8...@hotmail.com> wrote:
>
>> Yes. You have to change your key, or as BigData term, "adding salt".
>>
>>
>> Yong
>>
>> ------------------------------
>> *From:* Gourav Sengupta <gourav.sengu...@gmail.com>
>> *Sent:* Thursday, February 16, 2017 11:11 AM
>> *To:* user
>> *Subject:* skewed data in join
>>
>> Hi,
>>
>> Is there a way to do multiple reducers for joining on skewed data?
>>
>> Regards,
>> Gourav
>>
>

Reply via email to