It depends how you salt it. See slide 40 and onwards from a spark summit talk here: http://www.slideshare.net/cloudera/top-5-mistakes- to-avoid-when-writing-apache-spark-applications The speakers use a mod8 integer salt appended to the end of the key, the salt that works best for you might be different.
On Thu, Feb 16, 2017 at 12:40 PM, Gourav Sengupta <gourav.sengu...@gmail.com > wrote: > Hi, > > Thanks for your kind response. The hash key using random numbers increases > the time for processing the data. My entire join for the entire month > finishes within 150 seconds for 471 million records and then stays for > another 6 mins for 55 million records. > > Using hash keys increases the processing time to 11 mins. Therefore I am > not quite clear why should I do that. The overall idea was to ensure that > the entire processing of around 520 million records in may be another 10 > seconds more. > > > > Regards, > Gourav Sengupta > > On Thu, Feb 16, 2017 at 4:54 PM, Anis Nasir <aadi.a...@gmail.com> wrote: > >> You can also so something similar to what is mentioned in [1]. >> >> The basic idea is to use two hash functions for each key and assigning it >> to the least loaded of the two hashed worker. >> >> Cheers, >> Anis >> >> >> [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of- >> both-choices-practical-load-balancing-for-distributed- >> stream-processing-engines.pdf >> >> >> On Fri, 17 Feb 2017 at 01:34, Yong Zhang <java8...@hotmail.com> wrote: >> >>> Yes. You have to change your key, or as BigData term, "adding salt". >>> >>> >>> Yong >>> >>> ------------------------------ >>> *From:* Gourav Sengupta <gourav.sengu...@gmail.com> >>> *Sent:* Thursday, February 16, 2017 11:11 AM >>> *To:* user >>> *Subject:* skewed data in join >>> >>> Hi, >>> >>> Is there a way to do multiple reducers for joining on skewed data? >>> >>> Regards, >>> Gourav >>> >> >