Ok is there a way, I can use hash Partitioning so that I can improve the performance?
On 17 April 2015 at 19:33, Archit Thakur <archit279tha...@gmail.com> wrote: > By custom installation, I meant change the code and build it. I have not > done the complete impact analysis, just had a look on the code. > > When you say, same key goes to same node, It would need shuffling unless > the raw data you are reading is present that way. > On Apr 17, 2015 6:30 PM, "Jeetendra Gangele" <gangele...@gmail.com> wrote: > >> Hi Archit Thanks for reply. >> How can I don the costom compilation so reduce it to 4 bytes.I want to >> make it to 4 bytes in any case can you please guide? >> >> I am applying flatMapvalue in each step after ZipWithIndex it should be >> in same Node right? Why its suffling? >> Also I am running with very less records currently still its shuffling ? >> >> regards >> jeetendra >> >> >> >> On 17 April 2015 at 15:58, Archit Thakur <archit279tha...@gmail.com> >> wrote: >> >>> I dont think you can change it to 4 bytes without any custom compilation. >>> To make same key go to same node, you'll have to repartition the data, >>> which is shuffling anyway. Unless your raw data is such that the same key >>> is on same node, you'll have to shuffle atleast once to make same key on >>> same node. >>> >>> On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele < >>> gangele...@gmail.com> wrote: >>> >>>> Hi All >>>> >>>> I have a RDD which has 1 million keys and each key is repeated from >>>> around 7000 values so total there will be around 1M*7K records in RDD. >>>> >>>> and each key is created from ZipWithIndex so key start from 0 to M-1 >>>> the problem with ZipWithIndex is it take long for key which is 8 bytes. >>>> can I reduce it to 4 bytes? >>>> >>>> Now how Can I make sure the record with same key will go the same node >>>> so that I can avoid shuffling. Also how default partition-er will work >>>> here. >>>> >>>> Regards >>>> jeetendra >>>> >>>> >>> >> >> >>