Re: Custom partioner
Yes you can. Use partitionby method and pass partitioner to it. On Apr 17, 2015 8:18 PM, Jeetendra Gangele gangele...@gmail.com wrote: Ok is there a way, I can use hash Partitioning so that I can improve the performance? On 17 April 2015 at 19:33, Archit Thakur archit279tha...@gmail.com wrote: By custom installation, I meant change the code and build it. I have not done the complete impact analysis, just had a look on the code. When you say, same key goes to same node, It would need shuffling unless the raw data you are reading is present that way. On Apr 17, 2015 6:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi Archit Thanks for reply. How can I don the costom compilation so reduce it to 4 bytes.I want to make it to 4 bytes in any case can you please guide? I am applying flatMapvalue in each step after ZipWithIndex it should be in same Node right? Why its suffling? Also I am running with very less records currently still its shuffling ? regards jeetendra On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com wrote: I dont think you can change it to 4 bytes without any custom compilation. To make same key go to same node, you'll have to repartition the data, which is shuffling anyway. Unless your raw data is such that the same key is on same node, you'll have to shuffle atleast once to make same key on same node. On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I reduce it to 4 bytes? Now how Can I make sure the record with same key will go the same node so that I can avoid shuffling. Also how default partition-er will work here. Regards jeetendra
Re: Custom partioner
Hi Archit Thanks for reply. How can I don the costom compilation so reduce it to 4 bytes.I want to make it to 4 bytes in any case can you please guide? I am applying flatMapvalue in each step after ZipWithIndex it should be in same Node right? Why its suffling? Also I am running with very less records currently still its shuffling ? regards jeetendra On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com wrote: I dont think you can change it to 4 bytes without any custom compilation. To make same key go to same node, you'll have to repartition the data, which is shuffling anyway. Unless your raw data is such that the same key is on same node, you'll have to shuffle atleast once to make same key on same node. On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I reduce it to 4 bytes? Now how Can I make sure the record with same key will go the same node so that I can avoid shuffling. Also how default partition-er will work here. Regards jeetendra
Re: Custom partioner
By custom installation, I meant change the code and build it. I have not done the complete impact analysis, just had a look on the code. When you say, same key goes to same node, It would need shuffling unless the raw data you are reading is present that way. On Apr 17, 2015 6:30 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi Archit Thanks for reply. How can I don the costom compilation so reduce it to 4 bytes.I want to make it to 4 bytes in any case can you please guide? I am applying flatMapvalue in each step after ZipWithIndex it should be in same Node right? Why its suffling? Also I am running with very less records currently still its shuffling ? regards jeetendra On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com wrote: I dont think you can change it to 4 bytes without any custom compilation. To make same key go to same node, you'll have to repartition the data, which is shuffling anyway. Unless your raw data is such that the same key is on same node, you'll have to shuffle atleast once to make same key on same node. On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I reduce it to 4 bytes? Now how Can I make sure the record with same key will go the same node so that I can avoid shuffling. Also how default partition-er will work here. Regards jeetendra
Custom partioner
Hi All I have a RDD which has 1 million keys and each key is repeated from around 7000 values so total there will be around 1M*7K records in RDD. and each key is created from ZipWithIndex so key start from 0 to M-1 the problem with ZipWithIndex is it take long for key which is 8 bytes. can I reduce it to 4 bytes? Now how Can I make sure the record with same key will go the same node so that I can avoid shuffling. Also how default partition-er will work here. Regards jeetendra