Re: Custom partioner

2015-04-18 Thread Archit Thakur
Yes you can. Use partitionby method and pass partitioner to it.
On Apr 17, 2015 8:18 PM, Jeetendra Gangele gangele...@gmail.com wrote:

 Ok is there a way, I can use  hash Partitioning so that I can improve the
 performance?


 On 17 April 2015 at 19:33, Archit Thakur archit279tha...@gmail.com
 wrote:

 By custom installation, I meant change the code and build it. I have not
 done the complete impact analysis, just had a look on the code.

 When you say, same key goes to same node, It would need shuffling unless
 the raw data you are reading is present that way.
 On Apr 17, 2015 6:30 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi Archit Thanks for reply.
 How can I don the costom compilation so reduce it to 4 bytes.I want to
 make it to 4 bytes in any case can you please guide?

 I am applying flatMapvalue in each step after ZipWithIndex it should be
 in same Node right? Why its suffling?
 Also I am running with very less records currently still its shuffling ?

 regards
 jeetendra



 On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com
 wrote:

 I dont think you can change it to 4 bytes without any custom
 compilation.
 To make same key go to same node, you'll have to repartition the data,
 which is shuffling anyway. Unless your raw data is such that the same key
 is on same node, you'll have to shuffle atleast once to make same key on
 same node.

 On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele 
 gangele...@gmail.com wrote:

 Hi All

 I have a RDD which has 1 million keys and each key is repeated from
 around 7000 values so total there will be around 1M*7K records in RDD.

 and each key is created from ZipWithIndex so key start from 0 to M-1
 the problem with ZipWithIndex is it take long for key which is 8
 bytes. can I reduce it to 4 bytes?

 Now how Can I make sure the record with same key will go the same node
 so that I can avoid shuffling. Also how default partition-er will work 
 here.

 Regards
 jeetendra












Re: Custom partioner

2015-04-17 Thread Jeetendra Gangele
Hi Archit Thanks for reply.
How can I don the costom compilation so reduce it to 4 bytes.I want to make
it to 4 bytes in any case can you please guide?

I am applying flatMapvalue in each step after ZipWithIndex it should be in
same Node right? Why its suffling?
Also I am running with very less records currently still its shuffling ?

regards
jeetendra



On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com wrote:

 I dont think you can change it to 4 bytes without any custom compilation.
 To make same key go to same node, you'll have to repartition the data,
 which is shuffling anyway. Unless your raw data is such that the same key
 is on same node, you'll have to shuffle atleast once to make same key on
 same node.

 On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All

 I have a RDD which has 1 million keys and each key is repeated from
 around 7000 values so total there will be around 1M*7K records in RDD.

 and each key is created from ZipWithIndex so key start from 0 to M-1
 the problem with ZipWithIndex is it take long for key which is 8 bytes.
 can I reduce it to 4 bytes?

 Now how Can I make sure the record with same key will go the same node so
 that I can avoid shuffling. Also how default partition-er will work here.

 Regards
 jeetendra





Re: Custom partioner

2015-04-17 Thread Archit Thakur
By custom installation, I meant change the code and build it. I have not
done the complete impact analysis, just had a look on the code.

When you say, same key goes to same node, It would need shuffling unless
the raw data you are reading is present that way.
On Apr 17, 2015 6:30 PM, Jeetendra Gangele gangele...@gmail.com wrote:

 Hi Archit Thanks for reply.
 How can I don the costom compilation so reduce it to 4 bytes.I want to
 make it to 4 bytes in any case can you please guide?

 I am applying flatMapvalue in each step after ZipWithIndex it should be in
 same Node right? Why its suffling?
 Also I am running with very less records currently still its shuffling ?

 regards
 jeetendra



 On 17 April 2015 at 15:58, Archit Thakur archit279tha...@gmail.com
 wrote:

 I dont think you can change it to 4 bytes without any custom compilation.
 To make same key go to same node, you'll have to repartition the data,
 which is shuffling anyway. Unless your raw data is such that the same key
 is on same node, you'll have to shuffle atleast once to make same key on
 same node.

 On Thu, Apr 16, 2015 at 10:16 PM, Jeetendra Gangele gangele...@gmail.com
  wrote:

 Hi All

 I have a RDD which has 1 million keys and each key is repeated from
 around 7000 values so total there will be around 1M*7K records in RDD.

 and each key is created from ZipWithIndex so key start from 0 to M-1
 the problem with ZipWithIndex is it take long for key which is 8 bytes.
 can I reduce it to 4 bytes?

 Now how Can I make sure the record with same key will go the same node
 so that I can avoid shuffling. Also how default partition-er will work here.

 Regards
 jeetendra








Custom partioner

2015-04-16 Thread Jeetendra Gangele
Hi All

I have a RDD which has 1 million keys and each key is repeated from around
7000 values so total there will be around 1M*7K records in RDD.

and each key is created from ZipWithIndex so key start from 0 to M-1
the problem with ZipWithIndex is it take long for key which is 8 bytes. can
I reduce it to 4 bytes?

Now how Can I make sure the record with same key will go the same node so
that I can avoid shuffling. Also how default partition-er will work here.

Regards
jeetendra