Re: Save RDD with partition information

2015-01-13 Thread Raghavendra Pandey
I believe the default hash partitioner logic in spark will send all the same keys to same machine. On Wed, Jan 14, 2015, 03:03 Puneet Kapoor puneet.cse.i...@gmail.com wrote: Hi, I have a usecase where in I have hourly spark job which creates hourly RDDs, which are partitioned by keys. At

Re: Save RDD with partition information

2015-01-13 Thread lihu
there is no way to avoid shuffle if you use combine by key, no matter if your data is cached in memory, because the shuffle write must write the data into disk. And It seem that spark can not guarantee the similar key(K1) goes to the Container_X. you can use the tmpfs for your shuffle dir, this

Re: Save RDD with partition information

2015-01-13 Thread lihu
By the way, I am not sure enough wether the shuffle key can go into the similar container.