I believe the default hash partitioner logic in spark will send all the
same keys to same machine.
On Wed, Jan 14, 2015, 03:03 Puneet Kapoor puneet.cse.i...@gmail.com wrote:
Hi,
I have a usecase where in I have hourly spark job which creates hourly
RDDs, which are partitioned by keys.
At
there is no way to avoid shuffle if you use combine by key, no matter if
your data is cached in memory, because the shuffle write must write the
data into disk. And It seem that spark can not guarantee the similar
key(K1) goes to the Container_X.
you can use the tmpfs for your shuffle dir, this
By the way, I am not sure enough wether the shuffle key can go into the
similar container.