Re: spark single PROCESS_LOCAL task

2016-07-19 Thread Holden Karau
So its possible that you have a lot of data in one of the partitions which is local to that process, maybe you could cache & count the upstream RDD and see what the input partitions look like? On the otherhand - using groupByKey is often a bad sign to begin with - can you rewrite your code to

spark single PROCESS_LOCAL task

2016-07-15 Thread Matt K
Hi all, I'm seeing some curious behavior which I have a hard time interpreting. I have a job which does a "groupByKey" and results in 300 executors. 299 are run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode. The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much