So its possible that you have a lot of data in one of the partitions which
is local to that process, maybe you could cache & count the upstream RDD
and see what the input partitions look like? On the otherhand - using
groupByKey is often a bad sign to begin with - can you rewrite your code to
Hi all,
I'm seeing some curious behavior which I have a hard time interpreting. I
have a job which does a "groupByKey" and results in 300 executors. 299 are
run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode.
The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much