So its possible that you have a lot of data in one of the partitions which is local to that process, maybe you could cache & count the upstream RDD and see what the input partitions look like? On the otherhand - using groupByKey is often a bad sign to begin with - can you rewrite your code to avoid this?
On Fri, Jul 15, 2016 at 10:57 AM, Matt K <matvey1...@gmail.com> wrote: > Hi all, > > I'm seeing some curious behavior which I have a hard time interpreting. I > have a job which does a "groupByKey" and results in 300 executors. 299 are > run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode. > > The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much > input as the other executors. It dies with OOM, and the job fails. > > Only working theory I have is that there's a single key which has a ton of > data tied to it. Even so, I can't explain why it's run in PROCESS_LOCAL > mode and not others. > > Anyone has ideas? > > Thanks, > -Matt > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau