On 28 Dec 2017, at 19:40, Maximiliano Felice <maximilianofel...@gmail.com> 
wrote:
> I experienced a similar issue a few weeks ago. The situation was a result of 
> a mix of speculative execution and OOM issues in the container.

Interesting! However I don't have any OOM exception in the logs. Does that rule 
out your hypothesis?

> We've managed to check that when we have speculative execution enabled and 
> some YARN containers which were running speculative tasks died, they did take 
> a chance from the max-attempts number. This wouldn't represent any issue in 
> normal behavior, but it seems that if all the retries were consumed in a task 
> that has started speculative execution, the application itself doesn't fail, 
> but it hangs the task expecting to reschedule it sometime. As the attempts 
> are zero, it never reschedules it and the application itself fails to finish.

Hmm, this sounds like a huge design fail to me, but I'm sure there are very 
complicated issues that go way over my head.

> 1. Check the number of tasks scheduled. If you see one (or more) tasks 
> missing when you do the final sum, then you might be encountering this issue.
> 2. Check the container logs to see if anything broke. OOM is what failed to 
> me.

I can't find anything in the logs from EMR. Should I expect to find explicit 
OOM exception messages? 

JM


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to