On 28 Dec 2017, at 19:40, Maximiliano Felice <maximilianofel...@gmail.com> wrote: > I experienced a similar issue a few weeks ago. The situation was a result of > a mix of speculative execution and OOM issues in the container.
Interesting! However I don't have any OOM exception in the logs. Does that rule out your hypothesis? > We've managed to check that when we have speculative execution enabled and > some YARN containers which were running speculative tasks died, they did take > a chance from the max-attempts number. This wouldn't represent any issue in > normal behavior, but it seems that if all the retries were consumed in a task > that has started speculative execution, the application itself doesn't fail, > but it hangs the task expecting to reschedule it sometime. As the attempts > are zero, it never reschedules it and the application itself fails to finish. Hmm, this sounds like a huge design fail to me, but I'm sure there are very complicated issues that go way over my head. > 1. Check the number of tasks scheduled. If you see one (or more) tasks > missing when you do the final sum, then you might be encountering this issue. > 2. Check the container logs to see if anything broke. OOM is what failed to > me. I can't find anything in the logs from EMR. Should I expect to find explicit OOM exception messages? JM --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org