[ https://issues.apache.org/jira/browse/SPARK-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh Rosen updated SPARK-9745: ------------------------------ Priority: Critical (was: Major) > Applications hangs when the last executor fails with dynamic allocation > ----------------------------------------------------------------------- > > Key: SPARK-9745 > URL: https://issues.apache.org/jira/browse/SPARK-9745 > Project: Spark > Issue Type: Bug > Components: PySpark, Scheduler, YARN > Affects Versions: 1.5.0 > Environment: YARN + Pyspark + Dynamic Allocation > Reporter: Alex Angelini > Priority: Critical > Attachments: am_hung_job.png, executors_hung_job.png, > logs_hung_job.png, tasks_hung_job.png > > > When a job has only a single executor remaining and that executor dies (due > to something like an OOM), the application fails to notice that there are no > executors left and it hangs indefinitely. > This only happens when dynamic allocation is enabled. > The following images were taken from a hung application with no executors: > !logs_hung_job.png! > ^^ *Notice how 1 executor was lost, but the application never requested it to > be removed* > !am_hung_job.png! > !executors_hung_job.png! > !tasks_hung_job.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org