Closing the loop, I’ve submitted this issue – TD, cc-ing you since it’s spark streaming, not sure who oversees the Yarn module. https://issues.apache.org/jira/browse/SPARK-10792
-adrian From: Adrian Tanase Date: Friday, September 18, 2015 at 6:18 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Spark on YARN / aws - executor lost on node restart Hi guys, Digging up this question after spending some more time trying to replicate it. It seems to be an issue with the YARN – spark integration, wondering if there is a bug already tracking this? If I just kill the process on the machine, YARN detects the container is dead and the spark framework requests a new container to be deployed. If the machine goes away completely, spark sees that the executor is lost but YarnAllocator never tries to request the container again. Wondering if there’s an implicit assumption that it would be notified by YARN, which might not happen if the node dies completely? If there are no ideas on the list, I’ll prepare some logs and follow up with an issue. Thanks, -adrian From: Adrian Tanase Date: Wednesday, September 16, 2015 at 6:01 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Spark on YARN / aws - executor lost on node restart Hi all, We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a stateful app that reads from kafka (with the new direct API) and we’re checkpointing to HDFS. During some resilience testing, we restarted one of the machines and brought it back online. During the offline period, the Yarn cluster would not have resources to re-create the missing executor. After starting all the services on the machine, it correctly joined the Yarn cluster, however the spark streaming app does not seem to notice that the resources are back and has not re-created the missing executor. The app is correctly running with 6 out o 7 executors, however it’s running under capacity. If we manually kill the driver and re-submit the app to yarn, all the sate is correctly recreated from checkpoint and all 7 executors are now online – however this seems like a brutal workaround. So, here are some questions: * Isn't the driver supposed to auto-heal after a machine is completely lost and then comes back after some time? * Are any configuration settings that influence how spark driver should poll yarn to check back on resources being available again? * Is there a tool one can run to “force” the driver to re-create missing workers/executors? Lastly, another issue was that the driver also crashed and yarn successfully restarted it – I’m not sure yet if it’s because of some retry setting or another exception, will post the logs after I recreate the problem. Thanks in advance for any ideas, -adrian