Closing the loop, I’ve submitted this issue – TD, cc-ing you since it’s spark 
streaming, not sure who oversees the Yarn module.
https://issues.apache.org/jira/browse/SPARK-10792

-adrian

From: Adrian Tanase
Date: Friday, September 18, 2015 at 6:18 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Re: Spark on YARN / aws - executor lost on node restart

Hi guys,

Digging up this question after spending some more time trying to replicate it.
It seems to be an issue with the YARN – spark integration, wondering if there 
is a bug already tracking this?

If I just kill the process on the machine, YARN detects the container is dead 
and the spark framework requests a new container to be deployed.
If the machine goes away completely, spark sees that the executor is lost but 
YarnAllocator never tries to request the container again. Wondering if there’s 
an implicit assumption that it would be notified by YARN, which might not 
happen if the node dies completely?

If there are no ideas on the list, I’ll prepare some logs and follow up with an 
issue.

Thanks,
-adrian

From: Adrian Tanase
Date: Wednesday, September 16, 2015 at 6:01 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Spark on YARN / aws - executor lost on node restart

Hi all,

We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
stateful app that reads from kafka (with the new direct API) and we’re 
checkpointing to HDFS.

During some resilience testing, we restarted one of the machines and brought it 
back online. During the offline period, the Yarn cluster would not have 
resources to re-create the missing executor.
After starting all the services on the machine, it correctly joined the Yarn 
cluster, however the spark streaming app does not seem to notice that the 
resources are back and has not re-created the missing executor.

The app is correctly running with 6 out o 7 executors, however it’s running 
under capacity.
If we manually kill the driver and re-submit the app to yarn, all the sate is 
correctly recreated from checkpoint and all 7 executors are now online – 
however this seems like a brutal workaround.

So, here are some questions:

  *   Isn't the driver supposed to auto-heal after a machine is completely lost 
and then comes back after some time?
  *   Are any configuration settings that influence how spark driver should 
poll yarn to check back on resources being available again?
  *   Is there a tool one can run to “force” the driver to re-create missing 
workers/executors?

Lastly, another issue was that the driver also crashed and yarn successfully 
restarted it – I’m not sure yet if it’s because of some retry setting or 
another exception, will post the logs after I recreate the problem.

Thanks in advance for any ideas,
-adrian

Reply via email to