Hi! We are running into an interesting behavior with the Spark driver. We Spark running under Yarn. The spark driver seems to be sending work to a dead executor for 3 hours before it recognizes it. The workload seems to have been processed by other executors just fine and we see no loss in overall through put. This Jira - https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a similar behavior.
The yarn resource manager log indicates the following: 2016-05-02 21:36:40,081 INFO util.AbstractLivelinessMonitor (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 Timed out after 600 secs 2016-05-02 21:36:40,082 INFO rmnode.RMNodeImpl (RMNodeImpl.java:transition(746)) - Deactivating Node dn-a01.example.org:45454 as it is now LOST The Executor is not reachable for 10 minutes according to this log message but the Excutor's log shows plenty of RDD processing during that time frame. This seems like a pretty big issue because the orphan executor seems to cause a memory leak in the Driver and the Driver becomes non-respondent due to heavy Full GC. Has anyone else run into a similar situation? Thanks for any and all feedback / suggestions. Shankar