Hi!

We are running into an interesting behavior with the Spark driver. We Spark
running under Yarn. The spark driver seems to be sending work to a dead
executor for 3 hours before it recognizes it. The workload seems to have
been processed by other executors just fine and we see no loss in overall
through put. This Jira - https://issues.apache.org/jira/browse/SPARK-10586 -
seems to indicate a similar behavior.

The yarn resource manager log indicates the following:

2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor
(AbstractLivelinessMonitor.java:run(127)) -
Expired:dn-a01.example.org:45454 Timed out after 600 secs
2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl
(RMNodeImpl.java:transition(746)) - Deactivating Node
dn-a01.example.org:45454 as it is now LOST

The Executor is not reachable for 10 minutes according to this log message
but the Excutor's log shows plenty of RDD processing during that time frame.
This seems like a pretty big issue because the orphan executor seems to
cause a memory leak in the Driver and the Driver becomes non-respondent due
to heavy Full GC.

Has anyone else run into a similar situation?

Thanks for any and all feedback / suggestions.

Shankar

Reply via email to