On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman < shankarvenkataraman...@gmail.com> wrote:
> Hi! > > We are running into an interesting behavior with the Spark driver. We > Spark running under Yarn. The spark driver seems to be sending work to a > dead executor for 3 hours before it recognizes it. The workload seems to > have been processed by other executors just fine and we see no loss in > overall through put. This Jira - > https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a > similar behavior. > > The yarn resource manager log indicates the following: > > 2016-05-02 21:36:40,081 INFO util.AbstractLivelinessMonitor > (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 > Timed out after 600 secs > 2016-05-02 21:36:40,082 INFO rmnode.RMNodeImpl > (RMNodeImpl.java:transition(746)) - Deactivating Node > dn-a01.example.org:45454 as it is now LOST > > The Executor is not reachable for 10 minutes according to this log message > but the Excutor's log shows plenty of RDD processing during that time frame. > This seems like a pretty big issue because the orphan executor seems to > cause a memory leak in the Driver and the Driver becomes non-respondent due > to heavy Full GC. > > Has anyone else run into a similar situation? > > Thanks for any and all feedback / suggestions. > > Shankar > > I am not sure if this is exactly the same issue, but while we were doing heavy processing of large history of tweet data via streaming, we were having similar issues due to the load on the executors, and we bumped some configurations to avoid loosing some of these executors (even though there were alive, but busy to heart beat or something) Some of these are described at https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala -- Luciano Resende http://twitter.com/lresende1975 http://lresende.blogspot.com/