Thanks Luciano. The case we are seeing is different - the yarn resource
manager is shutting down the container in which the executor is running
since there does not seem to be a response and it is deeming it dead. It
started another container but the driver seems to be oblivious for nearly 2
hours. Am wondering if there is a condition where the driver is not seeing
the notification from the Yarn RM about the executor container going away.
We will try some of the settings you pointed to, and see if alleviates the
issue.

Shankar

On Thu, 19 May 2016 at 16:20 Luciano Resende <luckbr1...@gmail.com> wrote:

>
> On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman <
> shankarvenkataraman...@gmail.com> wrote:
>
>> Hi!
>>
>> We are running into an interesting behavior with the Spark driver. We
>> Spark running under Yarn. The spark driver seems to be sending work to a
>> dead executor for 3 hours before it recognizes it. The workload seems to
>> have been processed by other executors just fine and we see no loss in
>> overall through put. This Jira -
>> https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a
>> similar behavior.
>>
>> The yarn resource manager log indicates the following:
>>
>> 2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor 
>> (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 
>> Timed out after 600 secs
>> 2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl 
>> (RMNodeImpl.java:transition(746)) - Deactivating Node 
>> dn-a01.example.org:45454 as it is now LOST
>>
>> The Executor is not reachable for 10 minutes according to this log
>> message but the Excutor's log shows plenty of RDD processing during that
>> time frame.
>> This seems like a pretty big issue because the orphan executor seems to
>> cause a memory leak in the Driver and the Driver becomes non-respondent due
>> to heavy Full GC.
>>
>> Has anyone else run into a similar situation?
>>
>> Thanks for any and all feedback / suggestions.
>>
>> Shankar
>>
>>
> I am not sure if this is exactly the same issue, but while we were doing
> heavy processing of large history of tweet data via streaming, we were
> having similar issues due to the load on the executors, and we bumped some
> configurations to avoid loosing some of these executors (even though there
> were alive, but busy to heart beat or something)
>
> Some of these are described at
>
> https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>

Reply via email to