Re: Spark driver and yarn behavior

2016-05-20 Thread Steve Loughran

On 20 May 2016, at 00:34, Shankar Venkataraman 
mailto:shankarvenkataraman...@gmail.com>> 
wrote:

Thanks Luciano. The case we are seeing is different - the yarn resource manager 
is shutting down the container in which the executor is running since there 
does not seem to be a response and it is deeming it dead. It started another 
container but the driver seems to be oblivious for nearly 2 hours. Am wondering 
if there is a condition where the driver is not seeing the notification from 
the Yarn RM about the executor container going away. We will try some of the 
settings you pointed to, and see if alleviates the issue.

Shankar



the YARN RM doesn't (AFAIK) do any liveness checks on executors.

1. The AM regularly heartbeats with the RM;
2. if that stops the AM is killed (and unless its requested container 
preservation), all its containers. The AM is then restarted (if retries < 
yarn.am.retry.count (?").
3. Node Managers, one per server, heartbeat to the RM.
4. If they stop checking in, AM assumes node and all running containers are 
dead, reports failures to the AM, leaves it to deal with. (Special case: Work 
preserving NM restart).
5. If the process running in  a container fails, the NM picks it up and relays 
that to the AM via the RM.

some details: http://www.slideshare.net/steve_l/yarn-services


Have a look in the NM logs to see what it thinks is happening —but i think it 
may well be some driver/executor communication problem.




Re: Spark driver and yarn behavior

2016-05-19 Thread Shankar Venkataraman
Thanks Luciano. The case we are seeing is different - the yarn resource
manager is shutting down the container in which the executor is running
since there does not seem to be a response and it is deeming it dead. It
started another container but the driver seems to be oblivious for nearly 2
hours. Am wondering if there is a condition where the driver is not seeing
the notification from the Yarn RM about the executor container going away.
We will try some of the settings you pointed to, and see if alleviates the
issue.

Shankar

On Thu, 19 May 2016 at 16:20 Luciano Resende  wrote:

>
> On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman <
> shankarvenkataraman...@gmail.com> wrote:
>
>> Hi!
>>
>> We are running into an interesting behavior with the Spark driver. We
>> Spark running under Yarn. The spark driver seems to be sending work to a
>> dead executor for 3 hours before it recognizes it. The workload seems to
>> have been processed by other executors just fine and we see no loss in
>> overall through put. This Jira -
>> https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a
>> similar behavior.
>>
>> The yarn resource manager log indicates the following:
>>
>> 2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor 
>> (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 
>> Timed out after 600 secs
>> 2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl 
>> (RMNodeImpl.java:transition(746)) - Deactivating Node 
>> dn-a01.example.org:45454 as it is now LOST
>>
>> The Executor is not reachable for 10 minutes according to this log
>> message but the Excutor's log shows plenty of RDD processing during that
>> time frame.
>> This seems like a pretty big issue because the orphan executor seems to
>> cause a memory leak in the Driver and the Driver becomes non-respondent due
>> to heavy Full GC.
>>
>> Has anyone else run into a similar situation?
>>
>> Thanks for any and all feedback / suggestions.
>>
>> Shankar
>>
>>
> I am not sure if this is exactly the same issue, but while we were doing
> heavy processing of large history of tweet data via streaming, we were
> having similar issues due to the load on the executors, and we bumped some
> configurations to avoid loosing some of these executors (even though there
> were alive, but busy to heart beat or something)
>
> Some of these are described at
>
> https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Spark driver and yarn behavior

2016-05-19 Thread Luciano Resende
On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman <
shankarvenkataraman...@gmail.com> wrote:

> Hi!
>
> We are running into an interesting behavior with the Spark driver. We
> Spark running under Yarn. The spark driver seems to be sending work to a
> dead executor for 3 hours before it recognizes it. The workload seems to
> have been processed by other executors just fine and we see no loss in
> overall through put. This Jira -
> https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a
> similar behavior.
>
> The yarn resource manager log indicates the following:
>
> 2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor 
> (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 
> Timed out after 600 secs
> 2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl 
> (RMNodeImpl.java:transition(746)) - Deactivating Node 
> dn-a01.example.org:45454 as it is now LOST
>
> The Executor is not reachable for 10 minutes according to this log message
> but the Excutor's log shows plenty of RDD processing during that time frame.
> This seems like a pretty big issue because the orphan executor seems to
> cause a memory leak in the Driver and the Driver becomes non-respondent due
> to heavy Full GC.
>
> Has anyone else run into a similar situation?
>
> Thanks for any and all feedback / suggestions.
>
> Shankar
>
>
I am not sure if this is exactly the same issue, but while we were doing
heavy processing of large history of tweet data via streaming, we were
having similar issues due to the load on the executors, and we bumped some
configurations to avoid loosing some of these executors (even though there
were alive, but busy to heart beat or something)

Some of these are described at
https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/