Re: why the default value of 'yarn.resourcemanager.container.liveness-monitor.interval-ms' in yarn-default.xml is so high?

Ravi Prakash Thu, 03 Nov 2016 15:22:32 -0700

Hi Tanvir!

Although an application may request for that node, a container won't be
scheduled until the nodemanager sends a heartbeat. If the application
hasn't specified a preference for that node, then whichever node heartbeats
next, will be used to launch a container.


HTH
Ravi

On Thu, Nov 3, 2016 at 12:12 PM, Tanvir Rahman <tanvir9982...@gmail.com>
wrote:

> Thank you Ravi for your reply.
> I found one parameter 'yarn.resourcemanager.nm.
> liveness-monitor.interval-ms' (default value=1000ms) in yarn-default.xml
> (v2.4.1) which determines how often to check that node managers are still
> alive. So RM is checking heartbeat of NM every second but it takes 10 min
> to decide whether the NM is dead or not. (yarn.nm.liveness-monitor.
> expiry-interval-ms: How long to wait until a node manager is considered
> dead; default value = 600000 ms).
>
> What happens if RM finds that one NM's heartbeat is missing but it is not
> 10 min yet (yarn.nm.liveness-monitor.expiry-interval-ms time is not
> expired yet)
> Will a new application still make container request to that NM via RM?
>
> Thanks
> Tanvir
>
>
>
>
>
> On Wed, Nov 2, 2016 at 5:41 PM, Ravi Prakash <ravihad...@gmail.com> wrote:
>
>> Hi Tanvir!
>>
>> Its hard to have some configuration that works for all cluster scenarios.
>> I suspect that value was chosen as somewhat a mirror of the time it takes
>> HDFS to realize a datanode is dead (which is also 10 mins from what I
>> remember). The RM also has to reschedule the work when that timeout
>> expires. Also there may be network glitches which could last that
>> long...... Also, the NMs are pretty stable by themselves. Failing NMs have
>> not been too common in my experience.
>>
>> HTH
>> Ravi
>>
>> On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <tanvir9982...@gmail.com>
>> wrote:
>>
>>> Hello,
>>> Can anyone please tell me why the default value of '
>>> yarn.resourcemanager.container.liveness-monitor.interval-ms' in
>>> yarn-default.xml
>>> <https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml>
>>>  is
>>> so high? This parameter determines "How often to check that containers
>>> are still alive". The default value is 60000 ms or 10 minutes. So if a
>>> node manager fails, the resource manager detects the dead container after
>>> 10 minutes.
>>>
>>>
>>> I am running a wordcount code in my university cluster. In the middle of
>>> run, I stopped node manager of one node (the data node is still running)
>>> and found that the completion time increases about 10 minutes because of
>>> the node manager failure.
>>>
>>> Thanks in advance
>>> Tanvir
>>>
>>>
>>>>
>>>
>>
>

Re: why the default value of 'yarn.resourcemanager.container.liveness-monitor.interval-ms' in yarn-default.xml is so high?

Reply via email to