Hi Tanvir! Although an application may request for that node, a container won't be scheduled until the nodemanager sends a heartbeat. If the application hasn't specified a preference for that node, then whichever node heartbeats next, will be used to launch a container.
HTH Ravi On Thu, Nov 3, 2016 at 12:12 PM, Tanvir Rahman <tanvir9982...@gmail.com> wrote: > Thank you Ravi for your reply. > I found one parameter 'yarn.resourcemanager.nm. > liveness-monitor.interval-ms' (default value=1000ms) in yarn-default.xml > (v2.4.1) which determines how often to check that node managers are still > alive. So RM is checking heartbeat of NM every second but it takes 10 min > to decide whether the NM is dead or not. (yarn.nm.liveness-monitor. > expiry-interval-ms: How long to wait until a node manager is considered > dead; default value = 600000 ms). > > What happens if RM finds that one NM's heartbeat is missing but it is not > 10 min yet (yarn.nm.liveness-monitor.expiry-interval-ms time is not > expired yet) > Will a new application still make container request to that NM via RM? > > Thanks > Tanvir > > > > > > On Wed, Nov 2, 2016 at 5:41 PM, Ravi Prakash <ravihad...@gmail.com> wrote: > >> Hi Tanvir! >> >> Its hard to have some configuration that works for all cluster scenarios. >> I suspect that value was chosen as somewhat a mirror of the time it takes >> HDFS to realize a datanode is dead (which is also 10 mins from what I >> remember). The RM also has to reschedule the work when that timeout >> expires. Also there may be network glitches which could last that >> long...... Also, the NMs are pretty stable by themselves. Failing NMs have >> not been too common in my experience. >> >> HTH >> Ravi >> >> On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <tanvir9982...@gmail.com> >> wrote: >> >>> Hello, >>> Can anyone please tell me why the default value of ' >>> yarn.resourcemanager.container.liveness-monitor.interval-ms' in >>> yarn-default.xml >>> <https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml> >>> is >>> so high? This parameter determines "How often to check that containers >>> are still alive". The default value is 60000 ms or 10 minutes. So if a >>> node manager fails, the resource manager detects the dead container after >>> 10 minutes. >>> >>> >>> I am running a wordcount code in my university cluster. In the middle of >>> run, I stopped node manager of one node (the data node is still running) >>> and found that the completion time increases about 10 minutes because of >>> the node manager failure. >>> >>> Thanks in advance >>> Tanvir >>> >>> >>>> >>> >> >