[ https://issues.apache.org/jira/browse/YARN-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225312#comment-14225312 ]
Bruno Alexandre Rosa commented on YARN-2299: -------------------------------------------- I tried to reproduce the first case on version 2.5.2 and the bug it is still present. However, instead of host:port1 showing on Lost Nodes, I got host:port2. In the same fashion, I lost track of host:port1. The sum of Lost Nodes remains inconsistent. > inconsistency at identifying node > --------------------------------- > > Key: YARN-2299 > URL: https://issues.apache.org/jira/browse/YARN-2299 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Hong Zhiguo > Assignee: Hong Zhiguo > Priority: Critical > > If port of "yarn.nodemanager.address" is not specified at NM, NM will choose > random port. If the NM is ungracefully dead(OOM kill, kill -9, or OS restart) > and then restarted within "yarn.nm.liveness-monitor.expiry-interval-ms", > "host:port1" and "host:port2" will both be present in "Active Nodes" on WebUI > for a while, and after host:port1 expiration, we get host:port1 in "Lost > Nodes" and host:port2 in "Active Nodes". If the NM is ungracefully dead > again, we get only host:port1 in "Lost Nodes". "host:port2" is neither in > "Active Nodes" nor in "Lost Nodes". > Another case, two NM is running on same host(miniYarnCluster or other test > purpose), if both of them are lost, we get only one "Lost Nodes" in WebUI. > In both case, sum of "Active Nodes" and "Lost Nodes" is not the number of > nodes we expected. > The root cause is due to inconsistency at how we think two Nodes are > identical. > When we manager active nodes(RMContextImpl.nodes), we use NodeId which > contains port. Two nodes with same host but different port are thought to be > different node. > But when we manager inactive nodes(RMContextImpl.inactiveNodes), we use only > use host. Two nodes with same host but different port are thought to > identical. > To fix the inconsistency, we should differentiate below 2 cases and be > consistent for both of them: > - intentionally multiple NMs per host > - NM instances one after another on same host > Two possible solutions: > 1) Introduce a boolean config like "one-node-per-host"(default as "true"), > and use host to differentiate nodes on RM if it's true. > 2) Make it mandatory to have valid port in "yarn.nodemanager.address" config. > In this sutiation, NM instances one after another on same host will have > same NodeId, while intentionally multiple NMs per host will have different > NodeId. > Personally I prefer option 1 because it's easier for users. -- This message was sent by Atlassian JIRA (v6.3.4#6332)