Hi, Chris: Thank you very much! Yes, I am also concerned with the decommissioning of nodemanager in a Resource Manager High Availability scenario. In order to decommission a node manager,
Can I add the node manager address to a standby RM exclude.xml and run "yarn refreshnodes"? Or I can only do that on an active RM? Do RM's sync the exclude/include xml file? Thanks. Have a nice holiday. On Tue, Dec 27, 2022 at 11:44 AM Chris Nauroth <cnaur...@apache.org> wrote: > Every NodeManager registers and heartbeats to the active ResourceManager > instance, which acts as the source of truth for cluster node status. If the > active ResourceManager terminates, then another becomes active, and every > NodeManager will start a new connection to register and heartbeat with that > new active ResourceManager. > > As such, a standby ResourceManager cannot satisfy requests for node status > and instead will redirect to the current active: > > curl -i ' > http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026 > ' > HTTP/1.1 307 Temporary Redirect > Date: Tue, 27 Dec 2022 19:28:38 GMT > Cache-Control: no-cache > Expires: Tue, 27 Dec 2022 19:28:38 GMT > Date: Tue, 27 Dec 2022 19:28:38 GMT > Pragma: no-cache > Content-Type: text/plain;charset=utf-8 > X-Content-Type-Options: nosniff > X-XSS-Protection: 1; mode=block > X-Frame-Options: SAMEORIGIN > Location: > http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026 > Content-Length: 136 > > If it looked like you were able to query a standby, then perhaps you were > using a browser or some other client that automatically follows redirects > (e.g. curl -L)? > > The data really would have come from the active though, so you can trust > that it's not stale. The only thing you might have to consider is that > after a failover, it might take a while before every NodeManager registers > with the new ResourceManager. > > Separately, if you're concerned about divergence of node include/exclude > files, you can configure them to be stored at a shared file system (e.g. > your preferred cloud object store) to be used by all ResourceManager > instances. > > Chris Nauroth > > > On Sat, Dec 24, 2022 at 6:27 PM Dong Ye <yedong...@gmail.com> wrote: > >> Hi, All: >> >> I have some questions about the state of the node manager. If I use >> the rest API >> >> - http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid} >> >> to get node manager state from a standby RM, >> 1) is it possible that it could be stale? >> 2) If it is possible, how long will the node manager state be updated? >> 3) Is it possible that the NM state returned from standby RM be very >> different from that returned from the active RM? Say one is returning >> RUNNING while the other returns DECOMMISSIONED because the local >> exclude.xml is very different/diverges? >> >> Thanks. >> Have a good holiday. >> >