You can only run "yarn rmadmin -refreshNodes" against the active ResourceManager instance. In an HA deployment, a standby instance would return a "not active" error if it received this call, and then the client would failover to the other instance to retry.
The ResourceManagers do not synchronize the state of include/exclude files. Chris Nauroth On Wed, Dec 28, 2022 at 11:08 PM Dong Ye <yedong...@gmail.com> wrote: > Hi, Chris: > > Thank you very much! Yes, I am also concerned with the > decommissioning of nodemanager in a Resource Manager High Availability > scenario. In order to decommission a node manager, > > Can I add the node manager address to a standby RM exclude.xml and run > "yarn refreshnodes"? Or I can only do that on an active RM? Do RM's sync > the exclude/include xml file? > > Thanks. > Have a nice holiday. > > > On Tue, Dec 27, 2022 at 11:44 AM Chris Nauroth <cnaur...@apache.org> > wrote: > >> Every NodeManager registers and heartbeats to the active ResourceManager >> instance, which acts as the source of truth for cluster node status. If the >> active ResourceManager terminates, then another becomes active, and every >> NodeManager will start a new connection to register and heartbeat with that >> new active ResourceManager. >> >> As such, a standby ResourceManager cannot satisfy requests for node >> status and instead will redirect to the current active: >> >> curl -i ' >> http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026 >> ' >> HTTP/1.1 307 Temporary Redirect >> Date: Tue, 27 Dec 2022 19:28:38 GMT >> Cache-Control: no-cache >> Expires: Tue, 27 Dec 2022 19:28:38 GMT >> Date: Tue, 27 Dec 2022 19:28:38 GMT >> Pragma: no-cache >> Content-Type: text/plain;charset=utf-8 >> X-Content-Type-Options: nosniff >> X-XSS-Protection: 1; mode=block >> X-Frame-Options: SAMEORIGIN >> Location: >> http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026 >> Content-Length: 136 >> >> If it looked like you were able to query a standby, then perhaps you were >> using a browser or some other client that automatically follows redirects >> (e.g. curl -L)? >> >> The data really would have come from the active though, so you can trust >> that it's not stale. The only thing you might have to consider is that >> after a failover, it might take a while before every NodeManager registers >> with the new ResourceManager. >> >> Separately, if you're concerned about divergence of node include/exclude >> files, you can configure them to be stored at a shared file system (e.g. >> your preferred cloud object store) to be used by all ResourceManager >> instances. >> >> Chris Nauroth >> >> >> On Sat, Dec 24, 2022 at 6:27 PM Dong Ye <yedong...@gmail.com> wrote: >> >>> Hi, All: >>> >>> I have some questions about the state of the node manager. If I use >>> the rest API >>> >>> - http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid} >>> >>> to get node manager state from a standby RM, >>> 1) is it possible that it could be stale? >>> 2) If it is possible, how long will the node manager state be updated? >>> 3) Is it possible that the NM state returned from standby RM be very >>> different from that returned from the active RM? Say one is returning >>> RUNNING while the other returns DECOMMISSIONED because the local >>> exclude.xml is very different/diverges? >>> >>> Thanks. >>> Have a good holiday. >>> >>