Re: stale_status_of_NM_from_standby_RM

Dong Ye Wed, 28 Dec 2022 23:08:37 -0800

Hi, Chris:

        Thank you very much! Yes, I am also concerned with the
decommissioning of nodemanager in a Resource Manager High Availability
scenario. In order to decommission a node manager,


Can I add the node manager address to a standby RM exclude.xml and run
"yarn refreshnodes"? Or I can only do that on an active RM? Do RM's sync
the exclude/include xml file?

Thanks.
Have a nice holiday.


On Tue, Dec 27, 2022 at 11:44 AM Chris Nauroth <cnaur...@apache.org> wrote:

> Every NodeManager registers and heartbeats to the active ResourceManager
> instance, which acts as the source of truth for cluster node status. If the
> active ResourceManager terminates, then another becomes active, and every
> NodeManager will start a new connection to register and heartbeat with that
> new active ResourceManager.
>
> As such, a standby ResourceManager cannot satisfy requests for node status
> and instead will redirect to the current active:
>
> curl -i '
> http://cnauroth-ha-m-2:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
> '
> HTTP/1.1 307 Temporary Redirect
> Date: Tue, 27 Dec 2022 19:28:38 GMT
> Cache-Control: no-cache
> Expires: Tue, 27 Dec 2022 19:28:38 GMT
> Date: Tue, 27 Dec 2022 19:28:38 GMT
> Pragma: no-cache
> Content-Type: text/plain;charset=utf-8
> X-Content-Type-Options: nosniff
> X-XSS-Protection: 1; mode=block
> X-Frame-Options: SAMEORIGIN
> Location:
> http://cnauroth-ha-m-1.us-central1-c.c.hadoop-cloud-dev.google.com.internal.:8088/ws/v1/cluster/nodes/cnauroth-ha-w-0.us-central1-c.c.hadoop-cloud-dev.google.com.internal:8026
> Content-Length: 136
>
> If it looked like you were able to query a standby, then perhaps you were
> using a browser or some other client that automatically follows redirects
> (e.g. curl -L)?
>
> The data really would have come from the active though, so you can trust
> that it's not stale. The only thing you might have to consider is that
> after a failover, it might take a while before every NodeManager registers
> with the new ResourceManager.
>
> Separately, if you're concerned about divergence of node include/exclude
> files, you can configure them to be stored at a shared file system (e.g.
> your preferred cloud object store) to be used by all ResourceManager
> instances.
>
> Chris Nauroth
>
>
> On Sat, Dec 24, 2022 at 6:27 PM Dong Ye <yedong...@gmail.com> wrote:
>
>> Hi, All:
>>
>>     I have some questions about the state of the node manager. If I use
>> the rest API
>>
>>    - http://rm-http-address:port/ws/v1/cluster/nodes/{nodeid}
>>
>> to get node manager state from a standby RM,
>> 1) is it possible that it could be stale?
>> 2) If it is possible, how long will the node manager state be updated?
>> 3) Is it possible that the NM state returned from standby RM be very
>> different from that returned from the active RM? Say one is returning
>> RUNNING while the other returns DECOMMISSIONED because the local
>> exclude.xml is very different/diverges?
>>
>> Thanks.
>> Have a good holiday.
>>
>

Re: stale_status_of_NM_from_standby_RM

Reply via email to