Vinayakumar B created YARN-11839:
------------------------------------
Summary: [RM HA] - In corner case, RM stay in ACTIVE with
RMStateStore in FENCED state
Key: YARN-11839
URL: https://issues.apache.org/jira/browse/YARN-11839
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Reporter: Vinayakumar B
In a corner case involved with the following events RM will stay in ACTIVE, but
RMStateStore in FENCED state.
# initially RM in ACTIVE state.
# An event triggered to `transitionToStandby()` on RM.
# during *reinitialize(true)* in RM, CapacitySchduler created. BUT not inited
yet.
# Another `{*}transitionToActive(){*}` command for triggered from Admin cli,
which triggered `{*}reinitialize(){*}` on CapacityScheduler, resulting in
`{*}NullPointerException{*}` and in-turn generating
`{*}RMFatalEventType.TRANSITION_TO_ACTIVE_FAILED{*}`
# This triggered `{*}StandByTransitionRunnable{*}` runnable and set the flag
`{*}hasAlreadyRun=true{*}`, even though RM was already STANDBY at this stage.
# This state continued for sometime.
# After sometime RM became active after re-election. But this time
`{*}StandByTransitionRunnable#hasAlreadyRun{*}` is still true.
# Now, due to ZK unstable, RMStateStore met with ZK error and went to *FENCED*
state.
# This again triggered `{*}StandByTransitionRunnable{*}` runnable.
# Now, due the flag, `{*}StandByTransitionRunnable{*}` silently exited.
# RM continued to stay in *ACTIVE* with RMStateStore in *FENCED* state.
# All new applications are continued to stay in *NEW_SAVING* state and no more
state changes in any of the applications.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]