Krishan Goyal created YARN-11473:
------------------------------------
Summary: Create a safe mode RM service to enable DB access
Key: YARN-11473
URL: https://issues.apache.org/jira/browse/YARN-11473
Project: Hadoop YARN
Issue Type: Task
Components: resourcemanager
Reporter: Krishan Goyal
Assignee: Krishan Goyal
We have seen various issues where RM fails to start due to bad state leading to
exceptions on startup.
Eg: https://issues.apache.org/jira/browse/YARN-2340
Another issue we have seen internally is with issues in the capacity scheduler
config
{noformat}
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting
ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity
setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When
label=[]{noformat}
In such cases, we can't recover until a bug fix is deployed to enable RM to
start so that the data can be corrected. And during the time RM is forcefully
brought up in those cases, RM can still serve client / AM requests & further
complicate things.
Ideally we should be able to fix the database independently of RM unable to
startup. But with levelDB which is an embedded database this isn't possible
without RM being up. Using seperate tools like
[leveldb-cli|https://github.com/liderman/leveldb-cli] isn't useful always
because it requires additional code to handle specific comparators etc &
requires to be deployed together with RM binaries etc.
A patch to delete applications from state store was implemented in
https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other
bad entries in state store like DTs / Master keys / App attempts / CS Conf from
which we can't recover
A generic DB access will be helpful to delete / update invalid keys.
A better solution is to create a safe mode feature in RM which starts RM with
basic functionality to enable fixing it. RM will not serve client / AM / NM
requests in this mode. This mode will enable selective admin functionality only
(read / write access to the state store).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]