[
https://issues.apache.org/jira/browse/YARN-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18032305#comment-18032305
]
ASF GitHub Bot commented on YARN-11473:
---------------------------------------
github-actions[bot] closed pull request #5576: YARN-11473 Safe Mode RM with
LevelDB access
URL: https://github.com/apache/hadoop/pull/5576
> Create a safe mode RM service to enable DB access
> -------------------------------------------------
>
> Key: YARN-11473
> URL: https://issues.apache.org/jira/browse/YARN-11473
> Project: Hadoop YARN
> Issue Type: Task
> Components: resourcemanager
> Reporter: Krishan Goyal
> Assignee: Krishan Goyal
> Priority: Major
> Labels: pull-request-available
>
> We have seen various issues where RM fails to start due to bad state leading
> to exceptions on startup.
> Eg: https://issues.apache.org/jira/browse/YARN-2340
> Another issue we have seen internally is with issues in the capacity
> scheduler config
> {noformat}
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting
> ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity
> setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When
> label=[]{noformat}
> In such cases, we can't recover until a bug fix is deployed to enable RM to
> start so that the data can be corrected. And during the time RM is forcefully
> brought up in those cases, RM can still serve client / AM requests & further
> complicate things.
> Ideally we should be able to fix the database independently of RM unable to
> startup. But with levelDB which is an embedded database this isn't possible
> without RM being up. Using seperate tools like
> [leveldb-cli|https://github.com/liderman/leveldb-cli] isn't useful always
> because it requires additional code to handle specific comparators etc &
> requires to be deployed together with RM binaries etc.
> A patch to delete applications from state store was implemented in
> https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other
> bad entries in state store like DTs / Master keys / App attempts / CS Conf
> from which we can't recover
> A generic DB access will be helpful to delete / update invalid keys.
> A better solution is to create a safe mode feature in RM which starts RM with
> basic functionality to enable fixing it. RM will not serve client / AM / NM
> requests in this mode. This mode will enable selective admin functionality
> only (read / write access to the state store).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]