[ 
https://issues.apache.org/jira/browse/YARN-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18031943#comment-18031943
 ] 

ASF GitHub Bot commented on YARN-11473:
---------------------------------------

github-actions[bot] commented on PR #5576:
URL: https://github.com/apache/hadoop/pull/5576#issuecomment-3430009509

   We're closing this stale PR because it has been open for 100 days with no 
activity. This isn't a judgement on the merit of the PR in any way. It's just a 
way of keeping the PR queue manageable.
   If you feel like this was a mistake, or you would like to continue working 
on it, please feel free to re-open it and ask for a committer to remove the 
stale tag and review again.
   Thanks all for your contribution.




> Create a safe mode RM service to enable DB access
> -------------------------------------------------
>
>                 Key: YARN-11473
>                 URL: https://issues.apache.org/jira/browse/YARN-11473
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: resourcemanager
>            Reporter: Krishan Goyal
>            Assignee: Krishan Goyal
>            Priority: Major
>              Labels: pull-request-available
>
> We have seen various issues where RM fails to start due to bad state leading 
> to exceptions on startup.
> Eg: https://issues.apache.org/jira/browse/YARN-2340
> Another issue we have seen internally is with issues in the capacity 
> scheduler config
> {noformat}
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting 
> ResourceManagerjava.lang.IllegalArgumentException: Illegal queue capacity 
> setting, (abs-capacity=0.009548) > (abs-maximum-capacity=0.0095). When 
> label=[]{noformat}
> In such cases, we can't recover until a bug fix is deployed to enable RM to 
> start so that the data can be corrected. And during the time RM is forcefully 
> brought up in those cases, RM can still serve client / AM requests & further 
> complicate things. 
> Ideally we should be able to fix the database independently of RM unable to 
> startup. But with levelDB which is an embedded database this isn't possible 
> without RM being up. Using seperate tools like 
> [leveldb-cli|https://github.com/liderman/leveldb-cli] isn't useful always 
> because it requires additional code to handle specific comparators etc & 
> requires to be deployed together with RM binaries etc.  
> A patch to delete applications from state store was implemented in 
> https://issues.apache.org/jira/browse/YARN-3410 but that won't work for other 
> bad entries in state store like DTs / Master keys / App attempts / CS Conf 
> from which we can't recover
> A generic DB access will be helpful to delete / update invalid keys. 
> A better solution is to create a safe mode feature in RM which starts RM with 
> basic functionality to enable fixing it. RM will not serve client / AM / NM 
> requests in this mode. This mode will enable selective admin functionality 
> only (read / write access to the state store). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to