[ https://issues.apache.org/jira/browse/YARN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Devaraj K resolved YARN-2832. ----------------------------- Resolution: Duplicate It is fixed as part of YARN-3375, closing as duplicate. > Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors > ------------------------------------------------------------------ > > Key: YARN-2832 > URL: https://issues.apache.org/jira/browse/YARN-2832 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.4.1, 2.5.1 > Environment: Any environment > Reporter: Tianyin Xu > Attachments: health.check.service.1.patch > > > NodeManager allows users to specify the health checker script that will be > invoked by the health-checker service via the configuration parameter, > "_yarn.nodemanager.health-checker.script.path_" > During the _serviceInit()_ of the health-check service, NM checks whether the > parameter is set correctly using _shouldRun()_, as follows, > {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid} > protected void serviceInit(Configuration conf) throws Exception { > if (NodeHealthScriptRunner.shouldRun(conf)) { > nodeHealthScriptRunner = new NodeHealthScriptRunner(); > addService(nodeHealthScriptRunner); > } > addService(dirsHandler); > super.serviceInit(conf); > } > {code} > The problem is that if the parameter is misconfigured (e.g., permission > problem, wrong path), NM does not have any log message to inform users which > could cause latent errors or mysterious problems (e.g., "why my scripts does > not work?") > I see the checking and printing logic is put in _serviceStart()_ function in > _NodeHealthScriptRunner.java_ (see the following code snippets). However, the > logic is very wrong. For an incorrect parameter that does not pass the > "shouldRun" check, _serviceStart()_ would never be called because the > _NodeHealthScriptRunner_ instance does not have the chance to be created (see > the code snippets above). > {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid} > protected void serviceStart() throws Exception { > // if health script path is not configured don't start the thread. > if (!shouldRun(conf)) { > LOG.info("Not starting node health monitor"); > return; > } > ... > } > {code} > Basically, I think the checking and printing logic should be put in the > serviceInit() in NodeHealthCheckerService instead of serviceStart() in > NodeHealthScriptRunner. > See the attachment for the simple patch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)