[ 
https://issues.apache.org/jira/browse/YARN-6715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-6715:
---------------------------------
    Description: 
NodeHealthScriptRunner does *not* report a bad health if the script exits with 
an exit code other than 0. Look at the {{FAILED_WITH_EXIT_CODE}} case:

{noformat}
    void reportHealthStatus(HealthCheckerExitStatus status) {
      long now = System.currentTimeMillis();
      switch (status) {
      case SUCCESS:
        setHealthStatus(true, "", now);
        break;
      case TIMED_OUT:
        setHealthStatus(false, NODE_HEALTH_SCRIPT_TIMED_OUT_MSG);
        break;
      case FAILED_WITH_EXCEPTION:
        setHealthStatus(false, exceptionStackTrace);
        break;
      case FAILED_WITH_EXIT_CODE:
        setHealthStatus(true, "", now);
        break;
      case FAILED:
        setHealthStatus(false, shexec.getOutput());
        break;
      }
    }
{noformat}

Based on the discussion in YARN-5567, this is intentional, but conflicts with 
the upstream document, which says: 
"If the script *exits with a non-zero exit code*, times out or results in an 
exception being thrown, the node is marked as unhealthy"

This statement can be extremely misleading and must be corrected. We might also 
add an extra comment to {{reportHealthStatus()}} which explains that 
{{FAILED_WITH_EXIT_CODE}} is not buggy.

This case also lacks unit test coverage.

  was:
NodeHealthScriptRunner does *not* report a bad health if the script exits with 
an exit code other than 0. Look at the {{FAILED_WITH_EXIT_CODE}} case:

{noformat}
    void reportHealthStatus(HealthCheckerExitStatus status) {
      long now = System.currentTimeMillis();
      switch (status) {
      case SUCCESS:
        setHealthStatus(true, "", now);
        break;
      case TIMED_OUT:
        setHealthStatus(false, NODE_HEALTH_SCRIPT_TIMED_OUT_MSG);
        break;
      case FAILED_WITH_EXCEPTION:
        setHealthStatus(false, exceptionStackTrace);
        break;
      case FAILED_WITH_EXIT_CODE:
        setHealthStatus(true, "", now);
        break;
      case FAILED:
        setHealthStatus(false, shexec.getOutput());
        break;
      }
    }
{noformat}

Based on the discussion in YARN-5567, this is intional, but conflicts with the 
upstream document, which says: 
"If the script *exits with a non-zero exit code*, times out or results in an 
exception being thrown, the node is marked as unhealthy"

This statement can be extremely misleading and must be corrected. We might also 
add an extra comment to {{reportHealthStatus()}} which explains that 
{{FAILED_WITH_EXIT_CODE}} is not buggy.

This case also lacks unit test coverage.


> Fix documentation about NodeHealthScriptRunner 
> -----------------------------------------------
>
>                 Key: YARN-6715
>                 URL: https://issues.apache.org/jira/browse/YARN-6715
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> NodeHealthScriptRunner does *not* report a bad health if the script exits 
> with an exit code other than 0. Look at the {{FAILED_WITH_EXIT_CODE}} case:
> {noformat}
>     void reportHealthStatus(HealthCheckerExitStatus status) {
>       long now = System.currentTimeMillis();
>       switch (status) {
>       case SUCCESS:
>         setHealthStatus(true, "", now);
>         break;
>       case TIMED_OUT:
>         setHealthStatus(false, NODE_HEALTH_SCRIPT_TIMED_OUT_MSG);
>         break;
>       case FAILED_WITH_EXCEPTION:
>         setHealthStatus(false, exceptionStackTrace);
>         break;
>       case FAILED_WITH_EXIT_CODE:
>         setHealthStatus(true, "", now);
>         break;
>       case FAILED:
>         setHealthStatus(false, shexec.getOutput());
>         break;
>       }
>     }
> {noformat}
> Based on the discussion in YARN-5567, this is intentional, but conflicts with 
> the upstream document, which says: 
> "If the script *exits with a non-zero exit code*, times out or results in an 
> exception being thrown, the node is marked as unhealthy"
> This statement can be extremely misleading and must be corrected. We might 
> also add an extra comment to {{reportHealthStatus()}} which explains that 
> {{FAILED_WITH_EXIT_CODE}} is not buggy.
> This case also lacks unit test coverage.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to