[jira] [Comment Edited] (YARN-9923) Detect missing Docker binary or not running Docker daemon

Szilard Nemeth (Jira) Sat, 09 Nov 2019 08:43:40 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970862#comment-16970862
 ]


Szilard Nemeth edited comment on YARN-9923 at 11/9/19 4:42 PM:
---------------------------------------------------------------

Hi [~adam.antal]!

Really like this rework of the NM health checks and the way you did implement 
the common interface for all healthchecks. This was a refactor could have been 
done long time ago. Kudos for it :) 

Some comments: 
1. As the change is bigger, this patch is not only deals with Docker health 
checks but has further refactors. 
Please include your design decisions in the jira description: What is the 
purpose of the common interface, what refactors did you done, how the 
implementors of the interface behave, etc.
Without these written in the description, it was pretty difficult to review the 
patch.

2. Nit: In NodeHealthScriptRunner: I know you moved this part of the code, but 
let's fix this javadoc: 
  {code}
  /** Time after which the script should be timedout. */
  private long scriptTimeout;
  {code}

  "timedout" should be "timed out".

3. Nit: In NodeHealthScriptRunner.newInstance: I think you could modify this 
log statement a bit:
  {code}
 LOG.info("Node Manager health check script is not available "
          + "or doesn't have execute permission, so not "
          + "starting the node health script runner.");
  {code}
  I think you could change the string "node health script runner" to the class 
name instead: NodeHealthScriptRunner. Maybe this way, the log message is more 
explicit and to the point.

4. In NodeHealthScriptRunner.reportHealthStatus: I think this info log should 
be warn instead:
 {code}
       default:
        LOG.info("Unknown HealthCheckerExitStatus - ignored.");
        break;
 {code}

In the contstuctor of NodeHealthScriptRunner, you have this statement: 
{code}
    super(NodeHealthScriptRunner.class.getName(), chkInterval);
{code}
Typo in name "chkInterval".

5. Nit: Can you move the constructor of NodeHealthScriptRunner above or below 
the newInstane method? It was pretty hard to find it down there.

6. Nit: Javadoc of NodeHealthScriptRunner#shouldRun could be improved: 
{code}
     Method used to determine if or not node health monitoring service should be
   * started or not. Returns true if following conditions are met:
{code}
I would modify the first line as: "Method used to determine whether the health 
monitoring service should be started or not".

7. If NodeHealthScriptRunner#shouldRun returns false, can you log the same 
message but on warn or error level? I think this is an error case, as the 
script is null or empty. Isn't it? 
Maybe you could also log the script's file name on debug level, but maybe not 
in this method.

8. Can you use an enum to mark successful / unsuccessful cases 
NodeHealthMonitorExecutor#reportHealthStatus calls setHealthStatus? I don't 
think the boolean flag is the cleanest approach. Moreover, you have the log 
statement: 
{code}
LOG.info("health status being set as " + output);
{code}, that looks weird with a true/false value.

9. Nit: In NodeHealthMonitorExecutor#reportHealthStatus: The branches SUCCESS 
and FAILED_WITH_EXIT_CODE can be merged together as they are running the same 
code.

10. Nit: Javadoc of NodeHealthMonitorExecutor#hasErrors looks weird for 
parameter 'output'

11. In the constructor of NodeHealthScriptRunner: I would simply do: 
{code}
this.task = new NodeHealthMonitorExecutor(scriptArgs);
{code}, making the code more readable and making setTimerTask only used by 
tests.

12. Nit: Access on method TimedHealthReporterService#setLastReportedTime can be 
private.

13. Is it intentional that TimedHealthReporterService#setHealthStatus(boolean, 
java.lang.String) is not calling this.setLastReportedTime(); with the current 
time? If it is, why?

14. Nit: In the javadoc of class NodeHealthCheckerService: "The class..." 
should be "This class".

15. In the javadoc of class NodeHealthCheckerService: Can you add some basic 
examples on how to use this class, how reporters should be added, etc?

16. In NodeHealthCheckerService#addHealthReporter, if the 1st if-condition 
(noneMatch) fails, can you log something? AFAIU, if this condition is false, 
someone tried to add a duplicate service, so this is more likely a programming 
error.

17. Nit: Could you rephrase the javadoc of 
NodeHealthCheckerService#getHealthReport?

18. Question about NodeHealthCheckerService#isHealthy and its usage of 
allMatch: What if all reporters are returning false as the return value of 
isHealthy? Would allMatch return true for this case? Can you write a unit test 
for this case?

19. Nit: 
DockerHealthCheckerService.DockerDaemonMonitorExecutor#getPossiblePidFileLocations:
 Can you store "docker.pid" as a constant?

20. Comments for DockerDaemonMonitorExecutor#run: 
- Please log if none of the pid file variants are found - in other words, when 
you need to fallback to read the file from the daemon.json, you should log this 
case.
- You should also log if the daemon.json file does not exist or it's not a file.
- You should also log if the line of the daemon.json is not in the appropriate 
format, in other words, if the line with "pidfile" was not found.
- You should also log if the line of the daemon.json is not in the appropriate 
format, in other words, 
parts.length is not 2.
- Please do something with the nested if-chains. 
- Please also create 2 methods out of run: checkDefaultLocation / 
checkFromDockerDaemonConf. You should also break down the latter into smaller 
methods.

21. Comments for TestDockerHealthCheckerService: This test class seems a bit 
incomplete:
- Please delete the commented code block: 
{code}
/*private void setupFolderWithFile(File file, String content) {
  }*/
{code}

- PID_FILE / JSON_CONF variables are unused. Please either delete or use them.
- In testHealthChecker, there are many comments in the end. What's the purpose 
of them? Did you want to create new testcases, maybe?
- General comment: Please add explanation comments for the assertThat calls!
- I ran the test class and checked code coverage, apparently 
DockerHealthCheckerService.DockerDaemonMonitorExecutor#run is not covered at 
all. Can you cover this method with tests as well? 

Thanks!


was (Author: snemeth):
Hi [~adam.antal]!

Really like this rework of the NM health checks and the way you did implement 
the common interface for all healthchecks. This was a refactor could have been 
done long time ago. Kudos for it :) 

Some comments: 
1. As the change is bigger, this patch is not only deals with Docker health 
checks but has further refactors. 
Please include your design decisions in the jira description: What is the 
purpose of the common interface, what refactors did you done, how the 
implementors of the interface behave, etc.
Without these written in the description, it was pretty difficult to review the 
patch.

2. In NodeHealthScriptRunner: I know you moved this part of the code, but let's 
fix this javadoc: 
  {code}
  /** Time after which the script should be timedout. */
  private long scriptTimeout;
  {code}

  "timedout" should be "timed out".

3. In NodeHealthScriptRunner.newInstance: I think you could modify this log 
statement a bit:
  {code}
 LOG.info("Node Manager health check script is not available "
          + "or doesn't have execute permission, so not "
          + "starting the node health script runner.");
  {code}
  I think you could change the string "node health script runner" to the class 
name instead: NodeHealthScriptRunner. Maybe this way, the log message is more 
explicit and to the point.

4. In NodeHealthScriptRunner.reportHealthStatus: I think this info log should 
be warn instead:
 {code}
       default:
        LOG.info("Unknown HealthCheckerExitStatus - ignored.");
        break;
 {code}

In the contstuctor of NodeHealthScriptRunner, you have this statement: 
{code}
    super(NodeHealthScriptRunner.class.getName(), chkInterval);
{code}
Typo in name "chkInterval".

5. Can you move the constructor of NodeHealthScriptRunner above or below the 
newInstane method? It was pretty hard to find it down there.

6. Javadoc of NodeHealthScriptRunner#shouldRun could be improved: 
{code}
     Method used to determine if or not node health monitoring service should be
   * started or not. Returns true if following conditions are met:
{code}
I would modify the first line as: "Method used to determine whether the health 
monitoring service should be started or not".

7. If NodeHealthScriptRunner#shouldRun returns false, can you log the same 
message but on warn or error level? I think this is an error case, as the 
script is null or empty. Isn't it? 
Maybe you could also log the script's file name on debug level, but maybe not 
in this method.

8. Can you use an enum to mark successful / unsuccessful cases 
NodeHealthMonitorExecutor#reportHealthStatus calls setHealthStatus? I don't 
think the boolean flag is the cleanest approach. Moreover, you have the log 
statement: 
{code}
LOG.info("health status being set as " + output);
{code}, that looks weird with a true/false value.

9. In NodeHealthMonitorExecutor#reportHealthStatus: The branches SUCCESS and 
FAILED_WITH_EXIT_CODE can be merged together as they are running the same code.

10. Javadoc of NodeHealthMonitorExecutor#hasErrors looks weird for parameter 
'output'

11. In the constructor of NodeHealthScriptRunner: I would simply do: 
{code}
this.task = new NodeHealthMonitorExecutor(scriptArgs);
{code}, making the code more readable and making setTimerTask only used by 
tests.

12. Nit: Access on method TimedHealthReporterService#setLastReportedTime can be 
private.

13. Is it intentional that TimedHealthReporterService#setHealthStatus(boolean, 
java.lang.String) is not calling this.setLastReportedTime(); with the current 
time? If it is, why?

14. Nit: In the javadoc of class NodeHealthCheckerService: "The class..." 
should be "This class".

15. In the javadoc of class NodeHealthCheckerService: Can you add some basic 
examples on how to use this class, how reporters should be added, etc?

16. In NodeHealthCheckerService#addHealthReporter, if the 1st if-condition 
(nonematch) fails, can you log something? AFAIU, if this condition is false, 
someone tried to add a duplicate service, so this is more likely a programming 
error.

17. Nit: Could you rephrase the javadoc of 
NodeHealthCheckerService#getHealthReport?

18. Question about NodeHealthCheckerService#isHealthy and its usage of 
allMatch: What if all reporters are returning false as the return value of 
isHealthy? Would allMatch return true for this case? Can you write a unit test 
for this case?

19. Nit: 
DockerHealthCheckerService.DockerDaemonMonitorExecutor#getPossiblePidFileLocations:
 Can you store "docker.pid" as a constant?

20. Comments for DockerDaemonMonitorExecutor#run: 
- Please log if none of the pid file variants are found - in other words, when 
you need to fallback to read the file from the daemon.json, you should log this 
case.
- You should also log if the daemon.json file does not exist or it's not a file.
- You should also log if the line of the daemon.json is not in the appropriate 
format, in other words, if the line with "pidfile" was not found.
- You should also log if the line of the daemon.json is not in the appropriate 
format, in other words, 
parts.length is not 2.
- Please do something with the nested if-chains. 
- Please also create 2 methods out of run: checkDefaultLocation / 
checkFromDockerDaemonConf. You should also break down the latter into smaller 
methods.

21. Comments for TestDockerHealthCheckerService: This test class seems a bit 
incomplete:
- Please delete the commented code block: 
{code}
/*private void setupFolderWithFile(File file, String content) {
  }*/
{code}

- PID_FILE / JSON_CONF variables are unused. Please either delete or use them.
- In testHealthChecker, there are many comments in the end. What's the purpose 
of them? Did you want to create new testcases, maybe?
- General comment: Please add explanation comments for the assertThat calls!

> Detect missing Docker binary or not running Docker daemon
> ---------------------------------------------------------
>
>                 Key: YARN-9923
>                 URL: https://issues.apache.org/jira/browse/YARN-9923
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, yarn
>    Affects Versions: 3.2.1
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: YARN-9923.001.patch, YARN-9923.002.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-9923) Detect missing Docker binary or not running Docker daemon

Reply via email to