[ https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970862#comment-16970862 ]
Szilard Nemeth edited comment on YARN-9923 at 11/9/19 4:42 PM: --------------------------------------------------------------- Hi [~adam.antal]! Really like this rework of the NM health checks and the way you did implement the common interface for all healthchecks. This was a refactor could have been done long time ago. Kudos for it :) Some comments: 1. As the change is bigger, this patch is not only deals with Docker health checks but has further refactors. Please include your design decisions in the jira description: What is the purpose of the common interface, what refactors did you done, how the implementors of the interface behave, etc. Without these written in the description, it was pretty difficult to review the patch. 2. Nit: In NodeHealthScriptRunner: I know you moved this part of the code, but let's fix this javadoc: {code} /** Time after which the script should be timedout. */ private long scriptTimeout; {code} "timedout" should be "timed out". 3. Nit: In NodeHealthScriptRunner.newInstance: I think you could modify this log statement a bit: {code} LOG.info("Node Manager health check script is not available " + "or doesn't have execute permission, so not " + "starting the node health script runner."); {code} I think you could change the string "node health script runner" to the class name instead: NodeHealthScriptRunner. Maybe this way, the log message is more explicit and to the point. 4. In NodeHealthScriptRunner.reportHealthStatus: I think this info log should be warn instead: {code} default: LOG.info("Unknown HealthCheckerExitStatus - ignored."); break; {code} In the contstuctor of NodeHealthScriptRunner, you have this statement: {code} super(NodeHealthScriptRunner.class.getName(), chkInterval); {code} Typo in name "chkInterval". 5. Nit: Can you move the constructor of NodeHealthScriptRunner above or below the newInstane method? It was pretty hard to find it down there. 6. Nit: Javadoc of NodeHealthScriptRunner#shouldRun could be improved: {code} Method used to determine if or not node health monitoring service should be * started or not. Returns true if following conditions are met: {code} I would modify the first line as: "Method used to determine whether the health monitoring service should be started or not". 7. If NodeHealthScriptRunner#shouldRun returns false, can you log the same message but on warn or error level? I think this is an error case, as the script is null or empty. Isn't it? Maybe you could also log the script's file name on debug level, but maybe not in this method. 8. Can you use an enum to mark successful / unsuccessful cases NodeHealthMonitorExecutor#reportHealthStatus calls setHealthStatus? I don't think the boolean flag is the cleanest approach. Moreover, you have the log statement: {code} LOG.info("health status being set as " + output); {code}, that looks weird with a true/false value. 9. Nit: In NodeHealthMonitorExecutor#reportHealthStatus: The branches SUCCESS and FAILED_WITH_EXIT_CODE can be merged together as they are running the same code. 10. Nit: Javadoc of NodeHealthMonitorExecutor#hasErrors looks weird for parameter 'output' 11. In the constructor of NodeHealthScriptRunner: I would simply do: {code} this.task = new NodeHealthMonitorExecutor(scriptArgs); {code}, making the code more readable and making setTimerTask only used by tests. 12. Nit: Access on method TimedHealthReporterService#setLastReportedTime can be private. 13. Is it intentional that TimedHealthReporterService#setHealthStatus(boolean, java.lang.String) is not calling this.setLastReportedTime(); with the current time? If it is, why? 14. Nit: In the javadoc of class NodeHealthCheckerService: "The class..." should be "This class". 15. In the javadoc of class NodeHealthCheckerService: Can you add some basic examples on how to use this class, how reporters should be added, etc? 16. In NodeHealthCheckerService#addHealthReporter, if the 1st if-condition (noneMatch) fails, can you log something? AFAIU, if this condition is false, someone tried to add a duplicate service, so this is more likely a programming error. 17. Nit: Could you rephrase the javadoc of NodeHealthCheckerService#getHealthReport? 18. Question about NodeHealthCheckerService#isHealthy and its usage of allMatch: What if all reporters are returning false as the return value of isHealthy? Would allMatch return true for this case? Can you write a unit test for this case? 19. Nit: DockerHealthCheckerService.DockerDaemonMonitorExecutor#getPossiblePidFileLocations: Can you store "docker.pid" as a constant? 20. Comments for DockerDaemonMonitorExecutor#run: - Please log if none of the pid file variants are found - in other words, when you need to fallback to read the file from the daemon.json, you should log this case. - You should also log if the daemon.json file does not exist or it's not a file. - You should also log if the line of the daemon.json is not in the appropriate format, in other words, if the line with "pidfile" was not found. - You should also log if the line of the daemon.json is not in the appropriate format, in other words, parts.length is not 2. - Please do something with the nested if-chains. - Please also create 2 methods out of run: checkDefaultLocation / checkFromDockerDaemonConf. You should also break down the latter into smaller methods. 21. Comments for TestDockerHealthCheckerService: This test class seems a bit incomplete: - Please delete the commented code block: {code} /*private void setupFolderWithFile(File file, String content) { }*/ {code} - PID_FILE / JSON_CONF variables are unused. Please either delete or use them. - In testHealthChecker, there are many comments in the end. What's the purpose of them? Did you want to create new testcases, maybe? - General comment: Please add explanation comments for the assertThat calls! - I ran the test class and checked code coverage, apparently DockerHealthCheckerService.DockerDaemonMonitorExecutor#run is not covered at all. Can you cover this method with tests as well? Thanks! was (Author: snemeth): Hi [~adam.antal]! Really like this rework of the NM health checks and the way you did implement the common interface for all healthchecks. This was a refactor could have been done long time ago. Kudos for it :) Some comments: 1. As the change is bigger, this patch is not only deals with Docker health checks but has further refactors. Please include your design decisions in the jira description: What is the purpose of the common interface, what refactors did you done, how the implementors of the interface behave, etc. Without these written in the description, it was pretty difficult to review the patch. 2. In NodeHealthScriptRunner: I know you moved this part of the code, but let's fix this javadoc: {code} /** Time after which the script should be timedout. */ private long scriptTimeout; {code} "timedout" should be "timed out". 3. In NodeHealthScriptRunner.newInstance: I think you could modify this log statement a bit: {code} LOG.info("Node Manager health check script is not available " + "or doesn't have execute permission, so not " + "starting the node health script runner."); {code} I think you could change the string "node health script runner" to the class name instead: NodeHealthScriptRunner. Maybe this way, the log message is more explicit and to the point. 4. In NodeHealthScriptRunner.reportHealthStatus: I think this info log should be warn instead: {code} default: LOG.info("Unknown HealthCheckerExitStatus - ignored."); break; {code} In the contstuctor of NodeHealthScriptRunner, you have this statement: {code} super(NodeHealthScriptRunner.class.getName(), chkInterval); {code} Typo in name "chkInterval". 5. Can you move the constructor of NodeHealthScriptRunner above or below the newInstane method? It was pretty hard to find it down there. 6. Javadoc of NodeHealthScriptRunner#shouldRun could be improved: {code} Method used to determine if or not node health monitoring service should be * started or not. Returns true if following conditions are met: {code} I would modify the first line as: "Method used to determine whether the health monitoring service should be started or not". 7. If NodeHealthScriptRunner#shouldRun returns false, can you log the same message but on warn or error level? I think this is an error case, as the script is null or empty. Isn't it? Maybe you could also log the script's file name on debug level, but maybe not in this method. 8. Can you use an enum to mark successful / unsuccessful cases NodeHealthMonitorExecutor#reportHealthStatus calls setHealthStatus? I don't think the boolean flag is the cleanest approach. Moreover, you have the log statement: {code} LOG.info("health status being set as " + output); {code}, that looks weird with a true/false value. 9. In NodeHealthMonitorExecutor#reportHealthStatus: The branches SUCCESS and FAILED_WITH_EXIT_CODE can be merged together as they are running the same code. 10. Javadoc of NodeHealthMonitorExecutor#hasErrors looks weird for parameter 'output' 11. In the constructor of NodeHealthScriptRunner: I would simply do: {code} this.task = new NodeHealthMonitorExecutor(scriptArgs); {code}, making the code more readable and making setTimerTask only used by tests. 12. Nit: Access on method TimedHealthReporterService#setLastReportedTime can be private. 13. Is it intentional that TimedHealthReporterService#setHealthStatus(boolean, java.lang.String) is not calling this.setLastReportedTime(); with the current time? If it is, why? 14. Nit: In the javadoc of class NodeHealthCheckerService: "The class..." should be "This class". 15. In the javadoc of class NodeHealthCheckerService: Can you add some basic examples on how to use this class, how reporters should be added, etc? 16. In NodeHealthCheckerService#addHealthReporter, if the 1st if-condition (nonematch) fails, can you log something? AFAIU, if this condition is false, someone tried to add a duplicate service, so this is more likely a programming error. 17. Nit: Could you rephrase the javadoc of NodeHealthCheckerService#getHealthReport? 18. Question about NodeHealthCheckerService#isHealthy and its usage of allMatch: What if all reporters are returning false as the return value of isHealthy? Would allMatch return true for this case? Can you write a unit test for this case? 19. Nit: DockerHealthCheckerService.DockerDaemonMonitorExecutor#getPossiblePidFileLocations: Can you store "docker.pid" as a constant? 20. Comments for DockerDaemonMonitorExecutor#run: - Please log if none of the pid file variants are found - in other words, when you need to fallback to read the file from the daemon.json, you should log this case. - You should also log if the daemon.json file does not exist or it's not a file. - You should also log if the line of the daemon.json is not in the appropriate format, in other words, if the line with "pidfile" was not found. - You should also log if the line of the daemon.json is not in the appropriate format, in other words, parts.length is not 2. - Please do something with the nested if-chains. - Please also create 2 methods out of run: checkDefaultLocation / checkFromDockerDaemonConf. You should also break down the latter into smaller methods. 21. Comments for TestDockerHealthCheckerService: This test class seems a bit incomplete: - Please delete the commented code block: {code} /*private void setupFolderWithFile(File file, String content) { }*/ {code} - PID_FILE / JSON_CONF variables are unused. Please either delete or use them. - In testHealthChecker, there are many comments in the end. What's the purpose of them? Did you want to create new testcases, maybe? - General comment: Please add explanation comments for the assertThat calls! > Detect missing Docker binary or not running Docker daemon > --------------------------------------------------------- > > Key: YARN-9923 > URL: https://issues.apache.org/jira/browse/YARN-9923 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, yarn > Affects Versions: 3.2.1 > Reporter: Adam Antal > Assignee: Adam Antal > Priority: Major > Attachments: YARN-9923.001.patch, YARN-9923.002.patch > > > Currently if a NodeManager is enabled to allocate Docker containers, but the > specified binary (docker.binary in the container-executor.cfg) is missing the > container allocation fails with the following error message: > {noformat} > Container launch fails > Exit code: 29 > Exception message: Launch container failed > Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No > such file or directory > Could not inspect docker network to get type /usr/bin/docker network inspect > host --format='{{.Driver}}'. > Error constructing docker command, docker error code=-1, error > message='Unknown error' > {noformat} > I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" > to have the following options: > - STARTUP: setting this option the NodeManager would not start if Docker > binaries are missing or the Docker daemon is not running (the exception is > considered FATAL during startup) > - RUNTIME: would give a more detailed/user-friendly exception in > NodeManager's side (NM logs) if Docker binaries are missing or the daemon is > not working. This would also prevent further Docker container allocation as > long as the binaries do not exist and the docker daemon is not running. > - NONE (default): preserving the current behaviour, throwing exception during > container allocation, carrying on using the default retry procedure. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org