[ 
https://issues.apache.org/jira/browse/YARN-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991484#comment-16991484
 ] 

Adam Antal commented on YARN-9923:
----------------------------------

Thanks for the thorough review [~snemeth], uploaded patchset v9. Let me respond 
to the review items.
0. 1. 2. 3. 5. 7. 10. and 12 done cleanly, please double check.

bq. 4. More on this part, I don't get what this line mean:

It means that if you provide a string in 
{{yarn.nodemanager.health-checker.script}}, you have to add 
{{yarn.nodemanager.health-checker.%s.path}}, 
{{yarn.nodemanager.health-checker.%s.opts}} and possibly similar configs as 
well. E.g. you have:
{noformat}
yarn.nodemanager.health-checker.script=script1,script2
yarn.nodemanager.health-checker.script1.path=/path/to/the/first/script/
yarn.nodemanager.health-checker.script1.opts=args to the first script
yarn.nodemanager.health-checker.script2.path=/path/to/the/second/script/
yarn.nodemanager.health-checker.script2.opts=args to the second script
yarn.nodemanager.health-checker.script2.timeout-ms=3000
{noformat}

bq. 6. [...] I think NodeHealthCheckerService should have a method that 
receives an exception and propagates it to the ExceptionReporter

Actually this was the original behaviour, I decoupled the ExceptionReporter 
responsibility to handle exceptions separately. Restored the original.

bq. 8. What happens if the user provides an invalid value (e.g. negative) for 
health-checker interval?

The existing behaviour did not check the inputs, that's why I refrained from 
adding it. I think it's better to fix this bug now, so added 
{{IllegalArgumentException}} cases to the mentioned configs.

bq. 9 [...] TimedHealthReporterService#setHealthStatus is always receiving an 
empty string if the status is true (successful).

I think creating a class for that is a little bit of overkill. I modified the 
class so that it reports either healthy or not healthy. The latter function 
must be invoked with a String argument.

bq. 11. Is it enough that 
org.apache.hadoop.yarn.server.nodemanager.health.NodeHealthScriptRunner.NodeHealthMonitorExecutor#hasErrors
 only checks for uppercase "ERROR" in the shell output? Where is this 
documented that health checker scripts should print ERROR in their output?

It is documented in the markdowns. I don't want to touch that part of the code, 
as it may break existing behaviour. For further discussion see YARN-6715.

bq. 13. Isn't a helper method like this somewhere already implemented (Apache 
commons, lang3, YARN helper methods)?

I didn't find any one liner for this, but I could enhance that piece of a code 
into a few lines. It looks better. Also the use cases may vary for each 
invocation, so I couldn't decrease the level of duplication by a lot by moving 
to this {{FileUtil}}.

> Introduce HealthReporter interface to support multiple health checker files
> ---------------------------------------------------------------------------
>
>                 Key: YARN-9923
>                 URL: https://issues.apache.org/jira/browse/YARN-9923
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, yarn
>    Affects Versions: 3.2.1
>            Reporter: Adam Antal
>            Assignee: Adam Antal
>            Priority: Major
>         Attachments: YARN-9923.001.patch, YARN-9923.002.patch, 
> YARN-9923.003.patch, YARN-9923.004.patch, YARN-9923.005.patch, 
> YARN-9923.006.patch, YARN-9923.007.patch, YARN-9923.008.patch
>
>
> Currently if a NodeManager is enabled to allocate Docker containers, but the 
> specified binary (docker.binary in the container-executor.cfg) is missing the 
> container allocation fails with the following error message:
> {noformat}
> Container launch fails
> Exit code: 29
> Exception message: Launch container failed
> Shell error output: sh: <docker binary path, /usr/bin/docker by default>: No 
> such file or directory
> Could not inspect docker network to get type /usr/bin/docker network inspect 
> host --format='{{.Driver}}'.
> Error constructing docker command, docker error code=-1, error 
> message='Unknown error'
> {noformat}
> I suggest to add a property say "yarn.nodemanager.runtime.linux.docker.check" 
> to have the following options:
> - STARTUP: setting this option the NodeManager would not start if Docker 
> binaries are missing or the Docker daemon is not running (the exception is 
> considered FATAL during startup)
> - RUNTIME: would give a more detailed/user-friendly exception in 
> NodeManager's side (NM logs) if Docker binaries are missing or the daemon is 
> not working. This would also prevent further Docker container allocation as 
> long as the binaries do not exist and the docker daemon is not running.
> - NONE (default): preserving the current behaviour, throwing exception during 
> container allocation, carrying on using the default retry procedure.
> ------------------------------------------------------------------------------------------------
> A new interface called {{HealthChecker}} is introduced which is used in the 
> {{NodeHealthCheckerService}}. Currently existing implementations like 
> {{LocalDirsHandlerService}} are modified to implement this giving a clear 
> abstraction to the node's health. The {{DockerHealthChecker}} implements this 
> new interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to