[ https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025728#comment-17025728 ]
Szilard Nemeth commented on YARN-10107: --------------------------------------- Thanks [~prabhujoseph]. > Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery > binary even if auto discovery is turned off > ------------------------------------------------------------------------------------------------------------------- > > Key: YARN-10107 > URL: https://issues.apache.org/jira/browse/YARN-10107 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Szilard Nemeth > Assignee: Szilard Nemeth > Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, > nm-config-beforechange-gpu.xml.xml, > request-response-afterchange-with-autodiscovery.txt, > request-response-afterchange.txt, request-response-beforechange.txt > > > During internal end-to-end testing, I found the following issue: > Configuration: > - GPU is enabled > - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set > to "/usr/bin/ls" - Any existing valid binary file > - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to > "0:0,1:1,2:2", so auto-discovery is turned off. > If REST endpoint > [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu] > is called, the following exception is thrown in NM: > {code:java} > 2020-01-23 07:55:24,803 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin: > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery > executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104) > at > org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515) > {code} > *Let's break this down:* > 1. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo > just calls to the > {code:java} > gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation(); > {code} > 2. In > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation, > the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation: > {code:java} > try { > lastDiscoveredGpuInformation = > nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary); > } catch (IOException e) { > {code} > 3. > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation > finally throws the exception. > This is only happens in case of the parameter called "pathOfGpuBinary" is > null. > Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, > that passes it's field called "pathOfGpuBinary" as the only one parameter, we > can be sure if this field is null, then we have the exception. > 4. The only method that can set the "pathOfGpuBinary" fields is with this > call chain: > {code:java} > GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper) > (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu) > {code} > 5. GpuDiscoverer#initialize contains this code: > {code:java} > if (isAutoDiscoveryEnabled()) { > numOfErrorExecutionSinceLastSucceed = 0; > lookUpAutoDiscoveryBinary(config); > .... > {code} > , so > org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary > is set ONLY IF auto discovery is enabled. > Since our tests don't have auto discovery enabled, we have this exception. > In this sense, the exception message is very misleading for me: > {code:java} > Failed to find GPU discovery executable, please double check > yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting. > {code} > > Related jira: https://issues.apache.org/jira/browse/YARN-9337 > I think this exception message is very misleading and of course, it does not > make any sense at all to try to execute the discovery binary. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org