[ 
https://issues.apache.org/jira/browse/YARN-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17025684#comment-17025684
 ] 

Hudson commented on YARN-10107:
-------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17915 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17915/])
YARN-10107. Fix GpuResourcePlugin#getNMResourceInfo to honor Auto (pjoseph: rev 
825db8fe2ab37bd5a9a54485ea9ecbabf3766ed6)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuResourcePlugin.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuResourcePlugin.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/GpuDiscoverer.java


> Invoking NMWebServices#getNMResourceInfo tries to execute gpu discovery 
> binary even if auto discovery is turned off
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10107
>                 URL: https://issues.apache.org/jira/browse/YARN-10107
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>             Fix For: 3.3.0
>
>         Attachments: YARN-10107.001.patch, nm-config-afterchange-gpu.xml, 
> nm-config-beforechange-gpu.xml.xml, 
> request-response-afterchange-with-autodiscovery.txt, 
> request-response-afterchange.txt, request-response-beforechange.txt
>
>
> During internal end-to-end testing, I found the following issue:
> Configuration:
>  - GPU is enabled
>  - yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables is set 
> to "/usr/bin/ls" - Any existing valid binary file
>  - yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices is set to 
> "0:0,1:1,2:2", so auto-discovery is turned off.
>  If REST endpoint 
> [http://quasar-tsjqpq-3.vpc.cloudera.com:8042/ws/v1/node/resources/yarn.io%2Fgpu]
>  is called, the following exception is thrown in NM:
> {code:java}
> 2020-01-23 07:55:24,803 ERROR 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin:
>  Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to find GPU discovery 
> executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper.getGpuDeviceInformation(NvidiaBinaryHelper.java:54)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer.getGpuDeviceInformation(GpuDiscoverer.java:125)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin.getNMResourceInfo(GpuResourcePlugin.java:104)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getNMResourceInfo(NMWebServices.java:515)
> {code}
> *Let's break this down:* 
>  1. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuResourcePlugin#getNMResourceInfo
>  just calls to the
> {code:java}
> gpuDeviceInformation = gpuDiscoverer.getGpuDeviceInformation();
> {code}
> 2. In 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#getGpuDeviceInformation,
>  the following calls to the NvidiaBinaryHelper.getGpuDeviceInformation:
> {code:java}
>  try {
>       lastDiscoveredGpuInformation =
>           nvidiaBinaryHelper.getGpuDeviceInformation(pathOfGpuBinary);
>     } catch (IOException e) {
> {code}
> 3. 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.NvidiaBinaryHelper#getGpuDeviceInformation
>  finally throws the exception.
>  This is only happens in case of the parameter called "pathOfGpuBinary" is 
> null.
>  Since this method is only called from GpuDiscoverer#getGpuDeviceInformation, 
> that passes it's field called "pathOfGpuBinary" as the only one parameter, we 
> can be sure if this field is null, then we have the exception.
>  4. The only method that can set the "pathOfGpuBinary" fields is with this 
> call chain:
> {code:java}
> GpuDiscoverer.lookUpAutoDiscoveryBinary(Configuration)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
>   GpuDiscoverer.initialize(Configuration, NvidiaBinaryHelper)  
> (org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu)
> {code}
> 5. GpuDiscoverer#initialize contains this code:
> {code:java}
> if (isAutoDiscoveryEnabled()) {
>       numOfErrorExecutionSinceLastSucceed = 0;
>       lookUpAutoDiscoveryBinary(config);
>       ....
> {code}
> , so 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.GpuDiscoverer#pathOfGpuBinary
>  is set ONLY IF auto discovery is enabled.
>  Since our tests don't have auto discovery enabled, we have this exception. 
> In this sense, the exception message is very misleading for me:
> {code:java}
> Failed to find GPU discovery executable, please double check 
> yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables setting.
> {code}
>  
>  Related jira: https://issues.apache.org/jira/browse/YARN-9337
> I think this exception message is very misleading and of course, it does not 
> make any sense at all to try to execute the discovery binary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to