[
https://issues.apache.org/jira/browse/YARN-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012646#comment-18012646
]
ASF GitHub Bot commented on YARN-11844:
---------------------------------------
ayushtkn commented on code in PR #7857:
URL: https://github.com/apache/hadoop/pull/7857#discussion_r2260938054
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml:
##########
@@ -4650,6 +4650,34 @@
<value></value>
</property>
+ <property>
+ <description>
+ Sets the maximum duration for executions of the discovery binary defined
in
+ yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables. If
+ the binary takes longer than this amount of time to run, then the process
+ is aborted. Discovery may be attempted again, depending on
+ yarn.nodemanager.resource-plugins.gpu.discovery-max-errors.
+ </description>
+ <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name>
+ <value>10000ms</value>
Review Comment:
any reason for not using 10s?
##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java:
##########
@@ -297,6 +297,36 @@ public void
testGetGpuDeviceInformationFaultyNvidiaSmiScriptConsecutiveRun()
assertNotNull(discoverer.getGpusUsableByYarn());
}
+ @Test
+ public void testGetGpuDeviceInformationDisableMaxErrors()
+ throws YarnException, IOException {
+ Configuration conf = new Configuration(false);
+ // A negative value should disable max errors enforcement.
+ conf.setInt(YarnConfiguration.NM_GPU_DISCOVERY_MAX_ERRORS, -1);
+
+ File fakeBinary = createFakeNvidiaSmiScriptAsRunnableFile(
+ this::createFaultyNvidiaSmiScript);
+
+ GpuDiscoverer discoverer = creatediscovererWithGpuPathDefined(conf);
+ assertEquals(fakeBinary.getAbsolutePath(),
+ discoverer.getPathOfGpuBinary());
+ assertNull(discoverer.getEnvironmentToRunCommand().get(PATH));
+
+ final String terminateMsg = "Failed to execute GPU device " +
+ "detection script (" + fakeBinary.getAbsolutePath() + ") for 10 times";
+ final String msg = "Failed to execute GPU device detection script";
+
+ // The default max errors is 10. Verify that it keeps going for an 11th
try.
+ for (int i = 0; i < 11; ++i) {
Review Comment:
I changed this 11 to 15 & still the test doesn't fail for me, can you check
once?
> Support configuration of retry policy on GPU discovery
> ------------------------------------------------------
>
> Key: YARN-11844
> URL: https://issues.apache.org/jira/browse/YARN-11844
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: gpu, nodemanager
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Priority: Major
> Labels: pull-request-available
>
> The NodeManager invokes an external binary (e.g. {{nvidia-smi}}) to discover
> attached GPUs. Right now, there is a hard-coded 10-second timeout on
> execution of this binary and a hard-coded max error count of 10, beyond which
> the NodeManager will stop attempting discovery. This change will provide new
> configuration properties to control both the timeout and the max errors,
> which is useful in environments where there may be a delay in binding the GPU
> to the host. Default values for the new configuration properties will be set
> so as to maintain the existing behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]