Chris Nauroth created YARN-11844:
------------------------------------
Summary: Support configuration of retry policy on GPU discovery
Key: YARN-11844
URL: https://issues.apache.org/jira/browse/YARN-11844
Project: Hadoop YARN
Issue Type: Improvement
Components: gpu, nodemanager
Reporter: Chris Nauroth
Assignee: Chris Nauroth
The NodeManager invokes an external binary (e.g. {{nvidia-smi}}) to discover
attached GPUs. Right now, there is a hard-coded 10-second timeout on execution
of this binary and a hard-coded max error count of 10, beyond which the
NodeManager will stop attempting discovery. This change will provide new
configuration properties to control both the timeout and the max errors, which
is useful in environments where there may be a delay in binding the GPU to the
host. Default values for the new configuration properties will be set so as to
maintain the existing behavior.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]