[ https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763445#comment-16763445 ]
Zoltan Siegl commented on YARN-9217: ------------------------------------ In GpuDiscoverer.java: {code:java} 207 else if (getFileNameFromFile(binaryPath).equals(DEFAULT_BINARY_NAME)) { 208 // If path exists but file name is incorrect don't execute the file 209 LOG.warn( 210 "Please check the configuration value of {}. " 211 + "It should point to an {} binary.", 212 YarnConfiguration.NM_GPU_PATH_TO_EXEC, DEFAULT_BINARY_NAME); 213 } {code} To me it looks like that we allow anything but nvidia-smi as the filename part for the binary path. If the assumption is correct probably we would like to achieve the opposite here. > Nodemanager will fail to start if GPU is misconfigured on the node or GPU > drivers missing > ----------------------------------------------------------------------------------------- > > Key: YARN-9217 > URL: https://issues.apache.org/jira/browse/YARN-9217 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Affects Versions: 3.0.0, 3.1.0 > Reporter: Antal Bálint Steinbach > Assignee: Antal Bálint Steinbach > Priority: Major > Attachments: YARN-9217.001.patch, YARN-9217.002.patch, > YARN-9217.003.patch, YARN-9217.004.patch > > > Nodemanager will not start > 1. If Autodiscovery is enabled: > * If nvidia-smi path is misconfigured or the file does not exist. > * There is 0 GPU found > * If the file exists but it is not pointing to an nvidia-smi > * if the binary is ok but there is an IOException > 2. If the manually configured GPU devices are misconfigured > * Any index:minor number format failure will cause a problem > * 0 configured device will cause a problem > * NumberFormatException is not handled > It would be a better option to add warnings about the configuration, set 0 > available GPUs and let the node work and run non-gpu jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org