[ https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907985#comment-16907985 ]
Peter Bacsko commented on YARN-9217: ------------------------------------ Rebased patch (again) + introduced new fail-fast property. > Nodemanager will fail to start if GPU is misconfigured on the node or GPU > drivers missing > ----------------------------------------------------------------------------------------- > > Key: YARN-9217 > URL: https://issues.apache.org/jira/browse/YARN-9217 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Affects Versions: 3.0.0, 3.1.0 > Reporter: Antal Bálint Steinbach > Assignee: Peter Bacsko > Priority: Major > Attachments: YARN-9217.001.patch, YARN-9217.002.patch, > YARN-9217.003.patch, YARN-9217.004.patch, YARN-9217.005.patch, > YARN-9217.006.patch, YARN-9217.007.patch, YARN-9217.008.patch, > YARN-9217.009.patch, YARN-9217.010.patch > > > Nodemanager will not start > 1. If Autodiscovery is enabled: > * If nvidia-smi path is misconfigured or the file does not exist. > * There is 0 GPU found > * If the file exists but it is not pointing to an nvidia-smi > * if the binary is ok but there is an IOException > 2. If the manually configured GPU devices are misconfigured > * Any index:minor number format failure will cause a problem > * 0 configured device will cause a problem > * NumberFormatException is not handled > It would be a better option to add warnings about the configuration, set 0 > available GPUs and let the node work and run non-gpu jobs. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org