[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
Artem-B wrote: > What's the config to set this by default without any graphics? https://docs.nvidia.com/deploy/driver-persistence/index.html I usually use "nvidia-smi -i -pm ENABLED" to force the driver to be loaded permanently. As for `__nvcc_device_query`, my guess is that it just uses a handful of driver API calls. Nothing unusual, it's probably the standard sequence of cuInit/cuDeviceGetCount/cuDeviceGetAttribute like in this example: https://github.com/NVIDIA/cuda-samples/blob/5f97d7d0dff880bc6567faa4c5e62e389a6d6999/Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv.cpp#L56 ``` checkCudaErrors(cuInit(0)); checkCudaErrors(cuDeviceGetCount(&deviceCount)); for (dev = 0; dev < deviceCount; ++dev) { checkCudaErrors(cuDeviceGetName(deviceName, 256, dev)); ... } ``` https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
jhuber6 wrote: > Ooh... I think I know exactly what may be causing this. I've observed this a few times. For my case it's usually when some application hangs on the GPU and no one notices, then these tools hang forever and it takes awhile to notice. Figured an error is friendlier since I highly doubt these tools will take over ten seconds to run even in the worst case. > On machines where NVIDIA GPUs are used for compute only (e.g. a headless > server machine), NVIDIA drivers are not always loaded by default and may not > have driver persistence enabled. What's the config to set this by default without any graphics? Would be nice to not need to worry about it on my dev machine. > For the GPU detection, we may be able to work around the issue by leaving the > detection app running for the duration of the compilation, and prevent driver > unloading, but it's a rather gross hack. I know for AMD stuff we used to just probe the PCI connections, but that leaked a lot of information so this is the easier way to do it. I wonder what `__nvcc_device_query` does internally. https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
Artem-B wrote: Ooh... I think I know exactly what may be causing this. On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled. The drivers get loaded when GPU is accessed, and then released and unloaded when there are no GPU users remaining. A parallel compilation with `--offload-arch=native` will be the worst-case stress test scenario for the driver init/deinit machinery, as GPU probing is both short-living and will be done repeatedly. Adding a timeout here would help, sort of, but it would be much better if we could figure out a way to either detect that GPU probing takes too long (and likely causes the driver to load/unload), or cache probing results somehow, so we do not have to run the same detection over and over again. This is a point towards pushing the detection out of clang into the build system, which would be the better place to do it. For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack. https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
https://github.com/jhuber6 closed https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
https://github.com/arsenm approved this pull request. https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
https://github.com/jhuber6 updated https://github.com/llvm/llvm-project/pull/94751 >From 0e367c72a1cc163fd781f98b9fac809d90f4beb7 Mon Sep 17 00:00:00 2001 From: Joseph Huber Date: Fri, 7 Jun 2024 08:15:06 -0500 Subject: [PATCH] [Clang] Add timeout for GPU detection utilities Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out. --- clang/include/clang/Driver/ToolChain.h | 3 ++- clang/lib/Driver/ToolChain.cpp | 8 clang/lib/Driver/ToolChains/AMDGPU.cpp | 2 +- clang/lib/Driver/ToolChains/Cuda.cpp | 2 +- 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/clang/include/clang/Driver/ToolChain.h b/clang/include/clang/Driver/ToolChain.h index a4f9cad98aa8b..9789cfacafd78 100644 --- a/clang/include/clang/Driver/ToolChain.h +++ b/clang/include/clang/Driver/ToolChain.h @@ -205,7 +205,8 @@ class ToolChain { /// Executes the given \p Executable and returns the stdout. llvm::Expected> - executeToolChainProgram(StringRef Executable) const; + executeToolChainProgram(StringRef Executable, + unsigned SecondsToWait = 0) const; void setTripleEnvironment(llvm::Triple::EnvironmentType Env); diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp index 0e86bc07e0ea2..40ab2e91125d1 100644 --- a/clang/lib/Driver/ToolChain.cpp +++ b/clang/lib/Driver/ToolChain.cpp @@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T, } llvm::Expected> -ToolChain::executeToolChainProgram(StringRef Executable) const { +ToolChain::executeToolChainProgram(StringRef Executable, + unsigned SecondsToWait) const { llvm::SmallString<64> OutputFile; llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile); llvm::FileRemover OutputRemover(OutputFile.c_str()); @@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) const { }; std::string ErrorMessage; - if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, -/* SecondsToWait */ 0, -/*MemoryLimit*/ 0, &ErrorMessage)) + if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, SecondsToWait, +/*MemoryLimit=*/0, &ErrorMessage)) return llvm::createStringError(std::error_code(), Executable + ": " + ErrorMessage); diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp index 9ffea57b005de..11a98a0ec314d 100644 --- a/clang/lib/Driver/ToolChains/AMDGPU.cpp +++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp @@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) const { else Program = GetProgramPath("amdgpu-arch"); - auto StdoutOrErr = executeToolChainProgram(Program); + auto StdoutOrErr = executeToolChainProgram(Program, /*SecondsToWait=*/10); if (!StdoutOrErr) return StdoutOrErr.takeError(); diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp index bbc8be91fd70b..2dfc7457b0ac7 100644 --- a/clang/lib/Driver/ToolChains/Cuda.cpp +++ b/clang/lib/Driver/ToolChains/Cuda.cpp @@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const { else Program = GetProgramPath("nvptx-arch"); - auto StdoutOrErr = executeToolChainProgram(Program); + auto StdoutOrErr = executeToolChainProgram(Program, /*SecondsToWait=*/10); if (!StdoutOrErr) return StdoutOrErr.takeError(); ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
@@ -205,7 +205,7 @@ class ToolChain { /// Executes the given \p Executable and returns the stdout. llvm::Expected> - executeToolChainProgram(StringRef Executable) const; + executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const; arsenm wrote: Name this SecondsToWait to match ExecuteAndWait? https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
jhuber6 wrote: No active test because I have no clue how you would, but I intentionally made it time out and it returns a 'Child timed out` error as expected. https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
llvmbot wrote: @llvm/pr-subscribers-backend-amdgpu @llvm/pr-subscribers-clang-driver Author: Joseph Huber (jhuber6) Changes Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out. --- Full diff: https://github.com/llvm/llvm-project/pull/94751.diff 4 Files Affected: - (modified) clang/include/clang/Driver/ToolChain.h (+1-1) - (modified) clang/lib/Driver/ToolChain.cpp (+4-4) - (modified) clang/lib/Driver/ToolChains/AMDGPU.cpp (+1-1) - (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+1-1) ``diff diff --git a/clang/include/clang/Driver/ToolChain.h b/clang/include/clang/Driver/ToolChain.h index a4f9cad98aa8b..87a5034dfd78b 100644 --- a/clang/include/clang/Driver/ToolChain.h +++ b/clang/include/clang/Driver/ToolChain.h @@ -205,7 +205,7 @@ class ToolChain { /// Executes the given \p Executable and returns the stdout. llvm::Expected> - executeToolChainProgram(StringRef Executable) const; + executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const; void setTripleEnvironment(llvm::Triple::EnvironmentType Env); diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp index 0e86bc07e0ea2..8c746ac8066cb 100644 --- a/clang/lib/Driver/ToolChain.cpp +++ b/clang/lib/Driver/ToolChain.cpp @@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T, } llvm::Expected> -ToolChain::executeToolChainProgram(StringRef Executable) const { +ToolChain::executeToolChainProgram(StringRef Executable, + unsigned Timeout) const { llvm::SmallString<64> OutputFile; llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile); llvm::FileRemover OutputRemover(OutputFile.c_str()); @@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) const { }; std::string ErrorMessage; - if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, -/* SecondsToWait */ 0, -/*MemoryLimit*/ 0, &ErrorMessage)) + if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout, +/*MemoryLimit=*/0, &ErrorMessage)) return llvm::createStringError(std::error_code(), Executable + ": " + ErrorMessage); diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp index 9ffea57b005de..92895d8186e83 100644 --- a/clang/lib/Driver/ToolChains/AMDGPU.cpp +++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp @@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) const { else Program = GetProgramPath("amdgpu-arch"); - auto StdoutOrErr = executeToolChainProgram(Program); + auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10); if (!StdoutOrErr) return StdoutOrErr.takeError(); diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp index bbc8be91fd70b..47dac0e439f10 100644 --- a/clang/lib/Driver/ToolChains/Cuda.cpp +++ b/clang/lib/Driver/ToolChains/Cuda.cpp @@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const { else Program = GetProgramPath("nvptx-arch"); - auto StdoutOrErr = executeToolChainProgram(Program); + auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10); if (!StdoutOrErr) return StdoutOrErr.takeError(); `` https://github.com/llvm/llvm-project/pull/94751 ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)
https://github.com/jhuber6 created https://github.com/llvm/llvm-project/pull/94751 Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out. >From a7bcd6b0568b00d0cac9bf0a6f9b17ca681425f2 Mon Sep 17 00:00:00 2001 From: Joseph Huber Date: Fri, 7 Jun 2024 08:15:06 -0500 Subject: [PATCH] [Clang] Add timeout for GPU detection utilities Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out. --- clang/include/clang/Driver/ToolChain.h | 2 +- clang/lib/Driver/ToolChain.cpp | 8 clang/lib/Driver/ToolChains/AMDGPU.cpp | 2 +- clang/lib/Driver/ToolChains/Cuda.cpp | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/clang/include/clang/Driver/ToolChain.h b/clang/include/clang/Driver/ToolChain.h index a4f9cad98aa8b..87a5034dfd78b 100644 --- a/clang/include/clang/Driver/ToolChain.h +++ b/clang/include/clang/Driver/ToolChain.h @@ -205,7 +205,7 @@ class ToolChain { /// Executes the given \p Executable and returns the stdout. llvm::Expected> - executeToolChainProgram(StringRef Executable) const; + executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const; void setTripleEnvironment(llvm::Triple::EnvironmentType Env); diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp index 0e86bc07e0ea2..8c746ac8066cb 100644 --- a/clang/lib/Driver/ToolChain.cpp +++ b/clang/lib/Driver/ToolChain.cpp @@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T, } llvm::Expected> -ToolChain::executeToolChainProgram(StringRef Executable) const { +ToolChain::executeToolChainProgram(StringRef Executable, + unsigned Timeout) const { llvm::SmallString<64> OutputFile; llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile); llvm::FileRemover OutputRemover(OutputFile.c_str()); @@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) const { }; std::string ErrorMessage; - if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, -/* SecondsToWait */ 0, -/*MemoryLimit*/ 0, &ErrorMessage)) + if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout, +/*MemoryLimit=*/0, &ErrorMessage)) return llvm::createStringError(std::error_code(), Executable + ": " + ErrorMessage); diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp index 9ffea57b005de..92895d8186e83 100644 --- a/clang/lib/Driver/ToolChains/AMDGPU.cpp +++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp @@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) const { else Program = GetProgramPath("amdgpu-arch"); - auto StdoutOrErr = executeToolChainProgram(Program); + auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10); if (!StdoutOrErr) return StdoutOrErr.takeError(); diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp index bbc8be91fd70b..47dac0e439f10 100644 --- a/clang/lib/Driver/ToolChains/Cuda.cpp +++ b/clang/lib/Driver/ToolChains/Cuda.cpp @@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const { else Program = GetProgramPath("nvptx-arch"); - auto StdoutOrErr = executeToolChainProgram(Program); + auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10); if (!StdoutOrErr) return StdoutOrErr.takeError(); ___ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits