[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Artem Belevich via cfe-commits

Artem-B wrote:

> What's the config to set this by default without any graphics? 

https://docs.nvidia.com/deploy/driver-persistence/index.html

I usually use "nvidia-smi -i  -pm ENABLED" to force the driver to 
be loaded permanently. 

As for `__nvcc_device_query`, my guess is that it just uses a handful of driver 
API calls. Nothing unusual, it's probably the standard sequence of 
cuInit/cuDeviceGetCount/cuDeviceGetAttribute like in this example: 
https://github.com/NVIDIA/cuda-samples/blob/5f97d7d0dff880bc6567faa4c5e62e389a6d6999/Samples/1_Utilities/deviceQueryDrv/deviceQueryDrv.cpp#L56

```
  checkCudaErrors(cuInit(0));
  checkCudaErrors(cuDeviceGetCount(&deviceCount));
  for (dev = 0; dev < deviceCount; ++dev) {
checkCudaErrors(cuDeviceGetName(deviceName, 256, dev));
...
  }
```

https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Joseph Huber via cfe-commits

jhuber6 wrote:

> Ooh... I think I know exactly what may be causing this.

I've observed this a few times. For my case it's usually when some application 
hangs on the GPU and no one notices, then these tools hang forever and it takes 
awhile to notice. Figured an error is friendlier since I highly doubt these 
tools will take over ten seconds to run even in the worst case.

> On machines where NVIDIA GPUs are used for compute only (e.g. a headless 
> server machine), NVIDIA drivers are not always loaded by default and may not 
> have driver persistence enabled.

What's the config to set this by default without any graphics? Would be nice to 
not need to worry about it on my dev machine.



> For the GPU detection, we may be able to work around the issue by leaving the 
> detection app running for the duration of the compilation, and prevent driver 
> unloading, but it's a rather gross hack.

I know for AMD stuff we used to just probe the PCI connections, but that leaked 
a lot of information so this is the easier way to do it. I wonder what 
`__nvcc_device_query` does internally.

https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Artem Belevich via cfe-commits

Artem-B wrote:

Ooh... I think I know exactly what may be causing this.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server 
machine), NVIDIA drivers are not always loaded by default and may not have 
driver persistence enabled. The drivers get loaded when GPU is accessed, and 
then released and unloaded when there are no GPU users remaining. A parallel 
compilation with `--offload-arch=native` will be the worst-case stress test 
scenario for the driver init/deinit machinery, as GPU probing is both 
short-living and will be done repeatedly.

Adding a timeout here would help, sort of, but it would be much better if we 
could figure out a way to either detect that GPU probing takes too long (and 
likely causes the driver to load/unload), or cache probing results somehow, so 
we do not have to run the same detection over and over again. This is a point 
towards pushing the detection out of clang into the build system, which would 
be the better place to do it.

For the GPU detection, we may be able to work around the issue by leaving the 
detection app running for the duration of the compilation, and prevent driver 
unloading, but it's a rather gross hack.

https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 closed 
https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Matt Arsenault via cfe-commits

https://github.com/arsenm approved this pull request.


https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 updated 
https://github.com/llvm/llvm-project/pull/94751

>From 0e367c72a1cc163fd781f98b9fac809d90f4beb7 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Fri, 7 Jun 2024 08:15:06 -0500
Subject: [PATCH] [Clang] Add timeout for GPU detection utilities

Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.
---
 clang/include/clang/Driver/ToolChain.h | 3 ++-
 clang/lib/Driver/ToolChain.cpp | 8 
 clang/lib/Driver/ToolChains/AMDGPU.cpp | 2 +-
 clang/lib/Driver/ToolChains/Cuda.cpp   | 2 +-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/clang/include/clang/Driver/ToolChain.h 
b/clang/include/clang/Driver/ToolChain.h
index a4f9cad98aa8b..9789cfacafd78 100644
--- a/clang/include/clang/Driver/ToolChain.h
+++ b/clang/include/clang/Driver/ToolChain.h
@@ -205,7 +205,8 @@ class ToolChain {
 
   /// Executes the given \p Executable and returns the stdout.
   llvm::Expected>
-  executeToolChainProgram(StringRef Executable) const;
+  executeToolChainProgram(StringRef Executable,
+  unsigned SecondsToWait = 0) const;
 
   void setTripleEnvironment(llvm::Triple::EnvironmentType Env);
 
diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp
index 0e86bc07e0ea2..40ab2e91125d1 100644
--- a/clang/lib/Driver/ToolChain.cpp
+++ b/clang/lib/Driver/ToolChain.cpp
@@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T,
 }
 
 llvm::Expected>
-ToolChain::executeToolChainProgram(StringRef Executable) const {
+ToolChain::executeToolChainProgram(StringRef Executable,
+   unsigned SecondsToWait) const {
   llvm::SmallString<64> OutputFile;
   llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile);
   llvm::FileRemover OutputRemover(OutputFile.c_str());
@@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) 
const {
   };
 
   std::string ErrorMessage;
-  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects,
-/* SecondsToWait */ 0,
-/*MemoryLimit*/ 0, &ErrorMessage))
+  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, SecondsToWait,
+/*MemoryLimit=*/0, &ErrorMessage))
 return llvm::createStringError(std::error_code(),
Executable + ": " + ErrorMessage);
 
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp 
b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 9ffea57b005de..11a98a0ec314d 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) 
const {
   else
 Program = GetProgramPath("amdgpu-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*SecondsToWait=*/10);
   if (!StdoutOrErr)
 return StdoutOrErr.takeError();
 
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp 
b/clang/lib/Driver/ToolChains/Cuda.cpp
index bbc8be91fd70b..2dfc7457b0ac7 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) 
const {
   else
 Program = GetProgramPath("nvptx-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*SecondsToWait=*/10);
   if (!StdoutOrErr)
 return StdoutOrErr.takeError();
 

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Matt Arsenault via cfe-commits


@@ -205,7 +205,7 @@ class ToolChain {
 
   /// Executes the given \p Executable and returns the stdout.
   llvm::Expected>
-  executeToolChainProgram(StringRef Executable) const;
+  executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const;

arsenm wrote:

Name this SecondsToWait to match ExecuteAndWait? 

https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Joseph Huber via cfe-commits

jhuber6 wrote:

No active test because I have no clue how you would, but I intentionally made 
it time out and it returns a 'Child timed out` error as expected.

https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread via cfe-commits

llvmbot wrote:



@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-clang-driver

Author: Joseph Huber (jhuber6)


Changes

Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.


---
Full diff: https://github.com/llvm/llvm-project/pull/94751.diff


4 Files Affected:

- (modified) clang/include/clang/Driver/ToolChain.h (+1-1) 
- (modified) clang/lib/Driver/ToolChain.cpp (+4-4) 
- (modified) clang/lib/Driver/ToolChains/AMDGPU.cpp (+1-1) 
- (modified) clang/lib/Driver/ToolChains/Cuda.cpp (+1-1) 


``diff
diff --git a/clang/include/clang/Driver/ToolChain.h 
b/clang/include/clang/Driver/ToolChain.h
index a4f9cad98aa8b..87a5034dfd78b 100644
--- a/clang/include/clang/Driver/ToolChain.h
+++ b/clang/include/clang/Driver/ToolChain.h
@@ -205,7 +205,7 @@ class ToolChain {
 
   /// Executes the given \p Executable and returns the stdout.
   llvm::Expected>
-  executeToolChainProgram(StringRef Executable) const;
+  executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const;
 
   void setTripleEnvironment(llvm::Triple::EnvironmentType Env);
 
diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp
index 0e86bc07e0ea2..8c746ac8066cb 100644
--- a/clang/lib/Driver/ToolChain.cpp
+++ b/clang/lib/Driver/ToolChain.cpp
@@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T,
 }
 
 llvm::Expected>
-ToolChain::executeToolChainProgram(StringRef Executable) const {
+ToolChain::executeToolChainProgram(StringRef Executable,
+   unsigned Timeout) const {
   llvm::SmallString<64> OutputFile;
   llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile);
   llvm::FileRemover OutputRemover(OutputFile.c_str());
@@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) 
const {
   };
 
   std::string ErrorMessage;
-  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects,
-/* SecondsToWait */ 0,
-/*MemoryLimit*/ 0, &ErrorMessage))
+  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout,
+/*MemoryLimit=*/0, &ErrorMessage))
 return llvm::createStringError(std::error_code(),
Executable + ": " + ErrorMessage);
 
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp 
b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 9ffea57b005de..92895d8186e83 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) 
const {
   else
 Program = GetProgramPath("amdgpu-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
 return StdoutOrErr.takeError();
 
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp 
b/clang/lib/Driver/ToolChains/Cuda.cpp
index bbc8be91fd70b..47dac0e439f10 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) 
const {
   else
 Program = GetProgramPath("nvptx-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
 return StdoutOrErr.takeError();
 

``




https://github.com/llvm/llvm-project/pull/94751
___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

2024-06-07 Thread Joseph Huber via cfe-commits

https://github.com/jhuber6 created 
https://github.com/llvm/llvm-project/pull/94751

Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.


>From a7bcd6b0568b00d0cac9bf0a6f9b17ca681425f2 Mon Sep 17 00:00:00 2001
From: Joseph Huber 
Date: Fri, 7 Jun 2024 08:15:06 -0500
Subject: [PATCH] [Clang] Add timeout for GPU detection utilities

Summary:
The utilities `nvptx-arch` and `amdgpu-arch` are used to support
`--offload-arch=native` among other utilities in clang. However, these
rely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.

This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.
---
 clang/include/clang/Driver/ToolChain.h | 2 +-
 clang/lib/Driver/ToolChain.cpp | 8 
 clang/lib/Driver/ToolChains/AMDGPU.cpp | 2 +-
 clang/lib/Driver/ToolChains/Cuda.cpp   | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/clang/include/clang/Driver/ToolChain.h 
b/clang/include/clang/Driver/ToolChain.h
index a4f9cad98aa8b..87a5034dfd78b 100644
--- a/clang/include/clang/Driver/ToolChain.h
+++ b/clang/include/clang/Driver/ToolChain.h
@@ -205,7 +205,7 @@ class ToolChain {
 
   /// Executes the given \p Executable and returns the stdout.
   llvm::Expected>
-  executeToolChainProgram(StringRef Executable) const;
+  executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const;
 
   void setTripleEnvironment(llvm::Triple::EnvironmentType Env);
 
diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp
index 0e86bc07e0ea2..8c746ac8066cb 100644
--- a/clang/lib/Driver/ToolChain.cpp
+++ b/clang/lib/Driver/ToolChain.cpp
@@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T,
 }
 
 llvm::Expected>
-ToolChain::executeToolChainProgram(StringRef Executable) const {
+ToolChain::executeToolChainProgram(StringRef Executable,
+   unsigned Timeout) const {
   llvm::SmallString<64> OutputFile;
   llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile);
   llvm::FileRemover OutputRemover(OutputFile.c_str());
@@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) 
const {
   };
 
   std::string ErrorMessage;
-  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects,
-/* SecondsToWait */ 0,
-/*MemoryLimit*/ 0, &ErrorMessage))
+  if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout,
+/*MemoryLimit=*/0, &ErrorMessage))
 return llvm::createStringError(std::error_code(),
Executable + ": " + ErrorMessage);
 
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp 
b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 9ffea57b005de..92895d8186e83 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) 
const {
   else
 Program = GetProgramPath("amdgpu-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
 return StdoutOrErr.takeError();
 
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp 
b/clang/lib/Driver/ToolChains/Cuda.cpp
index bbc8be91fd70b..47dac0e439f10 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) 
const {
   else
 Program = GetProgramPath("nvptx-arch");
 
-  auto StdoutOrErr = executeToolChainProgram(Program);
+  auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
   if (!StdoutOrErr)
 return StdoutOrErr.takeError();
 

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits