Your message dated Mon, 13 Sep 2021 18:16:15 +0200 with message-id <20210913161615.wpmcyar6w4fyj4cm@begin> and subject line Re: Bug#994049: libhwloc-contrib-plugins: hwloc displays misleading CUDA and NVML errors when running MPI programs has caused the Debian Bug report #994049, regarding libhwloc-contrib-plugins: hwloc displays misleading CUDA and NVML errors when running MPI programs to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact [email protected] immediately.) -- 994049: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=994049 Debian Bug Tracking System Contact [email protected] with problems
--- Begin Message ---Package: libhwloc-contrib-plugins Version: 2.4.1+dfsg-2 Severity: important Dear Maintainer, When the libhwloc-contrib-plugins package is installed, running any MPI program on a Debian 11 host with no GPU produces the following errors: $ mpirun hostname CUDA: Failed to get number of devices with cudaGetDeviceCount(): no CUDA-capable device is detected NVML: Failed to initialize with nvmlInit(): Driver Not Loaded CUDA: Failed to get number of devices with cudaGetDeviceCount(): no CUDA-capable device is detected NVML: Failed to initialize with nvmlInit(): Driver Not Loaded dahu-28.grenoble.grid5000.fr For complex programs, it is quite hard to understand where these messages come from and what the exact problem is. After investigation, it turns out that these messages are "warnings" and don't prevent the program from executing, so they can be ignored. But when the program fails for unrelated reasons, these messages can mislead the user into thinking the problem is CUDA-related, while it's actually not. The expected behaviour is that hwloc should not print warnings about hardware detection when nothing is actually wrong. This bug has already been fixed upstream in version 2.5.0rc1: 835dfbe577fcd7 ("core: don't display "less critical" error messages by default") https://github.com/open-mpi/hwloc/issues/453 Would it be possible to backport this patch to Debian stable or, as an alternative, publish hwloc 2.5.0 in bullseye-backports? Thanks for your time, Baptiste -- System Information: Debian Release: 11.0 APT prefers stable-security APT policy: (500, 'stable-security'), (500, 'stable') Architecture: amd64 (x86_64) Kernel: Linux 5.10.0-8-amd64 (SMP w/64 CPU threads) Kernel taint flags: TAINT_FIRMWARE_WORKAROUND Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF-8), LANGUAGE=en_US:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages libhwloc-contrib-plugins depends on: ii libc6 2.31-13 ii libcudart11.0 5000.0g5k1 ii libhwloc15 2.4.1+dfsg-1 ii libnvidia-ml1 5000.0g5k1 libhwloc-contrib-plugins recommends no packages. libhwloc-contrib-plugins suggests no packages. -- no debconf information
--- End Message ---
--- Begin Message ---Version: 2.5.0~ Brice Goglin, le lun. 13 sept. 2021 18:03:48 +0200, a ecrit: > Le 13/09/2021 à 17:42, Baptiste Jonglez a écrit : > > On Sat, Sep 11, 2021 at 12:41:04AM +0200, Samuel Thibault wrote: > > > Hello, > > > > Would it be possible to backport this patch to Debian stable or, > > > > as an alternative, publish hwloc 2.5.0 in bullseye-backports? > > > I was waiting for a good reason to take the time to backport hwloc 2.5, > > > that was one, it's now in backports-NEW :) > > Perfect, thanks :) > > > > > That said perhaps we can ask d-release for a stable upload, Brice what > > > do you think? > > The bullseye-backports package will be enough for us, but of course having > > the fix in stable is also nice (if it's not considered too intrusive). > > > The problem should only occur if you managed to install NVIDIA drivers > without a NVIDIA GPU (which isn't possible by default IIRC). It basically > means deploying a GPU node image on a non-GPU node. I've seen only one other > report like this (but it wasn't running a Debian-based distro). I don't > think we need a stable fix for now. Ok, thus closing accordingly. Thanks! Samuel
--- End Message ---

