Your message dated Mon, 13 Sep 2021 18:16:15 +0200
with message-id <20210913161615.wpmcyar6w4fyj4cm@begin>
and subject line Re: Bug#994049: libhwloc-contrib-plugins: hwloc displays 
misleading CUDA and NVML errors when running MPI programs
has caused the Debian Bug report #994049,
regarding libhwloc-contrib-plugins: hwloc displays misleading CUDA and NVML 
errors when running MPI programs
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
994049: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=994049
Debian Bug Tracking System
Contact [email protected] with problems
--- Begin Message ---
Package: libhwloc-contrib-plugins
Version: 2.4.1+dfsg-2
Severity: important

Dear Maintainer,

When the libhwloc-contrib-plugins package is installed, running any MPI
program on a Debian 11 host with no GPU produces the following errors:

    $ mpirun hostname
    CUDA: Failed to get number of devices with cudaGetDeviceCount(): no 
CUDA-capable device is detected
    NVML: Failed to initialize with nvmlInit(): Driver Not Loaded
    CUDA: Failed to get number of devices with cudaGetDeviceCount(): no 
CUDA-capable device is detected
    NVML: Failed to initialize with nvmlInit(): Driver Not Loaded
    dahu-28.grenoble.grid5000.fr

For complex programs, it is quite hard to understand where these messages
come from and what the exact problem is.  After investigation, it turns out
that these messages are "warnings" and don't prevent the program from
executing, so they can be ignored.  But when the program fails for unrelated
reasons, these messages can mislead the user into thinking the problem is
CUDA-related, while it's actually not.

The expected behaviour is that hwloc should not print warnings about
hardware detection when nothing is actually wrong.

This bug has already been fixed upstream in version 2.5.0rc1:

    835dfbe577fcd7 ("core: don't display "less critical" error messages by 
default")
    https://github.com/open-mpi/hwloc/issues/453

Would it be possible to backport this patch to Debian stable or,
as an alternative, publish hwloc 2.5.0 in bullseye-backports?

Thanks for your time,
Baptiste

-- System Information:
Debian Release: 11.0
  APT prefers stable-security
  APT policy: (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-8-amd64 (SMP w/64 CPU threads)
Kernel taint flags: TAINT_FIRMWARE_WORKAROUND
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) (ignored: LC_ALL 
set to en_US.UTF-8), LANGUAGE=en_US:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages libhwloc-contrib-plugins depends on:
ii  libc6          2.31-13
ii  libcudart11.0  5000.0g5k1
ii  libhwloc15     2.4.1+dfsg-1
ii  libnvidia-ml1  5000.0g5k1

libhwloc-contrib-plugins recommends no packages.

libhwloc-contrib-plugins suggests no packages.

-- no debconf information

--- End Message ---
--- Begin Message ---
Version: 2.5.0~

Brice Goglin, le lun. 13 sept. 2021 18:03:48 +0200, a ecrit:
> Le 13/09/2021 à 17:42, Baptiste Jonglez a écrit :
> > On Sat, Sep 11, 2021 at 12:41:04AM +0200, Samuel Thibault wrote:
> > > Hello,
> > > > Would it be possible to backport this patch to Debian stable or,
> > > > as an alternative, publish hwloc 2.5.0 in bullseye-backports?
> > > I was waiting for a good reason to take the time to backport hwloc 2.5,
> > > that was one, it's now in backports-NEW :)
> > Perfect, thanks :)
> > 
> > > That said perhaps we can ask d-release for a stable upload, Brice what
> > > do you think?
> > The bullseye-backports package will be enough for us, but of course having
> > the fix in stable is also nice (if it's not considered too intrusive).
> 
> 
> The problem should only occur if you managed to install NVIDIA drivers
> without a NVIDIA GPU (which isn't possible by default IIRC). It basically
> means deploying a GPU node image on a non-GPU node. I've seen only one other
> report like this (but it wasn't running a Debian-based distro). I don't
> think we need a stable fix for now.

Ok, thus closing accordingly.

Thanks!
Samuel

--- End Message ---

Reply via email to