Re: [slurm-users] Wrong hwloc detected?
On 5/11/21 4:47 am, Diego Zuccato wrote: How can Slurm detect such an old HWLOC version? Looking at the code it's not actually checking the hwloc version, it's finding an error condition and suggesting that may be the cause, but it sounds like it's not for you. src/plugins/task/cgroup/task_cgroup_cpuset.c : /* should never happen in normal scenario */ if ((sock_loop > npdist) && !hwloc_success) { /* hwloc_get_obj_below_by_type() fails if no CPU set * configured, see hwloc documentation for details */ error("hwloc_get_obj_below_by_type() failing, " "task/affinity plugin may be required to address bug " "fixed in HWLOC version 1.11.5"); return XCGROUP_ERROR; } [...] If you've got support from SchedMD open a bug with them, but if not and you're using the Debian packages I'd suggest opening a bug with Debian about it. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Wrong hwloc detected?
Hi Ole. I'm using the packages from Debian stable (slurm 20.11.4, hwloc 2.4.1). And I checked: hwloc is installed on all the nodes. Quite obvious since it's a dep for slurmd: https://packages.debian.org/bullseye/slurmd Being a dep, i "suspect" slurmd is built with hwloc support. Diego Il 07/11/2021 20:22, Ole Holm Nielsen ha scritto: Hi Diego, Are you sure that the Slurm software installed on all compute nodes was actually built on a system which had the hwloc packages installed? They should also be installed on the compute nodes. The prerequisite packages are listed here: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites /Ole On 05-11-2021 15:38, Diego Zuccato wrote: They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto: On 11/5/21 12:47, Diego Zuccato wrote: Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0' I checked on that node and hwloc is newer: diego.zuccato@str957-mtx-01:~$ hwloc-info --version hwloc-info 2.4.1 How can Slurm detect such an old HWLOC version? Maybe the user loads a software module which also loads an old hwloc module? The user should do "module list" in the job to verify this. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] Wrong hwloc detected?
Hi Diego, Are you sure that the Slurm software installed on all compute nodes was actually built on a system which had the hwloc packages installed? They should also be installed on the compute nodes. The prerequisite packages are listed here: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites /Ole On 05-11-2021 15:38, Diego Zuccato wrote: They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto: On 11/5/21 12:47, Diego Zuccato wrote: Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0' I checked on that node and hwloc is newer: diego.zuccato@str957-mtx-01:~$ hwloc-info --version hwloc-info 2.4.1 How can Slurm detect such an old HWLOC version? Maybe the user loads a software module which also loads an old hwloc module? The user should do "module list" in the job to verify this.
Re: [slurm-users] Wrong hwloc detected?
They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto: On 11/5/21 12:47, Diego Zuccato wrote: Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0' I checked on that node and hwloc is newer: diego.zuccato@str957-mtx-01:~$ hwloc-info --version hwloc-info 2.4.1 How can Slurm detect such an old HWLOC version? Maybe the user loads a software module which also loads an old hwloc module? The user should do "module list" in the job to verify this. My 2 cents, Ole -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Re: [slurm-users] Wrong hwloc detected?
On 11/5/21 12:47, Diego Zuccato wrote: Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0' I checked on that node and hwloc is newer: diego.zuccato@str957-mtx-01:~$ hwloc-info --version hwloc-info 2.4.1 How can Slurm detect such an old HWLOC version? Maybe the user loads a software module which also loads an old hwloc module? The user should do "module list" in the job to verify this. My 2 cents, Ole
[slurm-users] Wrong hwloc detected?
Hello all. Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0' I checked on that node and hwloc is newer: diego.zuccato@str957-mtx-01:~$ hwloc-info --version hwloc-info 2.4.1 How can Slurm detect such an old HWLOC version? BTW I'm using TaskPlugin=task/cgroup in slurm.conf, and cgroup.conf contains: CgroupMountpoint=/sys/fs/cgroup ConstrainCores=yes TaskAffinity=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes MemorySwappiness=0 MaxSwapPercent=0 AllowedSwapSpace=0 Any ideas? Tks. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786