Re: [slurm-users] Wrong hwloc detected?

2021-11-09 Thread Chris Samuel

On 5/11/21 4:47 am, Diego Zuccato wrote:


How can Slurm detect such an old HWLOC version?


Looking at the code it's not actually checking the hwloc version, it's 
finding an error condition and suggesting that may be the cause, but it 
sounds like it's not for you.


src/plugins/task/cgroup/task_cgroup_cpuset.c :

/* should never happen in normal scenario */
if ((sock_loop > npdist) && !hwloc_success) {
/* hwloc_get_obj_below_by_type() fails if no CPU set
 * configured, see hwloc documentation for details */
error("hwloc_get_obj_below_by_type() failing, "
  "task/affinity plugin may be required to address 
bug "

  "fixed in HWLOC version 1.11.5");
return XCGROUP_ERROR;
} [...]


If you've got support from SchedMD open a bug with them, but if not and 
you're using the Debian packages I'd suggest opening a bug with Debian 
about it.


Best of luck!
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Wrong hwloc detected?

2021-11-08 Thread Diego Zuccato

Hi Ole.

I'm using the packages from Debian stable (slurm 20.11.4, hwloc 2.4.1).
And I checked: hwloc is installed on all the nodes. Quite obvious since 
it's a dep for slurmd:

https://packages.debian.org/bullseye/slurmd
Being a dep, i "suspect" slurmd is built with hwloc support.

Diego

Il 07/11/2021 20:22, Ole Holm Nielsen ha scritto:

Hi Diego,

Are you sure that the Slurm software installed on all compute nodes was 
actually built on a system which had the hwloc packages installed?  They 
should also be installed on the compute nodes.  The prerequisite 
packages are listed here:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites

/Ole


On 05-11-2021 15:38, Diego Zuccato wrote:

They aren't using modules so it must be something system-wide :(
But not all jobs are impacted. And it seems it's a bit random (doesn't 
happen always).

I'm out of ideas, currently :(

Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto:

On 11/5/21 12:47, Diego Zuccato wrote:

Some users are reporting this error:

slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() 
failing, task/affinity plugin may be required to address bug fixed 
in HWLOC version 1.11.5

slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0'

I checked on that node and hwloc is newer:
diego.zuccato@str957-mtx-01:~$ hwloc-info --version
hwloc-info 2.4.1

How can Slurm detect such an old HWLOC version?


Maybe the user loads a software module which also loads an old hwloc 
module?   The user should do "module list" in the job to verify this.




--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Wrong hwloc detected?

2021-11-07 Thread Ole Holm Nielsen

Hi Diego,

Are you sure that the Slurm software installed on all compute nodes was 
actually built on a system which had the hwloc packages installed?  They 
should also be installed on the compute nodes.  The prerequisite 
packages are listed here:

https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites

/Ole


On 05-11-2021 15:38, Diego Zuccato wrote:

They aren't using modules so it must be something system-wide :(
But not all jobs are impacted. And it seems it's a bit random (doesn't 
happen always).

I'm out of ideas, currently :(

Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto:

On 11/5/21 12:47, Diego Zuccato wrote:

Some users are reporting this error:

slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() 
failing, task/affinity plugin may be required to address bug fixed in 
HWLOC version 1.11.5

slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0'

I checked on that node and hwloc is newer:
diego.zuccato@str957-mtx-01:~$ hwloc-info --version
hwloc-info 2.4.1

How can Slurm detect such an old HWLOC version?


Maybe the user loads a software module which also loads an old hwloc 
module?   The user should do "module list" in the job to verify this.




Re: [slurm-users] Wrong hwloc detected?

2021-11-05 Thread Diego Zuccato

They aren't using modules so it must be something system-wide :(
But not all jobs are impacted. And it seems it's a bit random (doesn't 
happen always).

I'm out of ideas, currently :(

Il 05/11/2021 13:10, Ole Holm Nielsen ha scritto:

On 11/5/21 12:47, Diego Zuccato wrote:

Some users are reporting this error:

slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() 
failing, task/affinity plugin may be required to address bug fixed in 
HWLOC version 1.11.5

slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0'

I checked on that node and hwloc is newer:
diego.zuccato@str957-mtx-01:~$ hwloc-info --version
hwloc-info 2.4.1

How can Slurm detect such an old HWLOC version?


Maybe the user loads a software module which also loads an old hwloc 
module?   The user should do "module list" in the job to verify this.


My 2 cents,
Ole



--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786



Re: [slurm-users] Wrong hwloc detected?

2021-11-05 Thread Ole Holm Nielsen

On 11/5/21 12:47, Diego Zuccato wrote:

Some users are reporting this error:

slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, 
task/affinity plugin may be required to address bug fixed in HWLOC version 
1.11.5

slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0'

I checked on that node and hwloc is newer:
diego.zuccato@str957-mtx-01:~$ hwloc-info --version
hwloc-info 2.4.1

How can Slurm detect such an old HWLOC version?


Maybe the user loads a software module which also loads an old hwloc 
module?   The user should do "module list" in the job to verify this.


My 2 cents,
Ole



[slurm-users] Wrong hwloc detected?

2021-11-05 Thread Diego Zuccato

Hello all.

Some users are reporting this error:

slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, 
task/affinity plugin may be required to address bug fixed in HWLOC 
version 1.11.5

slurmstepd-str957-mtx-01: error: task[0] unable to set taskset '0x0'

I checked on that node and hwloc is newer:
diego.zuccato@str957-mtx-01:~$ hwloc-info --version
hwloc-info 2.4.1

How can Slurm detect such an old HWLOC version?

BTW I'm using
 TaskPlugin=task/cgroup
in slurm.conf, and cgroup.conf contains:
 CgroupMountpoint=/sys/fs/cgroup
 ConstrainCores=yes
 TaskAffinity=yes
 ConstrainRAMSpace=yes
 ConstrainSwapSpace=yes
 MemorySwappiness=0
 MaxSwapPercent=0
 AllowedSwapSpace=0

Any ideas?

Tks.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786