[slurm-users] Re: GPU shards not exclusive

2024-02-29 Thread Reed Dier via slurm-users
Hi Will,

I appreciate your corroboration.

After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce 
than before.
However, the issue appears to have subsided, and the only change I can 
potentially attribute it to was after turning on
> SlurmctldParameters=rl_enable 
in slurm.conf.

And here’s hoping that 23.11 will offer even more in the future.

Reed

> On Feb 28, 2024, at 7:28 AM, wdennis--- via slurm-users 
>  wrote:
> 
> Hi Reed,
> 
> Unfortunately, we had the same issue with 22.05.9; SchedMD advice was to 
> upgrade to 23.11.x, and this appears to have resolved this issue for us. 
> SchedMD support said to us, "We did a lot of work regarding shards in the 
> 23.11 release."
> 
> HTH,
> Will
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com



smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-29 Thread Dietmar Rieder via slurm-users

Hi list,

I finally got it working, I completely overlooked that I set 
Oversubscribe=EXCLUSIVE for the partition that I used to test, stupid 
me.


sorry for the noise and thanks again for your answers

Best
   Dietmar

On 2/29/24 13:19, Dietmar Rieder via slurm-users wrote:

Hi Josef, hi list,

I now rebuild the rpms from OpenHPC but using the original sources form 
version 23.11.4.


The configure command that is genereated from the spec is the following:

./configure --build=x86_64-redhat-linux-gnu \
--host=x86_64-redhat-linux-gnu \
--program-prefix= \
--disable-dependency-tracking \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc/slurm \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib64 \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/var/lib \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--enable-multiple-slurmd \
--with-pmix=/opt/ohpc/admin/pmix \
--with-hwloc=/opt/ohpc/pub/libs/hwloc

(Do I miss something here)

the configure output shows:

[...]
checking for bpf installation... /usr
checking for dbus-1... yes
[...]

config.log

dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include '
dbus_LIBS='-ldbus-1


confdefs.h.
#define WITH_CGROUP 1
#define HAVE_BPF 1

However I still can't see any CPU limits when I use sbatch to run a 
batch job.



$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 
'grep Cpus /proc/$$/status'


$ cat slurm-72.out
Cpus_allowed:   ,,
Cpus_allowed_list:  0-95


The logs from the head node (leto) and the compute node (apollo-01) are 
showing:


Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: 
_slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001

Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: 
_start_job: Started JobId=72 in standard on apollo-01
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 WEXITSTATUS 0
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 done



Best
   Dietmar

On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:

 > I'm running slurm 22.05.11 which is available with OpenHCP 3.x
 > Do you think an upgrade is needed?

I feel that lot of slurm operators tend to not use 3rd party sources 
of slurm binaries, as you do not have the build environment fully in 
your hands.


But before making such a complex decision, perhaps look for build logs 
of slurm you use (somewhere in OpenHPC buildsystem?) and check if it 
was built with libraries needed to have cgroupsv2 working..


Not having cgroupsv2 dependencies during build-time is only one of all 
possible causes..


josef








OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-29 Thread Dietmar Rieder via slurm-users

Hi Josef, hi list,

I now rebuild the rpms from OpenHPC but using the original sources form 
version 23.11.4.


The configure command that is genereated from the spec is the following:

./configure --build=x86_64-redhat-linux-gnu \
--host=x86_64-redhat-linux-gnu \
--program-prefix= \
--disable-dependency-tracking \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc/slurm \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib64 \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/var/lib \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--enable-multiple-slurmd \
--with-pmix=/opt/ohpc/admin/pmix \
--with-hwloc=/opt/ohpc/pub/libs/hwloc

(Do I miss something here)

the configure output shows:

[...]
checking for bpf installation... /usr
checking for dbus-1... yes
[...]

config.log

dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include '
dbus_LIBS='-ldbus-1


confdefs.h.
#define WITH_CGROUP 1
#define HAVE_BPF 1

However I still can't see any CPU limits when I use sbatch to run a 
batch job.



$ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 
'grep Cpus /proc/$$/status'


$ cat slurm-72.out
Cpus_allowed:   ,,
Cpus_allowed_list:  0-95


The logs from the head node (leto) and the compute node (apollo-01) are 
showing:


Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: 
_slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU input mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: 
batch_bind: job 72 CPU final HW mask for node: 0x
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001
Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 
for UID 50001

Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: 
_start_job: Started JobId=72 in standard on apollo-01
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 WEXITSTATUS 0
Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: 
JobId=72 done



Best
  Dietmar

On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote:

 > I'm running slurm 22.05.11 which is available with OpenHCP 3.x
 > Do you think an upgrade is needed?

I feel that lot of slurm operators tend to not use 3rd party sources of 
slurm binaries, as you do not have the build environment fully in your 
hands.


But before making such a complex decision, perhaps look for build logs 
of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was 
built with libraries needed to have cgroupsv2 working..


Not having cgroupsv2 dependencies during build-time is only one of all 
possible causes..


josef




OpenPGP_signature.asc
Description: OpenPGP digital signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-29 Thread Miriam Olmi via slurm-users
Dear Josef,

thanks a lot again for your help.
Unfortunately I cannot solve this problem.

According to the Slurm documentation
(https://slurm.schedmd.com/quickstart_admin.html#upgrade) I have to
upgrade only slurmdbd at the very beginning and the cluster should be able
to work even with slurmdbd-23.11.0-1 and slurmctld-23.02.3-1 and
slurmd-23.02.3-1.

The slurmdbd-23.11.0-1 package should provide the following files:

$ rpm -ql slurm-slurmdbd-23.11.0-1-no-frontend.el8.x86_64.rpm
/usr/lib/.build-id
/usr/lib/.build-id/01
/usr/lib/.build-id/01/da333fd28f1765164e46d00569ca55e55eb066
/usr/lib/.build-id/e7/4ab5829ee8f5b959cd71d47077cb09fb40fb54
/usr/lib/systemd/system/slurmdbd.service
/usr/lib64/slurm/accounting_storage_mysql.so
/usr/sbin/slurmdbd


I check on my cluster and all this files are present and coming from
slurmdbd-23.11.0-1 as you can see from example for this two files:

$ rpm -q --whatprovides /usr/lib/systemd/system/slurmdbd.service
slurm-slurmdbd-23.11.0-1.el8.x86_64
$ rpm -q --whatprovides /usr/lib64/slurm/accounting_storage_mysql.so
slurm-slurmdbd-23.11.0-1.el8.x86_64


All the other libraries where the symbol 'slurm_conf' is mentioned are
from the other packages: slurm-23.02.3-1.el8.x86_64.rpm,
slurm-slurmctld-23.02.3-1.el8.x86_64.rpm,
slurm-slurmd-23.02.3-1.el8.x86_64.rpm.

How can I solve this problem now?

Many thanks in advance,
Miriam



> I think installing/upgrading "slurm" rpm will replace this shared lib.
>
> Indeed, as always, test it first at not-so-critical system, use vm
> snapshots to be able to travel back in time ... as once you'll upgrade
> DB schema (if part of upgrade) you AFAIK can not go back.
>
> josef
>
> On 28. 02. 24 15:51, Miriam Olmi via slurm-users wrote:
>> I installed the new version of slurm 23.11.0-1 by rpm.
>> How can I fix this?
>>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>


-- 
***
Miriam Olmi
Computing & Network Service

Laboratori Nazionali del Gran Sasso - INFN
Via G. Acitelli, 22
67100 Assergi (AQ) Italy
https://www.lngs.infn.it

email: miriam.o...@lngs.infn.it
   office: +39 0862 437222
***


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-29 Thread Xaver Stiensmeier via slurm-users

I am wondering why my question (below) didn't catch anyone's attention.
Just for me as a feedback. Is it unclear where my problem lies or is it
clear, but no solution is known? I looked through the documentation and
now searched the Slurm repository, but am still unable to clearly
identify how to handle "NOT_RESPONDING".

I would really like to improve my question if necessary.

Best regards,
Xaver

On 23.02.24 18:55, Xaver Stiensmeier wrote:

Dear slurm-user list,

I have a cloud node that is powered up and down on demand. Rarely it
can happen that slurm's resumeTimeout is reached and the node is
therefore powered down. We have set ReturnToService=2 in order to
avoid the node being marked down, because the instance behind that
node is created on demand and therefore after a failure nothing stops
the system to start the node again as it is a different instance.

I thought this would be enough, but apparently the node is still
marked with "NOT_RESPONDING" which leads to slurm not trying to
schedule on it.

After a while NOT_RESPONDING is removed, but I would like to move it
directly from within my fail script if possible so that the node can
return to service immediately and not be blocked by "NOT_RESPONDING".

Best regards,
Xaver



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com