[slurm-users] Re: GPU shards not exclusive
Hi Will, I appreciate your corroboration. After we upgraded to 23.02.$latest, it seemed to make it easier to reproduce than before. However, the issue appears to have subsided, and the only change I can potentially attribute it to was after turning on > SlurmctldParameters=rl_enable in slurm.conf. And here’s hoping that 23.11 will offer even more in the future. Reed > On Feb 28, 2024, at 7:28 AM, wdennis--- via slurm-users > wrote: > > Hi Reed, > > Unfortunately, we had the same issue with 22.05.9; SchedMD advice was to > upgrade to 23.11.x, and this appears to have resolved this issue for us. > SchedMD support said to us, "We did a lot of work regarding shards in the > 23.11 release." > > HTH, > Will > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com smime.p7s Description: S/MIME cryptographic signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2
Hi list, I finally got it working, I completely overlooked that I set Oversubscribe=EXCLUSIVE for the partition that I used to test, stupid me. sorry for the noise and thanks again for your answers Best Dietmar On 2/29/24 13:19, Dietmar Rieder via slurm-users wrote: Hi Josef, hi list, I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4. The configure command that is genereated from the spec is the following: ./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc (Do I miss something here) the configure output shows: [...] checking for bpf installation... /usr checking for dbus-1... yes [...] config.log dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1 confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1 However I still can't see any CPU limits when I use sbatch to run a batch job. $ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' $ cat slurm-72.out Cpus_allowed: ,, Cpus_allowed_list: 0-95 The logs from the head node (leto) and the compute node (apollo-01) are showing: Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done Best Dietmar On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote: > I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working.. Not having cgroupsv2 dependencies during build-time is only one of all possible causes.. josef OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2
Hi Josef, hi list, I now rebuild the rpms from OpenHPC but using the original sources form version 23.11.4. The configure command that is genereated from the spec is the following: ./configure --build=x86_64-redhat-linux-gnu \ --host=x86_64-redhat-linux-gnu \ --program-prefix= \ --disable-dependency-tracking \ --prefix=/usr \ --exec-prefix=/usr \ --bindir=/usr/bin \ --sbindir=/usr/sbin \ --sysconfdir=/etc/slurm \ --datadir=/usr/share \ --includedir=/usr/include \ --libdir=/usr/lib64 \ --libexecdir=/usr/libexec \ --localstatedir=/var \ --sharedstatedir=/var/lib \ --mandir=/usr/share/man \ --infodir=/usr/share/info \ --enable-multiple-slurmd \ --with-pmix=/opt/ohpc/admin/pmix \ --with-hwloc=/opt/ohpc/pub/libs/hwloc (Do I miss something here) the configure output shows: [...] checking for bpf installation... /usr checking for dbus-1... yes [...] config.log dbus_CFLAGS='-I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include ' dbus_LIBS='-ldbus-1 confdefs.h. #define WITH_CGROUP 1 #define HAVE_BPF 1 However I still can't see any CPU limits when I use sbatch to run a batch job. $ sbatch --time 5 --ntasks-per-node=1 --nodes=1 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' $ cat slurm-72.out Cpus_allowed: ,, Cpus_allowed_list: 0-95 The logs from the head node (leto) and the compute node (apollo-01) are showing: Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _slurm_rpc_submit_batch_job: JobId=72 InitPrio=1 usec=365 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 72 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU input mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: task/affinity: batch_bind: job 72 CPU final HW mask for node: 0x Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:05 apollo-01 slurmd[172835]: slurmd: Launching batch job 72 for UID 50001 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:06 apollo-01 kernel: slurm.epilog.cl (172966): drop_caches: 3 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: sched/backfill: _start_job: Started JobId=72 in standard on apollo-01 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 WEXITSTATUS 0 Feb 29 12:55:05 leto slurmctld[272883]: slurmctld: _job_complete: JobId=72 done Best Dietmar On 2/28/24 16:25, Josef Dvoracek via slurm-users wrote: > I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex decision, perhaps look for build logs of slurm you use (somewhere in OpenHPC buildsystem?) and check if it was built with libraries needed to have cgroupsv2 working.. Not having cgroupsv2 dependencies during build-time is only one of all possible causes.. josef OpenPGP_signature.asc Description: OpenPGP digital signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object
Dear Josef, thanks a lot again for your help. Unfortunately I cannot solve this problem. According to the Slurm documentation (https://slurm.schedmd.com/quickstart_admin.html#upgrade) I have to upgrade only slurmdbd at the very beginning and the cluster should be able to work even with slurmdbd-23.11.0-1 and slurmctld-23.02.3-1 and slurmd-23.02.3-1. The slurmdbd-23.11.0-1 package should provide the following files: $ rpm -ql slurm-slurmdbd-23.11.0-1-no-frontend.el8.x86_64.rpm /usr/lib/.build-id /usr/lib/.build-id/01 /usr/lib/.build-id/01/da333fd28f1765164e46d00569ca55e55eb066 /usr/lib/.build-id/e7/4ab5829ee8f5b959cd71d47077cb09fb40fb54 /usr/lib/systemd/system/slurmdbd.service /usr/lib64/slurm/accounting_storage_mysql.so /usr/sbin/slurmdbd I check on my cluster and all this files are present and coming from slurmdbd-23.11.0-1 as you can see from example for this two files: $ rpm -q --whatprovides /usr/lib/systemd/system/slurmdbd.service slurm-slurmdbd-23.11.0-1.el8.x86_64 $ rpm -q --whatprovides /usr/lib64/slurm/accounting_storage_mysql.so slurm-slurmdbd-23.11.0-1.el8.x86_64 All the other libraries where the symbol 'slurm_conf' is mentioned are from the other packages: slurm-23.02.3-1.el8.x86_64.rpm, slurm-slurmctld-23.02.3-1.el8.x86_64.rpm, slurm-slurmd-23.02.3-1.el8.x86_64.rpm. How can I solve this problem now? Many thanks in advance, Miriam > I think installing/upgrading "slurm" rpm will replace this shared lib. > > Indeed, as always, test it first at not-so-critical system, use vm > snapshots to be able to travel back in time ... as once you'll upgrade > DB schema (if part of upgrade) you AFAIK can not go back. > > josef > > On 28. 02. 24 15:51, Miriam Olmi via slurm-users wrote: >> I installed the new version of slurm 23.11.0-1 by rpm. >> How can I fix this? >> > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > -- *** Miriam Olmi Computing & Network Service Laboratori Nazionali del Gran Sasso - INFN Via G. Acitelli, 22 67100 Assergi (AQ) Italy https://www.lngs.infn.it email: miriam.o...@lngs.infn.it office: +39 0862 437222 *** -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING
I am wondering why my question (below) didn't catch anyone's attention. Just for me as a feedback. Is it unclear where my problem lies or is it clear, but no solution is known? I looked through the documentation and now searched the Slurm repository, but am still unable to clearly identify how to handle "NOT_RESPONDING". I would really like to improve my question if necessary. Best regards, Xaver On 23.02.24 18:55, Xaver Stiensmeier wrote: Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is created on demand and therefore after a failure nothing stops the system to start the node again as it is a different instance. I thought this would be enough, but apparently the node is still marked with "NOT_RESPONDING" which leads to slurm not trying to schedule on it. After a while NOT_RESPONDING is removed, but I would like to move it directly from within my fail script if possible so that the node can return to service immediately and not be blocked by "NOT_RESPONDING". Best regards, Xaver -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com