[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users
; /sys/fs/cgroup/system.slice/cgroup.subtree_control /usr/sbin/slurmstepd infinity & *From:*Josef Dvoracek via slurm-users *Sent:* Thursday, April 11, 2024 11:14 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Slurmd enabled crash with CgroupV2 I observe same behavior on slurm 23.11.5

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users
I observe same behavior on slurm 23.11.5 Rocky Linux8.9.. > [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control > memory pids > [root@compute ~]# systemctl disable slurmd > Removed /etc/systemd/system/multi-user.target.wants/slurmd.service. > [root@compute ~]# cat

[slurm-users] visualisation of JobComp and JobacctGather data with Grafana - screenshots, ideas?

2024-04-10 Thread Josef Dvoracek via slurm-users
Is here anybody having nice visualization of JobComp and JobacctGather data in Grafana? I save JobComp data in Elasticsearch, JobacctGather data in influxDB, and thinking about how to provide meaningful insights to $users. Things I'd like to show..: especially memory & cpu utilization, job

[slurm-users] Re: cgroups_exporter for slurm on rhel9 (cgroups-v2)

2024-03-25 Thread Josef Dvoracek via slurm-users
I use telegraf (which supports "exporter" output format as well) to capture cgroupsv2 job data: https://github.com/jose-d/telegraf-configs/tree/master/slurm-cgroupsv2 I had to rework it when changing from cgroupsv1 to cgroupsv2, as the format/structure of textfiles changed a bit. cheers

[slurm-users] Re: Slurm suspend preemption not working

2024-03-15 Thread Josef Dvoracek via slurm-users
I think you need set reasonable "DefMemPerCPU" - otherwise jobs will take all memory by default, and there is no remaining memory for the second job. We calculated DefMemPerCPU in such way, that the default allocated memory of full node is slightly under half of total node memory. So there

[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users
> I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users
I think installing/upgrading "slurm" rpm will replace this shared lib. Indeed, as always, test it first at not-so-critical system, use vm snapshots to be able to travel back in time ... as once you'll upgrade DB schema (if part of upgrade) you AFAIK can not go back. josef On 28. 02. 24

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users
I see this question unanswered so far.. so I'll give you my 2 cents: Quick check reveals that mentioned symbol is in libslurmfull.so : [root@slurmserver2 ~]# nm -gD /usr/lib64/slurm/libslurmfull.so | grep "slurm_conf$" 000d2c06 T free_slurm_conf 000d3345 T init_slurm_conf

[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users
Hi Dietmar; I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently.. I must say that on my setup it looks it works as expected, see the grepped stdout from your reproducer below. I use recent slurm 23.11.4 . Wild guess.. Has your build machine bpt and dbus devel packages

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Josef Dvoracek via slurm-users
From unclear reason "--wrap" was not part of my /repertoire/ so far. thanks On 26. 02. 24 9:47, Ward Poelmans via slurm-users wrote: sbatch --wrap 'screen -D -m' srun --jobid --pty screen -rd smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list --

[slurm-users] canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-26 Thread Josef Dvoracek via slurm-users
What is the recommended way to run longer interactive job at your systems? Our how-to includes starting screen at front-end node and running srun with bash/zsh inside, but that indeed brings dependency between login node (with screen) and the compute node job. On systems with multiple

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Josef Dvoracek via slurm-users
> Just looking for some feedback, please. Is this OK? Is there a better way? > I’m tempted to spec all new HPCs with only a high speed (200Gbps) IB network, Well you need Ethernet for OOB management (bmc/ipmi/ilo/whatever) anyway.. or? cheers josef On 25. 02. 24 21:12, Dan Healy via

[slurm-users] Re: Compilation question

2024-02-10 Thread Josef Dvoracek via slurm-users
isn't your /softs.. filesystem eg. some cluster network filesystem mount? It happened to me multiple times, that I was attempting to build some scientific software, and because of building on top of BeeGFS (I think hardlinks are not fully supported), or NFS ( caching), I was getting

[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Josef Dvoracek via slurm-users
My impression is, that there are multiple challenges why it's not easy to create good-for-all recent slurm RPM: - NVML dependency - different sites use different NVML lib version with varying update cycle - pmi* deps - some sites (like mine) is using only one reasonable recent openpmix, I