[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users
; /sys/fs/cgroup/system.slice/cgroup.subtree_control /usr/sbin/slurmstepd infinity & *From:*Josef Dvoracek via slurm-users *Sent:* Thursday, April 11, 2024 11:14 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: Slurmd enabled crash with CgroupV2 I observe same behavior on slurm 23.11.5

[slurm-users] Re: Slurmd enabled crash with CgroupV2

2024-04-11 Thread Josef Dvoracek via slurm-users
I observe same behavior on slurm 23.11.5 Rocky Linux8.9.. > [root@compute ~]# cat /sys/fs/cgroup/cgroup.subtree_control > memory pids > [root@compute ~]# systemctl disable slurmd > Removed /etc/systemd/system/multi-user.target.wants/slurmd.service. > [root@compute ~]# cat

[slurm-users] visualisation of JobComp and JobacctGather data with Grafana - screenshots, ideas?

2024-04-10 Thread Josef Dvoracek via slurm-users
Is here anybody having nice visualization of JobComp and JobacctGather data in Grafana? I save JobComp data in Elasticsearch, JobacctGather data in influxDB, and thinking about how to provide meaningful insights to $users. Things I'd like to show..: especially memory & cpu utilization, job

[slurm-users] Re: cgroups_exporter for slurm on rhel9 (cgroups-v2)

2024-03-25 Thread Josef Dvoracek via slurm-users
I use telegraf (which supports "exporter" output format as well) to capture cgroupsv2 job data: https://github.com/jose-d/telegraf-configs/tree/master/slurm-cgroupsv2 I had to rework it when changing from cgroupsv1 to cgroupsv2, as the format/structure of textfiles changed a bit. cheers

[slurm-users] Re: Slurm suspend preemption not working

2024-03-15 Thread Josef Dvoracek via slurm-users
I think you need set reasonable "DefMemPerCPU" - otherwise jobs will take all memory by default, and there is no remaining memory for the second job. We calculated DefMemPerCPU in such way, that the default allocated memory of full node is slightly under half of total node memory. So there

[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users
> I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users
I think installing/upgrading "slurm" rpm will replace this shared lib. Indeed, as always, test it first at not-so-critical system, use vm snapshots to be able to travel back in time ... as once you'll upgrade DB schema (if part of upgrade) you AFAIK can not go back. josef On 28. 02. 24

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users
I see this question unanswered so far.. so I'll give you my 2 cents: Quick check reveals that mentioned symbol is in libslurmfull.so : [root@slurmserver2 ~]# nm -gD /usr/lib64/slurm/libslurmfull.so | grep "slurm_conf$" 000d2c06 T free_slurm_conf 000d3345 T init_slurm_conf

[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users
Hi Dietmar; I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently.. I must say that on my setup it looks it works as expected, see the grepped stdout from your reproducer below. I use recent slurm 23.11.4 . Wild guess.. Has your build machine bpt and dbus devel packages

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Josef Dvoracek via slurm-users
From unclear reason "--wrap" was not part of my /repertoire/ so far. thanks On 26. 02. 24 9:47, Ward Poelmans via slurm-users wrote: sbatch --wrap 'screen -D -m' srun --jobid --pty screen -rd smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list --

[slurm-users] canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-26 Thread Josef Dvoracek via slurm-users
What is the recommended way to run longer interactive job at your systems? Our how-to includes starting screen at front-end node and running srun with bash/zsh inside, but that indeed brings dependency between login node (with screen) and the compute node job. On systems with multiple

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Josef Dvoracek via slurm-users
> Just looking for some feedback, please. Is this OK? Is there a better way? > I’m tempted to spec all new HPCs with only a high speed (200Gbps) IB network, Well you need Ethernet for OOB management (bmc/ipmi/ilo/whatever) anyway.. or? cheers josef On 25. 02. 24 21:12, Dan Healy via

[slurm-users] Re: Compilation question

2024-02-10 Thread Josef Dvoracek via slurm-users
isn't your /softs.. filesystem eg. some cluster network filesystem mount? It happened to me multiple times, that I was attempting to build some scientific software, and because of building on top of BeeGFS (I think hardlinks are not fully supported), or NFS ( caching), I was getting

[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Josef Dvoracek via slurm-users
My impression is, that there are multiple challenges why it's not easy to create good-for-all recent slurm RPM: - NVML dependency - different sites use different NVML lib version with varying update cycle - pmi* deps - some sites (like mine) is using only one reasonable recent openpmix, I

Re: [slurm-users] Database cluster

2024-01-25 Thread Josef Dvoracek
To protect from HW failure, and to have more free hands when upgrading underlying OS, we use virtualization with "live migration"/HA and MariaDB server as a VM. VM is easy to backup, restore as a snapshot, clone for possible tests, etc. In the past, I deployed (customer-requirement) one site

Re: [slurm-users] Autodetect of nvml is not working in gres.conf

2023-11-30 Thread Josef Dvoracek
couldn't be that library "cuda-nvml-devel" was not installed when you were building slurm? cheers josef On 30. 11. 23 15:06, Ravi Konila wrote: Hello, My gres.conf has AutoDetect=nvml when I restart slurmd service I do get *fatal: We were configured to autodetect nvml functionality, but we

Re: [slurm-users] SLURM new user query, does SLURM has GUI /Web based management version also

2023-11-28 Thread Josef Dvoracek
> can you please advice me on the monitoring tools, I I'm _somehow_ satisfied with: Prometheus Slurm exporter - ( https://github.com/vpenso/prometheus-slurm-exporter), being grabbed by Telegraf - ( https://www.influxdata.com/time-series-platform/telegraf ) sending metrics to InfluxDB.

[slurm-users] meaning of "next_state_after_reboot" in scontrol show node output / API

2023-11-14 Thread Josef Dvoracek
I'm writing ansible module to interact with my clusters, so currently diving in --yaml output of `scontrol show node`.. What is the meaning of "next_state_after_reboot" attribute of node? eg for one of my nodes, it is:     "next_state_after_reboot": [    

Re: [slurm-users] Granular or dynamic control of partitions?

2023-08-04 Thread Josef Dvoracek
Just remove given node from partition. Already running jobs will continue without interruption.. HTH josef On 04. 08. 23 16:40, Pacey, Mike wrote: .. smime.p7s Description: S/MIME Cryptographic Signature

[slurm-users] stopping job array after N failed jobs in row

2023-08-01 Thread Josef Dvoracek
my users found the beauty of job arrays, and they tend to use it every then and now. Sometimes human factor steps in, and something is wrong in job array specification, and cluster "works" on one failed array job after another. Isn't there any way how to automatically stop/scancel/? job

Re: [slurm-users] monitoring and accounting

2023-06-12 Thread Josef Dvoracek
> But I'd be interested to see what other places do. we installed this: https://github.com/vpenso/prometheus-slurm-exporter and scrape this exporter with "inputs.prometheus" Telegraf input and it's sent to influx (and shown by Grafana) -- josef On 12. 06. 23 1:43, Andrew Elwell wrote: ...

[slurm-users] assigning qos and account to new users

2023-04-18 Thread Josef Dvoracek
hi all slurm ops, I'd like to improve my new user workflow. when new (eg. authenticated against external LDAP) user tries to submit job at my facilities, he sees this: [test@login1 ~]$ sbatch sbatch_sleep.sh sbatch: error: Batch job submission failed: Invalid account or account/partition

[slurm-users] slurm jobs and and amount of licenses (matlab)

2022-09-26 Thread Josef Dvoracek
hello @list! anyone who was dealing with following scenario? * we have limited amount of Matlab network licenses ( and various features have various amount of available seats, eg. machine learning: N licenses, Image_Toolbox: M licenses) * licenses are being used by slurm jobs and by

Re: [slurm-users] container on slurm cluster

2022-05-18 Thread Josef Dvoracek
> I had config the right slurm and munge inside the container. this is the reason. Who has access to munge.key can effectively became root at slurm cluster. you should not disclose munge.key to containers. cheers josef On 18. 05. 22 9:13, GHui wrote: ...I had config the right slurm and

Re: [slurm-users] work with sensitive data

2021-12-16 Thread Josef Dvoracek
ing filesystem being not secure by default.. On 15. 12. 21 10:29, Hermann Schwärzler wrote: ... -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences | office 230A cell+signal: +420 608 563 558 smime.p7s Description: S/MIME Cryptographic Signature

[slurm-users] slurm and kerberized NFSv4 - current perspective?

2021-12-16 Thread Josef Dvoracek
kerberized-nfs at compute nodes under slurm is not-so common scenario? thanks for any thoughts. josef -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences | office 230A cell+signal: +420 608 563 558

[slurm-users] how to temporarily avoid node being suspended by SuspendProgram

2021-08-10 Thread Josef Dvoracek
powersaving mechanism for particular node/noderange? I'm aware that there is SuspendExcNodes configuration parameter, but AFAIK it cannot be applied/changed without slurmctld restart. cheers josef -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences cell: +420 608 563 558 | https

Re: [slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

2021-06-24 Thread Josef Dvoracek
cked up - at least once with the cluster running.) It is absolutely safe to restart slurmctld (and slurmdbd) with jobs running on the cluster, that really is something that at least I do all the time. Tina On 24/06/2021 10:16, Josef Dvoracek wrote: hi, just set the partitions to "DO

Re: [slurm-users] Effect of slurmctld and slurmdb going down on running/pending jobs

2021-06-24 Thread Josef Dvoracek
ining all partitions and then restart the server. That is slurmctld.slurmdb and mariadb? Or  will the restarting of slurm vm have  no effect on running/pending iobs? Sincerely Amjad -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences cell: +420 608 563 558 | https://telegram

[slurm-users] srun at front-end nodes with --enable_configless fails with "Can't find an address, check slurm.conf"

2021-03-22 Thread Josef Dvoracek
=slurmserver2.DOMAIN AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=slurmserver2.DOMAIN AccountingStoragePort=7031 SlurmctldParameters=enable_configless -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences cell: +420 608 563 558 | https://telegram.me/jose_d |

Re: [slurm-users] Major newbie - Slurm/jupyterhub

2020-05-05 Thread Josef Dvoracek
On 05. 05. 20 2:24, Lisa Kay Weihl wrote: .. -- Josef Dvoracek Institute of Physics | Czech Academy of Sciences cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669

Re: [slurm-users] Monitoring with Telegraf

2019-09-27 Thread Josef Dvoracek
some time ago I wrote this small collector, https://github.com/jose-d/influxdb-collectors/tree/master/slurm_metric_writer. Until you'll write/find better one, feel free to use it, send PRs with improvements, etc :) cheers. josef On 26. 09. 19 17:15, Marcus Boden wrote: Hey everyone, I am

[slurm-users] slurmctl listening on IPv4 only

2019-06-04 Thread Josef Dvoracek
  0 127.0.0.11:57504 0.0.0.0:*   - [root@slurmctld_container ~]# cheers josef -- Josef Dvoracek Institute of Physics @ Czech Academy of Sciences cell: +420 608 563 558 | office: +420 266 052 669 | fzu phone nr. : 2669 smime.p7s Description: S/MIME Cryptographic