Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-07 Thread Stefan Staeglich
Hi Xaver, we also had a similar problem with Slurm 21.08 (see thread "error: power_save module disabled, NULL SuspendProgram"). Fortunately, we have not yet observed this since the upgrade to 23.02. But the time period (about a month) is still too short to know if the problem is really fixed

[slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-06 Thread Stefan Staeglich
Hi, since a half year we using the suspend/resume support for Slurm. This works quite well but sometimes it breaks and no nodes are suspended or resumed anymore. In this case we see the following message in the log: error: power_save module disabled, NULL SuspendProgram A restart of slurmctld

Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-20 Thread Stefan Staeglich
u using a UnkillableStepProgram? Thank you :) Best, Stefan Am Freitag, 20. Januar 2023, 05:59:19 CET schrieb Christopher Samuel: > On 1/19/23 5:01 am, Stefan Staeglich wrote: > > Hi, > > Hiya, > > > I'm wondering where the UnkillableStepProgram is actually executed. >

Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-19 Thread Stefan Staeglich
Hi, I'm wondering where the UnkillableStepProgram is actually executed. According to Mike it has to be available on every on the compute nodes. This makes sense only if it is executed there. But the man page slurm.conf of 21.08.x states: UnkillableStepProgram Must be execut

Re: [slurm-users] Setting up slurmrestd

2022-06-29 Thread Stefan Staeglich
Hi Karl, do you've found a solution? Best, Stefan Am Freitag, 8. Januar 2021, 23:14:34 CEST schrieb Karl Lovink: > Hi Luke, > > Thanks it’s working now. Thanks. One last question, is it possible to create > a non-expiring token. Yes, I know it is not secure > Sincerely yours, > Karl > >

[slurm-users] Allow specific users to drain nodes

2022-04-27 Thread Stefan Staeglich
Hi, we want to allow specific users to drain nodes. This feature seems to be implemented in the nonstop plugin. But this seems to be overkill of using only this feature. Is there any other plugin that implements this feature? Best, Stefan -- Stefan Stäglich, Universität Freiburg, Institut f

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2022-02-18 Thread Stefan Staeglich
Hi Mike, thank you very much :) Stefan Am Montag, 7. Februar 2022, 16:50:54 CET schrieb Michael Robbert: > They moved Arbiter2 to Github. Here is the new official repo: > https://github.com/CHPC-UofU/arbiter2 > > Mike > > On 2/7/22, 06:51, "slurm-users" > wrote: Hi, > > I've just noticed tha

Re: [slurm-users] Increasing /dev/shm max size?

2022-02-18 Thread Stefan Staeglich
Hi Diego, do you any new insights regarding this issue? Best, Stefan Am Montag, 26. Oktober 2020, 14:48:17 CET schrieb Diego Zuccato: > Il 22/10/20 12:56, Diego Zuccato ha scritto: > > 2) Is the shared memory accounted as belonging to the process and > > enforced accordingly by cgroups? > > Acc

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2022-02-07 Thread Stefan Staeglich
Hi, I've just noticed that the repository https://gitlab.chpc.utah.edu/arbiter2 seems is down. Does someone know more? Thank you! Best, Stefan Am Dienstag, 27. April 2021, 17:35:35 CET schrieb Prentice Bisbal: > I think someone asked this same exact question a few weeks ago. The best > solutio

Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2021-06-11 Thread Stefan Staeglich
Hi Prentice, thanks for the hint. I'm evaluating this too. Seems that arbiter doesn't distinguish between RAM that's used really and RAM that's sused as cache only. Or is my impression wrong? Best, Stefan Am Dienstag, 27. April 2021, 17:35:35 CEST schrieb Prentice Bisbal: > I think someone ask

[slurm-users] Parent accounts

2021-05-28 Thread Stefan Staeglich
Hi, for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP ? Something like sacctmgr list account parent=bla withasso -nP doesn't work. Best, Stefan -- Stefan Stäglich, Universität Freiburg

Re: [slurm-users] GRES Restrictions

2021-04-15 Thread Stefan Staeglich
Hello, is there a best practise for activating this feature (set ConstrainDevices=yes)? Do I have restart the slurmds? Does this affects running jobs? We are using Slurm 19.05. Best, Stefan Am Dienstag, 25. August 2020, 17:24:41 CEST schrieb Christoph Brüning: > Hello, > > we're using cgroup

Re: [slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

2021-03-18 Thread Stefan Staeglich
Hi Sven, I think it makes more sense to adjust the config file /etc/slurm-llnl/slurm.conf and not the systemd units: SlurmctldPidFile=/run/slurmctld.pid SlurmdPidFile=/run/slurmd.pid Best, Stefan Am Mittwoch, 17. März 2021, 19:16:38 CET schrieb Sven Duscha: > Hi, > > I experience with SLURM s

[slurm-users] Current status of checkpointing

2020-08-14 Thread Stefan Staeglich
Hi, what's the current status of the checkpointing support in SLURM? There was a CRIU plugin mentioned: https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf But it doesn't exist in SLURM 19.05.5 on Ubuntu 20.04. And the manual page mentions an OpenMPI plugin only. Best, Stefan -- Stefan Stäglich,

Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Stefan Staeglich
Hi, all except of /etc/ssl/certs/ca-certificates.crt is ignored. So I've copied it to /usr/local/share/ca-certificates/ and run update-ca-certificates. Now it's working :) Best, Stefan Am Freitag, 14. August 2020, 11:42:04 CEST schrieb Stefan Staeglich: > Hi, > >

[slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Stefan Staeglich
Hi, I try to setup the acct_gather plugin ProfileInfluxDB. Unfortunately our influxdb server has a self-signed certificate only: [2020-08-14T09:54:30.007] [46.0] error: acct_gather_profile/influxdb _send_data: curl_easy_perform failed to send data (discarded). Reason: SSL peer certificate or SS

Re: [slurm-users] Upgrade from Ubuntu 18.04 to 20.04

2020-03-25 Thread Stefan Staeglich
Hi Will, in this case it should no problem to upgrade directly to Ubuntu 20.04? It ships 19.05, there is no 19.11. Best, Stefan Am Montag, 16. März 2020, 15:41:56 CET schrieb Will Dennis: > Hi Stefan, > > I have not been able to find any 18.08.x PPAs; I myself have backported the > latest Debi

Re: [slurm-users] Usage splitting

2019-09-12 Thread Stefan Staeglich
Hi Chris, I'm not sure how this works. I'm not very experienced in QoS objects. Have I to create two QoS objects a and b with UsageThreshold=0.1,Flags= EnforceUsageThreshold / UsageThreshold=0.9? And I need two different accounts A and B like Daniel suggested? Or can I use a single account? Al

[slurm-users] Usage splitting

2019-08-30 Thread Stefan Staeglich
Hi, we have some compute nodes paid by different project owners. 10% are owned by project A and 90% are owned by project B. We want to implement the following policy such that every certain time period (e.g. two weeks): - Project A doesn't use more than 10% of the cluster in this time period -