Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-07 Thread Stefan Staeglich
Hi Xaver,

we also had a similar problem with Slurm 21.08 (see thread "error: power_save 
module disabled, NULL SuspendProgram").

Fortunately, we have not yet observed this since the upgrade to 23.02. But the 
time period  (about a month) is still too short to know if the problem is 
really fixed as we are still in the normal recurrence period of that event.

Best regards,
Stefan

Am Mittwoch, 6. Dezember 2023, 12:14:46 CET schrieb Xaver Stiensmeier:
> Hi Ole,
> 
> for multiple reasons we build it ourself, but I am not really involved
> in that process, but I will contact the person who is. Thanks for the
> recommendation! We should probably implement a regular check whether
> there is a new slurm version. I am not 100% whether this will fix our
> issues or not, but it's worth a try.
> 
> Best regards
> Xaver
> 
> On 06.12.23 12:03, Ole Holm Nielsen wrote:
> > On 12/6/23 11:51, Xaver Stiensmeier wrote:
> >> Good idea. Here's our current version:
> >> 
> >> ```
> >> sinfo -V
> >> slurm 22.05.7
> >> ```
> >> 
> >> Quick googling told me that the latest version is 23.11. Does the
> >> upgrade change anything in that regard? I will keep reading.
> > 
> > There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving
> > Power with Slurm" at https://slurm.schedmd.com/publications.html
> > 
> > For reasons of security and functionality it is recommended to follow
> > Slurm's releases (maybe not the first few minor versions of new major
> > releases like 23.11).  FYI, I've collected information about upgrading
> > Slurm in the Wiki page
> > https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-sl
> > urm
> > 
> > /Ole


-- 
Albert-Ludwigs-Universität Freiburg
Institut für Informatik
Professur für Maschinelles Lernen

Stefan Stäglich
System-Administrator

T +49 761 203-8223

staeg...@informatik.uni-freiburg.de
https://ml.informatik.uni-freiburg.de

Georges-Köhler-Allee 52
D-79110 Freiburg


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-06 Thread Stefan Staeglich
Hi,

since a half year we using the suspend/resume support for Slurm. This works 
quite well but sometimes it breaks and no nodes are suspended or resumed 
anymore.

In this case we see the following message in the log:
error: power_save module disabled, NULL SuspendProgram

A restart of slurmctld fixes the issue for a few weeks.

In the beginning we had also messages like
error: power_save: program exit status of 1

So we started to implement error logging in the scripts and terminated  them  
always with exit code. The idea was avoiding that slurms sets the 
SuspendProgram to NULL.

But this fixed not the main error but might have reduced the frequency of 
occurring. Has someone observed similar issues? We will try a higher 
SuspendTimeout.

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-8223


signature.asc
Description: This is a digitally signed message part.


Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-20 Thread Stefan Staeglich
Hi Chris,

thank you. I've overseen this part.

But someone who is actually using a UnkillableStepProgram stated the opposite 
(that it's executed on the controller nodes). Are you aware of any change 
between Slurm releases? Maybe one of the two parts is just a leftover. Are you 
using a UnkillableStepProgram?

Thank you :)

Best,
Stefan

Am Freitag, 20. Januar 2023, 05:59:19 CET schrieb Christopher Samuel:
> On 1/19/23 5:01 am, Stefan Staeglich wrote:
> > Hi,
> 
> Hiya,
> 
> > I'm wondering where the UnkillableStepProgram is actually executed.
> > According to Mike it has to be available on every on the compute nodes.
> > This makes sense only if it is executed there.
> 
> That's right, it's only executed on compute nodes.
> 
> > But the man page slurm.conf of 21.08.x states:
> > UnkillableStepProgram
> > 
> >Must be executable by user SlurmUser.  The file must be
> > 
> > accessible by the primary and backup control machines.
> > 
> > So I would expect it's executed on the controller node.
> 
> That's strange, my slurm.conf man page from a system still running 21.08
> says:
> 
> UNKILLABLE STEP PROGRAM SCRIPT
> This program can be used to take special actions to clean up
> the unkillable processes and/or notify system administrators.
> The program will be run as SlurmdUser (usually "root") on
> the compute node where UnkillableStepTimeout was triggered.
> 
> Ah, I see, there's a later "FILE AND DIRECTORY PERMISSIONS" part which
> has the text that you've found - that part's wrong! :-)
> 
> All the best,
> Chris


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-8223


signature.asc
Description: This is a digitally signed message part.


Re: [slurm-users] Slurm - UnkillableStepProgram

2023-01-19 Thread Stefan Staeglich
Hi,

I'm wondering where the UnkillableStepProgram is actually executed. According 
to Mike it has to be available on every on the compute nodes. This makes sense 
only if it is executed there.

But the man page slurm.conf of 21.08.x states:
   UnkillableStepProgram
  Must be executable by user SlurmUser.  The file must be 
accessible by the primary and backup control machines.

So I would expect it's executed on the controller node.

Best,
Stefan

Am Dienstag, 23. März 2021, 05:30:01 CET schrieb Chris Samuel:
> Hi Mike,
> 
> On 22/3/21 7:12 pm, Yap, Mike wrote:
> > # I presume UnkillableStepTimeout is set in slurm.conf. and it act as a
> > timer to trigger UnkillableStepProgram
> 
> That is correct.
> 
> > # UnkillableStepProgram   can be use to send email or reboot compute node
> > – question is how do we configure it ?
> 
> Also - or to automate collecting debug info (which is what we do) and
> then we manually intervene to reboot the node once we've determined
> there's no more useful info to collect.
> 
> It's just configured in your slurm.conf.
> 
> UnkillableStepProgram=/path/to/the/unkillable/step/script.sh
> 
> Of course this script has to be present on every compute node.
> 
> All the best,
> Chris


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-8223


signature.asc
Description: This is a digitally signed message part.


Re: [slurm-users] Setting up slurmrestd

2022-06-29 Thread Stefan Staeglich
Hi Karl,

do you've found a solution?

Best,
Stefan

Am Freitag, 8. Januar 2021, 23:14:34 CEST schrieb Karl Lovink:
> Hi Luke,
> 
> Thanks it’s working now. Thanks. One last question, is it possible to create
> a non-expiring token. Yes, I know it is not secure
 
> Sincerely yours,
> Karl 
> 
> 
> > On 8 Jan 2021, at 20:25, Luke Yeager  wrote:
> > 
> > 
> > There's this: https://slurm.schedmd.com/rest.html
> > 
> >  
> > 
> > From my personal notes:
> > Install these packages before re-compiling Slurm (ubuntu 20.04):
> > libhttp-parser-dev, libjwt-dev, and libyaml-dev
 Setup JWT using these
> > instructions: https://slurm.schedmd.com/jwt.html Create
> > /etc/slurm/slurmrest.conf with contents like this:
> > include {{ _sysconf_dir }}/slurm.conf
> > AuthType=auth/jwt
> > Run like this: slurmrestd -f /etc/slurm/slurmrestd.conf 0.0.0.0:6820
> > 
> >  
> > 
> > -Original Message-
> > From: slurm-users  On Behalf Of
> > Karl Lovink
 Sent: Friday, January 8, 2021 11:04 AM
> > To: slurm-us...@schedmd.com
> > Subject: Re: [slurm-users] Setting up slurmrestd
> > 
> >  
> > 
> > External email: Use caution opening links or attachments
> > 
> >  
> >  
> > 
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> > 
> >  
> > 
> > Hello,
> > 
> >  
> > 
> > I'm trying to configure slurmrestd but I haven't been very successful so
> > far. The plan is to use splunk to approach the endpoints. To test the
> > communication with slurmrestd I am now using curl. However, I always get
> > "Authentication failure" back. I think there are two options to
> > authenticate: JWT and an API key. I would like to use the API key.
> > slurmrestd, curl and soon Splunk will run on the same machine. So
> > communication via localhost is sufficient.
> 
> >  
> > 
> > Is there somewhere and HOWTO to configure this? And can somebody help me
> > to setup this.
> 
> >  
> > 
> > Sincerely,
> > Karl
> > 
> >  
> >  
> > 
> > -BEGIN PGP SIGNATURE-
> > 
> >  
> > 
> > iQEzBAEBCAAdFiEEdAEe0RRL+gREs9oxGJor1wjGePMFAl/4rKIACgkQGJor1wjG
> > ePNpWwgAg4/m39mUv3ac4ZS4NxPErHccymUEELLagPIidUdbPa+NV9VagCoCbjvN
> > bM38k4X4Biud0AEGQ8yD5YDDJYZplYzViTJUF1wQr1aObU6NRgAj426rEFRZeP7j
> > pNL42QXh5/0szNYrrtH9c1UY+ALvwFR0CEpphvdoySa5tTm73zsSOgzD6CZN+AIY
> > g2bou6dLPBTB5v9y5WgjiC/9gdZxmtSv80Z72r6uWzKwVdI5pOattkjt9AFNy+Wa
> > mTYUN9Zix9u6ILeQ08poOLR2dMyfHbVtoL4lk4lV/yTyN7+9m2VlfTjM9ondyYuT
> > Bt7ToUEECLKL5/mMTuEz0RCC5xuUsA==
> > =4ZMz
> > -END PGP SIGNATURE-
> > 
> >  
> >  


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223


smime.p7s
Description: S/MIME cryptographic signature


[slurm-users] Allow specific users to drain nodes

2022-04-27 Thread Stefan Staeglich
Hi,

we want to allow specific users to drain nodes. This feature seems to be 
implemented in the nonstop plugin. But this seems to be overkill of using only 
this feature.

Is there any other plugin that implements this feature?

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2022-02-18 Thread Stefan Staeglich
Hi Mike,

thank you very much :)

Stefan

Am Montag, 7. Februar 2022, 16:50:54 CET schrieb Michael Robbert:
> They moved Arbiter2 to Github. Here is the new official repo:
> https://github.com/CHPC-UofU/arbiter2
> 
> Mike
> 
> On 2/7/22, 06:51, "slurm-users" 
> wrote: Hi,
> 
> I've just noticed that the repository https://gitlab.chpc.utah.edu/arbiter2
> seems is down. Does someone know more?
> 
> Thank you!
> 
> Best,
> Stefan
> 
> Am Dienstag, 27. April 2021, 17:35:35 CET schrieb Prentice Bisbal:
> > I think someone asked this same exact question a few weeks ago. The best
> > solution I know of is to use Arbiter, which was created exactly for this
> > situation. It uses cgroups to limit resource usage, but it adjusts those
> > limits based on login node utilization and each users behavior ("bad"
> > users get their resources limited more severely when they do "bad" things.
> > 
> > I will be deploying it myself very soon.
> > 
> > https://dylngg.github.io/resources/arbiterTechPaper.pdf
> >  > ithub.io/resources/arbiterTechPaper.pdf%3e>
> > 
> > Prentice
> > 
> > On 4/23/21 10:37 PM, Cristóbal Navarro wrote:
> > > Hi Community,
> > > I have a set of users still not so familiar with slurm, and yesterday
> > > they bypassed srun/sbatch and just ran their CPU program directly on
> > > the head/login node thinking it would still run on the compute node. I
> > > am aware that I will need to teach them some basic usage, but in the
> > > meanwhile, how have you solved this type of user-behavior problem? Is
> > > there a preffered way to restrict the master/login resources, or
> > > actions,  to the regular users ?
> > > 
> > > many thanks in advance
> 
> --
> Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
> Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany
> 
> E-Mail :
> staeg...@informatik.uni-freiburg.de .de> WWW: gki.informatik.uni-freiburg.de
> Telefon: +49 761 203-8223
> Fax: +49 761 203-8222


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
Fax: +49 761 203-8222


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Increasing /dev/shm max size?

2022-02-18 Thread Stefan Staeglich
Hi Diego,

do you any new insights regarding this issue?

Best,
Stefan

Am Montag, 26. Oktober 2020, 14:48:17 CET schrieb Diego Zuccato:
> Il 22/10/20 12:56, Diego Zuccato ha scritto:
> > 2) Is the shared memory accounted as belonging to the process and
> > enforced accordingly by cgroups?
> 
> According to some preliminary tests, it seems it's not enforced. Or
> maybe I haven't configured cgroups correctly.
> Hints?


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
Fax: +49 761 203-8222


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2022-02-07 Thread Stefan Staeglich
Hi,

I've just noticed that the repository https://gitlab.chpc.utah.edu/arbiter2 
seems is down. Does someone know more?

Thank you!

Best,
Stefan

Am Dienstag, 27. April 2021, 17:35:35 CET schrieb Prentice Bisbal:
> I think someone asked this same exact question a few weeks ago. The best
> solution I know of is to use Arbiter, which was created exactly for this
> situation. It uses cgroups to limit resource usage, but it adjusts those
> limits based on login node utilization and each users behavior ("bad"
> users get their resources limited more severely when they do "bad" things.
> 
> I will be deploying it myself very soon.
> 
> https://dylngg.github.io/resources/arbiterTechPaper.pdf
> 
> 
> Prentice
> 
> On 4/23/21 10:37 PM, Cristóbal Navarro wrote:
> > Hi Community,
> > I have a set of users still not so familiar with slurm, and yesterday
> > they bypassed srun/sbatch and just ran their CPU program directly on
> > the head/login node thinking it would still run on the compute node. I
> > am aware that I will need to teach them some basic usage, but in the
> > meanwhile, how have you solved this type of user-behavior problem? Is
> > there a preffered way to restrict the master/login resources, or
> > actions,  to the regular users ?
> > 
> > many thanks in advance


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
Fax: +49 761 203-8222


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] [External] What is an easy way to prevent users run programs on the master/login node.

2021-06-11 Thread Stefan Staeglich
Hi Prentice,

thanks for the hint. I'm evaluating this too.

Seems that arbiter doesn't distinguish between RAM that's used really and RAM 
that's sused as cache only. Or is my impression wrong?

Best,
Stefan

Am Dienstag, 27. April 2021, 17:35:35 CEST schrieb Prentice Bisbal:
> I think someone asked this same exact question a few weeks ago. The best
> solution I know of is to use Arbiter, which was created exactly for this
> situation. It uses cgroups to limit resource usage, but it adjusts those
> limits based on login node utilization and each users behavior ("bad"
> users get their resources limited more severely when they do "bad" things.
> 
> I will be deploying it myself very soon.
> 
> https://dylngg.github.io/resources/arbiterTechPaper.pdf
> 
> 
> Prentice
> 
> On 4/23/21 10:37 PM, Cristóbal Navarro wrote:
> > Hi Community,
> > I have a set of users still not so familiar with slurm, and yesterday
> > they bypassed srun/sbatch and just ran their CPU program directly on
> > the head/login node thinking it would still run on the compute node. I
> > am aware that I will need to teach them some basic usage, but in the
> > meanwhile, how have you solved this type of user-behavior problem? Is
> > there a preffered way to restrict the master/login resources, or
> > actions,  to the regular users ?
> > 
> > many thanks in advance


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
Fax: +49 761 203-8222






[slurm-users] Parent accounts

2021-05-28 Thread Stefan Staeglich
Hi,

for our monitoring system I want to query the account hierarchy. Is there a 
better approach than to parse the output of

sacctmgr list account withasso -nP

?

Something like

sacctmgr list account parent=bla withasso -nP

doesn't work.

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
Fax: +49 761 203-8222






Re: [slurm-users] GRES Restrictions

2021-04-15 Thread Stefan Staeglich
Hello,

is there a best practise for activating this feature (set 
ConstrainDevices=yes)? Do I have restart the slurmds? Does this affects running 
jobs?

We are using Slurm 19.05.

Best,
Stefan

Am Dienstag, 25. August 2020, 17:24:41 CEST schrieb Christoph Brüning:
> Hello,
> 
> we're using cgroups to restrict access to the GPUs.
> 
> What I found particularly helpful, are the slides by Marshall Garey from
> last year's Slurm User Group Meeting:
> https://slurm.schedmd.com/SLUG19/cgroups_and_pam_slurm_adopt.pdf
> (NVML didn't work for us for some reason I cannot recall, but listing
> the GPU device files explicitly was not a big deal)
> 
> Best,
> Christoph
> 
> On 25/08/2020 16.12, Willy Markuske wrote:
> > Hello,
> > 
> > I'm trying to restrict access to gpu resources on a cluster I maintain
> > for a research group. There are two nodes put into a partition with gres
> > gpu resources defined. User can access these resources by submitting
> > their job under the gpu partition and defining a gres=gpu.
> > 
> > When a user includes the flag --gres=gpu:# they are allocated the number
> > of gpus and slurm properly allocates them. If a user requests only 1 gpu
> > they only see CUDA_VISIBLE_DEVICES=1. However, if a user does not
> > include the --gres=gpu:# flag they can still submit a job to the
> > partition and are then able to see all the GPUs. This has led to some
> > bad actors running jobs on all GPUs that other users have allocated and
> > causing OOM errors on the gpus.
> > 
> > Is it possible, and where would I find the documentation on doing so, to
> > require users to define a --gres=gpu:# to be able to submit to a
> > partition? So far reading the gres documentation doesn't seem to have
> > yielded any word on this issue specifically.
> > 
> > Regards,


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-8222






Re: [slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

2021-03-18 Thread Stefan Staeglich
Hi Sven,

I think it makes more sense to adjust the config file 
/etc/slurm-llnl/slurm.conf 
and not the systemd units:
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid

Best,
Stefan

Am Mittwoch, 17. März 2021, 19:16:38 CET schrieb Sven Duscha:
> Hi,
> 
> I experience with SLURM slurmctld an error on Ubuntu20.04, when starting
> the service (through systemctl):
> 
> 
> I installed munge and SLURM version 19.05.5-1 through the package
> manager from
> the default repository:
> 
> apt-get install munge slurm-client slurm-wlm slurm-wlm-doc slurmctld slurmd
> 
> 
> systemctl start slurmctld &
> [1] 2735
> 18:55 [root@slurm ~]# systemctl status slurmctld
> ● slurmctld.service - Slurm controller daemon
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled;
> vendor preset: enabled)
> Active: activating (start) since Wed 2021-03-17 18:55:49 CET; 5s ago
>   Docs: man:slurmctld(8)
>Process: 2737 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
>  Tasks: 12
> Memory: 2.5M
> CGroup: /system.slice/slurmctld.service
> └─2759 /usr/sbin/slurmctld
> 
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurmctld.pid (yet?) after start: Operation not permitted
> 
> 
> 
> 
> After about 60 seconds slurmctld terminates:
> 
> 
> -- A stop job for unit slurmctld.service has finished.
> --
> -- The job identifier is 1043 and the job result is done.
> Mar 17 18:55:49 slurm systemd[1]: Starting Slurm controller daemon...
> -- Subject: A start job for unit slurmctld.service has begun execution
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit slurmctld.service has begun execution.
> --
> -- The job identifier is 1044.
> Mar 17 18:55:49 slurm systemd[1]: slurmctld.service: Can't open PID file
> /run/slurmctld.pid (yet?) after start: Operation not permitted
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:57:19 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> 
> 
> 
> 
> My slurm.conf file lists custom PID file locations for slurmctld and slurmd:
> /etc/slurm-llnl/slurm.conf
> 
> SlurmctldPidFile=/run/slurm-llnl/slurmctld.pid
> SlurmdPidFile=/run/slurm-llnl/slurmd.pid
> 
> 
> 
> Starting the slurmctld executable by hand works fine:
> /usr/sbin/slurmctld &
> 
> pgrep slurmctld
> 2819
> [1]+  Done/usr/sbin/slurmctld
> pgrep slurmctld
> 2819
> squeue
>   JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
> sinfo -lNe
> Wed Mar 17 19:01:45 2021
> NODELIST   NODES PARTITION   STATE CPUSS:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> ekgen1 1  cluster*unknown*   162:8:1 48   
> 0  1   (null) none
> ekgen2 1  cluster*   down*   162:8:1 25   
> 0  1   (null) Not responding
> ekgen3 1debianunknown*   162:8:1 25   
> 0  1   (null) none
> ekgen4 1  cluster*unknown*   162:8:1 25   
> 0  1   (null) none
> ekgen5 1  cluster*unknown*   162:8:1 25   
> 0  1   (null) none
> ekgen6 1debianunknown*   162:8:1 25   
> 0  1   (null) none
> ekgen7 1  cluster*unknown*   162:8:1 25   
> 0  1   (null) none
> ekgen8 1debian   down*   162:8:1 25   
> 0  1   (null) Not responding
> ekgen9 1  cluster*unknown*   162:8:1 192000   
> 0  1   (null) none
> 
> 
> 
> I tried then to modify /lib/systemd/system/slurmd.service
> 
> cp /lib/systemd/system/slurmd.service
> /lib/systemd/system/slurmd.service.orig
> 
> changed
> PIDFile=/run/slurmd.pid
> to
> PIDFile=/run/slurm-llnl/slurmd.pid
> 
> systemctl start slurmctld &
> [1] 1869
> pgrep slurm
> 1875
> squeue
>   JOBID PARTITION NAME USER ST   TIME  NODES NODELIST(REASON)
> 
> after ca. 60 seconds:
> 
> Job for slurmctld.service failed because a timeout was exceeded.
> See "systemctl status slurmctld.service" and "journalctl -xe" for details
> 
> 
> - Subject: A start job for unit packagekit.service has finished successfully
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- A start job for unit packagekit.service has finished successfully.
> --
> -- The job identifier is 586.
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: start operation
> timed out. Terminating.
> Mar 17 18:28:08 slurm systemd[1]: slurmctld.service: Failed with result
> 'timeout'.
> -- Subject: Unit failed
> -- Defined-By: systemd
> -- Support: http://www.ubuntu.com/support
> --
> -- The unit slurmctld.service has entered the 'failed' state with result
> 'timeout'.
> Mar 17 18:28:08 slurm systemd[1]: Failed to start Slurm controller daemon.
> -- Subject: A start job 

[slurm-users] Current status of checkpointing

2020-08-14 Thread Stefan Staeglich
Hi,

what's the current status of the checkpointing support in SLURM? There was a 
CRIU plugin mentioned:
https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf

But it doesn't exist in SLURM 19.05.5 on Ubuntu 20.04. And the manual page 
mentions an OpenMPI plugin only.

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-74217






Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Stefan Staeglich
Hi,

all except of /etc/ssl/certs/ca-certificates.crt is ignored. So I've copied it 
to /usr/local/share/ca-certificates/ and run update-ca-certificates.

Now it's working :)

Best,
Stefan

Am Freitag, 14. August 2020, 11:42:04 CEST schrieb Stefan Staeglich:
> Hi,
> 
> I try to setup the acct_gather plugin ProfileInfluxDB. Unfortunately our
> influxdb server has a self-signed certificate only:
> [2020-08-14T09:54:30.007] [46.0] error: acct_gather_profile/influxdb
> _send_data: curl_easy_perform failed to send data (discarded). Reason: SSL
> peer certificate or SSH remote key was not OK
> 
> I've copied the certificate to /etc/ssl/certs/ but this doesn't help. But
> his command is working:
> curl 'https://influxdb-server.privat:8086' --cacert /etc/ssl/certs/
> influxdb.crt
> 
> Has someone a solution for this issue?
> 
> Best,
> Stefan


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-74217






Re: [slurm-users] Upgrade from Ubuntu 18.04 to 20.04

2020-03-25 Thread Stefan Staeglich
Hi Will,

in this case it should no problem to upgrade directly to Ubuntu 20.04? It 
ships 19.05, there is no 19.11.

Best,
Stefan

Am Montag, 16. März 2020, 15:41:56 CET schrieb Will Dennis:
> Hi Stefan,
> 
> I have not been able to find any 18.08.x PPAs; I myself have backported the
> latest Debian HPC Team release of 19.05.5 into my PPA -
> https://launchpad.net/~wdennis/+archive/ubuntu/dhpc-backports
> 
> I have also created local packages of 18.08.6.2, but only for Ubuntu 16.04,
> for my own use...
> 
> Could you not use my 19.05 PPA to upgrade your 17.11? I think that meets the
> "no more than 2 releases" rule (17.11 -> 18.08 -> 19.05)
> 
> Best,
> Will
> 
> 
> -Original Message-
> From: slurm-users  On Behalf Of
> "Stefan Stäglich" Sent: Monday, March 16, 2020 8:39 AM
> To: slurm-users@lists.schedmd.com
> Subject: Re: [slurm-users] Upgrade from Ubuntu 18.04 to 20.04
> 
> Hello,
> 
> seems that the is a single package only in the PPA and the SystemD units are
> missing.
> 
> Is there a PPA that provides Slurm 18.08.x spitted in different packages
> like Ubuntu does?
> 
> Best,
> Stefan
> 
> > Hello,
> > 
> > 
> > 
> > we are running our cluster with Ubuntu 18.04. I thinking already about
> > the upgrade to the coming release of Ubuntu 20.04.
> > 
> > 
> > 
> > I would like to do a in-place upgrade to Ubuntu 20.04 if possible.
> > Ubuntu ships 17.11.x per default and Ubuntu 20.04 will ship 19.11.x.
> > So it won't be possible to update to Ubuntu 20.04 directly. But the
> > PPA ppa:slurm-mainline/ppa ships 18.08.x.
> > 
> > 
> > 
> > So it should be possible to
> > 
> > *   Upgrade to Slum 18.08.x first via PPA
> > *   Upgrade to Slurm 19.11.x and Ubuntu 20.04 after
> > *   Always the database will be updated first, then the scheduler and
> > then the submit hosts/nodes
> > 
> > 
> > 
> > Has someone experiences with this PPA or any other comments/suggestions?
> > 
> > 
> > 
> > Best,
> > 
> > Stefan Stäglich


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-74217






Re: [slurm-users] Usage splitting

2019-09-12 Thread Stefan Staeglich
Hi Chris,

I'm not sure how this works. I'm not very experienced in QoS objects.

Have I to create two QoS objects a and b with UsageThreshold=0.1,Flags= 
EnforceUsageThreshold / UsageThreshold=0.9? And I need two different accounts 
A and B like Daniel suggested? Or can I use a single account?

All the best,
Stefan

Am Montag, 2. September 2019, 02:46:52 CEST schrieb Chris Samuel:
> On Sunday, 1 September 2019 1:44:26 AM PDT Daniel Letai wrote:
> > This won't do exactly what you want - it might allow 'A' to utilize more
> > than 10%, if the cluster is under utilized.
> 
> An alternative might be to try and use QOS's for this with their own
> fairshare targets and set "EnforceUsageThreshold"  on the QOS that you
> don't want to be able to exceed its limit.
> 
> https://slurm.schedmd.com/sacctmgr.html#lbAW
> 
> All the best,
> Chris


-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-74217






[slurm-users] Usage splitting

2019-08-30 Thread Stefan Staeglich
Hi,

we have some compute nodes paid by different project owners. 10% are owned by 
project A and 90% are owned by project B.

We want to implement the following policy such that every certain time period 
(e.g. two weeks):
- Project A doesn't use more than 10% of the cluster in this time period
- But project B is allowed to use more than 90%

What's the best way to enforce this?

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.74,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: ml.informatik.uni-freiburg.de
Telefon: +49 761 203-54216
Fax: +49 761 203-54217