[slurm-users] Re: Cgroup

2024-07-22 Thread Ole Holm Nielsen via slurm-users
On 7/22/24 12:05, stth via slurm-users wrote: I am configuring cgroups on my server for the first time. I've created a|cgroup.conf|file in the Slurm directory with the following values: |ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes AllowedRAMSpace=90 AllowedSwapSpace=10 |

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-18 Thread Ole Holm Nielsen via slurm-users
On 18-07-2024 08:15, William V via slurm-users wrote: yes ! that work with crb repo Thanks for the test, I'm glad the package installations works as expected now! I've corrected the repository documentation in the Wiki page at

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-17 Thread Ole Holm Nielsen via slurm-users
Hi William, Maybe you need to enable the CodeReady Linux Builder (CRB) repository for AlmaLinux/RockyLinux 9? Look for CRB in https://wiki.rockylinux.org/rocky/repo/ The command for EL9 is: dnf config-manager --set-enabled crb For EL8 enable this in stead: dnf config-manager --set-enabled

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-17 Thread Ole Holm Nielsen via slurm-users
On 7/17/24 08:43, William V via slurm-users wrote: I had exactly this problem : https://www.reddit.com/r/SLURM/comments/152ef0c/problems_installing_slurm/ Jul 16 11:28:31 occitest slurmd[54981]: slurmd: error: Couldn't find the specified plugin name for cgroup/v2 looking at all files Jul 16

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-16 Thread Ole Holm Nielsen via slurm-users
On 16-07-2024 16:20, William V via slurm-users wrote: How can I propose modifications to the wiki? For example, for RHEL9, it is missing 'dnf install dbus-devel' for compil with "cgroup v2" . On my RockyLinux 9.4 system there was no requirement for the dbus-devel RPM package (it isn't

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread Ole Holm Nielsen via slurm-users
On 7/15/24 11:35, William V via slurm-users wrote: Thank you for your response, I hadn't considered that version 22 could be the problem. I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread Ole Holm Nielsen via slurm-users
On 7/15/24 10:43, William VINCENT via slurm-users wrote: I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) . Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was

[slurm-users] Re: problem with squeue --json with version 24.05.1

2024-07-03 Thread Ole Holm Nielsen via slurm-users
On 7/2/24 18:48, Markus Köberl via slurm-users wrote: $ squeue --version slurm 24.05.1 $ squeue --json malloc(): invalid size (unsorted) Aborted forcing an older data_parser version works: $ squeue --json=v0.0.40 The minimum version requirements are listed in the rest_quickstart guide[1]:

[slurm-users] Re: Help with slurmdbd and Slurmctld

2024-06-17 Thread Ole Holm Nielsen via slurm-users
On 6/17/24 17:56, stth via slurm-users wrote: Hello, I wanted help with a problem I have. I just updated my operation system and the slurmdbd and slurmctld is not working anymore. I get this Errors: slurmdbd: fatal: Database schema is too old for this version of Slurm to upgrade.

[slurm-users] Re: JWKS File Update - Reread File

2024-06-06 Thread Ole Holm Nielsen via slurm-users
On 06-06-2024 19:50, Pedro Daniel Da Rocha Santos via slurm-users wrote: I am currently implementing a service to monitor for jwks key rotation. If a change is detected, the jwks file is updated. Is it enough to use “scontrol reconfigure” to reread the jwks file? Or does it require a full

[slurm-users] Re: slurmdbd not connecting to mysql (mariadb)

2024-05-29 Thread Ole Holm Nielsen via slurm-users
This might be the firewall blocking communication to slurmdbd? You may perhaps find some useful information in this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/ /Ole On 29-05-2024 23:05, Radhouane Aniba via slurm-users wrote: Hi everyone I am trying to get

[slurm-users] Re: Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

2024-05-26 Thread Ole Holm Nielsen via slurm-users
On 25-05-2024 03:49, Hongyi Zhao via slurm-users wrote: Ultimately, I found that the cause of the problem was that hyper-threading was enabled by default in the BIOS. If I disable hyper-threading, I observed that the computational efficiency is consistent between using slurm and using mpirun

[slurm-users] Re: Slurm DB upgrade failure behavior

2024-05-17 Thread Ole Holm Nielsen via slurm-users
On 5/16/24 20:27, Yuengling, Philip J. via slurm-users wrote: I'm writing up some Ansible code to manage Slurm software updates, and I haven't found any documentation about slurmdbd behavior if the mysql/mariadb database doesn't upgrade successfully. I would discourage the proposed Slurm

[slurm-users] Re: Removing safely a node

2024-05-17 Thread Ole Holm Nielsen via slurm-users
On 5/17/24 05:16, Ratnasamy, Fritz via slurm-users wrote:  What is the "official" process to remove nodes safely? I have drained the nodes so jobs are completed and put them in down state after they are completely drained. I edited the slurm.conf file to remove the nodes. After some time, I

[slurm-users] Re: srun launched mpi job occasionally core dumps

2024-05-07 Thread Ole Holm Nielsen via slurm-users
On 5/7/24 15:32, Henderson, Brent via slurm-users wrote: Over the past few days I grabbed some time on the nodes and ran for a few hours.  Looks like I **can** still hit the issue with cgroups disabled. Incident rate was 8 out of >11k jobs so dropped an order of magnitude or so.  Guessing

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-19 Thread Ole Holm Nielsen via slurm-users
= 13107200. Does this make sense? Best regards, Ole [1] "How to set limits for services in RHEL and systemd" https://access.redhat.com/solutions/1257953 [2] https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#slurmd-systemd-limits On 4/18/24 11:23, Ole Holm Nielsen wrote

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-18 Thread Ole Holm Nielsen via slurm-users
nged, you'll now get the errno associated with the backlog and can confirm EMFILE vs. ENFILE vs. ENOMEM. -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homep

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users
Hi Bjørn-Helge, On 4/16/24 12:08, Bjørn-Helge Mevik via slurm-users wrote: Ole Holm Nielsen via slurm-users writes: Therefore I believe that the root cause of the present issue is user applications opening a lot of files on our 96-core nodes, and we need to increase fs.file-max. You could

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users
e? Questions: Have other sites seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks for sharing your insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical Unive

[slurm-users] Munge log-file fills up the file system to 100%

2024-04-15 Thread Ole Holm Nielsen via slurm-users
unge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks for sharing your insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark -- slurm-users mailing list -- slurm-use

[slurm-users] Re: Lua script

2024-03-20 Thread Ole Holm Nielsen via slurm-users
What is the contents of your /etc/slurm/job_submit.lua file? Did you reconfigure slurmctld? Check the log file by: grep job_submit /var/log/slurm/slurmctld.log What is your Slurm version? You can read about job_submit plugins in this Wiki page:

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Ole Holm Nielsen via slurm-users
Hi Simon, Maybe you could print the user's limits using this tool: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits Which version of Slurm do you run? /Ole On 3/14/24 17:47, Simon Andrews via slurm-users wrote: Our cluster has developed a strange intermittent behaviour

[slurm-users] Re: Slurm management of dual-node server trays?

2024-03-07 Thread Ole Holm Nielsen via slurm-users
ats a Very interesting design and looking at the SD665 V3 documentation am I correct each node has dual 25GBs SFP28 interfaces? If so, the despite dual nodes in a 1u configuration, you actually have 2 separate servers? Sid On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, mailto:s

[slurm-users] Slurm management of dual-node server trays?

2024-02-23 Thread Ole Holm Nielsen via slurm-users
and insights! Ole [1] https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server [2] https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters [3] https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-lenovo-ser

[slurm-users] Re: URL for how to do for SLURM accounting setup

2024-02-15 Thread Ole Holm Nielsen via slurm-users
On 2/16/24 07:01, John Joseph via slurm-users wrote: we were able to setup a test SLURM based system, with 4 nodes , Ubuntu 22.04 LTS and we were able to run COMSOL using "comsol batch" command Now we plan to have accounting https://slurm.schedmd.com/accounting.html

[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Ole Holm Nielsen via slurm-users
On 1/31/24 09:02, Bjørn-Helge Mevik via slurm-users wrote: This isn't answering your question, but I strongly suggest you build Slurm from source. You can use the provided slurm.spec file to make rpms (we do) or use "configure + make". Apart from being able to upgrade whenever a new version is

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Ole Holm Nielsen
On 1/30/24 09:36, Fokke Dijkstra wrote: We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also

Re: [slurm-users] error

2024-01-18 Thread Ole Holm Nielsen
On 1/18/24 17:42, Felix wrote: I started a new AMD node, and the error is as follows: "CPU frequency setting not configured for this node" extended looks like this: [2024-01-18T18:28:06.682] CPU frequency setting not configured for this node [2024-01-18T18:28:06.691] slurmd started on Thu, 18

Re: [slurm-users] A fairshare policy that spans multiple clusters

2024-01-05 Thread Ole Holm Nielsen
On 05-01-2024 17:26, David Baker wrote: We are soon to install new Slurm cluster at our site. That means that we will have a total of three clusters running Slurm. Only two, that is the new clusters, will share a common file system. The original cluster has its own file system is independent

Re: [slurm-users] How to run one maintenance job on each node in the cluster

2023-12-23 Thread Ole Holm Nielsen
On 23-12-2023 05:09, Jeffrey Tunison wrote: Is there a straightforward way to create a batch job that runs once on every node in the cluster? A technique simpler than generating a list from sinfo output and dispatching the job in a for loop for the N nodes. That’s not very hard, but I

Re: [slurm-users] Slurm compute node with Intel 12th gen CPU

2023-12-20 Thread Ole Holm Nielsen
On 20-12-2023 15:59, Michael Bernasconi wrote: I'm trying to get slurm working on an Intel 12th gen CPU. slurmd instantly fails with the error message "Thread count (24) not multiple of core count (16)". I have tried adding "SlurmdParameters=config_overrides" to slurm.conf, and I have

Re: [slurm-users] install new slurm, no slurmctld found

2023-12-15 Thread Ole Holm Nielsen
Hi Farcas, On 12/15/23 11:00, Felix wrote: we are installing a new server with slurm on ALMA Linux 9.2 Slurm support on EL9 might perhaps be a little less mature than on EL8. we did the followimg: dnf install slurm The result is rpm -qa | grep slurm slurm-libs-22.05.9-1.el9.x86_64

Re: [slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-13 Thread Ole Holm Nielsen
On 12/13/23 10:44, John Joseph wrote: Thanks for the mail, and sorry for not properly explaining what info I was requesting, what actually I meant was that how could we could  do a check how the HPC system I set is working. Eg a program which can be run individually on a node, and comparing

Re: [slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-13 Thread Ole Holm Nielsen
On 12/13/23 07:13, John Joseph wrote: We have setup of slurm setup for a HPC setup of 4 node We want to do a stress test , guidnace requested for getting a  code which can test the functionality of the SLURM efficiency.  If there is such  a program, like to try out Guidance requested Then

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ole Holm Nielsen
On 10-12-2023 17:29, Ryan Novosielski wrote: This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory. /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Agreed! That's

Re: [slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Ole Holm Nielsen
Hi Xaver, On 12/8/23 16:00, Xaver Stiensmeier wrote: during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole -- Ole Holm Niels

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME

Re: [slurm-users] RPC rate limiting for different users

2023-11-28 Thread Ole Holm Nielsen
On 11/28/23 11:59, Cutts, Tim wrote: Is the new rate limiting feature always global for all users, or is there an option, which I’ve missed, to have different settings for different users?  For example, to allow a higher rate from web services which submit jobs on behalf of a large number of

Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Ole Holm Nielsen
On 11/24/23 12:15, Ole Holm Nielsen wrote: On 11/24/23 09:31, Gestió Servidors wrote: Some days ago, I started to configure a new server with SLURM 23.02.5. Yesterday, I read in this mailing list that version 23.11.0 was released, so today I have compiled this latest version. However, after

Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Ole Holm Nielsen
On 11/24/23 09:31, Gestió Servidors wrote: Some days ago, I started to configure a new server with SLURM 23.02.5. Yesterday, I read in this mailing list that version 23.11.0 was released, so today I have compiled this latest version. However, after starting slurmdbd (with a database upgrade),

Re: [slurm-users] slurm comunication between versions

2023-11-23 Thread Ole Holm Nielsen
Hi Felix, On 11/23/23 18:14, Felix wrote: Will slurm-20.02 which is installed on a management node comunicate with slurm-22.05 installed on a work nodes? They have the same configuration file slurm.conf Or do the version have to be the same. Slurm 20.02 was installed manually and slurm

Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Ole Holm Nielsen
On 11/23/23 11:50, Markus Kötter wrote: On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those

Re: [slurm-users] SLURM new user query, does SLURM has GUI /Web based management version also

2023-11-19 Thread Ole Holm Nielsen
On 19-11-2023 09:11, Joseph John wrote: I am new user, trying out SLURM Like to check if the SLURM has a GUI/web based management tool also Did you read the Quick Start Administrator Guide at https://slurm.schedmd.com/quickstart_admin.html ? I don't believe there are any Slurm management

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-13 Thread Ole Holm Nielsen
le, On 10/11/2023 15:04, Ole Holm Nielsen wrote: On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This might disturb the logic in waitforib.sh, or at least cause some

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Ole Holm Nielsen
On 2/11/2023 09:28, Ole Holm Nielsen wrote: Hi Ward, Thanks a lot for the feedback!  The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask y

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-02 Thread Ole Holm Nielsen
Hi Ward, Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask you how you implement your script as a

Re: [slurm-users] RES: RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
and nmcli connection, startup is still pending. ***" PÚBLICA -Mensagem original----- De: slurm-users Em nome de Ole Holm Nielsen Enviada em: quarta-feira, 1 de novembro de 2023 05:19 Para: slurm-users@lists.schedmd.com Assunto: Re: [slurm-users] RES: How to delay the start of slurm

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
jobs. We should configure NHC to make site-specific hardware and network checks, for example for Infiniband/OPA network or NVIDIA GPUs. Best regards, Ole On 11/1/23 09:44, Rémi Palancher wrote: Hi Ole, Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit : I'm fighting this strange scenario wh

Re: [slurm-users] RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
: ExecStart=/usr/bin/nm-online -s -q and this is causing our problems with Infiniband/OPA networks. This is the reason why we need Max's workaround wait-for-interfaces.service. /Ole -Mensagem original- De: slurm-users Em nome de Ole Holm Nielsen Enviada em: terça-feira, 31 de

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-31 Thread Ole Holm Nielsen
On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up? ◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones,

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
adjustments, but it's pretty straight forward -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45

[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
workaround? Of course we could remove the Infiniband check in Node Health Check (NHC), but that would not really be acceptable during operations. Thanks for sharing any insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] RES: Change something in user's script using job_submit.lua plugin

2023-10-28 Thread Ole Holm Nielsen
, PÚBLICA -Mensagem original- De: slurm-users Em nome de Ole Holm Nielsen Enviada em: sexta-feira, 27 de outubro de 2023 03:31 Para: slurm-users@lists.schedmd.com Assunto: Re: [slurm-users] Change something in user's script using job_submit.lua plugin Hi Paulo, Which Slurm version do you have

Re: [slurm-users] Change something in user's script using job_submit.lua plugin

2023-10-27 Thread Ole Holm Nielsen
Hi Paulo, Which Slurm version do you have, and did you set this in slurm.conf: JobSubmitPlugins=lua ? Perhaps you may find some useful information in this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins /Ole On 26-10-2023 19:07, Paulo Jose Braga

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Ole Holm Nielsen
Hi Tim, I think the scontrol manual page explains the "scontrol reboot" function fairly well: reboot [ASAP] [nextstate={RESUME|DOWN}] [reason=] {ALL|} Reboot the nodes in the system when they become idle using the RebootProgram as

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Ole Holm Nielsen
On 10/13/23 12:22, Taras Shapovalov wrote: Oh, does this mean that no one should use Slurm versions <= 21.08 any more? SchedMD recommends to use the currently supported versions (currently 22.05 or 23.02). Next month 23.11 will be released and 22.05 will become unsupported. The question

Re: [slurm-users] Slurm powersave

2023-10-06 Thread Ole Holm Nielsen
Hi Davide, On 10/5/23 15:28, Davide DelVento wrote: IMHO, "pretending" to power down nodes defies the logic of the Slurm power_save plugin. And it is sure useless ;) But I was using the suggestion from https://slurm.schedmd.com/power_save.html

Re: [slurm-users] Slurm powersave

2023-10-05 Thread Ole Holm Nielsen
Hi Davide, On 10/4/23 23:03, Davide DelVento wrote: I'm experimenting with slurm powersave and I have several questions. I'm following the guidance from https://slurm.schedmd.com/power_save.html and the great presentation from our own

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ole Holm Nielsen
On 29-09-2023 17:33, Ryan Novosielski wrote: I’ll just say, we haven’t done an online/jobs running upgrade recently (in part because we know our database upgrade will take a long time, and we have some processes that rely on -M), but we have done it and it does work fine. So the paranoia isn’t

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ole Holm Nielsen
On 9/28/23 17:58, Groner, Rob wrote: There's 14 steps to upgrading slurm listed on their website, including shutting down and backing up the database.  So far we've only updated slurm during a downtime, and it's been a major version change, so we've taken all the steps indicated. We now want

Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Ole Holm Nielsen
On 9/26/23 14:50, Groner, Rob wrote: There's a builtin slurm command, I can't remember what it is and google is failing me, that will take a compacted list of nodenames and return their full names, and I'm PRETTY sure it will do the opposite as well (what you're asking for). It's probably

Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Ole Holm Nielsen
On 9/20/23 01:39, Feng Zhang wrote: Restarting the slurmd dameon of the compute node should work, if the node is still online and normal. Probably not. If the filesystem used by the job is hung, the node must probably be rebooted, and the filesystem must be checked. /Ole On Tue, Sep 19,

Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Ole Holm Nielsen
On 9/19/23 13:59, Felix wrote: Hello I have a job on my system which is running more than its time, more than 4 days. 1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047 The job has state "CG" which means "Completing". The Completing status is explained in "man sinfo".

Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-30 Thread Ole Holm Nielsen
Hi Magnus, On 8/30/23 11:17, Hagdorn, Magnus Karl Moritz wrote: On Wed, 2023-08-30 at 10:38 +0200, Ole Holm Nielsen wrote: This is a very useful example!  I guess that you have also defined EnergyIPMIUsername and EnergyIPMIPassword in acct_gather.conf?  How is the EnergyIPMIPassword protected

Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-30 Thread Ole Holm Nielsen
Hi Magnus, On 8/30/23 10:12, Hagdorn, Magnus Karl Moritz wrote: Yes, but can you share the details of which parameters you configure in this plugin so that you can extract node power?  This doesn't seem obvious to me. not much needs configuring. We have EnergyIPMIFrequency=10

Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-29 Thread Ole Holm Nielsen
Hi Magnus, On 29-08-2023 13:56, Hagdorn, Magnus Karl Moritz wrote: I'm curious to learn about your energy gathering method:  How do you extract node power using IPMI using FreeIMPI (or some other toolset), and how do you configure Slurm for this? We are using the SLURM plugin which is

Re: [slurm-users] bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-29 Thread Ole Holm Nielsen
Hi Magnus, On 8/28/23 10:16, Hagdorn, Magnus Karl Moritz wrote: we recently enabled the energy gathering plugin on using the IPMI gatherer with libfreeipmi. We are running the latest slurm 23.02.4 on rocky 8.5. We are getting sporadic buffer overflows in slurmd when it is trying to query the

Re: [slurm-users] Is there any public scientific-workflow example that can be run through Slurm?

2023-08-18 Thread Ole Holm Nielsen
Hi Alper, On 18-08-2023 18:39, Alper Alimoglu wrote: In slurm we can build pipelines using [slurm dependencies][1], which allows us to run workflows. In my work, I have stuck in a point regarding finding a workflow that I can run using Slurm. As an example, I have to use a workflow

Re: [slurm-users] slurm sinfo format memory

2023-07-21 Thread Ole Holm Nielsen
Hi Arsene, On 7/20/23 18:24, Arsene Marian Alain wrote: > I would like to see the following information of my nodes "hostname, total > mem, free mem and cpus". So, I used  ‘sinfo -o "%8n %8m %8e %C"’ but in > the output it shows me the memory in MB like "190560" and I need it in GB > (without

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-20 Thread Ole Holm Nielsen
Hi Lorenzo, On 7/20/23 12:16, Lorenzo Bosio wrote: > One more thing I'd like to point out, is that I need to monitor jobs going > from pending to running state (after waiting in the jobs queue). I > currently have a separate pthread to achieve this, but I think at this > point the

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Ole Holm Nielsen
Hi Lorenzo, On 7/19/23 14:22, Lorenzo Bosio wrote: > I'm developing a job submit plugin to check if some conditions are met > before a job runs. > I'd need a way to notify the user about the plugin actions (i.e. why its > jobs was killed and what to do), but after a lot of research I could only

Re: [slurm-users] Job step do not take the hole allocation

2023-06-30 Thread Ole Holm Nielsen
On 6/30/23 08:41, Tommi Tervo wrote: This was an annoying change: 22.05.x RELEASE_NOTES: -- srun will no longer read in SLURM_CPUS_PER_TASK. This means you will implicitly have to specify --cpus-per-task on your srun calls, or set the new SRUN_CPUS_PER_TASK env var to accomplish the

Re: [slurm-users] monitoring and accounting

2023-06-12 Thread Ole Holm Nielsen
Hi Andrew, On 6/12/23 01:43, Andrew Elwell wrote: Are your slurm to influx scripts publicly available anywhere? I do something similar for squeue via python subprocess to call squeue -M all -a -o "%P,%a,%u,%D,%q,%T,%r" And some sinfo calls for node/cpu usage: sinfo -M {} -o "%P,%a,%F" sinfo

Re: [slurm-users] Temporary Stop User Submission

2023-05-26 Thread Ole Holm Nielsen
On 5/26/23 01:56, Markuske, William wrote: I would but unfortunately they were creating 100s of TBs of data and I need them to log in and delete it but I don't want them creating more in the meantime. Does your filesystem have disk quotas? Using disk quotas would seem to be a good choice to

Re: [slurm-users] Temporary Stop User Submission

2023-05-26 Thread Ole Holm Nielsen
On 5/26/23 01:29, Doug Meyer wrote: I always like Sacctmgr update user where user= set grpcpus=0 This GrpCPUs group limit may perhaps affect the entire group? Anyway, GrpCPUs is undocumented in the sacctmgr manual page, for which I opened a bug in

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 15:23, Roger Mason wrote: NodeName=node012 CoresPerSocket=2 CPUAlloc=0 CPUTot=4 CPULoad=N/A AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=node012 NodeHostName=node012 RealMemory=10193 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Ole Holm Nielsen
On 5/25/23 13:59, Roger Mason wrote: slurm 20.02.7 on FreeBSD. Uh, that's old! I have a couple of nodes stuck in the drain state. I have tried scontrol update nodename=node012 state=down reason="stuck in drain state" scontrol update nodename=node012 state=resume without success. I then

Re: [slurm-users] Limit run time of interactive jobs

2023-05-08 Thread Ole Holm Nielsen
On 5/8/23 08:39, Bjørn-Helge Mevik wrote: Angel de Vicente writes: But one possible way to something similar is to have a partition only for interactive jobs and a different partition for batch jobs, and then enforce that each job uses the right partition. In order to do this, I think we can

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-01 Thread Ole Holm Nielsen
On 5/1/23 12:08, Angel de Vicente wrote: Hello Ole, Ole Holm Nielsen writes: As Brian wrote: On a technical note: slurm keeps the detailed accounting data for each cluster in separate TABLES within a single database. In the Federation page https://urldefense.com/v3/__https

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-01 Thread Ole Holm Nielsen
Hi Angel, On 5/1/23 11:28, Angel de Vicente wrote: Ole Holm Nielsen writes: If I read Brian's comments correctly, he's saying that Slurm already has a well-tested and documented solution for multi-cluster sites: Federated clusters. Thanks Ole. Don't get me wrong, I have nothing against

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-05-01 Thread Ole Holm Nielsen
On 5/1/23 09:22, Angel de Vicente wrote: This is the first time that I'm installing Slurm, so things are not very clear to me yet (even more so for multi-cluster operation). Brian Andrus writes: You can do it however you like. You asked if there was a good or existing way to do it easily,

Re: [slurm-users] Several slurmdbds against one mysql server?

2023-04-29 Thread Ole Holm Nielsen
On 29-04-2023 11:44, Angel de Vicente wrote: Hello, I'm setting Slurm in a number of machines and (at least for the moment) we don't plan to let users submit across machines, so the initial plan was to install Slurm+slurmdbd+mysql in every machine. But in order to get stats for all the

Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-29 Thread Ole Holm Nielsen
sting more minutes than the GrpTRESMins limit should not be permitted to start. /Ole On Apr 28, 2023, at 6:43 AM, Ole Holm Nielsen wrote: Hi Hoot, I'm glad that you have figured out that GrpTRESMins is working as documented and kills running jobs when the limit is exceeded. This would

Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-28 Thread Ole Holm Nielsen
help. On Apr 24, 2023, at 1:55 PM, Ole Holm Nielsen wrote: On 24-04-2023 18:33, Hoot Thompson wrote: In my reading of the Slurm documentation, it seems that exceeding the limits set in GrpTRESMins should result in terminating a running job. However, in testing this, The ‘current value

Re: [slurm-users] Terminating Jobs based on GrpTRESMins

2023-04-24 Thread Ole Holm Nielsen
On 24-04-2023 18:33, Hoot Thompson wrote: In my reading of the Slurm documentation, it seems that exceeding the limits set in GrpTRESMins should result in terminating a running job. However, in testing this, The ‘current value’ of the GrpTRESMins only updates upon job completion and is not

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Ole Holm Nielsen
On 4/24/23 08:56, Purvesh Parmar wrote: Thank you.. will try this and get back. Any other step being missed here for migration? I don't know if any steps are missing, because I never tried moving a cluster like you want to do. /Ole On Mon, 24 Apr 2023 at 12:08, Ole Holm Nielsen

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-24 Thread Ole Holm Nielsen
if it works or not. Then you can always make multiple attempts without breaking anything. Best regards, Ole On Mon, 24 Apr 2023 at 11:25, Ole Holm Nielsen <mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 4/24/23 06:58, Purvesh Parmar wrote: > thank you, but its change

Re: [slurm-users] Migration of slurm communication network / Steps / how to

2023-04-23 Thread Ole Holm Nielsen
On 4/24/23 06:58, Purvesh Parmar wrote: thank you, but its change of hostnames as well, apart from ip addresses as well of the slurm server, database serverver name and slurmd compute nodes as well. I suggest that you talk to your networking people and request that the old DNS names be

Re: [slurm-users] sview not installed

2023-04-23 Thread Ole Holm Nielsen
On 23-04-2023 02:43, mohammed shambakey wrote: I installed slurm 23.11.0-0rc1, and sview is not installed, despite it exists in /src/sview/sview. I can execute it from that path but not /bin (because it does not exist there). I tried just copying it to /bin, but it complained about being

Re: [slurm-users] Resource LImits

2023-04-21 Thread Ole Holm Nielsen
chedMD is the best way to get consulting services, get general help, and report bugs. We have excellent experiences with SchedMD support (https://www.schedmd.com/support.php). Best regards, Ole On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote:

Re: [slurm-users] Resource LImits

2023-04-20 Thread Ole Holm Nielsen
missing something or I don’t understand how it’s supposed to work. Hoot On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen wrote: Hi Hoot, On 4/20/23 00:15, Hoot Thompson wrote: Is there a ‘how to’ or recipe document for setting up and enforcing resource limits? I can establish accounts, users

Re: [slurm-users] Resource LImits

2023-04-20 Thread Ole Holm Nielsen
Hi Hoot, On 4/20/23 00:15, Hoot Thompson wrote: Is there a ‘how to’ or recipe document for setting up and enforcing resource limits? I can establish accounts, users, and set limits but 'current value' is not incrementing after running jobs. I have written about resource limits in this Wiki

Re: [slurm-users] Submit sbatch to multiple partitions

2023-04-17 Thread Ole Holm Nielsen
On 4/17/23 11:36, Xaver Stiensmeier wrote: let's say I want to submit a large batch job that should run on 8 nodes. I have two partitions, each holding 4 nodes. Slurm will now tell me that "Requested node configuration is not available". However, my desired output would be that slurm makes use

Re: [slurm-users] Slurmdbd High Availability

2023-04-13 Thread Ole Holm Nielsen
On 4/13/23 11:49, Shaghuf Rahman wrote: I am setting up Slurmdb in my system and I need some inputs My current setup is like server1 : 192.168.123.12(slurmctld) server2: 192.168.123.13(Slurmctld) server3: 192.168.123.14(Slurmdbd) which is pointing to both Server1 and Server2. database: MySQL

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-29 Thread Ole Holm Nielsen
gards, Ole On 3/29/23 14:16, Dr. Thomas Orgis wrote: Am Mon, 27 Mar 2023 13:17:01 +0200 schrieb Ole Holm Nielsen : FYI: Slurm power_save works very well for us without the issues that you describe below. We run Slurm 22.05.8, what's your version? I'm sure that there are setups where it wo

Re: [slurm-users] [External] Power saving method selection for different kinds of hardware

2023-03-27 Thread Ole Holm Nielsen
to suspend/resume each host should be very doable, if not ideal. Prentice On 11/8/22 09:36, Ole Holm Nielsen wrote: I'm thinking about the best way to configure power saving (see https://slurm.schedmd.com/power_save.html) when we have different types of node hardware whose power state have

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

2023-03-27 Thread Ole Holm Nielsen
. Things are too intertangled, even with my simple concept of 'job' not beginning to describe what Slurm has in terms of various steps as scheduling entities that by default also use delayed allocation techniques (regarding prolog script behaviour, for example). Alrighty then, Thomas -- Ole Holm

  1   2   3   4   5   >