[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-24 Thread Paul Edmon via slurm-users
You might need to do some tuning on your backfill loop as that loop should be the one that backfills in those lower priority jobs.  I would also look to see if those lower priority jobs will actually fit in prior to the higher priority job running, they may not. -Paul Edmon- On 9/24/24 2:19

[slurm-users] Re: Nodelist syntax and semantics

2024-09-05 Thread Paul Edmon via slurm-users
;). If one or more numeric expressions are included, one of them must be at the end of the name (e.g. "unit[0-31]rack" is invalid), but arbitrary names can always be used in a comma-separated list." -Paul Edmon- On 9/5/24 3:24 PM, Jackson, Gary L. via slurm-users wrote: Is ther

[slurm-users] Re: salloc not starting shell despite LaunchParameters=use_interactive_step

2024-09-05 Thread Paul Edmon via slurm-users
Its definitely working for 23.11.8, which is what we are using. -Paul Edmon- On 9/5/24 10:22 AM, Loris Bennett via slurm-users wrote: Jason Simms via slurm-users writes: Ours works fine, however, without the InteractiveStepOptions parameter. My assumption is also that default value should

[slurm-users] Re: Print Slurm Stats on Login

2024-08-29 Thread Paul Edmon via slurm-users
Thanks. I've made that fix. -Paul Edmon- On 8/28/24 5:42 PM, Davide DelVento wrote: Thanks everybody once again and especially Paul: your job_summary script was exactly what I needed, served on a golden plate. I just had to modify/customize the date range and change the following line (I

[slurm-users] Re: Print Slurm Stats on Login

2024-08-27 Thread Paul Edmon via slurm-users
tion about this. Lots of great ideas. -Paul Edmon- On 8/9/24 12:04 PM, Jeffrey T Frey wrote: You'd have to do this within e.g. the system's bashrc infrastructure. The simplest idea would be to add to e.g. /etc/profile.d/zzz-slurmstats.sh and have some canned commands/scripts

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
e use Reframe for our testing: https://github.com/fasrc/reframe-fasrc). -Paul Edmon- On 8/26/2024 3:28 PM, Ole Holm Nielsen via slurm-users wrote: On 26-08-2024 20:30, Paul Edmon via slurm-users wrote: I haven't seen any behavior like that. For reference we are running Rocky 8.9 with MOFED 23.10.

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
I haven't seen any behavior like that.  For reference we are running Rocky 8.9 with MOFED 23.10.2 -Paul Edmon- On 8/26/2024 2:23 PM, Ole Holm Nielsen via slurm-users wrote: Hi Paul, On 26-08-2024 15:29, Paul Edmon via slurm-users wrote: We've had this exact hardware for years no

[slurm-users] Re: Slurm management of Lenovo SD665 V3 dual-server trays?

2024-08-26 Thread Paul Edmon via slurm-users
issue. That said you are free to reboot either node with out loss of connectivity. We do that all the time with no issues. As noted though if you want to actually physically service the nodes, then you have to take out both. -Paul Edmon- On 8/26/2024 8:51 AM, Ole Holm Nielsen via slurm-users

[slurm-users] Re: How to select a container runtime system?

2024-08-23 Thread Paul Edmon via slurm-users
Containers -Paul Edmon- On 8/23/24 2:21 PM, wdennis--- via slurm-users wrote: We are getting a few calls to support container workloads on our Slurm cluster; I want to support these user's usecases, so am looking into it now. The problem for me is, I'm not super-familiar with containe

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-12 Thread Paul Edmon via slurm-users
Ah, that's even more fun. I know with Singularity you can launch MPI applications by calling MPI outside of the container and then having it link to the internal version: https://docs.sylabs.io/guides/3.3/user-guide/mpi.html  Not sure about docker though. -Paul Edmon- On 8/12/2024 10:

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-12 Thread Paul Edmon via slurm-users
hostlist, your ranks may not end up properly bound to the specific cores they are supposed to be allocated. So definitely proceed with caution and validate your ranks are being laid out properly, as you will be relying on mpirun/mpiexec to bootstrap rather than the scheduler. -Paul Edmon- On 8

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-12 Thread Paul Edmon via slurm-users
l way to do it if you need to would be the scontrol show hostnames command against the $SLURM_JOB_NODELIST (https://slurm.schedmd.com/scontrol.html#OPT_hostnames). That will give you the list of hosts your job is set to run on. -Paul Edmon- On 8/12/2024 8:34 AM, Jeffrey Layton via slurm-users

[slurm-users] Re: Annoying canonical question about converting SLURM_JOB_NODELIST to a host list for mpirun

2024-08-09 Thread Paul Edmon via slurm-users
as a environmental variable. -Paul Edmon- On 8/9/2024 12:34 PM, Jeffrey Layton via slurm-users wrote: Good afternoon, I know this question has been asked a million times, but what is the canonical way to convert the list of nodes for a job that is container in a Slurm variable,

[slurm-users] Re: Print Slurm Stats on Login

2024-08-09 Thread Paul Edmon via slurm-users
Yup, we have that installed already. It's been very beneficial for over all monitoring. -Paul Edmon- On 8/9/2024 12:27 PM, Reid, Andrew C.E. (Fed) wrote: Maybe a heavier lift than you had in mind, but check out xdmod, open.xdmod.org. It was developed by the NSF as part of th

[slurm-users] Re: Print Slurm Stats on Login

2024-08-09 Thread Paul Edmon via slurm-users
Yeah, I was contemplating doing that so I didn't have a dependency on the scheduler being up or down or busy. What I was more curious about is if any one had an prebaked scripts for that. -Paul Edmon- On 8/9/2024 12:04 PM, Jeffrey T Frey wrote: You'd have to do this withi

[slurm-users] Print Slurm Stats on Login

2024-08-09 Thread Paul Edmon via slurm-users
curious what other sites do and if they would be willing to share their scripts and methodology. -Paul Edmon- -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Paul Edmon via slurm-users
I think this would be a good feature request. At least to me everything you can get in scontrol show job should be in sacct in some form. -Paul Edmon- On 8/7/2024 9:29 AM, Steffen Grunewald wrote: On Wed, 2024-08-07 at 08:55:21 -0400, Slurm users wrote: Warning on that one, it can eat up a

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Paul Edmon via slurm-users
. -Paul Edmon- On 8/7/2024 8:51 AM, Juergen Salk via slurm-users wrote: Hi Steffen, not sure if this is what you are looking for, but with `AccountingStoreFlags=job_env´ set in slurm.conf, the batch job environment will be stored in the accounting database and can later be retrieved with `sacct -j

[slurm-users] Re: Find out submit host of past job?

2024-08-07 Thread Paul Edmon via slurm-users
That looks to be the case from my glance at sacct. Not everything in scontrol show job ends up in sacct, which is a bit frustrating at times. -Paul Edmon- On 8/7/2024 8:08 AM, Steffen Grunewald via slurm-users wrote: Hello everyone, I've grepped the manual pages and crawled the 

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-09 Thread Paul Edmon via slurm-users
when the job ends the user's session will also end. However if the user has no job on that node, then they can ssh as normal to that host with out any problem. -Paul Edmon- On 7/8/2024 5:48 PM, Chris Taylor via slurm-users wrote: On my Rocky9 cluster I got this to work fine also- Added a

[slurm-users] Re: Unsupported RPC version by slurmctld 19.05.3 from client slurmd 22.05.11

2024-06-17 Thread Paul Edmon via slurm-users
https://slurm.schedmd.com/upgrades.html#compatibility_window Looks like no. You have to be with in 2 major releases. -Paul Edmon- On 6/17/24 5:40 AM, ivgeokig via slurm-users wrote: Hello!     I have a question. I have the server 19.05.3. No chance to upgrade it.   Have I any chance to

[slurm-users] Re: need to set From: address for slurm

2024-06-07 Thread Paul Edmon via slurm-users
There is no way to do it in slurm. You have to do it in the mail program you are using to send mail. In our case we use postfix and we set smtp_generic_maps to accomplish this. -Paul Edmon- On 6/7/2024 3:33 PM, Vanhorn, Mike via slurm-users wrote: All, When the slurm daemon is sending out

[slurm-users] Re: dynamical configuration || meta configuration mgmt

2024-05-29 Thread Paul Edmon via slurm-users
u are using a QoS to manage this (which I am assuming you are), I would use sacctmgr. As for a framework that does the state inspection, I'm not aware of one. You could do it via cron and batch scripts to do the state inspection. I don't know if some one has something more sophisticated

[slurm-users] HPC Principal System Engineer at the Broad

2024-04-25 Thread Paul Edmon via slurm-users
A friend ask me to pass this along. Figured some folks on this list might be interested. https://broadinstitute.avature.net/en_US/careers/JobDetail/HPC-Principal-System-Engineer/17773 -Paul Edmon- -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Paul Edmon via slurm-users
Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them. -Paul Edmon- On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote: We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported

[slurm-users] Re: Avoiding fragmentation

2024-04-09 Thread Paul Edmon via slurm-users
it to force jobs to one side of the partition, though generally the scheduler does this automatically. -Paul Edmon- On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote: Agree with that.   Plus, of course, even if the jobs run a bit slower by not having all the cores on a single node, they wi

[slurm-users] Re: FairShare priority questions

2024-03-27 Thread Paul Edmon via slurm-users
would be my recommendation. This is how we handle fairshare at FASRC: https://docs.rc.fas.harvard.edu/kb/fairshare/ As we use Classic Fairshare. You will need to enable this: https://slurm.schedmd.com/slurm.conf.html#OPT_NO_FAIR_TREE as Fair Tree is on by default. -Paul Edmon- On 3/27/2024 9

[slurm-users] Slurm Utilities

2024-03-13 Thread Paul Edmon via slurm-users
utput for slurm partition information stdg: https://github.com/fasrc/stdg Slurm test deck generator prometheus-slurm-exporter: https://github.com/fasrc/prometheus-slurm-exporter  Slurm exporters for prometheus Hopefully people find these useful. Pull requests are always appreciated. -Paul

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
He's talking about recent versions of Slurm which now have this option: https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step -Paul Edmon- On 2/28/2024 10:46 AM, Paul Raines wrote: What do you mean "operate via the normal command line"?  When you salloc, you a

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
but swapped to salloc a few years back and haven't had any issues. -Paul Edmon- On 2/28/2024 10:17 AM, wdennis--- via slurm-users wrote: Hi list, In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ...&qu

[slurm-users] Re: Question about IB and Ethernet networks

2024-02-26 Thread Paul Edmon via slurm-users
poses. So we haven't heavily invested in a high speed ethernet backbone but instead invested in IB. To invest in both seems to me to be overkill, you should focus on one or the other unless you have the cash to spend and a good use case. -Paul Edmon- On 2/26/24 7:07 AM, Dan Healy via s

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Paul Edmon via slurm-users
Are you using the job_script storage option? If so then you should be able to get at it by doing: sacct -B j JOBID https://slurm.schedmd.com/sacct.html#OPT_batch-script -Paul Edmon- On 2/16/2024 2:41 PM, Jason Simms via slurm-users wrote: Hello all, I've used the "scon

[slurm-users] Re: Naive SLURM question: equivalent to LSF pre-exec

2024-02-14 Thread Paul Edmon via slurm-users
You probably want the Prolog option: https://slurm.schedmd.com/slurm.conf.html#OPT_Prolog along with: https://slurm.schedmd.com/slurm.conf.html#OPT_ForceRequeueOnFail -Paul Edmon- On 2/14/2024 8:38 AM, Cutts, Tim via slurm-users wrote: Hi, I apologise if I’ve failed to find this in the

Re: [slurm-users] Two jobs each with a different partition running on same node?

2024-01-29 Thread Paul Edmon
t is some obscure option. -Paul Edmon- On 1/29/2024 9:25 AM, Loris Bennett wrote: Hi, I seem to remember that in the past, if a node was configured to be in two partitions, the actual partition of the node was determined by the partition associated with the jobs running on it. Moreover, at an

Re: [slurm-users] preemptable queue

2024-01-12 Thread Paul Edmon
ry setting that default of PreemptMode=CANCEL and then set specific PreemptModes for all your partitions. That's what we do and it works for us. -Paul Edmon- On 1/12/2024 10:33 AM, Davide DelVento wrote: Thanks Paul, I don't understand what you mean by having a typo somewhere. I mean,

Re: [slurm-users] preemptable queue

2024-01-12 Thread Paul Edmon
At least in the example you are showing you have PreemptType commented out, which means it will return the default. PreemptMode Cancel should work, I don't see anything in the documentation that indicates it wouldn't.  So I suspect you have a typo somewhere in your conf. -Paul Edmon

Re: [slurm-users] Beginner admin question: Prioritization within a partition based on time limit

2024-01-09 Thread Paul Edmon
will work best for the policy you want to implement. -Paul Edmon- On 1/9/2024 10:43 AM, Kenneth Chiu wrote: I'm just learning about slurm. I understand that different different partitions can be prioritized separately, and can have different max time limits. I was wondering whether or not t

Re: [slurm-users] GPU Card Reservation?

2023-12-15 Thread Paul Edmon
t. A partition would be all or nothing for a node so that would not work. -Paul Edmon- On 12/15/23 12:16 PM, Jason Simms wrote: Hello all, At least at one point, I understood that it was not particularly possible, or at least not elegant, to provide priority preempt access to a specific GPU

Re: [slurm-users] Disabling SWAP space will it effect SLURM working

2023-12-11 Thread Paul Edmon
We've been running for years with out swap on with no issues. You may want to set MemSpecLimit in your config to reserve memory for the OS, so that way you don't OOM the system with user jobs: https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit -Paul Edmon- On 12/11/202

Re: [slurm-users] enabling job script archival

2023-10-03 Thread Paul Edmon
You will probably need to. The way we handle it is that we add users when the first submit a job via the job_submit.lua script. This way the database autopopulates with active users. -Paul Edmon- On 10/3/23 9:01 AM, Davide DelVento wrote: By increasing the slurmdbd verbosity level, I got

Re: [slurm-users] enabling job script archival

2023-10-02 Thread Paul Edmon
At least in our setup, users can see their own scripts by doing sacct -B -j JOBID I would make sure that the scripts are being stored and how you have PrivateData set. -Paul Edmon- On 10/2/2023 10:57 AM, Davide DelVento wrote: I deployed the job_script archival and it is working, however it

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Paul Edmon
paranoia we general stop everything. The entire process takes about an hour start to finish, with the longest part being the pausing of all the jobs. -Paul Edmon- On 9/29/2023 9:48 AM, Groner, Rob wrote: I did already see the upgrade section of Jason's talk, but it wasn't much abo

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
ssion which helps with the on disk size. Raw uncompressed our database is about 90G.  We keep 6 months of data in our active database. -Paul Edmon- On 9/28/2023 1:57 PM, Ryan Novosielski wrote: Sorry for the duplicate e-mail in a short time: do you know (or anyone) when the hashing was added

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
job_scripts as they are functionally the same and thus you have many jobs pointed to the same script, but less so for job_envs. -Paul Edmon- On 9/28/2023 1:55 PM, Ryan Novosielski wrote: Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought of

Re: [slurm-users] enabling job script archival

2023-09-28 Thread Paul Edmon
of them if they get large is to 0 out the column in the table. You can ask SchedMD for the mysql command to do this as we had to do it here to our job_envs. -Paul Edmon- On 9/28/2023 1:40 PM, Davide DelVento wrote: In my current slurm installation, (recently upgraded to slurm v23.02.3), I only

Re: [slurm-users] Submitting hybrid OpenMPI and OpenMP Jobs

2023-09-22 Thread Paul Edmon
You might also try swapping to use srun instead of mpiexec as that way slurm can give more direction as to what cores have been allocated to what. I've found it in the past that mpiexec will ignore what Slurm tells it. -Paul Edmon- On 9/22/23 8:24 AM, Lambers, Martin wrote: Hello, for

Re: [slurm-users] Best way to accurately calculate the CPU usage of an account when using fairshare?

2023-05-08 Thread Paul Edmon
I would recommend standing up an instance of XDMod as it handles most of this for you in its summary reports. https://open.xdmod.org/10.0/index.html -Paul Edmon- On 5/3/23 2:05 PM, Joseph Francisco Guzman wrote: Good morning, We have at least one billed account right now, where the

Re: [slurm-users] changing the operational network in slurm setup

2023-03-14 Thread Paul Edmon
We do this for our Infiniband set up.  What we do is that we populate /etc/hosts with the hostname mapped to the IP we want Slurm to use.  This way you get IP traffic traversing the address you want between nodes while not having to mess with DNS. -Paul Edmon- On 3/14/2023 12:19 AM, Purvesh

Re: [slurm-users] linting slurm.conf files

2023-01-27 Thread Paul Edmon
We have a gitlab runner that fires up a docker container that basically starts up a mini scheduler (slurmdbd and slurmctld) to confirm that both can start. It covers most bases but we would like to see an official syntax checker (https://bugs.schedmd.com/show_bug.cgi?id=3435). -Paul Edmon

Re: [slurm-users] Maintaining slurm config files for test and production clusters

2023-01-04 Thread Paul Edmon
The symlink method for slurm.conf is what we do as well. We have a NFS mount from the slurm master that we host the slurm.conf on that we then symlink slurm.conf to that NFS share. -Paul Edmon- On 1/4/2023 1:53 PM, Brian Andrus wrote: One of the simple ways I have dealt with different

Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-14 Thread Paul Edmon
The seff utility (in slurm-contribs) also gives good summary info. You can also you --parsable to make things more managable. -Paul Edmon- On 12/14/22 3:41 PM, Ross Dickson wrote: I wrote a simple Python script to transpose the output of sacct from a row into a column.  See if it meets your

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Paul Edmon
Yeah, our spec is based off of their spec with our own additional features plugged in. -Paul Edmon- On 12/2/22 2:12 PM, David Thompson wrote: Hi Paul, thanks for passing that along. The error I saw was coming from the rpmbuild %check stage in the el9/fc38 builds, which your .spec file

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread Paul Edmon
Yup, here is the spec we use that works for CentOS 7, Rocky 8, and Alma 8. -Paul Edmon- On 12/2/22 12:21 PM, David Thompson wrote: Hi folks, I’m working on getting Slurm v22 RPMs built for our Alma 8 Slurm cluster. We would like to be able to use the sbatch –prefer option, which isn’t

Re: [slurm-users] slurm 22.05 "hash_k12" related upgrade issue

2022-10-24 Thread Paul Edmon
It only happens for versions on the 22.05 series prior to the latest release (22.05.5).  So the 21 version isn't impacted and you should be fine to upgrade from 21 to 22.05.5 and not see the hash_k12 issue.  If you upgrade to any prior minor version though you will hit this issue. -Paul

Re: [slurm-users] Ideal NFS exported StateSaveLocation size.

2022-10-24 Thread Paul Edmon
the HA setup for slurmctld will protect you from the server hosting the slurmctld getting hosed, not the entire rack going down or the datacenter going down. -Paul Edmon- On 10/24/2022 4:14 AM, Ole Holm Nielsen wrote: On 10/24/22 09:57, Diego Zuccato wrote: Il 24/10/2022 09:32, Ole Holm

Re: [slurm-users] Check consistency

2022-10-07 Thread Paul Edmon
The slurmctld log will print out if hosts are out of sync with the slurmctld slurm.conf.  That said it doesn't report on cgroup consistency changes like that.  It's possible that dialing up the verbosity on the slurmd logs may give that info but I haven't seen it in normal ope

Re: [slurm-users] Recommended amount of memory for the database server

2022-09-26 Thread Paul Edmon
our database is bigger than that. -Paul Edmon- On 9/25/22 5:18 PM, byron wrote: Hi Does anyone know what is the recommended amount of memory to give slurms mariadb database server? I seem to remember reading a simple estimate based on the size of certain tables (or something along those

Re: [slurm-users] Providing users with info on wait time vs. run time

2022-09-16 Thread Paul Edmon
We also call scontrol in our scripts (a little as we can manage) and we run at the scale of 1500 nodes.  It hasn't really caused many issues, but we try to limit it as much as we possibly can. -Paul Edmon- On 9/16/22 9:41 AM, Sebastian Potthoff wrote: Hi Hermann, So you both are ha

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Paul Edmon
But not any 20.  There are 20 versions, 20.02 and 20.11, and there was a previous 19.05.  So two versions for 18.08 would be 20.02 not 20.11 -Paul Edmon- On 9/8/22 12:14 PM, Wadud Miah wrote: The previous version was 18 and now I am trying to upgrade to 20, so I am well within 2 major

Re: [slurm-users] Upgrading SLURM from 18 to 20.11.9

2022-09-08 Thread Paul Edmon
Typically slurm only supports upgrading between 2 major versions ahead.  If you are on 18.08 you likely can only go to 20.02. Then after you upgrade to 20.02 you can go to 20.11 or 21.08. -Paul Edmon- On 9/8/22 11:38 AM, Wadud Miah wrote: hi Mick, I have checked that all the compute nodes

Re: [slurm-users] maridb version compatibility with Slurm version

2022-08-24 Thread Paul Edmon
I've regularly upgraded the mariadb version with out upgrading the slurm version with no issue. We are currently running 10.6.7 for MariaDB on CentOS 7.9 with Slurm 22.05.2.  So long as you do the mysql_upgrade after the upgrade and have a backup just in case you should be fine. -Paul

Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Paul Edmon
True.  Though be aware that Slurm will by default map the environment from login nodes to compute.  That's the real thing that matters.  So as long as the environment is setup properly, any filesystems excluding the home directory do not need to be mounted on login. -Paul Edmon- On 8/2

Re: [slurm-users] Does the slurmctld node need access to Parallel File system and Runtime libraries of the SW in the Compute nodes.

2022-08-02 Thread Paul Edmon
No, the node running the slurmctld does not need access to any of the customer facing filesystems or home directories.  While all the login and client nodes do, the slurmctld does not. -Paul Edmon- On 8/2/2022 9:30 AM, Richard Chang wrote: Hi, I am new to SLURM, so please bear with me. I

Re: [slurm-users] SlurmDB Archive settings?

2022-07-18 Thread Paul Edmon
ter=6month PurgeTXNAfter=6month PurgeUsageAfter=6month -Paul Edmon- On 7/15/2022 2:08 AM, Ole Holm Nielsen wrote: Hi Paul, On 7/14/22 15:10, Paul Edmon wrote: We just use the Archive function built into slurm.  That has worked fine for us for the past 6 years. We keep 6 months of data in the acti

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
22.05 so that it is more efficient but getting from here to there is the trick. For details see the bug report we filed: https://bugs.schedmd.com/show_bug.cgi?id=14514 -Paul Edmon- On 7/14/2022 2:34 PM, Timony, Mick wrote: What I can tell you is that we have never had a problem

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
cripts and envs. -Paul Edmon- On 7/14/2022 12:55 PM, Timony, Mick wrote: Hi Paul If you have 6 years worth of data and you want to prune down to 2 years, I recommend going month by month rather than doing it in one go.  When we initially started archiving data several years back

Re: [slurm-users] SlurmDB Archive settings?

2022-07-14 Thread Paul Edmon
archive one month at a time which allowed it to get done in a reasonable amount of time. The archived data can be pulled into a different slurm database, which is what we do for importing historic data into our XDMod instance. -Paul Edmon- On 7/13/2022 4:55 PM, Timony, Mick wrote: Hi Slurm

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
sorts of problems. -Paul Edmon- On 5/17/22 2:50 PM, Ole Holm Nielsen wrote: Hi, You can upgrade from 19.05 to 20.11 in one step (2 major releases), skipping 20.02.  When that is completed, it is recommended to upgrade again from 20.11 to 21.08.8 in order to get the current major version. The

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
I think it should be, but you should be able to run a test and find out. -Paul Edmon- On 5/17/22 12:13 PM, byron wrote: Sorry, I should have been clearer.   I understand that with regards to slurmd / slurmctld you can skip a major release without impacting running jobs etc.  My questions was

Re: [slurm-users] upgrading slurm to 20.11

2022-05-17 Thread Paul Edmon
s they can hand out if you are bootstrapping to a newer release. -Paul Edmon- On 5/17/22 11:42 AM, byron wrote: Thanks Brian for the speedy responce. Am I not correct in thinking that if I just go from 19.05 to 20.11 then there is the advantage that I can upgrade slurmd and slurmctld in one

Re: [slurm-users] High log rate on messages like "Node nodeXX has low real_memory size"

2022-05-12 Thread Paul Edmon
They fix this in newer versions of Slurm.  We had the same issue with older versions so we hard to run with the config_override option on to keep the logs quiet.  They changed the way logging was done in the more recent releases and its not as chatty. -Paul Edmon- On 5/12/22 7:35 AM, Per

Re: [slurm-users] Slurm 21.08.8-2 upgrade

2022-05-06 Thread Paul Edmon
We upgraded from 21.08.6 to 21.08.8-1 yesterday morning but overnight we saw the communications issues described by Tim W.  We upgraded to 21.08.8-2 this morning and that did the trick to resolve all the communications problems we were having. -Paul Edmon- On 5/6/2022 4:38 AM, Ole Holm

Re: [slurm-users] what is the elegant way to drain node from epilog with self-defined reason?

2022-05-03 Thread Paul Edmon
them when you absolutely have no other work around then you should be fine. -Paul Edmon- On 5/3/2022 3:46 AM, taleinterve...@sjtu.edu.cn wrote: Hi, all: We need to detect some problem at job end timepoint, so we write some detection script in slurm epilog, which should drain the node if chec

Re: [slurm-users] non-historical scheduling

2022-04-12 Thread Paul Edmon
tting hard limits for each user. -Paul Edmon- On 4/12/2022 8:55 AM, Chagai Nota wrote: Hi Loris Thanks for your answer. I tired to configure it and I didn't get desired results. This is my configuration: PriorityType=priority/multifactor PriorityDecayHalfLife=0 PriorityUsageRe

Re: [slurm-users] Limit partition to 1 job at a time

2022-03-22 Thread Paul Edmon
I think you could do this by clever use of a partition level QoS but I don't have an obvious way of doing this. -Paul Edmon- On 3/22/2022 11:40 AM, Russell Jones wrote: Hi all, For various reasons, we need to limit a partition to being able to run max 1 job at a time. Not 1 job per

Re: [slurm-users] MPI Jobs OOM-killed which weren't pre-21.08.5

2022-02-10 Thread Paul Edmon
older versions of MPI): https://github.com/SchedMD/slurm/blob/slurm-21-08-5-1/NEWS  What we've recommended to users who have hit this was to swap over to using srun instead of mpirun and the situation clears up. -Paul Edmon- On 2/10/2022 8:59 AM, Ward Poelmans wrote: Hi Paul, On 10/02/20

Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Paul Edmon
, the specified memory will only be unavailable for user allocations. These will restrict specific memory and cores for system use. This is probably the best way to go rather than spoofing your config. -Paul Edmon- On 1/7/2022 2:36 AM, Rémi Palancher wrote: Le jeudi 6 janvier 2022 à 22:39,

Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-07 Thread Paul Edmon
You can actually spoof the number of cores and RAM on a node by using the config_override option.  I've used that before for testing purposes.  Mind you core binding and other features like that will not work if you start spoofing the number of cores and ram, so use with caution. -Paul

Re: [slurm-users] export qos

2021-12-17 Thread Paul Edmon
Just of our curiosity is there a reason you aren't just doing a mysqldump of the extant DB and then reimporting it? I'm not aware of a way to dump just the qos settings for import other than: sacctmgr show qos -Paul Edmon- On 12/17/2021 10:24 AM, Williams, Jenny Avis wrote: Sac

Re: [slurm-users] slurmdbd full backup so the primary can be purged

2021-12-13 Thread Paul Edmon
ably ping SchedMD as to any limitations they are aware of.  Usually they are pretty good about being comprehensive in their docs so they would have probably mentioned it if there was one. -Paul Edmon- On 12/13/2021 5:07 AM, Loris Bennett wrote: Hi Paul, Am I right in assuming that there are g

Re: [slurm-users] slurmdbd full backup so the primary can be purged

2021-12-10 Thread Paul Edmon
is writing your sql into the database. So you could set up a full mirror and then read the old archives into that.  You just want to make sure that mirror has archiving/purging turned off so it won't rearchive the data you restored. -Paul Edmon- On 12/10/2021 1:28 PM, Ransom, Geoff

Re: [slurm-users] Database Compression

2021-12-09 Thread Paul Edmon
e dump and reimport will take a while (for me it was about 4 hours start to finish on my test system). -Paul Edmon- On 12/2/2021 1:06 PM, Baer, Troy wrote: My site has just updated to Slurm 21.08 and we are looking at moving to the built-in job script capture capability, so I'm curiou

Re: [slurm-users] A Slurm topological scheduling question

2021-12-07 Thread Paul Edmon
also have all our internode IP comms going over our IB fabric and it works fine. -Paul Edmon- On 12/7/2021 11:05 AM, David Baker wrote: Hello, These days we have now enabled topology aware scheduling on our Slurm cluster. One part of the cluster consists of two racks of AMD compute no

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Paul Edmon
I would check that you have MariaDB-shared installed too on the host you build on prior to your build.  The changed the way the packaging is done in MariaDB and Slurm needs to detect the files in MariaDB-shared to actually trigger the configure to build the mysql libs. -Paul Edmon- On 12/3

Re: [slurm-users] Preferential scheduling on a subset of nodes

2021-12-01 Thread Paul Edmon
*PreemptMode* for this partition. It can be set to OFF to disable preemption and gang scheduling for this partition. See also *PriorityTier* and the above description of the cluster-wide *PreemptMode* parameter for further details. This is at least how we manage that. -Paul Edmon- On

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-25 Thread Paul Edmon
g all the jobs and scheduling this is some what mitigated, though jobs will still exit due to timeout. -Paul Edmon- On 10/25/2021 4:47 AM, Alan Orth wrote: Dear Jurgen and Paul, This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol su

Re: [slurm-users] Suspending jobs for file system maintenance

2021-10-19 Thread Paul Edmon
Yup, we follow the same process for when we do Slurm upgrades, this looks analogous to our process. -Paul Edmon- On 10/19/2021 3:06 PM, Juergen Salk wrote: Dear all, we are planning to perform some maintenance work on our Lustre file system which may or may not harm running jobs. Although

Re: [slurm-users] slurm.conf syntax checker?

2021-10-13 Thread Paul Edmon
then have it reject any changes that cause failure.  It's not perfect but it works.  A real syntax checker would be better. -Paul Edmon- On 10/12/2021 4:08 PM, bbenede...@goodyear.com wrote: Is there any sort of syntax checker that we could run our slurm.conf file through before com

[slurm-users] Using Nice to Break Ties

2021-09-14 Thread Paul Edmon
ernal to an account/group/lab?  What solutions have people used for this? -Paul Edmon-

Re: [slurm-users] User CPU limit across partitions?

2021-08-03 Thread Paul Edmon
I think you can accomplish this by setting Partition QoS and defining it to hook into the same QoS for all there.  I believe that would force it to share the same pool. That said I don't know if that would work properly, its worth a test.  That is my first guess though. -Paul Edmon- O

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
its the sum total of all the TRES a Group could run in a partition at one time. -Paul Edmon- On 8/2/2021 12:05 PM, Adrian Sevcenco wrote: On 8/2/21 6:26 PM, Paul Edmon wrote: Probably more like MaxTRESPERJob=cpu=8 i see, thanks!! i'm still searching for the definition of GrpTRES :) T

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
Probably more like MaxTRESPERJob=cpu=8 You would need to specify how much TRES you need for each job in the normal tres format. -Paul Edmon- On 8/2/2021 11:24 AM, Adrian Sevcenco wrote: On 8/2/21 5:44 PM, Paul Edmon wrote: You can set up a Partition based QoS that can set this limit

Re: [slurm-users] declare availability of up to 8 cores//job

2021-08-02 Thread Paul Edmon
You can set up a Partition based QoS that can set this limit: https://slurm.schedmd.com/resource_limits.html  See the MaxTRESPerJob limit. -Paul Edmon- On 8/2/2021 10:40 AM, Adrian Sevcenco wrote: Hi! Is there a way to declare that jobs can request up to 8 cores? Or is it allowed by default

Re: [slurm-users] Can I get the original sbatch command, after the fact?

2021-07-16 Thread Paul Edmon
Not in the current version of Slurm.  In the next major version long term storage of job scripts will be available. -Paul Edmon- On 7/16/2021 2:16 PM, David Henkemeyer wrote: If I execute a bunch of sbatch commands, can I use sacct (or something else) to show me the original sbatch command

Re: [slurm-users] MinJobAge

2021-07-06 Thread Paul Edmon
conditions, the minimum non-zero value for *MinJobAge* recommended is 2. From my experience this does work.  We've been running with MinJobAge=600 for years with out any problems to my knowledge -Paul Edmon- On 7/6/2021 8:59 AM, Emre Brookes wrote:   Brian Andrus Nov 23, 2020

Re: [slurm-users] Long term archiving

2021-06-28 Thread Paul Edmon
We keep 6 months in our active database and then we archive and purge anything older than that.  The archive data itself is available for reimport and historical investigation.  We've done this when importing historical data into XDMod. -Paul Edmon- On 6/28/2021 10:43 AM, Yair Yarom

Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Paul Edmon
major version upgrades than minors. So if you are doing a minor version upgrade its likely fine to do live.  For major version I would recommend at least pausing all the jobs. -Paul Edmon- On 5/26/2021 2:48 PM, Ole Holm Nielsen wrote: On 26-05-2021 20:23, Will Dennis wrote: About to embark on my

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Paul Edmon
XDMod can give these sorts of stats.  I also have some diamond collectors we use in concert with grafana to pull data and plot it which is useful for seeing large scale usage trends: https://github.com/fasrc/slurm-diamond-collector -Paul Edmon- On 5/13/2021 6:08 PM, Sid Young wrote: Hi All

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Paul Edmon
Yup, we use XDMod for this sort of data as well. -Paul Edmon- On 5/11/2021 8:52 AM, Renfro, Michael wrote: XDMoD [1] is useful for this, but it’s not a simple script. It does have some user-accessible APIs if you want some report automation. I’m using that to create a lightning-talk-style

Re: [slurm-users] Testing Lua job submit plugins

2021-05-06 Thread Paul Edmon
We go the route of having a test cluster and vetting our lua scripts there before putting them in the production environment. -Paul Edmon- On 5/6/2021 1:23 PM, Renfro, Michael wrote: I’ve used the structure at https://gist.github.com/mikerenfro/92d70562f9bb3f721ad1b221a1356de5 <ht

  1   2   3   >