Re: [slurm-users] Different max number of jobs in individual and array jobs

2021-06-07 Thread Sebastian T Smith
Hi, This doesn't solve your problem but might be an option: In similar cases, we instruct our users to create `n` Jobs of `m` Steps. Some experimentation may be required to determine the number of Steps to maximize Job run time without hitting your limits. Our max limit is 14 days, so this

Re: [slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-24 Thread Sebastian T Smith
tp://rc.unr.edu/> From: slurm-users on behalf of Yair Yarom Sent: Monday, November 23, 2020 4:21 AM To: Slurm User Community List Subject: Re: [slurm-users] NoDecay on accounts (or on GrpTRESMins in general) On Fri, Nov 20, 2020 at 12:11 AM Sebastian T Smith mailto:stsm.

Re: [slurm-users] NoDecay on accounts (or on GrpTRESMins in general)

2020-11-19 Thread Sebastian T Smith
Hi, We're setting GrpTRESMins on the account association and have NoDecay QOS's for different user classes. All user associations with a GrpTRESMins-limited account are assigned a NoDecay QOS. I'm not sure if it's a better approach... but it's an option. Our GrpTRESMins limits are applied

Re: [slurm-users] can't lengthen my jobs log

2020-11-12 Thread Sebastian T Smith
Hi John, Have you tried specifying a start time? The default is 00:00:00 of the current day (depending on other options). Example: sacct -S 2020-11-01T00:00:00 Our accounting database retains all job data from the epoch of our system. Best, Sebastian -- [University of Nevada,

Re: [slurm-users] Using hyperthreaded processors

2020-11-04 Thread Sebastian T Smith
Hi, We have Hyper-threading/SMT enabled on our cluster. It's challenging to fully utilize threads, as Brian suggests. We have a few workloads that benefit from it being enabled, but they represent a minority of our overall workload. We use SelectTypeParameters=CR_Core_Memory. This

[slurm-users] Help decoding step ID in slurmd log

2020-10-23 Thread Sebastian T Smith
Hi, I'm performing diagnostics on an application that isn't terminating correctly. While reviewing slurmd logs I found a couple of lines I need help decoding (logs are normal): Line 45: [2020-10-23T14:30:22.610] [2547451.batch] Sent signal 18 to 2547451.4294967294 Line 46:

Re: [slurm-users] Simple free for all cluster

2020-10-06 Thread Sebastian T Smith
Our MaxTime and DefaultTime are 14-days. Setting a high DefaultTime was a convenience to our users (and the support team) but has evolved into a mistake because it impacts backfill. Under high load we'll see small backfill jobs take over because the estimated start and end time of

Re: [slurm-users] Quickly throttling/limiting a specific user's jobs

2020-09-23 Thread Sebastian T Smith
I've used Paul's `MaxJobs` suggestion in emergencies with success. +1 vote. We've encountered RPC timeouts and have been able to tune the `sched_max_job_start` (decrease) and `sched_min_interval` (increase) options of `SchedulerParameters` to reduce/eliminate timeouts during high job flux.

Re: [slurm-users] CR_Core_Memory behavior

2020-08-25 Thread Sebastian T Smith
Hi, I agree that this may be a node configuration issue. It might also be caused by your resource request. Can you provide your node configuration and an example submission script? - Sebastian -- [University of Nevada, Reno] Sebastian Smith High-Performance Computing

Re: [slurm-users] Tuning MaxJobs and MaxJobsSubmit per user and for the whole cluster?

2020-08-10 Thread Sebastian T Smith
My rule of thumb for our cluster is 1,024 jobs/node. Our nodes have 32 cores, so we're 32x core count (converting to Paul's units). We have 120 nodes with a maximum of 122,880 jobs. At a high-level, nodes are allocated to different partitions and each partition is allocated a maximum number

Re: [slurm-users] how to know the real utilization of a node when oversubscribe is set to FORCE (Mark Hahn)

2020-07-17 Thread Sebastian T Smith
Hi, I think the `Elapsed` or `ElapsedRaw` field is what you're looking for. Selected example from my system: $ sacct -X --allusers --format="AllocCPUS,Elapsed,ElapsedRaw,CPUTime,CPUTimeRAW" AllocCPUS ElapsedElapsedRawCPUTime CPUTimeRAW

Re: [slurm-users] Reset Fair-share tree account values

2020-07-16 Thread Sebastian T Smith
ware of that one. Hopefully they will support the ability to reset to other values in the future as that would be a handy ability. -Paul Edmon- On 7/16/2020 12:56 PM, Sebastian T Smith wrote: `sacctmgr` can be used to reset the accrued RawUsage value. Example usage: # sacctmgr modify user where Acc

Re: [slurm-users] Reset Fair-share tree account values

2020-07-16 Thread Sebastian T Smith
`sacctmgr` can be used to reset the accrued RawUsage value. Example usage: # sacctmgr modify user where Account= set RawUsage=0 Review the `sacctmgr` documentation for more details: https://slurm.schedmd.com/sacctmgr.html Best, Sebastian -- [University of Nevada,

Re: [slurm-users] Allow certain users to run over partition limit

2020-07-08 Thread Sebastian T Smith
ding the hierarchy Root/Cluster association Partition limit None Where in this list does the reservations fall under? Do reservations override all of these if they are set to exceed resources imposed by the partition configuration? Thanks! On 7/7/20, 6:02 PM, "slurm-users on behalf of Seb

Re: [slurm-users] Allow certain users to run over partition limit

2020-07-07 Thread Sebastian T Smith
Hi, We use Job QOS and Resource Reservations for this purpose. QOS is a good option for a "permanent" change to a user's resource limits. We use reservations similar to how you're currently using partitions to "temporarily" provide a resource boost without the complexities of re-partitioning