Re: [slurm-users] Slurm Node Unresponsive

2020-09-08 Thread Doug Meyer
Hi, Does scontrol ping from the node show the slurm server up? If so munge is fine. Betting it is not this but it is such an easy check. Ensure you have the same slurm.conf on master and client. The fact you can restart the slurmd and all is well is really odd. Suggests slurm is coming up too so

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-13 Thread Doug Meyer
Probably need to define the problem a bit better. spreport has very good functionality, see the boom of the man page for examples. You can group orgs in accounting groups to map like use and use wckeys to provide accounting for specific users billing groups. Configure tres billing to get a charg

Re: [slurm-users] [ADMIN-QUESTION] Issue about usage report by user

2021-07-09 Thread Doug Meyer
Please check the output summary at the top of the page. Believe you will see you are missing the last day of each month. start=05/01 end=06/01 will get you May 1st,0001 to may 31st 235959 Doug On Fri, Jul 9, 2021 at 3:45 PM Jorge Ivan Diaz wrote: > Hi everyone, > > I would like to ask about s

Re: [slurm-users] [External] Re: srun : Communication connection failure

2022-01-21 Thread Doug Meyer
Hi, Did you recently add nodes? We have seen that when we add nodes past the treewidth count the most recently added nodes will lose communication (asterisk next to node name in sifo). We have to ensure the treewidth declaration in the slurm.conf matches or exceeds the number of nodes. Doug On

Re: [slurm-users] [External] Re: srun : Communication connection failure

2022-01-25 Thread Doug Meyer
y network issue. > > Best, > Durai Arasan > MPI Tübingen > > On Fri, Jan 21, 2022 at 2:15 PM Doug Meyer wrote: > >> Hi, >> Did you recently add nodes? We have seen that when we add nodes past the >> treewidth count the most recently added nodes will lose comm

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-23 Thread Doug Meyer
Hi, Did you configure your node definition with the outputs of slurmd -C? Ignore boards. Don't know if it is still true but several years ago declaring boards made things difficult. Also, if you have hyperthreaded AMD or Intel processors your partition declaration should be overscribe:2 Start w

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Doug Meyer
s. It is a little more work up front but far easier than correcting scripts later. Doug On Thu, Feb 23, 2023 at 9:41 PM Analabha Roy wrote: > Howdy, and thanks for the warm welcome, > > On Fri, 24 Feb 2023 at 07:31, Doug Meyer wrote: > >> Hi, >> >> Did you con

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-25 Thread Doug Meyer
anks for your considered response. Couple of questions linger... > > On Sat, 25 Feb 2023 at 21:46, Doug Meyer wrote: > >> Hi, >> >> Declaring cores=64 will absolutely work but if you start running MPI >> you'll want a more detailed config description. The

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-26 Thread Doug Meyer
up all 64 cores. > > Then I logged in as another user and launched the same job with sbatch -n > 2. To my dismay, it started to run! > > Shouldn't slurm have figured out that all 64 cores were occupied and > queued the -n 2 job to pending? > > AR > > > On

Re: [slurm-users] Chaining srun commands

2023-02-28 Thread Doug Meyer
Hi, I read the problem differently. Might also want to look at heterogeneous jobs. Slurm Workload Manager - Heterogeneous Job Support (schedmd.com) Doug On Tue, Feb 28, 2023 at 3:27 PM Jake Jellinek wrote: > Hi Brian > > Thanks for your resp

Re: [slurm-users] Single Node cluster. How to manage oversubscribing

2023-02-28 Thread Doug Meyer
onfig from the last time it started. Doug On Sun, Feb 26, 2023 at 10:25 PM Analabha Roy wrote: > Hey, > > > Thanks for sticking with this. > > On Sun, 26 Feb 2023 at 23:43, Doug Meyer wrote: > >> Hi, >> >> Suggest removing "boards=1", The docs

Re: [slurm-users] Job killed for unknown reason

2023-04-04 Thread Doug Meyer
Hi, I don't think I have ever seen a sig 9 that wasn't a user. Is it possible you have folks in slurm coordinator/administrator that may be killing jobs or run running a cleanup script? Only other thing I can think of is the user is closing their remote session before the srun completes. I can't

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Doug Meyer
Could also review the node log in /varlog/slurm/ . Often sinfo -lR will tell you the cause, fro example mem not matching the config. Doug On Thu, May 25, 2023 at 5:32 AM Ole Holm Nielsen wrote: > On 5/25/23 13:59, Roger Mason wrote: > > slurm 20.02.7 on FreeBSD. > > Uh, that's old! > > > I hav

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Doug Meyer
I always like Sacctmgr update user where user= set grpcpus=0 On Thu, May 25, 2023, 4:19 PM Markuske, William wrote: > Hello, > > I have a badly behaving user that I need to speak with and want to > temporarily disable their ability to submit jobs. I know I can change their > account settings to

Re: [slurm-users] How to automatically release jobs that failed with "launch failed requeued held"

2019-01-22 Thread Doug Meyer
scontrol release job n Not sure if the system can be set to automatically release jobs but I would not want them too as a faulty system will go into a do loop start, fail, start. Doug On Tue, Jan 22, 2019 at 10:45 AM Roger Moye wrote: > This morning we had several jobs fail with “launch fa

Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-01-31 Thread Doug Meyer
Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job. Doug On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs wrote: > Hi All, > > Just checking to see if this sounds familiar to anyone. > > Environment: > - CentOS 7.5 x86_64 > - Slurm 17.11.10 (but this also happ

[slurm-users] Source of SIGTERM

2019-03-07 Thread Doug Meyer
source of the SIGTERM. Thank you, Doug Meyer

Re: [slurm-users] How do I impose a limit the memory requested by a job?

2019-03-14 Thread Doug Meyer
We also run diskless. In the slurm.conf we round down on memory so slurm does not have the total budget to work with and use a default memory per job value reflecting declared memory/# of threads per node. If users don't declarememory limit we are fine. If they declare more we are fine too. Mostly

Re: [slurm-users] How should I do so that jobs are allocated to the thread and not to the core ?

2019-05-02 Thread Doug Meyer
Had same problem in slurm 15, not sure if it affects newer versions. Don’t use the expanded node definition NodeName = DEFAULT Boards = 1 SocketsPerBoard = 2 CoresPerSocket = 18 ThreadsPerCore = 2 RealMemory = 128000 Use the simpler NodeName=DEFAULT Cores=36 RealMemory = 128000 Slurm will us

Re: [slurm-users] Forcibly end "zombie" jobs?

2020-01-08 Thread Doug Meyer
Totally agree with the solution. We were running slurm 15.xx for some time and the manual job edit was miserable. In 16 the command noted was created. Have used it often and been pleased. Doug On Wed, Jan 8, 2020 at 7:40 AM Douglas Jacobsen wrote: > Try running `sacctmgr show runawayjobs`;

Re: [slurm-users] one job at a time - how to set?

2020-04-29 Thread Doug Meyer
Change node definition in slurm.conf for that one node to 1 CPU. Doug Meyer From: slurm-users On Behalf Of Rutger Vos Sent: Wednesday, April 29, 2020 1:20 PM To: Slurm User Community List Subject: [External] Re: [slurm-users] one job at a time - how to set? Hi Michael, thanks very much for

[slurm-users] Re: Incorrect hyperthreading with Slurm 23.11

2024-03-27 Thread Doug Meyer via slurm-users
Hi, Please review the settings in slurm.conf for oversubscribe for cpu cores and setting jobs to use oversubscribe in sbatch. I don't know if it is still true, but delete the boards=1 from node definition. It used to mess up the math. Doug On Wed, Mar 27, 2024, 7:09 AM Guillaume COCHARD via slurm-