Re: [slurm-users] slurm cluster error - bad node index

2023-10-27 Thread Patrick Goetz
Hi - Very delayed response to this, as I'm working my way through a backlog of slurm-user posts. If this error is intermittent, it's likely a hardware issue. Recently I ran into an problem where a host with 8 GPUs was spontaneously rebooting a couple of minutes after a user would start an 8

Re: [slurm-users] Nodes stay drained no matter what I do

2023-08-25 Thread Patrick Goetz
t GPUs as '--gres=gpu:a100:X'. Tina On 24/08/2023 23:17, Patrick Goetz wrote: Hi Mick - Thanks for these suggestions.  I read over both release notes, but didn't find anything helpful. Note that I didn't include gres.conf in my original post.  That would be this:   

Re: [slurm-users] Nodes stay drained no matter what I do

2023-08-24 Thread Patrick Goetz
|Name|​ in your |gres.conf|​? Kind regards -- Mick Timony Senior DevOps Engineer Harvard Medical School -- -------- *From:* slurm-users on behalf of Patrick Goetz *Sent:* Thursday, August 24, 2023 11:27 AM *To:* Slurm User Commun

Re: [slurm-users] Nodes stay drained no matter what I do

2023-08-24 Thread Patrick Goetz
de to whatever slurmd -C says, or set config_overrides in slurm.conf Rob *From:* slurm-users on behalf of Patrick Goetz *Sent:* Thursday, August 24, 2023 11:27 AM *To:* Slurm User Community List *Subject:* [slurm-us

[slurm-users] Nodes stay drained no matter what I do

2023-08-24 Thread Patrick Goetz
Master/Nodes: Ubuntu 20.04, Slurm 19.05.5 (as packaged by Debian) This is an upgrade from a working Ubuntu 18.04/Slurm 17.x system where I re-used the original slurm.conf (fearing this might cause issues). The hardware is the same. The Master and nodes all use the same slurm.conf, gres.con

[slurm-users] scanceling a job puts the node in a draining state

2023-04-25 Thread Patrick Goetz
Hi - This was a known bug: https://bugs.schedmd.com/show_bug.cgi?id=3941 However, the bug report says this was fixed in version 17.02.7. The problem is we're running version 17.11.2, but appear to still have this bug going on: [2023-04-18T17:09:42.482] _slurm_rpc_kill_job: REQUEST_KILL_JOB

[slurm-users] gres.conf and select/cons_res plugin

2022-09-13 Thread Patrick Goetz
I think reading the documentation is making me more confused; maybe this has to do with version changes. My current slurm cluster is using version 17.x Looking at the man page for gres.conf (https://slurm.schedmd.com/gres.conf.html) I see this: NOTE: Slurm support for gres/[mps|shard] requ

[slurm-users] resource selection algorithm cons_tres ?

2022-09-13 Thread Patrick Goetz
I'm working on an inherited Slurm cluster, and was reading through the Slurm documentation when I found this in the Easy Configurator section (https://slurm.schedmd.com/configurator.easy.html) - cons_tres: Allocate individual processors, memory, GPUs, and other trackable resources - Cons_re

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Patrick Goetz
On 5/25/21 11:07 AM, Loris Bennett wrote: PS Am I wrong to be surprised that this is something one needs to roll oneself? It seems to me that most clusters would want to implement something similar. Is that incorrect? If not, are people doing something else? Or did some vendor setting things

Re: [slurm-users] R jobs crashing when run in parallel

2021-03-29 Thread Patrick Goetz
Could this be a function of the R script you're trying to run, or are you saying you get this error running the same script which works at other times? On 3/29/21 7:47 AM, Simon Andrews wrote: I've got a weird problem on our slurm cluster.  If I submit lots of R jobs to the queue then as soon

Re: [slurm-users] salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Patrick Goetz
That sounds like a linux issue. You probably need to reset the max limit for file descriptors someplace. Maybe start here: https://rtcamp.com/tutorials/linux/increase-open-files-limit/ On 2/2/21 11:50 AM, Prentice Bisbal wrote: Has anyone seen this error message before? A user just reported it

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-24 Thread Patrick Goetz
This bug report appears to address the issue you're seeing: https://bugs.schedmd.com/show_bug.cgi?id=5868 On 2/24/20 4:46 AM, Pär Lundö wrote: Dear all, I started testing and evaluating Slurm roughly a year ago and used it succesfully with MPI-programs. I have now identified that I need

Re: [slurm-users] can't create memory group (cgroup)

2018-09-10 Thread Patrick Goetz
On 9/8/18 5:11 AM, John Hearns wrote: Not an answer to your question - a good diagnostic for cgroups is the utility 'lscgroups' Where does one find this utility?

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-08-22 Thread Patrick Goetz
On 08/22/2018 10:58 AM, Kilian Cavalotti wrote: My guess is that you're experiencing first-hand the awesomeness of systemd. Yes, systemd uses cgroups. I'm trying to understand if the Slurm use of cgroups is incompatible with systemd, or if there is another way to resolve this issue? Look

Re: [slurm-users] Controller / backup controller q's

2018-05-29 Thread Patrick Goetz
On 05/25/2018 11:19 AM, Will Dennis wrote: Not yet time for us... There's problems with U18.04 that render it unusable for our environment. What problems have you run in to with 18.04?

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-17 Thread Patrick Goetz
Does your SMS have a dedicated interface for node traffic? On 05/16/2018 04:00 PM, Sean Caron wrote: I see some chatter on 6818/TCP from the compute node to the SLURM controller, and from the SLURM controller to the compute node. The policy is to permit all packets inbound from SLURM controlle

Re: [slurm-users] Built in X11 forwarding in 17.11 won't work on local displays

2018-05-10 Thread Patrick Goetz
On 05/09/2018 04:14 PM, Nathan Harper wrote: Yep, exactly the same issue. Our dirty workaround is to ssh -X back into the same host and it will work. Hi - Since I'm having this problem too, can you elaborate? You're ssh -X ing into a machine and then ssh -X ing back to the original host?

Re: [slurm-users] sacct: error

2018-05-04 Thread Patrick Goetz
I concur with this. Make sure your nodes are in the /etc/hosts file on the SMS. Also, if you name them by base + numerical sequence, you can configure them with a single line in Slurm (using the example below): NodeName=radonc[01-04] CPUs=32 RealMemory=64402 Sockets=2 CoresPerSocket=8 Thread

Re: [slurm-users] SLURM on Ubuntu 18.04

2018-05-03 Thread Patrick Goetz
Why wouldn't slurm.conf just go into /etc/slurm? On 05/03/2018 10:33 AM, Raymond Wan wrote: Hi Eric, On Thu, May 3, 2018 at 11:21 PM, Eric F. Alemany wrote: I will follow your advice. It doesn't hurt to try right (?) Thank you for your quick reply No, it doesn't hurt to try. If this was

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-26 Thread Patrick Goetz
I don't think the problem Chris is referring to (a SQL injection attack) is going to apply to you because you're way too small to need to worry about Slurm accounting, but if it is a concern, install the distro packages; confirm that things are roughly working and then just take note of how thi

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-26 Thread Patrick Goetz
Hi Chris - He has 4 nodes and one master. I'm pretty sure he's not going to be using slurmdbd? Of course something to keep in mind if things work out so well that his organization is commanding him to order an additional thousand nodes in 6 months. On 04/25/2018 07:03 PM, Christopher Samu

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-25 Thread Patrick Goetz
Hi Eric - Did you follow my suggestion of -- on 18.04, mind you; the packages on 16.04 are too old -- - Install the slurmctld package on the SMS (the master) - Install the slurmd package on the nodes? You'll still need to do some configuration, but my guess is this will pull in the neces

Re: [slurm-users] FSU & Slurm

2018-04-13 Thread Patrick Goetz
On 04/11/2018 02:35 PM, Sean Caron wrote: As a protest to asking questions on this list and getting solicitations for pay-for support, let me give you some advice for free :) Now, now. Paid support is how they keep the project going. You like using Slurm, right?

[slurm-users] Running Slurm on a single host?

2018-04-06 Thread Patrick Goetz
I've been using Slurm on a traditional CPU compute cluster, but am now looking at a somewhat different issue. We recently purchased a single machine with 10 high end graphics cards to be used for CUDA calculations and which will shared among a couple of different user groups. Does it make sen

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-22 Thread Patrick Goetz
I forgot to add that you will need to reload the daemon after doing this (and systemd will probably prompt you to do so). On 03/22/2018 08:10 AM, Patrick Goetz wrote: Or even better, don't think about it.  If you type   sudo systemctl edit slurmd this will open an editor.  Type your ch

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-22 Thread Patrick Goetz
Or even better, don't think about it. If you type sudo systemctl edit slurmd this will open an editor. Type your changes into this and save it and systemd will set up the snippet file for you automatically (in etc/systemd/system/slurmd.service.d/). On 03/21/2018 02:14 PM, Ole Holm Nielse

Re: [slurm-users] ntasks and cpus-per-task

2018-02-22 Thread Patrick Goetz
On 02/22/2018 07:50 AM, Christopher Benjamin Coffey wrote: It’s a big deal if folks use -n when it’s not an mpi program. This is because the non mpi program is launched n times (instead of once with internal threads) and will stomp over logs and output files (uncoordinated) leading to poor per

Re: [slurm-users] How to deal with user running stuff in frontend node?

2018-02-15 Thread Patrick Goetz
The simple solution is to tell people not to do this -- that's what I do. And if that doesn't work threaten to kick them off the system. On 02/15/2018 09:11 AM, Manuel Rodríguez Pascual wrote: Hi all, Although this is not strictly related to Slurm, maybe you can recommend me some actions to d

Re: [slurm-users] Single user consuming all resources of the cluster

2018-02-08 Thread Patrick Goetz
What is TRES? On 02/06/2018 06:03 AM, Christopher Samuel wrote: On 06/02/18 21:40, Matteo F wrote: I've tried to limit the number of running job using Qos -> MaxJobsPerAccount, but this wouldn't stop a user to just fill up the cluster with fewer (but bigger) jobs. You probably want to look

Re: [slurm-users] Limit resources on login node

2018-01-31 Thread Patrick Goetz
On 01/31/2018 03:52 AM, Christopher Samuel wrote: Short version, add this to slurm.conf: PropagateResourceLimits NONE I'm surprised that this isn't the default setting?

Re: [slurm-users] ntpd or chrony?

2018-01-17 Thread Patrick Goetz
On newer systemd-based systems you can just use timedatectl -- I find this does everything I need it to do. Although I think on RHEL/CentOS systems timedatectl is just set start chrony, or something like this. On 01/14/2018 08:11 PM, Lachlan Musicman wrote: Hi all, As part of both Munge and S

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Patrick Goetz
On 01/17/2018 08:12 AM, Ole Holm Nielsen wrote: John: I would refrain from installing the old default package "environment-modules" from the Linux distribution, since it doesn't seem to be maintained any more. Lmod, on the other hand, is actively maintained and solves some problems with the o