[slurm-users] Re: Spread a multistep job across clusters
On 26/8/24 8:40 am, Di Bernardini, Fabio via slurm-users wrote: Hi everyone, for accounting reasons, I need to create only one job across two or more federated clusters with two or more srun steps. The limitations for heterogenous jobs say: https://slurm.schedmd.com/heterogeneous_jobs.html#limitations > In a federation of clusters, a heterogeneous job will execute > entirely on the cluster from which the job is submitted. The > heterogeneous job will not be eligible to migrate between clusters > or to have different components of the job execute on different > clusters in the federation. However, from your script it's not clear to me that's what you're meaning, because you include multiple --cluster options. I'm not sure if that works, as you mention the docs don't cover that case. They do say (however) that: > If a heterogeneous job is submitted to run in multiple clusters not > part of a federation (e.g. "sbatch --cluster=alpha,beta ...") then > the entire job will be sent to the cluster expected to be able to > start all components at the earliest time. My gut instinct is that this isn't going to work, my feeling is that to launch a heterogenous job like this requires the slurmctld's on each cluster to coordinate and I'm not aware of that being possible currently. All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 27/8/24 10:26 am, jpuerto--- via slurm-users wrote: Is anyone in contact with the development team? Folks with a support contract can submit bugs at https://support.schedmd.com/ I feel that this is pretty basic functionality that was removed from the REST API without warning. Considering that this was a "patch" release (based on traditional semantic versioning guidelines), this type of modification shouldn't have happened and makes me worry about upgrading in the future. Slurm hasn't used semantic versioning for a long time, they moved to a year.month.minor version system a long time ago. The major releases are (now) every 6 months, so the most recent ones have been: * 23.02.0 * 23.11.0 (old 9 month system) * 24.05.0 (new 6 month system) Next major release should be in November: * 24.11.0 All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 22/8/24 11:18 am, jpuerto--- via slurm-users wrote: Do you have a link to that code? Haven't had any luck finding that repo It's here (on the 23.11 branch): https://github.com/SchedMD/slurm/tree/slurm-23.11/src/slurmrestd/plugins/openapi/dbv0.0.38 -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: REST API - get_user_environment
On 15/8/24 10:55 am, jpuerto--- via slurm-users wrote: Any ideas on whether there's a way to mirror this functionality in v0.0.40? Sorry for not seeing this sooner, I don't I'm afraid! All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?
On 26/2/24 12:27 am, Josef Dvoracek via slurm-users wrote: What is the recommended way to run longer interactive job at your systems? We provide NX for our users and also access via JupyterHub. We also have high priority QOS's intended for interactive use for rapid response, but they are capped at 4 hours (or 6 hours for Jupyter users). All the best, Chris -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: slurm-23.11.3-1 with X11 and zram causing permission errors: error: _forkexec_slurmstepd: slurmstepd failed to send return code got 0: Resource temporarily unavailable; Requeue of Jo
On 24/2/24 06:14, Robert Kudyba via slurm-users wrote: For now I just set it to chmod 777 on /tmp and that fixed the errors. Is there a better option? Traditionally /tmp and /var/tmp have been 1777 (that "1" being the sticky bit, originally invented to indicate that the OS should attempt to keep a frequently used binary in memory but then adopted to indicate special handling of a world writeable directory so users can only unlink objects they own and not others). Hope that helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account
On 12/9/23 9:22 am, Stephan Roth wrote: Thanks Noam, this looks promising! I would suggest that as was as the "magnetic" flag you may want the "flex" flag on the reservation too in order to let jobs that match it run on GPUs outside of the reservation. All the best, Chris
Re: [slurm-users] Dynamic Node Shrinking/Expanding for Running Jobs in Slurm
On 28/6/23 04:02, Rahmanpour Koushki, Maysam wrote: Upon reviewing the current FAQ, I found that it states node shrinking is only possible for pending jobs. Unfortunately, it does not provide additional information or examples to clarify if this functionality can be extended to running jobs. You can definitely release nodes from a running job, what I believe the FAQ is saying is you cannot do something like change the number of cores per node or memory you requested once a job is running. As for why you'd do that, we've had people who (before we set up a mechanism to automatically reboot nodes to address this) would request more nodes than they needed, look for how fragmented kernel hugepages were and then exclude nodes where there were too many fragmented for their needs. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 27/2/23 03:34, David Laehnemann wrote: Hi Chris, hi Sean, Hiya! thanks also (and thanks again) for chiming in. No worries. Quick follow-up question: Would `squeue` be a better fall-back command than `scontrol` from the perspective of keeping `slurmctld` responsive? Sadly not, whilst a site can do some tricks to enforce rate limiting on squeue via the cli_filter that doesn't mean others have that set up, so they are vulnerable to the same issue. Also, just as a quick heads-up: I am documenting your input by linking to the mailing list archives, I hope that's alright for you? https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467 No problem - but I would say it's got to be sacct. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 27/2/23 06:53, Brian Andrus wrote: Sorry, I had to share that this is very much like "Are we there yet?" on a road trip with kids 😄 Slurm is trying to drive. Oh I love this analogy! Whereas sacct is like looking talking to the navigator. The navigator does talk to the driver to give directions, and the driver keeps them up to date with the current situation, but the kids can talk to the navigator without disrupting the drivers concentration. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 23/2/23 2:55 am, David Laehnemann wrote: And consequently, would using `scontrol` thus be the better default option (as opposed to `sacct`) for repeated job status checks by a workflow management system? Many others have commented on this, but use of scontrol in this way is really really bad because of the impact it has on slurmctld. This is because responding to the RPC (IIRC) requires taking read locks on internal data structures and on a large, busy system (like ours, we recently rolled over slurm job IDs back to 1 after ~6 years of operation and run at over 90% occupancy most of the time) this can really damage scheduling performance. We've had numerous occasions where we've had to track down users abusing scontrol in this way and redirect them to use sacct instead. We already use the cli filter abilities in Slurm to impose a form of rate limiting on RPCs from other commands, but unfortunately scontrol is not covered by that. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm - UnkillableStepProgram
On 20/1/23 3:51 am, Stefan Staeglich wrote: But someone who is actually using a UnkillableStepProgram stated the opposite (that it's executed on the controller nodes). Are you aware of any change between Slurm releases? Maybe one of the two parts is just a leftover. Are you using a UnkillableStepProgram? Yes, we've been using it for years on 7 different systems in my time here. It runs on the compute nodes and collects troubleshooting info for us when a job fails to die in an allowed time. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] slurmrestd service broken by 22.05.07 update
On 29/12/22 11:31 am, Timo Rothenpieler wrote: Having service files in top level dirs like /run or /var/lib is bound to cause issues like this. You can use local systemd overrides for things like this. In this case I suspect you can create this directory: /etc/systemd/system/slurmrestd.service.d/ and drop files into it via the Configuration Management System Of Your Choice to override/augment the vendor supplied configuration. https://www.freedesktop.org/software/systemd/man/systemd.unit.html > Along with a unit file foo.service, a "drop-in" directory > foo.service.d/ may exist. All files with the suffix ".conf" > from this directory will be merged in the alphanumeric order > and parsed after the main unit file itself has been parsed. > This is useful to alter or add configuration settings for a > unit, without having to modify unit files. Each drop-in file > must contain appropriate section headers. For instantiated > units, this logic will first look for the instance ".d/" > subdirectory (e.g. "foo@bar.service.d/") and read its ".conf" > files, followed by the template ".d/" subdirectory > (e.g. "foo@.service.d/") and the ".conf" files there. Caveat: written whilst travelling and without testing or even having access to a system where I can test, but we do use this method for other services already. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job cancelled into the future
On 20/12/22 6:01 pm, Brian Andrus wrote: You may want to dump the database, find what table/records need updated and try updating them. If anything went south, you could restore from the dump. +lots to making sure you've got good backups first, and stop slurmdbd before you start on the backups and don't restart it until you've made the changes, including setting the rollup times to be before the jobs started to make sure that the rollups include these changes! When you start slurmdbd after making the changes it should see that it needs to do rollups and kick those off. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] salloc problem
On 27/10/22 4:18 am, Gizo Nanava wrote: we run into another issue when using salloc interactively on a cluster where Slurm power saving is enabled. The problem seems to be caused by the job_container plugin and occurs when the job starts on a node which boots from a power down state. If I resubmit a job immediately after the failure to the same node, it always works. I can't find any other way to reproduce the issue other than booting a reserved node from a power down state. Looking at this: slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory I'm wondering is a separate filesystem and, if so, could /scratch be only getting mounted _after_ slurmd has started on the node? If that's the case then it would explain the error and why it works immediately after. On our systems we always try and ensure that slurmd is the very last thing to start on a node, and it only starts if everything has succeeded up to that point. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Switch setting in slurm.conf breaks slurmctld if the switch type is not there in slurmcrld node
On 27/10/22 11:30 pm, Richard Chang wrote: Yes, the system is a HPE Cray EX, and I am trying to use switch/hpe_slingshot. Which version of Slurm are you using Richard? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
On 30/10/22 12:27 pm, Davide DelVento wrote: But if I understand correctly your Prolog vs TaskProlog distinction, the latter would have the environmental variable and run as user, whereas the former runs as root and doesn't get the environment, That's correct. My personal view is that injecting arbitrary input from a user (such as these environment variables) would make life hazardous from a security point of view for a root privileged process such as a prolog. not even from the job_submit script. That is correct, all the job_submit will do is inject the environment variable into the jobs environment, just as if a user had done so. The problem with a TaskProlog approach is that what I want to do (making a non-accessible file available) would work best as root. As a workaround is that I could make that just obscure but still user-possible. Not ideal, but better than nothing as it is now. Alternatively, I could use another way to let the job_submit lua script communicate with the Prolog, not sure exactly what (temp directory on the shared filesystem, writeable only by root??) My only other thought is that you might be able to use node features & job constraints to communicate this without the user realising. For instance you could declare the nodes where the software is installed to have "Feature=mysoftware" and then your job submit could spot users requesting the license and add the constraint "mysoftware" to their job. The (root privileged) Prolog can see that via the SLURM_JOB_CONSTRAINTS environment variable and so could react to it. Then when 23.02 comes out you could use the new SLURM_JOB_LICENSES environment variable in addition and retire the old way once jobs using the old method have completed. Thanks for pointing to that commit. I bit too down the road but good to know. No worries, best of luck! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
On 30/10/22 10:23 am, Chris Samuel wrote: Unfortunately it looks like the license request information doesn't get propagated into any prologs from what I see from a scan of the documentation. 🙁 This _may_ be fixed in the next major Slurm release (February) if I'm reading this right: https://github.com/SchedMD/slurm/commit/3c6c4c08d8deb89aa2c992a65964f53663097d26 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Prolog and job_submit
On 29/10/22 7:37 am, Davide DelVento wrote: So either I misinterpreted that "same environment as the user tasks" or there is something else that I am doing wrong. Slurm has a number of different prologs that can run which can cause confusion, and I suspect that's what's happening here. The "Prolog" in your configuration runs as root, but its the "TaskProlog" that runs as the user and so has access to the jobs environment (including the environment variable you are setting). Unfortunately it looks like the license request information doesn't get propagated into any prologs from what I see from a scan of the documentation. :-( Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] job_time_limit: inactivity time limit reached ...
On 19/9/22 05:46, Paul Raines wrote: In slurm.conf I had InactiveLimit=60 which I guess is what is happening but my reading of the docs on this setting was it only affects the starting of a job with srun/salloc and not a job that has been running for days. Is it InactiveLimit that leads to the "inactivity time limit reached" message? I believe so, but remember that this governs timeouts around communications between slurmctld and the srun/salloc commands, and not things like shell inactivity timeouts which are quite different. See: https://slurm.schedmd.com/faq.html#purge # A job is considered inactive if it has no active job steps or # if the srun command creating the job is not responding. Hope this helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] admin users without a database
On 19/9/22 06:14, Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) wrote: Is it possible to make a user an admin without slurmdbd? The docs I've found indicates that I need to set the user's admin level with sacctmgr, but that command always says I don't believe so, I believe that's all stored in slurmdbd (and sacctmgr is a command to communicate with slurmdbd). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] srun: error: io_init_msg_unpack: unpack error
On 6/8/22 10:43 am, David Magda wrote: It seems that the the new srun(1) cannot talk to the old slurmd(8). Is this 'on purpose'? Does the backwards compatibility of the protocol not extend to srun(1)? That's expected, what you're hoping for here is forward compatibility. Newer daemons know how to talk to older utilities, but it doesn't work the other way around. What we do in this situation is upgrade slurmdbd, then slurmctld, change our images for compute nodes to be ones that have the new Slurm version then before we bring partitions back up we issue an "scontrol reboot ASAP nextstate=resume" for all the compute nodes. This means existing jobs will keep going but no new jobs will start on compute nodes with older versions of Slurm from that point on. As jobs on nodes finish they'll get rebooted into the new images and will accept jobs again (the "ASAP" flag drains the node, then once it's successfully started its slurmd as the final thing on boot it'll undrain at that point - and also slurmctld is smart with planning its scheduling for this situation). It's also safe to restart slurmd's with running jobs, though you may want to drain them before that so slurmctld won't try and send them a job in the middle. The one issue you can get where backwards compatibility in the Slurm protocol can't help is if there are incompatible config file changes needed, then you need to bite the bullet and upgrade the slurmd's and commands at the same time everywhere where the new config file goes (and for those of running in configless mode that means everywhere). Hope this helps! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?
On 3/8/22 10:20 pm, Gerhard Strangar wrote: With a fake license called reboot? It's a neat idea, but I think there is a catch: * 3 jobs start, each taking 1 license * Other reboot jobs are all blocked * Running reboot jobs trigger node reboot * Running reboot jobs end when either the script exits and slurmd cleans it up before the reboot kills it, or it gets killed as NODE_FAIL when the node has been unresponsive for too long and is marked as down * Licenses for those jobs are released * 3 more reboot jobs start whilst the original 3 are rebooting * 6 nodes are now rebooting * Filesystem fall down go boom * Also your rebooted nodes are now drained as "Node unexpectedly rebooted" I guess you could change your Slurm config to not mark nodes as down if they stop responding and make sure the job that's launched, but that feels wrong to me. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] "Plugin is corrupted" message when using drmaa / debugging libslurm
On 1/7/22 07:51, Jean-Christophe HAESSIG wrote: The libraries were incompatible but that wasn't reflected in the packaging and due to the similar and long version string, I didn't spot it before. Oh good spot! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] "Plugin is corrupted" message when using drmaa / debugging libslurm
On 29/6/22 09:01, Jean-Christophe HAESSIG wrote: No, the job is placed through DRMAA API which enables programs to place jobs in a cluster-agnostic way. Th program doesn't know it is talking to Slurm. The DRMAA library makes the translation and loads libslurm36, where the messages comes from. That's why I don't know how to tell libslurm to log more, since its use is hidden behind DRMAA. My gut instinct with this is that it will be reading your slurm.conf file to find its configuration and so you can adjust that to increase the log level (realising that everything that reads it at that point will pick those up). Academic now though you've solved it I guess! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] "Plugin is corrupted" message when using drmaa / debugging libslurm
On 28/6/22 12:19 pm, Jean-Christophe HAESSIG wrote: Hi, I'm facing a weird issue where launching a job through drmaa (https://github.com/natefoo/slurm-drmaa) aborts with the message "Plugin is corrupted", but only when that job is placed from one of my compute nodes. Running the command from the login node seems to work. I suspect this is where your error is happening: https://github.com/SchedMD/slurm/blob/1ce55318222f89fbc862ce559edfd17e911fee38/src/common/plugin.c#L284 it's when it's checking it can load the plugin and not hit any unresolved library symbols. The fact you are hitting this sounds like you're missing libraries from the compute nodes that are present on the login node (or there's some reason they're not getting found if present). [...] Anyway, the message seems to originate from libslurm36 and I would like to activate the debug messages (debug3, debug4). Is there a way to do this with an environment variable or any other convenient method ? This depends on what part of Slurm is generating these errors, is this something like sbatch or srun? If so using multiple -v's will increase the debug level so you can pick those up. If it's from slurmd then you'll want to set SlurmdDebug to "debug3" in your slurm.conf. Once that's done you should get the information on what symbols are not being found and that should give you some insight into what's going on. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rolling upgrade of compute nodes
On 30/5/22 10:06 am, Chris Samuel wrote: If you switch that symlink those jobs will pick up the 20.11 srun binary and that's where you may come unstuck. Just to quickly fix that, srun talks to slurmctld (which would also be 20.11 for you), slurmctld will talk to the slurmd's running the job (which would be 19.05, so OK) but then the slurmd would try and launch a 20.11 slurmstepd and that is where I suspect things could come undone. Sorry - hadn't had coffee when I was writing earlier. :-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Rolling upgrade of compute nodes
On 30/5/22 3:01 am, byron wrote: The one thing I'm unsure about is as much as Linux / NFS issue than a a slurm one. When I change the soft link for "default" to point to the new 20.11 slurm install but all the compute nodes are still run the old 19.05 version because they havent been restarted yet, will that not cause any problems? Or will they still just see the same old 19.05 version of slurm that they are running until they are restarted. That may cause issues, whilst the ASAP flag to scontrol reboot guarantees no new jobs will start on the selected nodes until after they've rebooted that doesn't (and shouldn't) stop new job steps from srun starting on them. If you switch that symlink those jobs will pick up the 20.11 srun binary and that's where you may come unstuck. This is one of the reasons why we do everything with Slurm installed via RPM inside an image, you have a pretty straightforward A -> B transition. If your symlink was node-local in some way (say created at boot time via some config management system before slurmd start) then that could work around that as then the nodes would still see the appropriate slurm binaries for the running slurmd. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Limit partition to 1 job at a time
On 22/3/22 11:40 am, Russell Jones wrote: I am struggling to figure out how to do this. Any tips? My only thought to achieve this would be to define a license for the partition with a count of 1 and to use the job submit filter to ensure that any job that is submitted to (or ends up being directed to) that partition requests that one license. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] initscript poll timeout @ 10000 msec :: what slurm conf var?
On 4/12/21 9:34 am, Adrian Sevcenco wrote: actually is not ... so, once again, does anyone have an idea about customization of the timeout of init script defined in job_container.conf? Looking at the source it's hard-coded in Slurm 21.08, so you'd need to patch and rebuild at present. https://github.com/SchedMD/slurm/blob/934f3b543b6bc9f3335d1cc6813b8d95cb2c49b4/src/plugins/job_container/tmpfs/job_container_tmpfs.c#L473 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Wrong hwloc detected?
On 5/11/21 4:47 am, Diego Zuccato wrote: How can Slurm detect such an old HWLOC version? Looking at the code it's not actually checking the hwloc version, it's finding an error condition and suggesting that may be the cause, but it sounds like it's not for you. src/plugins/task/cgroup/task_cgroup_cpuset.c : /* should never happen in normal scenario */ if ((sock_loop > npdist) && !hwloc_success) { /* hwloc_get_obj_below_by_type() fails if no CPU set * configured, see hwloc documentation for details */ error("hwloc_get_obj_below_by_type() failing, " "task/affinity plugin may be required to address bug " "fixed in HWLOC version 1.11.5"); return XCGROUP_ERROR; } [...] If you've got support from SchedMD open a bug with them, but if not and you're using the Debian packages I'd suggest opening a bug with Debian about it. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How to get an estimate of job completion for planned maintenance?
On 9/11/21 5:42 am, Loris Bennett wrote: We just set up a reservation at a point at a time which is further in the future than our maximum run-time. There is then no need to drain anything. Short running jobs can still run right up to the reservation. This is the same technique we use too, works well! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] is there a way to temporarily freeze an account?
On 6/10/21 6:21 am, byron wrote: We have some accounts that we would like to suspend / freeze for the time being that have unused hours associated with them. Is there anyway of doing this without removing the users associated with the accounts or zeroing their hours? We have a QOS called "batchdisable" which has MaxJobs=0 and MaxSubmitJobs=0 and then we just set the user's list of QOS's to that. sacctmgr update where where name=bar qos=batchdisable All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] draining nodes due to failed killing of task?
On Friday, 6 August 2021 12:02:45 AM PDT Adrian Sevcenco wrote: > i was wondering why a node is drained when killing of task fails and how can > i disable it? (i use cgroups) moreover, how can the killing of task fails? > (this is on slurm 19.05) Slurm has tried to kill processes, but they refuse to go away. Usually this means they're stuck in a device or I/O wait for some reason, so look for processes that are in a "D" state on the node. As others have said they can be stuck writing out large files and waiting for the kernel to complete that before they exit. This can also happen if you're using GPUs and something has gone wrong in the driver and the process is stuck in the kernel somewhere. You can try doing "echo w > /proc/sysrq-trigger" on the node to see if the kernel reports tasks stuck and where they are stuck. If there are tasks stuck in that state then often the only recourse is to reboot the node back into health. You can tell Slurm to run a program on the node should it find itself in this state, see: https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepProgram Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] (no subject)
On Friday, 30 July 2021 11:21:19 AM PDT Soichi Hayashi wrote: > I am running slurm-wlm 17.11.2 You are on a truly ancient version of Slurm there I'm afraid (there have been 4 major releases & over 13,000 commits since that was tagged in January 2018), I would strongly recommend you try and get to a more recent release to get those bug fixes and improvements. A quick scan of the NEWS file shows a number that are cloud related. https://github.com/SchedMD/slurm/blob/slurm-20.11/NEWS All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] OpenMPI interactive change in behavior?
On Monday, 26 April 2021 2:12:41 PM PDT John DeSantis wrote: > Furthermore, > searching the mailing list suggests that the appropriate method is to use > `salloc` first, despite version 17.11.9 not needing `salloc` for an > "interactive" sessions. Before 20.11 with salloc you needed to set a SallocDefaultCommand to use srun to push the session over on to a compute node, and then you needed to set a bunch of things to prevent that srun from consuming resources that the subsequent srun's would need. That was especially annoying when you were dealing with GPUs as you would need to "srun" anything that needed to access them (when you used cgroups to control access). With 20.11 there's a new "use_interactive_step" option that uses similar trickery, except Slurm handles not consuming those resources for you and handles GPUs correctly. So for your 20.11 system I would recommend giving salloc and the "use_interactive_step" option a go and see if it helps. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] unable to Hold and release the job using scontrol
On Saturday, 22 May 2021 11:05:54 PM PDT Zainul Abiddin wrote: > i am trying to hold the job from Scontol but not able to hold the job. It looks like you're trying to hold a running job, which isn't possible. I see from the Slurm FAQ that you should be able to use "scontrol requeuehold" for what you are trying to achieve. https://slurm.schedmd.com/faq.html#req # Slurm supports requeuing jobs in a hold state with the command: # # scontrol requeuehold job_id # # The job can be in state RUNNING, SUSPENDED, COMPLETED or FAILED before # being requeued. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm - UnkillableStepProgram
Hi Mike, On 22/3/21 7:12 pm, Yap, Mike wrote: # I presume UnkillableStepTimeout is set in slurm.conf. and it act as a timer to trigger UnkillableStepProgram That is correct. # UnkillableStepProgram can be use to send email or reboot compute node – question is how do we configure it ? Also - or to automate collecting debug info (which is what we do) and then we manually intervene to reboot the node once we've determined there's no more useful info to collect. It's just configured in your slurm.conf. UnkillableStepProgram=/path/to/the/unkillable/step/script.sh Of course this script has to be present on every compute node. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job not running with Resource Reason even though resources appear to be available
On Saturday, 23 January 2021 9:54:11 AM PST Paul Raines wrote: > Now rtx-08 which has only 4 GPUs seems to always get all 4 uses. > But the others seem to always only get half used (except rtx-07 > which somehow gets 6 used so another wierd thing). > > Again if I submit non-GPU jobs, they end up allocating all hte > cores/cpus on the nodes just fine. What does your gres.conf look like for these nodes? One thing I've seen in the past is where the core specifications for the GPUs are out of step with the hardware and so Slurm thinks they're on the wrong socket. Then when all the cores in that socket are used up Slurm won't put more GPU jobs on the node without the jobs explicitly asking to not do locality. One thing I've noticed is that in prior to Slurm 20.02 the documentation for gres.conf used to say: # If your cores contain multiple threads only the first thread # (processing unit) of each core needs to be listed. but that language is gone from 20.02 and later and the change isn't mentioned in the release notes for 20.02 so I'm not sure what happened there, the only clue is this commit: https://github.com/SchedMD/slurm/commit/ 7461b6ba95bb8ae70b36425f2c7e4961ac35799e#diff- cac030b65a8fc86123176971a94062fafb262cb2b11b3e90d6cc69e353e3bb89 which says "xcpuinfo_abs_to_mac() expects a core list, not a CPU list." Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] trying to add gres
On 24/12/20 4:42 pm, Erik Bryer wrote: I made sure my slurm.conf is synchronized across machines. My intention is to add some arbitrary gres for testing purposes. Did you update your gres.conf on all the nodes to match? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm Upgrade Philosophy?
On 24/12/20 6:24 am, Paul Edmon wrote: We then have a test cluster that we install the release on a run a few test jobs to make sure things are working, usually MPI jobs as they tend to hit most of the features of the scheduler. One thing I meant to mention last night was that we use Reframe from CSCS as the test framework for our systems, our user support folks maintain our local tests as they're best placed to understand the user requirements that need coverage and we feed in our system facing requirements to them so they can add tests for that side too. https://reframe-hpc.readthedocs.io/ All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm Upgrade Philosophy?
On Friday, 18 December 2020 10:10:19 AM PST Jason Simms wrote: > Thanks to several helpful members on this list, I think I have a much better > handle on how to upgrade Slurm. Now my question is, do most of you upgrade > with each major release? We do, though not immediately and not without a degree of testing on our test systems. One of the big reasons for us upgrading is that we've usually paid for features in Slurm for our needs (for example in 20.11 that includes scrontab so users won't be tied to favourite login nodes, as well as the experimental RPC queue code due to the large numbers of RPCs our systems need to cope with). I also keep an eye out for discussions of what other sites find with new releases too, so I'm following the current concerns about 20.11 and the change in behaviour for job steps that do (expanding NVIDIA's example slightly): #SBATCH --exclusive #SBATCH -N2 srun --ntasks-per-node=1 python multi_node_launch.py which (if I'm reading the bugs correctly) fails in 20.11 as that srun no longer gets all the allocated resources, instead just gets the default of --cpus-per-task=1 instead, which also affects things like mpirun in OpenMPI built with Slurm support (as it effectively calls "srun orted" and that "orted" launches the MPI ranks, so in 20.11 it only has access to a single core for them all to fight over). Again - if I'm interpreting the bugs correctly! I don't currently have a test system that's free to try 20.11 on, but hopefully early in the new year I'll be able to test this out to see how much of an impact this is going to have and how we will manage it. https://bugs.schedmd.com/show_bug.cgi?id=10383 https://bugs.schedmd.com/show_bug.cgi?id=10489 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] 20.11.1 on Cray: job_submit.lua: SO loaded on CtlD restart: script skipped when job submitted
On 16/12/20 6:21 pm, Kevin Buckley wrote: The skip is occuring, in src/lua/slurm_lua.c, because of this trap That looks right to me, that's Doug's code which is checking whether the file has been updated since slurmctld last read it in. If it has then it'll reload it, but if it hasn't then it'll skip it (and if you've got debugging up high then you'll see that message). So if you see that message then the lua has been read in to slurmctld and should get called. You might want to check the log for when it last read it in, just in case there was some error detected at that point. You can also use luac to run a check over the script you've got like this: luac -p /etc/opt/slurm/job_submit.lua All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Backfill pushing jobs back
Hi David, On 9/12/20 3:35 am, David Baker wrote: We see the following issue with smaller jobs pushing back large jobs. We are using slurm 19.05.8 so not sure if this is patched in newer releases. This sounds like a problem that we had at NERSC (small jobs pushing back multi-thousand node jobs), and we carried a local patch for which Doug managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 is the current version). Hope this helps! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes
On 26/11/20 9:21 am, Steve Bland wrote: Sinfo always returns nodes not responding One thing - do the nodes return to this state when you resume them with "scontrol update node=srvgridslurm[01-03] state=resume" ? If they do then what does your slurmctld logs say for the reason for this? You can bump up the log level on your slurmctld with (for instance "scontrol setdebug debug" for more info (we run ours at debug all the time anyway). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm Upgrade
On 11/2/20 7:31 am, Paul Edmon wrote: e. Run slurmdbd -Dv to do the database upgrade. Depending on the upgrade this can take a while because of database schema changes. I'd like to emphasis the importance of doing the DB upgrade in this way, do not use systemctl for this as if systemd runs out of patience waiting for slurmdbd to finish the migration and start up it can kill it part way through the migration. Fortunately not something I've run into myself, but as our mysqldump of our production DB is approaching 100GB now it's not something we want to run into! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Jobs stuck in "completing" (CG) state
On 10/24/20 9:22 am, Kimera Rodgers wrote: [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. To me this looks like networking issues, perhaps firewall/iptables rules blocking connections. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it
On Thursday, 15 October 2020 8:50:33 PM PDT Kevin Buckley wrote: > Maybe the SLES 15 SRPM will shed some light althought it seems odd > that the SPEC file inside the Slurm tarball can't recognise that's > on a SLES 15 OS. I've not had problems building Slurm 20.02.x on SLES15 SP0 (CLE7.0 UP01), so I'm wondering if something big happened with munge in SP1? I'd suggest opening a bug with SchedMD on this to check into what's happening, they'll likely be able to help with this! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Segfault with 32 processes, OK with 30 ???
On Monday, 12 October 2020 2:43:36 AM PDT Diego Zuccato wrote: > Seems so: > "The application appears to have been direct launched using "srun", > but OMPI was not built with SLURM's PMI support and therefore cannot > execute." > > So it seems I can't use srun to launch OpenMPI jobs. OK, I suspect this rules Slurm out of the running as the cause, I'd suggest either rebuilding OpenMPI with Slurm support or if it's a distro related package filing a bug with the distro, or alternatively trying for help with the OpenMPI users list: https://lists.open-mpi.org/mailman/listinfo/users Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] unable to run on all the logical cores
On 10/7/20 10:13 pm, David Bellot wrote: NodeName=foobar01 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=257243 State=UNKNOWN With this configuration Slurm is allocation a single physical core (with 2 thread units) per task. So you are using all (physical) cores. However, if what you want is to have 1 process per thread unit (not necessarily a good idea, depending on how your code works) then I think you'd need to adjust your config to lie to Slurm and tell it it's got 40 cores per socket and 1 thread per core instead. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Simple free for all cluster
On Tuesday, 6 October 2020 7:53:02 AM PDT Jason Simms wrote: > I currently don't have a MaxTime defined, because how do I know how long a > job will take? Most jobs on my cluster require no more than 3-4 days, but > in some cases at other campuses, I know that jobs can run for weeks. I > suppose even setting a time limit such as 4 weeks would be overkill, but at > least it's not infinite. I'm curious what others use as that value, and how > you arrived at it My journey over the last 16 years in HPC has been one of decreasing time limits, back in 2003 with VPAC's first Linux cluster we had no time limits, we then introduced a 90 day limit so we could plan quarterly maintenances (and yes, we had users who had jobs which legitimately ran longer than that, so they had to learn to checkpoint). At VLSCI we had 30 day limits (life sciences, so many long running poorly scaling jobs), then when I was at Swinburne it was a 7 day limit, and now here at NERSC we've got 2 day limits. It really is down to what your use cases are and how much influence you have over your users. It's often the HPC sysadmins responsibility to try and find that balance between good utilisation, effective use of the system and reaching the desired science/research/development outcomes. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Segfault with 32 processes, OK with 30 ???
On Tuesday, 6 October 2020 12:12:41 AM PDT Diego Zuccato wrote: > At least I couldn't replicate launching manually (it always says "no > slots available" unless I use mpirun -np 16 ...). I'm no MPI expert > (actually less than a noob!) so I can't rule out it's unrelated to > Slurm. I mostly hope that on this list I can find someone with enough > experience with both Slurm and MPI. Launch it with "srun" rather than "mpirun", that way it'll be managed by Slurm. If your test program then says every rank is rank 0 that will tell you OpenMPI is not built with Slurm support. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How to contact slurm developers
On 9/30/20 8:29 am, Relu Patrascu wrote: We have actually modified the code on both v 19 and 20 to do what we would like, preemption within the same QOS, but we think that the community would benefit from this feature, hence our request to have it in the release version. There's a special severity level for contributions of code in the SchedMD bugzilla "C - Contributions". All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Core reserved/bound to a GPU
On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote: > Every thing works great so far but now I would like to bound a specific > core to each GPUs on each node. By "bound" I mean to make a particular > core not assignable to a CPU job alone so that the GPU is available > whatever the CPU workload on the node. What I've done in the past (waves to Swinburne folks on the list) was to have overlapping partitions on GPU nodes where the GPU job partition had access to all the cores and the CPU only job partition had access to only a subset (limited by the MaxCPUsPerNode parameter on the partition). The problem you run into there though is that there's no way to reserve cores on a particular socket, which means problems for folks who care about locality for GPU codes as they can wait in the queue with GPUs free and cores free but not the right cores on the right socket to be able to use the GPUs. :-( Here's my bug from when I was in Australia for this issue where I suggested a MaxCPUsPerSocket parameter for partitions: https://bugs.schedmd.com/show_bug.cgi?id=4717 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Alternatives for MailProg
On 8/27/20 3:42 pm, Brian Andrus wrote: Actually, you can add headers of all kinds: Quick search of "sendmail add headers" discovers: Problem is that Slurm doesn't directly call sendmail, it calls "mail" (or MailProg in your slurm.conf) instead, hence not being able to add headers. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] cgroup limits not created for jobs
On Friday, 24 July 2020 9:48:35 AM PDT Paul Raines wrote: > But when I run a job on the node it runs I can find no > evidence in cgroups of any limits being set > > Example job: > > mlscgpu1[0]:~$ salloc -n1 -c3 -p batch --gres=gpu:quadro_rtx_6000:1 --mem=1G > salloc: Granted job allocation 17 > mlscgpu1[0]:~$ echo $$ > 137112 > mlscgpu1[0]:~$ You're not actually running inside a job at that point unless you've defined "SallocDefaultCommand" in your slurm.conf, and I'm guessing that's not the case there. You can make salloc fire up an srun for you in the allocation using that option, see the docs here: https://slurm.schedmd.com/slurm.conf.html#OPT_SallocDefaultCommand All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] squeue reports ReqNodeNotAvail but node is available
On Friday, 10 July 2020 3:34:44 PM PDT Janna Ore Nugent wrote: > I’ve got an intermittent situation with gpu nodes that sinfo says are > available and idle, but squeue reports as “ReqNodeNotAvail”. We’ve cycled > the nodes to restart services but it hasn’t helped. Any suggestions for > resolving this or digging into it more deeply? What does "scontrol show job $JOB" say for an affected job, and what does "scontrol show node $NODE" look like for one of these nodes? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Jobs killed by OOM-killer only on certain nodes.
On Thursday, 2 July 2020 6:52:15 AM PDT Prentice Bisbal wrote: > [2020-07-01T16:19:19.463] [801777.extern] _oom_event_monitor: oom-kill > event count: 1 We get that line for pretty much every job, I don't think it reflects the OOM killer being invoked on something in the extern step. OOM killer invocations should be recorded in the kernel logs on the node, check with "dmesg -T" to see if it's being invoked (or whether they are getting logged to via syslog if they've got dropped from the ring buffer due to later messages). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Nodes do not return to service after scontrol reboot
On 17/6/20 11:32 pm, David Baker wrote: Thank you for your comments. The scontrol reboot command is now working as expected. Fantastic! For those who don't know, using scontrol reboot in this way also allows Slurm to take these rebooting nodes into account for scheduling; so if you have a large job needing a lot of nodes waiting to begin with high priority and you need to reboot some nodes then Slurm won't give up on them and put smaller jobs on the system on all the other nodes, delaying the larger job for no good reason. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Fw: slurm-users Digest, Vol 31, Issue 50
On Wednesday, 13 May 2020 6:15:53 PM PDT Abhinandan Patil wrote: > However still: > sinfo > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB What does "sinfo -R" say ? If the node was down at some point you may need to resume it. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job Step Resource Requests are Ignored
On Tuesday, 5 May 2020 11:00:27 PM PDT Maria Semple wrote: > Is there no way to achieve what I want then? I'd like the first and last job > steps to always be able to run, even if the second step needs too many > resources (based on the cluster). That should just work. #!/bin/bash #SBATCH -c 2 #SBATCH -n 1 srun -c 1 echo hello srun -c 4 echo big wide srun -c 1 echo world gives: hello srun: Job step's --cpus-per-task value exceeds that of job (4 > 2). Job step may never run. srun: error: Unable to create step for job 604659: More processors requested than permitted world > As a side note, do you know why it's not even possible to restrict the > number of resources a single step uses (i.e. set less CPUs than are > available to the full job)? My suspicion is that you've not set up Slurm to use cgroups to restrict the resources a job can use to just those requested. https://slurm.schedmd.com/cgroups.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] "sacctmgr add cluster" crashing slurmdbd
On Tuesday, 5 May 2020 3:21:45 PM PDT Dustin Lang wrote: > Since this happens on a fresh new database, I just don't understand how I > can get back to a basic functional state. This is exceedingly frustrating. I have to say that if you're seeing this with 17.11, 18.08 and 19.05 and this only started when your colleague upgraded MySQL then this sounds like MySQL is triggering this problem. We're running with MariaDB 10.x (from SLES15) without issues (our database is huge). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job Step Resource Requests are Ignored
On Tuesday, 5 May 2020 4:47:12 PM PDT Maria Semple wrote: > I'd like to set different resource limits for different steps of my job. A > sample script might look like this (e.g. job.sh): > > #!/bin/bash > srun --cpus-per-task=1 --mem=1 echo "Starting..." > srun --cpus-per-task=4 --mem=250 --exclusive > srun --cpus-per-task=1 --mem=1 echo "Finished." > > Then I would run the script from the command line using the following > command: sbatch --ntasks=1 job.sh. You shouldn't ask for more resources with "srun" than have been allocated with "sbatch" - so if you want the job to be able to use up to 4 cores at once & that amount of memory you'll need to use: sbatch -c 4 --mem=250 --ntasks=1 job.sh I'd also suggest using suffixes for memory to disambiguate the values. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition
On Tuesday, 5 May 2020 3:48:22 PM PDT Sean Crosby wrote: > sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4 Also don't forget you need to tell Slurm to enforce QOS limits with: AccountingStorageEnforce=safe,qos in your Slurm configuration ("safe" is good to set, and turns on enforcement of other restrictions around associations too). See: https://slurm.schedmd.com/resource_limits.html All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] IPv6 for slurmd and slurmctld
On Friday, 1 May 2020 8:31:47 AM PDT Thomas Schäfer wrote: > is there an switch, option, environment variable, configurable key word to > enable IP6 for the slurmd and slurmctld daemons? I don't believe those Slurm daemons support IPv6, my understanding is the only one that does is slurmrestd, see slide 22 of the presentation here: https://slurm.schedmd.com/SLUG19/REST_API.pdf All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
On Friday, 17 April 2020 2:22:00 PM PDT Dean Schulze wrote: > Both work. The only discrepancy is that the slurm controller output had > these two lines: > > UID: ??? (1000) > GID: ??? (1000) > > Like the controller doesn't know the username for UID 1000. What does this say on the controller and the compute node? getent passwd 1000 Are you using LDAP or the like to ensure that all nodes have the same user database? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Munge decode failing on new node
On 4/15/20 10:57 am, Dean Schulze wrote: error: Munge decode failed: Invalid credential ENCODED: Wed Dec 31 17:00:00 1969 DECODED: Wed Dec 31 17:00:00 1969 error: authentication: Invalid authentication credential That's really interesting, I had one of these last week when on call, for us at least it seemed to be a hardware error as when attempting to reboot it the node failed completely and would no longer boot. Worth checking whatever hardware logging capabilities your system has to see if MCE's are being reported. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Header lengths are longer than data received after changing SelectType & GresTypes to use MPS
On 8/4/20 7:20 am, Eric Berquist wrote: Once you’ve built SLURM, it’s enough to just have the GPU drivers on the nodes where SLURM will be installed. Yeah I checked that at the Slurm User Group - slurmd will try and dlopen() the required libraries and should gracefully deal with them not being present. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Running an MPI job across two partitions
On 23/3/20 8:32 am, CB wrote: I've looked at the heterogeneous job support but it creates two-separate jobs. Yes, but the web page does say: # By default, the applications launched by a single execution of # the srun command (even for different components of the # heterogeneous job) are combined into one MPI_COMM_WORLD with # non-overlapping task IDs. So it _should_ work. I know there are issues with Cray systems & hetjobs at the moment, but I suspect that's not likely to concern you. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] slurmd -C showing incorrect core count
On 10/3/20 1:40 pm, mike tie wrote: Here is the output of lstopo Hmm, well I believe Slurm should be using hwloc (which provides lstopo) to get its information (at least it calls the xcpuinfo_hwloc_topo_get() function for that), so if lstopo works then slurmd should too. Ah, looking a bit deeper I see in src/slurmd/common/xcpuinfo.c: if (!hwloc_xml_whole) hwloc_xml_whole = xstrdup_printf("%s/hwloc_topo_whole.xml", conf->spooldir); Do you happen to have a file called "hwloc_topo_whole.xml" in your spool directory on that node? I'm wondering if it's cached old config there. If so move it out of the way somewhere safe (just in case) and try again. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] slurmd -C showing incorrect core count
On 9/3/20 7:44 am, mike tie wrote: Specifically, how is slurmd -C getting that info? Maybe this is a kernel issue, but other than lscpu and /proc/cpuinfo, I don't know where to look. Maybe I should be looking at the slurmd source? It would be worth looking at what something like "lstopo" from the hwloc package says about your VM. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Question about determining pre-empted jobs
On 28/2/20 9:53 am, Jeffrey R. Lang wrote: We have had a request to generate a report showing the number of jobs by date showing pre-empted jobs. We used sacct to try to gather the data but we only found a few jobs with the state “PREEMPTED”. It might be that if jobs are being set to be requeued then you'll need to use the --duplicates option to sacct to see previous iterations of the job when it was preempted. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Setup for backup slurmctld
On Wednesday, 26 February 2020 12:48:26 PM PST Joshua Baker-LePain wrote: > We're planning the migration of our moderately sized cluster (~400 nodes, > 40K jobs/day) from SGE to slurm. We'd very much like to have a backup > slurmctld, and it'd be even better if our backup slurmctld could be in a > separate data center from the primary (though they'd still be on the same > private network). So, how are folks sharing the StateSaveLocation in such > a setup? Any and all recommendations (including those with the 2 > slurmctld servers in the same rack) welcome. Thanks! We use GPFS for our shared state directory (Cori is 12K nodes and we put 5K-30K jobs a day through it, very variable job mix); the important thing is the IOPS rate for the filesystem, if it can't keep up with Slurm then you're going to see performance issues. Tim from SchedMD had some notes on HA (and other things) from the Slurm 2017 user group): https://slurm.schedmd.com/SLUG17/FieldNotes.pdf All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries
On 21/2/20 9:02 am, Tina Friedrich wrote: In case that's of interest - this is actually SLURM 18.08.3 that I've now gotten to run (I haven't quite managed to upgrade to 19 yet). I've made minor modifications to the spec file - the unhardening of the flags and the the python dependency. From what I can see there's a fix in for 20.02 (the same change you've added from what I can see), but it's not (yet) backported to earlier releases. commit d3b308aae6d63a9acecd50c0d63a5c8e3ff0086f Author: Tim McMullan Date: Fri Feb 14 08:25:06 2020 -0500 slurm.spec - disable "hardening" flags Disable the "hardening" flags - '-z,relro' or '-z,now' that RHEL8/Fedora inject by default which break Slurm's plugin stack. Bug 8499. -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Is it safe to convert cons_res to cons_tres on a running system?
On 20/2/20 2:16 pm, Nathan R Crawford wrote: I interpret this as, in general, changing SelectType will nuke existing jobs, but that since cons_tres uses the same state format as cons_res, it should work. We got caught with just this on our GPU nodes (though it was fixed before I got to see what was going on) - it seems that the format of the RPCs changes when you go from cons_res to cons_tres and we were having issues until we restarted slurmd on the compute nodes as well. My memory is that this was causing issues for starting new jobs (in a failing completely type of manner), I'm not sure what the consequences were for running jobs (though I suspect it would not have been great for them). If Doug sees this he may remember this (he caught and fixed it). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm Upgrade from 17.02
On 19/2/20 6:10 am, Ricardo Gregorio wrote: I am putting together an upgrade plan for slurm on our HPC. We are currently running old version 17.02.11. Would you guys advise us upgrading to 18.08 or 19.05? Slurm versions only support upgrading from 2 major versions back, so you could only upgrade from 17.02 to 17.11 or 18.08. I'd suggest going straight to 18.08. Remember you have to upgrade slurmdbd first, then upgrade slurmctld and then finally the slurmd's. Also, as Ole points out, 20.02 is due out soon at which point 18.08 gets retired from support, so you'd probably want to jump to 19.05 from 18.08. Don't forget to take backups first! We do a mysqldump of the whole accounting DB and rsync backups of our state directories before an upgrade. Best of luck! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Inconsistent cpu bindings with cpu-bind=none
On 17/2/20 12:48 am, Marcus Boden wrote: I am facing a bit of a weird issue with CPU bindings and mpirun: I think if you want Slurm to have any control over bindings you'll be wanting to use srun to launch your MPI program, not mpirun. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Cluster usage with Slurm
On 17/2/20 4:19 am, Parag Khuraswar wrote: Does Slurm provide cluster usage reports like mentioned below ? For the detailed info you're being asked for I'd probably suggest looking at the OpenXDMoD project. https://open.xdmod.org/ Its "shredder" data importer can import data from a bunch of different batch systems, including Slurm. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Longer queuing times for larger jobs
On 5/2/20 1:44 pm, Antony Cleave wrote: Hi, from what you are describing it sounds like jobs are backfilling in front and stopping the large jobs from starting We use a feature that SchedMD implemented for us called "bf_min_prio_reserve" which lets you set a priority threshold below which Slurm won't make a forward reservation for a job (and so can only start if it can start right now without delaying other jobs). https://slurm.schedmd.com/slurm.conf.html#OPT_bf_min_prio_reserve So if you can arrange your local priority system so that large jobs are over that threshold and smaller jobs are below it (or whatever suits your use case) then you should have a way to let these large jobs get a reliable start time without smaller jobs pushing them back in time. There's some useful background from the bug where this was implemented: https://bugs.schedmd.com/show_bug.cgi?id=2565 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How should I configure a node with Autodetect=nvml?
On Tuesday, 11 February 2020 7:27:56 AM PST Dean Schulze wrote: > No other errors in the logs. Identical slurm.conf on all nodes and > controller. Only the node with gpus has the gres.conf (with the single > line Autodetect=nvml). It might be useful to post the output of "slurmd -C" and your slurm.conf for us to see (sorry if you've done that already and I've not seen it). You can also increase the debug level for slurmctld and slurm in slurm.conf (we typically run with SlurmctldDebug=debug, you may want to try SlurmdDebug=debug whilst experimenting). Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] How should I configure a node with Autodetect=nvml?
On Monday, 10 February 2020 12:11:30 PM PST Dean Schulze wrote: > With this configuration I get this message every second in my slurmctld.log > file: > > error: _slurm_rpc_node_registration node=slurmnode1: Invalid argument What other errors are in the logs? Could you check that you've got identical slurm.conf and gres.conf files everywhere? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] sacct does always print all jobs regardless filter parameters with accounting_storage/filetxt
On 30/1/20 10:20 am, Dr. Thomas Orgis wrote: Matching for user (-u) and Job ID (-j) works, but not -N/-S/-E. So is this just the current state and it's up to me to provide a patch to enable it if I want that behaviour? You're using a very very very old version of slurm there (15.08), you should upgrade to a recent one (I'd suggest 19.05.5) to check whether it's been fixed in the intervening years. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Question about slurm source code and libraries
On 25/1/20 8:08 am, dean.w.schu...@gmail.com wrote: I'm working on the 19.05.4 source code since it is stable, but I would prefer to use the same C REST library that will be used in 20.02. Does anyone know what C library that is? They're using OpenAPI (formerly Swagger) for this (see slide 5), and it seems that includes a code generator for various languages. https://swagger.io/tools/swagger-codegen/ Their source code is on Github here: https://github.com/swagger-api/swagger-codegen All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Multinode blast run
On 24/1/20 3:46 am, Mahmood Naderan wrote: Has anyone run blast on multiple nodes via slurm? I don't think blast is something that can run across nodes (or at least it didn't used to be). There is/was something called "mpiblast" that could do that. If you'll excuse the plug this sounds like a good question for the Beowulf list https://www.beowulf.org/ which is a more general purpose cluster computing list (disclaimer: I'm the caretaker of it these days). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Can't get node out of drain state
On 23/1/20 7:09 pm, Dean Schulze wrote: Pretty strange that having a Gres= property on a node that doesn't have a gpu would get it stuck in the drain state. Slurm verifies that nodes have the capabilities you say they have so that should a node boot with less RAM than it should have, or a socket hidden or should a GPU fail and a node reboot you'll know about it and not blindly send jobs to it only for them to find they fail because they no longer meet their requirements. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Node can't run simple job when STATUS is up and STATE is idle
On 20/1/20 3:00 pm, Dean Schulze wrote: There's either a problem with the source code I cloned from github, or there is a problem when the controller runs on Ubuntu 19 and the node runs on CentOS 7.7. I'm downgrading to a stable 19.05 build to see if that solves the problem. I've run the master branch on a Cray XC without issues, and I concur with what the others have said and suggest it's worth checking the slurmd and slurmctld logs to find out why communications is not right between them. Good luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Job completed but child process still running
On 1/13/20 5:55 am, Youssef Eldakar wrote: In an sbatch script, a user calls a shell script that starts a Java background process. The job immediately is completed, but the child Java process is still running on the compute node. Is there a way to prevent this from happening? What I would recommend is to use Slurm's cgroups support so that processes that put themselves into the background this way are tracked as part of the job and cleaned up when the job exits. https://slurm.schedmd.com/cgroups.html Depending on how the Java process puts itself into the background you could try adding a "wait" command at the end of the shell script so that it doesn't exit immediately (it's not guaranteed though). With cgroups the Slurm script could also check the processes in your cgroup to monitor the existence of the Java process, sleeping for a while between checks, and exit when it's no longer found. For instance once you've got the PID of the Java process you can use "kill -0 $PID" to check if it's still there (rather than using ps). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Submission without Scheduling
Hey Lev! :-) On Monday, 2 December 2019 2:29:06 PM PST Lev Lafayette wrote: > An idea that bouncing around our site at the moment is the possibility of > jobs being submitted without being scheduled, given that these are two > separate functions. Could you expand on that - do you mean some way to submit jobs whilst slurmctld is down, or just whilst nodes are down? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Maxjobs to accrue age priority points
On Friday, 13 December 2019 7:01:48 AM PST Christopher Benjamin Coffey wrote: > Maybe because that setting is just not included in the default list of > settings shown? That is counterintuitive to this in the man page for > sacctmgr: > > show [] > Display information about the specified entity. By default, > all entries are displayed, you can narrow results by specifying SPECS in > your query. Identical to the list command. > > Thoughts? Thanks! I _suspect_ what that's saying is that it is has a default list that you can narrow, not that specifying it there will show it if it's not part of the default list. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] error: persistent connection experienced an error
On 13/12/19 12:19 pm, Christopher Benjamin Coffey wrote: error: persistent connection experienced an error Looking at the source code that comes from here: if (ufds.revents & POLLERR) { error("persistent connection experienced an error"); return false; } So your TCP/IP stack reported a problem with an existing connection. That's very odd if you're on the same box. If you are on a large system or putting a lot of small jobs through quickly then it's worth checking out the Slurm HTC guide for networking: https://slurm.schedmd.com/high_throughput.html Good luck.. Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Multi-node job failure
On 11/12/19 8:05 am, Chris Woelkers - NOAA Federal wrote: Partial progress. The scientist that developed the model took a look at the output and found that instead of one model run being ran in parallel srun had ran multiple instances of the model, one per thread, which for this test was 110 threads. This sounds like MVAPICH isn't built to support Slurm, from the Slurm MPI guide you need to build it with this to enable Slurm support (and of course add any other options you were using): ./configure --with-pmi=pmi2 --with-pm=slurm All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Is that possible to submit jobs to a Slurm cluster right from a developer's PC
On 12/12/19 7:38 am, Ryan Cox wrote: Be careful with this approach. You also need the same munge key installed everywhere. If the developers have root on their own system, they can submit jobs and run Slurm commands as any user. I would echo Ryan's caution on this and add that as root they will be able to run admin commands on the box too, create reservations, shut Slurm down, cancel other users jobs, etc. At the Slurm User Group this year Tim Wickberg foreshadowed (and demo'd with a very neat "pay-for-priority" box) a REST API planned for the Slurm 20.02 release. It has its own auth system separate to munge and would make this a lot safer. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Need help with controller issues
On 12/12/19 8:14 am, Dean Schulze wrote: configure:5021: gcc -o conftest -I/usr/include/mysql -g -O2 conftest.c -L/usr/lib/x86_64-linux-gnu -lmysqlclient -lpthread -lz -lm -lrt -latomic -lssl -lcrypto -ldl >&5 /usr/bin/ld: cannot find -lssl /usr/bin/ld: cannot find -lcrypto collect2: error: ld returned 1 exit status That looks like your failure, you're missing the package that provides those libraries it's trying to use - in this case for Debian/Ubuntu I suspect it's libssl-dev. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Maxjobs to accrue age priority points
Hi Chris, On 12/12/19 3:16 pm, Christopher Benjamin Coffey wrote: What am I missing? It's just a setting on the QOS, not the user: csamuel@cori01:~> sacctmgr show qos where name=regular_1 format=MaxJobsAccruePerUser MaxJobsAccruePU --- 2 So any user in that QOS can only have 2 jobs ageing at any one time. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Need help with controller issues
On 11/12/19 11:31 am, Eli V wrote: Look for libmariadb-client. That's needed for slurmdbd on debian. Looking at the output from building some Slurm 19.05.4 RPMs earlier tonight, this is what I see in the output of configure: [...] checking for mysql_config... /usr/bin/mysql_config MySQL 10.4.3 test program built properly. [...] You should look at the the config.log for the gory details of what it's trying to discover and what it found (or didn't). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Need help with controller issues
On Tuesday, 10 December 2019 1:57:59 PM PST Dean Schulze wrote: > This bug report from a couple of years ago indicates a source code issue: > > https://bugs.schedmd.com/show_bug.cgi?id=3278 > > This must have been fixed by now, though. > > I built using slurm-19.05.2. Does anyone know if this has been fixed in > 19.05.4? I don't think this is a Slurm issue - have you checked that you have the MariaDB development package for your distro installed before trying to buidl Slurm? It will skip things it doesn't find and that could explain what you're seeing. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Multi-node job failure
Hi Chris, On Tuesday, 10 December 2019 11:49:44 AM PST Chris Woelkers - NOAA Federal wrote: > Test jobs, submitted via sbatch, are able to run on one node with no problem > but will not run on multiple nodes. The jobs are using mpirun and mvapich2 > is installed. Is there a reason why you aren't using srun for launching these? https://slurm.schedmd.com/mpi_guide.html If you're using mpirun then (unless you've built mvapich2 with Slurm support) then you'll be relying on ssh to launch tasks and so that could be what's broken for you. Running with srun will avoid that and allow Slurm to track your processes correctly. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm configuration, Weight Parameter
On 23/11/19 9:14 am, Chris Samuel wrote: My gut instinct (and I've never tried this) is to make the 3GB nodes be in a separate partition that is guarded by AllowQos=3GB and have a QOS called "3GB" that uses MinTRESPerJob to require jobs to ask for more than 2GB of RAM to be allowed into the QOS. Of course there's nothing to stop a user requesting more memory than they need to get access to these nodes, but that's a social issue not a technical one. :-) -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Slurm configuration, Weight Parameter
On 21/11/19 7:25 am, Sistemas NLHPC wrote: Currently we have two types of nodes, one with 3GB and another with 2GB of RAM, it is required that in nodes of 3 GB it is not allowed to execute tasks with less than 2GB, to avoid underutilization of resources. My gut instinct (and I've never tried this) is to make the 3GB nodes be in a separate partition that is guarded by AllowQos=3GB and have a QOS called "3GB" that uses MinTRESPerJob to require jobs to ask for more than 2GB of RAM to be allowed into the QOS. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Force a use job to a node with state=drain/maint
On 23/11/19 8:54 am, René Neumaier wrote: In general, is it possible to move a pending job (means forcing as root) to a specific node which is marked as DRAIN for troubleshooting? I don't believe so. Put a reservation on the node first only for this user, add the reservation to the job then resume the node. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA