Re: [slurm-users] How to checkout a slurm node?
Joe Teumer wrote: > However, if the user needs to reboot the node, set BIOS settings, etc then > `salloc` automatically terminates the allocation when the new shell is What kind of BIOS settings would a user need to change?
Re: [slurm-users] How to checkout a slurmnode?
Hey Joe, Have you considered using a reservation? An operator can reserve a (set of) nodes for a given time, and as a user, you would simply submit your jobs within this reservation. Depending on your system configuration, a node might be marked as down if you reboot it, and an operator would have to make it back into SLURM. FWIW, the KNL slurm plugin had some features that might be interesting for you: an end user submit a job with a required cluster and/or memory mode, and the node config is automatically updated and the node rebooted if needed, and then the job starts. No operator intervention is required in the process. Cheers, Gilles - Original Message - Hello! How best for a user to check out a slurm node? Unfortunately, command 'salloc' doesn't appear to meet this need. Command `salloc --nodelist some_node --time 3:00:00` This gives the user a new shell and the user can use `srun` to start an interactive session. However, if the user needs to reboot the node, set BIOS settings, etc then `salloc` automatically terminates the allocation when the new shell is closed. salloc: Relinquishing job allocation 82 salloc: Job allocation 82 has been revoked. Ideally, if a user requests a node for a few hours then they can do all of their work in the allotted time (srun sessions, reboots, BIOS settings, etc) using a single job allocation. Also, how can I reply to posts and replies on https://groups.google.com/g/slurm-users/? The 'Reply all' and 'Reply to author' buttons on the site are greyed out. Much appreciated!
Re: [slurm-users] How to checkout a slurm node?
I don't think slum does what you think it does. It manages the resources and schedule, not the actual hardware of a node. You are likely looking for something more along a hypervisor (if you are doing VMs) or remote KVM (since you are mentioning BIOS access). Brian Andrus On 11/12/2021 2:00 PM, Joe Teumer wrote: Hello! How best for a user to check out a slurm node? Unfortunately, command 'salloc' doesn't appear to meet this need. Command `salloc --nodelist some_node --time 3:00:00` This gives the user a new shell and the user can use `srun` to start an interactive session. However, if the user needs to reboot the node, set BIOS settings, etc then `salloc` automatically terminates the allocation when the new shell is closed. salloc: Relinquishing job allocation 82 salloc: Job allocation 82 has been revoked. Ideally, if a user requests a node for a few hours then they can do all of their work in the allotted time (srun sessions, reboots, BIOS settings, etc) using a single job allocation. Also, how can I reply to posts and replies on https://groups.google.com/g/slurm-users/? The 'Reply all' and 'Reply to author' buttons on the site are greyed out. Much appreciated!
[slurm-users] Slurm BoF and booth at SC21
The Slurm Birds-of-a-Feather session will be held virtually on Thursday, November at 12:15 - 1:15pm (Central). This is conducted through the SC21 HUBB platform, and you will need to have registered in some capacity through the conference to be able to participate live. We'll be reviewing the Slurm 21.08 release, as well at a look at the roadmap for Slurm 22.05 and beyond. The remainder of the time will be reserved for live Q+A as we've traditionally done. One note: SC21 has told us that they will not be recording any of the BoFs this year, and they will only be available live through their platform. However, SchedMD will be posting a recording of the Slurm BoF on our YouTube channel at a later point to ensure the broader community has access to it. In addition to the BoF, there will be presentations in the Slurm booth - #1807 - over the course of the week. The tentative schedule is: Tuesday: 11am - Introduction to Slurm 1pm - REST API 3pm - Google Cloud 5pm - Introduction to Slurm Wednesday: 11am - Slurm in the Clouds 1pm - Introduction to Slurm 3pm - REST API 5pm - Introduction to Slurm Thursday: 11am - Introduction to Slurm 1pm - Introduction to Slurm -- Tim Wickberg Chief Technology Officer, SchedMD LLC Commercial Slurm Development and Support
[slurm-users] How to checkout a slurm node?
Hello! How best for a user to check out a slurm node? Unfortunately, command 'salloc' doesn't appear to meet this need. Command `salloc --nodelist some_node --time 3:00:00` This gives the user a new shell and the user can use `srun` to start an interactive session. However, if the user needs to reboot the node, set BIOS settings, etc then `salloc` automatically terminates the allocation when the new shell is closed. salloc: Relinquishing job allocation 82 salloc: Job allocation 82 has been revoked. Ideally, if a user requests a node for a few hours then they can do all of their work in the allotted time (srun sessions, reboots, BIOS settings, etc) using a single job allocation. Also, how can I reply to posts and replies on https://groups.google.com/g/slurm-users/? The 'Reply all' and 'Reply to author' buttons on the site are greyed out. Much appreciated!
Re: [slurm-users] enable_configless, srun and DNS vs. hosts file
Hi: We run configless. If we add a node to slurm.conf and don't restart slurmd on our submit nodes, then attempts to submit to that new node will get the error you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding nodes needs slurmd restarted everywhere). Could this be what you're seeing (as opposed to /etc/hosts vs DNS)? -- Wishing that I'd just listened this time, Paul Brunk, system administrator, Workstation Support Group GACRC (formerly RCC) UGA EITS (formerly UCNS) -Original Message- From: slurm-users On Behalf Of Mark Dixon Sent: Wednesday, November 10, 2021 10:14 To: slurm-users@lists.schedmd.com Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file [EXTERNAL SENDER - PROCEED CAUTIOUSLY] Hi, I'm using the "enable_configless" mode to avoid the need for a shared slurm.conf file, and am having similar trouble to others when running "srun", e.g. srun: error: fwd_tree_thread: can't find address for host cn120, check slurm.conf srun: error: Task launch for StepId=113.0 failed on node cn120: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted: Waiting up to 32 seconds for job step to finish. I understand that the accepted solution is to add the nodenames to DNS. Is that really correct? I ask because it would be a great help if slurm instead used the more usual mechanism and consult the sources listed in /etc/nsswitch.conf. We use a large /etc/hosts file instead of DNS for our cluster and would rather not start running named if we can help it. Thanks, Mark PS Adding a line like "NodeName=cn[001-999]" to the submit/compute host slurm.conf file makes this go away (I hope skipping the node detail, or adding nodes that don't exist [yet] won't cause other problems).