Re: [slurm-users] How to checkout a slurm node?

2021-11-12 Thread Gerhard Strangar
Joe Teumer wrote:

> However, if the user needs to reboot the node, set BIOS settings, etc then
> `salloc` automatically terminates the allocation when the new shell  is

What kind of BIOS settings would a user need to change?



Re: [slurm-users] How to checkout a slurmnode?

2021-11-12 Thread gilles
Hey Joe,



Have you considered using a reservation?

An operator can reserve a (set of) nodes for a given time, and as a user,
 you would simply

submit your jobs within this reservation.

Depending on your system configuration, a node might be marked as down 
if you reboot it,

and an operator would have to make it back into SLURM.



FWIW, the KNL slurm plugin had some features that might be interesting 
for you:

an end user submit a job with a required cluster and/or memory mode, and 
the node config is automatically

updated and the node rebooted if needed, and then the job starts. No 
operator intervention is required

in the process.





Cheers,



Gilles 

- Original Message -

Hello!

How best for a user to check out a slurm node?

Unfortunately, command 'salloc' doesn't appear to meet this need.

Command `salloc --nodelist some_node --time 3:00:00`
This gives the user a new shell and the user can use `srun` to start an 
interactive session.

However, if the user needs to reboot the node, set BIOS settings, etc 
then `salloc` automatically terminates the allocation when the new shell 
 is closed.

salloc: Relinquishing job allocation 82
salloc: Job allocation 82 has been revoked.

Ideally, if a user requests a node for a few hours then they can do all 
of their work in the allotted time (srun sessions, reboots, BIOS 
settings, etc) using a single job allocation.

Also, how can I reply to posts and replies on 
https://groups.google.com/g/slurm-users/?
The 'Reply all' and 'Reply to author' buttons on the site are greyed out.

Much appreciated!


 


Re: [slurm-users] How to checkout a slurm node?

2021-11-12 Thread Brian Andrus

I don't think slum does what you think it does.

It manages the resources and schedule, not the actual hardware of a node.

You are likely looking for something more along a hypervisor (if you are 
doing VMs) or remote KVM (since you are mentioning BIOS access).


Brian Andrus

On 11/12/2021 2:00 PM, Joe Teumer wrote:

Hello!

How best for a user to check out a slurm node?

Unfortunately, command 'salloc' doesn't appear to meet this need.

Command `salloc --nodelist some_node --time 3:00:00`
This gives the user a new shell and the user can use `srun` to start 
an interactive session.


However, if the user needs to reboot the node, set BIOS settings, etc 
then `salloc` automatically terminates the allocation when the new 
shell  is closed.


salloc: Relinquishing job allocation 82
salloc: Job allocation 82 has been revoked.

Ideally, if a user requests a node for a few hours then they can do 
all of their work in the allotted time (srun sessions, reboots, BIOS 
settings, etc) using a single job allocation.


Also, how can I reply to posts and replies on 
https://groups.google.com/g/slurm-users/?

The 'Reply all' and 'Reply to author' buttons on the site are greyed out.

Much appreciated!






[slurm-users] Slurm BoF and booth at SC21

2021-11-12 Thread Tim Wickberg
The Slurm Birds-of-a-Feather session will be held virtually on Thursday, 
November at 12:15 - 1:15pm (Central). This is conducted through the SC21 
HUBB platform, and you will need to have registered in some capacity 
through the conference to be able to participate live.


We'll be reviewing the Slurm 21.08 release, as well at a look at the 
roadmap for Slurm 22.05 and beyond. The remainder of the time will be 
reserved for live Q+A as we've traditionally done.


One note: SC21 has told us that they will not be recording any of the 
BoFs this year, and they will only be available live through their 
platform. However, SchedMD will be posting a recording of the Slurm BoF 
on our YouTube channel at a later point to ensure the broader community 
has access to it.


In addition to the BoF, there will be presentations in the Slurm booth - 
#1807 - over the course of the week. The tentative schedule is:


Tuesday:
11am - Introduction to Slurm
1pm - REST API
3pm - Google Cloud
5pm - Introduction to Slurm

Wednesday:
11am - Slurm in the Clouds
1pm - Introduction to Slurm
3pm - REST API
5pm - Introduction to Slurm

Thursday:
11am - Introduction to Slurm
1pm - Introduction to Slurm

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support



[slurm-users] How to checkout a slurm node?

2021-11-12 Thread Joe Teumer
Hello!

How best for a user to check out a slurm node?

Unfortunately, command 'salloc' doesn't appear to meet this need.

Command `salloc --nodelist some_node --time 3:00:00`
This gives the user a new shell and the user can use `srun` to start an
interactive session.

However, if the user needs to reboot the node, set BIOS settings, etc then
`salloc` automatically terminates the allocation when the new shell  is
closed.

salloc: Relinquishing job allocation 82
salloc: Job allocation 82 has been revoked.

Ideally, if a user requests a node for a few hours then they can do all of
their work in the allotted time (srun sessions, reboots, BIOS settings,
etc) using a single job allocation.

Also, how can I reply to posts and replies on
https://groups.google.com/g/slurm-users/?
The 'Reply all' and 'Reply to author' buttons on the site are greyed out.

Much appreciated!


Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-12 Thread Paul Brunk
Hi:

We run configless.  If we add a node to slurm.conf and don't restart slurmd on 
our submit nodes, then attempts to submit to that new node will get the error 
you saw.  Restarting slurmd on the submit node fixes it.  This is the 
documented behavior (adding nodes needs slurmd restarted everywhere).  Could 
this be what you're seeing (as opposed to /etc/hosts vs DNS)?

-- 
Wishing that I'd just listened this time,
Paul Brunk, system administrator, Workstation Support Group
GACRC (formerly RCC) 
UGA EITS  (formerly UCNS)


-Original Message-
From: slurm-users  On Behalf Of Mark 
Dixon
Sent: Wednesday, November 10, 2021 10:14
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] enable_configless, srun and DNS vs. hosts file

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]


Hi,

I'm using the "enable_configless" mode to avoid the need for a shared 
slurm.conf file, and am having similar trouble to others when running "srun", 
e.g.

   srun: error: fwd_tree_thread: can't find address for host cn120, check 
slurm.conf
   srun: error: Task launch for StepId=113.0 failed on node cn120: Can't find 
an address, check slurm.conf
   srun: error: Application launch failed: Can't find an address, check 
slurm.conf
   srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

I understand that the accepted solution is to add the nodenames to DNS. Is that 
really correct?

I ask because it would be a great help if slurm instead used the more usual 
mechanism and consult the sources listed in /etc/nsswitch.conf. We use a large 
/etc/hosts file instead of DNS for our cluster and would rather not start 
running named if we can help it.

Thanks,

Mark

PS Adding a line like "NodeName=cn[001-999]" to the submit/compute host
slurm.conf file makes this go away (I hope skipping the node detail, or
adding nodes that don't exist [yet] won't cause other problems).