Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Christopher Samuel

On 10/24/23 12:39, Tim Schneider wrote:

Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME 
", the node goes in "mix@" state (not drain), but no new jobs get 
scheduled until the node reboots. Essentially I get draining behavior, 
even though the node's state is not "drain". Note that this behavior is 
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled 
as expected. Does anyone have an idea why that could be?


The intent of the "ASAP` flag for "scontrol reboot" is to not let any 
more jobs onto a node until it has rebooted.


IIRC that was from work we sponsored, the idea being that (for how our 
nodes are managed) we would build new images with the latest software 
stack, test them on a separate test system and then once happy bring 
them over to the production system and do an "scontrol reboot ASAP 
nextstate=resume reason=... $NODES" to ensure that from that point 
onwards no new jobs would start in the old software configuration, only 
the new one.


Also slurmctld would know that these nodes are due to come back in 
"ResumeTimeout" seconds after the reboot is issued and so could plan for 
them as part of scheduling large jobs, rather than thinking there was no 
way it could do so and letting lots of smaller jobs get in the way.


Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Tim Schneider

Hi,

from my understanding, if I run "scontrol reboot ", the node 
should continue to operate as usual and reboots once it is idle. When 
adding the ASAP flag (scontrol reboot ASAP ), the node should go 
into drain state and not accept any more jobs.


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME 
", the node goes in "mix@" state (not drain), but no new jobs get 
scheduled until the node reboots. Essentially I get draining behavior, 
even though the node's state is not "drain". Note that this behavior is 
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled 
as expected. Does anyone have an idea why that could be?


I am running slurm 22.05.9.

Steps to reproduce:

# To prevent node from rebooting immediately
sbatch -t 1:00:00 -c 1 --mem-per-cpu 1G -w  ./long_running_script.sh

# Request reboot
scontrol reboot nextstate=RESUME 

# Run interactive command, which does not start until "scontrol 
cancel_reboot " is executed in another shell

srun -t 1:00:00 -c 1 --mem-per-cpu 1G -w  --pty bash


Thanks a lot in advance!

Best,

Tim




Re: [slurm-users] Site factor plugin example?

2023-10-24 Thread Loris Bennett
Loris Bennett  writes:

> Christopher Samuel  writes:
>
>> On 10/13/23 10:10, Angel de Vicente wrote:
>>
>>> But, in any case, I would still be interested in a site factor
>>> plugin example, because I might revisit this in the future.
>>
>> I don't know if you saw, but there is a skeleton example in the Slurm
>> sources:
>>
>> src/plugins/site_factor/none
>>
>> Not sure if that helps?
>
> Thanks for the pointer, Chris.  I couldn't find the folder 'none' on
> Github at first, because it doesn't seem to be on the 'master' branch,
> but once I switched branches, I found it.
>
> I'll have a go at creating a memory-wasted factor.

OK, the structure of the plugin itself given here

  
https://github.com/SchedMD/slurm/blob/slurm-23.02/src/plugins/site_factor/none/site_factor_none.c

seems relatively straight forward.  The crux seems to be what one would
need in

  _update(void *x, void *ignored)

to get the data regarding the requested and used memory for each user
for a given period.

>From my rather limited understanding, the multifactor plugin just has
access to information such as 'effective usage' and 'normalized shares'
for the association.  Thus it is not possible to directly access the
amount of unused memory.  Therefore it seems like I would have rely on
generating my own metric for memory-wasting and then reading that from
the plugin.

Or does anyone see an alternative?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin