Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Groner, Rob
Well again, I don't want to tweak things just to get the test to happen 
quicker.  I DO have to keep in mind the scheduler and backfill settings, 
though.  For instance, I think the default scheduler and backfill interval is 
60 and 30 seconds...or vice versa.  So, before I check the Scheduler value for 
the high priority job via scontrol, I wait 90 seconds and then some.  In a 
perfect world, that SHOULD have given the scheduler and backfill scheduler time 
to get to it.  I THINK, however, that in a sufficiently busy system, there's no 
guarantee even after that amount of time that the new high priority job has 
been evaluated.

I'll take a look at sdiag and see if it can tell me where the job is at, thanks 
for the suggestion.

Rob


From: slurm-users  on behalf of Ryan 
Novosielski 
Sent: Friday, September 29, 2023 4:19 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Verifying preemption WON'T happen

You can get some information on that from sdiag, and there are tweaks you can 
make to backfill scheduling that affect how quickly it will get to a job.

That doesn’t really answer your real question, but might help you when you are 
looking into this.

Sent from my iPhone

On Sep 29, 2023, at 16:10, Groner, Rob  wrote:


I'm not looking for a one-time answer.  We run these tests anytime we change 
anything related to slurmversion, configuration, etc.We certainly run 
the test after the system comes back up after an outage, and an hour would be a 
long time to wait for that.  That's certainly the brute-force approach, but I'm 
hoping there's a definitive way to show, through scontrol job output, that the 
job won't preempt.

I could set the preemptexempttime to a smaller value, say 5 minutes instead of 
1 hour, that is true, but there's a few issues with that.


  1.  I would then no longer be testing the system as it actually is.  I want 
to test the system in its actual production configuration.
  2.  If I did lower its value, what would be a safe value?  5 minutes?  Does 
running for 5 minutes guarantee that the higher priority job had a chance to 
preempt it but didn't?  Or did the scheduler even ever get to it?  On a test 
cluster with few jobs, you could be reasonably assured it did, but running 
tests on the production cluster...isn't it possible the scheduler hasn't yet 
had a chance to process it, even after 5 minutes?  Depends on the slurm 
scheduler  settings I suppose

rob


From: slurm-users  on behalf of 
Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) 

Sent: Friday, September 29, 2023 3:14 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Verifying preemption WON'T happen

You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
On Sep 29, 2023, at 2:51 PM, Davide DelVento 
mailto:davide.quan...@gmail.com>> wrote:

I don't really have an answer for you other than a "hallway comment", that it 
sounds like a good thing which I would test with a simulator, if I had one. 
I've been intrigued by (but really not looked much into) 
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob 
mailto:rug...@psu.edu>> wrote:

I could obviously let the test run for an hour to verify the lower priority job 
was never preempted...but that's not really feasible.

Why not? Isn't it going to take longer than an hour to wait for responses to 
this post? Also, you could set up the minimum time to a much smaller value, so 
it won't take as long to test.


Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Ryan Novosielski
You can get some information on that from sdiag, and there are tweaks you can 
make to backfill scheduling that affect how quickly it will get to a job.

That doesn’t really answer your real question, but might help you when you are 
looking into this.

Sent from my iPhone

On Sep 29, 2023, at 16:10, Groner, Rob  wrote:


I'm not looking for a one-time answer.  We run these tests anytime we change 
anything related to slurmversion, configuration, etc.We certainly run 
the test after the system comes back up after an outage, and an hour would be a 
long time to wait for that.  That's certainly the brute-force approach, but I'm 
hoping there's a definitive way to show, through scontrol job output, that the 
job won't preempt.

I could set the preemptexempttime to a smaller value, say 5 minutes instead of 
1 hour, that is true, but there's a few issues with that.


  1.  I would then no longer be testing the system as it actually is.  I want 
to test the system in its actual production configuration.
  2.  If I did lower its value, what would be a safe value?  5 minutes?  Does 
running for 5 minutes guarantee that the higher priority job had a chance to 
preempt it but didn't?  Or did the scheduler even ever get to it?  On a test 
cluster with few jobs, you could be reasonably assured it did, but running 
tests on the production cluster...isn't it possible the scheduler hasn't yet 
had a chance to process it, even after 5 minutes?  Depends on the slurm 
scheduler  settings I suppose

rob


From: slurm-users  on behalf of 
Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) 

Sent: Friday, September 29, 2023 3:14 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Verifying preemption WON'T happen

You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
On Sep 29, 2023, at 2:51 PM, Davide DelVento 
mailto:davide.quan...@gmail.com>> wrote:

I don't really have an answer for you other than a "hallway comment", that it 
sounds like a good thing which I would test with a simulator, if I had one. 
I've been intrigued by (but really not looked much into) 
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob 
mailto:rug...@psu.edu>> wrote:

I could obviously let the test run for an hour to verify the lower priority job 
was never preempted...but that's not really feasible.

Why not? Isn't it going to take longer than an hour to wait for responses to 
this post? Also, you could set up the minimum time to a much smaller value, so 
it won't take as long to test.


Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Groner, Rob
I'm not looking for a one-time answer.  We run these tests anytime we change 
anything related to slurmversion, configuration, etc.We certainly run 
the test after the system comes back up after an outage, and an hour would be a 
long time to wait for that.  That's certainly the brute-force approach, but I'm 
hoping there's a definitive way to show, through scontrol job output, that the 
job won't preempt.

I could set the preemptexempttime to a smaller value, say 5 minutes instead of 
1 hour, that is true, but there's a few issues with that.


  1.  I would then no longer be testing the system as it actually is.  I want 
to test the system in its actual production configuration.
  2.  If I did lower its value, what would be a safe value?  5 minutes?  Does 
running for 5 minutes guarantee that the higher priority job had a chance to 
preempt it but didn't?  Or did the scheduler even ever get to it?  On a test 
cluster with few jobs, you could be reasonably assured it did, but running 
tests on the production cluster...isn't it possible the scheduler hasn't yet 
had a chance to process it, even after 5 minutes?  Depends on the slurm 
scheduler  settings I suppose

rob


From: slurm-users  on behalf of 
Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) 

Sent: Friday, September 29, 2023 3:14 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Verifying preemption WON'T happen

You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>
On Sep 29, 2023, at 2:51 PM, Davide DelVento 
mailto:davide.quan...@gmail.com>> wrote:

I don't really have an answer for you other than a "hallway comment", that it 
sounds like a good thing which I would test with a simulator, if I had one. 
I've been intrigued by (but really not looked much into) 
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob 
mailto:rug...@psu.edu>> wrote:

I could obviously let the test run for an hour to verify the lower priority job 
was never preempted...but that's not really feasible.

Why not? Isn't it going to take longer than an hour to wait for responses to 
this post? Also, you could set up the minimum time to a much smaller value, so 
it won't take as long to test.


Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Sep 29, 2023, at 2:51 PM, Davide DelVento 
mailto:davide.quan...@gmail.com>> wrote:

I don't really have an answer for you other than a "hallway comment", that it 
sounds like a good thing which I would test with a simulator, if I had one. 
I've been intrigued by (but really not looked much into) 
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob 
mailto:rug...@psu.edu>> wrote:

I could obviously let the test run for an hour to verify the lower priority job 
was never preempted...but that's not really feasible.

Why not? Isn't it going to take longer than an hour to wait for responses to 
this post? Also, you could set up the minimum time to a much smaller value, so 
it won't take as long to test.


Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Davide DelVento
I don't really have an answer for you other than a "hallway comment", that
it sounds like a good thing which I would test with a simulator, if I had
one. I've been intrigued by (but really not looked much into)
https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf

On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob  wrote:

> On our system, for some partitions, we guarantee that a job can run at
> least an hour before being preempted by a higher priority job.  We use the
> QOS preempt exempt time for this, and it appears to be working.  But of
> course, I want to TEST that it works.
>
> So on a test system, I start a lower priority job on a specific node, wait
> until it starts running, and then I start a higher priority job for the
> same node.  The test should only pass if the higher priority job has an
> OPPORTUNITY to preempt the lower priority job, and doesn't.
>
> Now, I know I can get a preempt eligible time out of scontrol for the
> lower priority job and verify that it's set for an hour (I do check that
> already), but that's not good enough for me.  I could obviously let the
> test run for an hour to verify the lower priority job was never
> preempted...but that's not really feasible.  So instead, I want to verify
> that the higher priority job has had a chance to preempt the lower priority
> job, and it did not.
>
> So far, the way I've been doing that is to check the reported Scheduler in
> the scontrol job output for the higher priority job.  I figure that when
> the scheduler changes to Backfill instead of Main, then the higher priority
> job has been seen by the main scheduler and it passed on the chance to
> preempt the lower priority job.
>
> Is that a good assumption?  Is there any other, or potentially quicker,
> way to verify that the higher priority job will NOT preempt the lower
> priority job?
>
> Rob
>


[slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Groner, Rob
On our system, for some partitions, we guarantee that a job can run at least an 
hour before being preempted by a higher priority job.  We use the QOS preempt 
exempt time for this, and it appears to be working.  But of course, I want to 
TEST that it works.

So on a test system, I start a lower priority job on a specific node, wait 
until it starts running, and then I start a higher priority job for the same 
node.  The test should only pass if the higher priority job has an OPPORTUNITY 
to preempt the lower priority job, and doesn't.

Now, I know I can get a preempt eligible time out of scontrol for the lower 
priority job and verify that it's set for an hour (I do check that already), 
but that's not good enough for me.  I could obviously let the test run for an 
hour to verify the lower priority job was never preempted...but that's not 
really feasible.  So instead, I want to verify that the higher priority job has 
had a chance to preempt the lower priority job, and it did not.

So far, the way I've been doing that is to check the reported Scheduler in the 
scontrol job output for the higher priority job.  I figure that when the 
scheduler changes to Backfill instead of Main, then the higher priority job has 
been seen by the main scheduler and it passed on the chance to preempt the 
lower priority job.

Is that a good assumption?  Is there any other, or potentially quicker, way to 
verify that the higher priority job will NOT preempt the lower priority job?

Rob