Re: [slurm-users] Verifying preemption WON'T happen
Well again, I don't want to tweak things just to get the test to happen quicker. I DO have to keep in mind the scheduler and backfill settings, though. For instance, I think the default scheduler and backfill interval is 60 and 30 seconds...or vice versa. So, before I check the Scheduler value for the high priority job via scontrol, I wait 90 seconds and then some. In a perfect world, that SHOULD have given the scheduler and backfill scheduler time to get to it. I THINK, however, that in a sufficiently busy system, there's no guarantee even after that amount of time that the new high priority job has been evaluated. I'll take a look at sdiag and see if it can tell me where the job is at, thanks for the suggestion. Rob From: slurm-users on behalf of Ryan Novosielski Sent: Friday, September 29, 2023 4:19 PM To: Slurm User Community List Subject: Re: [slurm-users] Verifying preemption WON'T happen You can get some information on that from sdiag, and there are tweaks you can make to backfill scheduling that affect how quickly it will get to a job. That doesn’t really answer your real question, but might help you when you are looking into this. Sent from my iPhone On Sep 29, 2023, at 16:10, Groner, Rob wrote: I'm not looking for a one-time answer. We run these tests anytime we change anything related to slurmversion, configuration, etc.We certainly run the test after the system comes back up after an outage, and an hour would be a long time to wait for that. That's certainly the brute-force approach, but I'm hoping there's a definitive way to show, through scontrol job output, that the job won't preempt. I could set the preemptexempttime to a smaller value, say 5 minutes instead of 1 hour, that is true, but there's a few issues with that. 1. I would then no longer be testing the system as it actually is. I want to test the system in its actual production configuration. 2. If I did lower its value, what would be a safe value? 5 minutes? Does running for 5 minutes guarantee that the higher priority job had a chance to preempt it but didn't? Or did the scheduler even ever get to it? On a test cluster with few jobs, you could be reasonably assured it did, but running tests on the production cluster...isn't it possible the scheduler hasn't yet had a chance to process it, even after 5 minutes? Depends on the slurm scheduler settings I suppose rob From: slurm-users on behalf of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: Friday, September 29, 2023 3:14 PM To: Slurm User Community List Subject: Re: [slurm-users] Verifying preemption WON'T happen You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> On Sep 29, 2023, at 2:51 PM, Davide DelVento mailto:davide.quan...@gmail.com>> wrote: I don't really have an answer for you other than a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob mailto:rug...@psu.edu>> wrote: I could obviously let the test run for an hour to verify the lower priority job was never preempted...but that's not really feasible. Why not? Isn't it going to take longer than an hour to wait for responses to this post? Also, you could set up the minimum time to a much smaller value, so it won't take as long to test.
Re: [slurm-users] Verifying preemption WON'T happen
You can get some information on that from sdiag, and there are tweaks you can make to backfill scheduling that affect how quickly it will get to a job. That doesn’t really answer your real question, but might help you when you are looking into this. Sent from my iPhone On Sep 29, 2023, at 16:10, Groner, Rob wrote: I'm not looking for a one-time answer. We run these tests anytime we change anything related to slurmversion, configuration, etc.We certainly run the test after the system comes back up after an outage, and an hour would be a long time to wait for that. That's certainly the brute-force approach, but I'm hoping there's a definitive way to show, through scontrol job output, that the job won't preempt. I could set the preemptexempttime to a smaller value, say 5 minutes instead of 1 hour, that is true, but there's a few issues with that. 1. I would then no longer be testing the system as it actually is. I want to test the system in its actual production configuration. 2. If I did lower its value, what would be a safe value? 5 minutes? Does running for 5 minutes guarantee that the higher priority job had a chance to preempt it but didn't? Or did the scheduler even ever get to it? On a test cluster with few jobs, you could be reasonably assured it did, but running tests on the production cluster...isn't it possible the scheduler hasn't yet had a chance to process it, even after 5 minutes? Depends on the slurm scheduler settings I suppose rob From: slurm-users on behalf of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: Friday, September 29, 2023 3:14 PM To: Slurm User Community List Subject: Re: [slurm-users] Verifying preemption WON'T happen You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> On Sep 29, 2023, at 2:51 PM, Davide DelVento mailto:davide.quan...@gmail.com>> wrote: I don't really have an answer for you other than a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob mailto:rug...@psu.edu>> wrote: I could obviously let the test run for an hour to verify the lower priority job was never preempted...but that's not really feasible. Why not? Isn't it going to take longer than an hour to wait for responses to this post? Also, you could set up the minimum time to a much smaller value, so it won't take as long to test.
Re: [slurm-users] Verifying preemption WON'T happen
I'm not looking for a one-time answer. We run these tests anytime we change anything related to slurmversion, configuration, etc.We certainly run the test after the system comes back up after an outage, and an hour would be a long time to wait for that. That's certainly the brute-force approach, but I'm hoping there's a definitive way to show, through scontrol job output, that the job won't preempt. I could set the preemptexempttime to a smaller value, say 5 minutes instead of 1 hour, that is true, but there's a few issues with that. 1. I would then no longer be testing the system as it actually is. I want to test the system in its actual production configuration. 2. If I did lower its value, what would be a safe value? 5 minutes? Does running for 5 minutes guarantee that the higher priority job had a chance to preempt it but didn't? Or did the scheduler even ever get to it? On a test cluster with few jobs, you could be reasonably assured it did, but running tests on the production cluster...isn't it possible the scheduler hasn't yet had a chance to process it, even after 5 minutes? Depends on the slurm scheduler settings I suppose rob From: slurm-users on behalf of Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Sent: Friday, September 29, 2023 3:14 PM To: Slurm User Community List Subject: Re: [slurm-users] Verifying preemption WON'T happen You don't often get email from noam.bernst...@nrl.navy.mil. Learn why this is important<https://aka.ms/LearnAboutSenderIdentification> On Sep 29, 2023, at 2:51 PM, Davide DelVento mailto:davide.quan...@gmail.com>> wrote: I don't really have an answer for you other than a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob mailto:rug...@psu.edu>> wrote: I could obviously let the test run for an hour to verify the lower priority job was never preempted...but that's not really feasible. Why not? Isn't it going to take longer than an hour to wait for responses to this post? Also, you could set up the minimum time to a much smaller value, so it won't take as long to test.
Re: [slurm-users] Verifying preemption WON'T happen
On Sep 29, 2023, at 2:51 PM, Davide DelVento mailto:davide.quan...@gmail.com>> wrote: I don't really have an answer for you other than a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob mailto:rug...@psu.edu>> wrote: I could obviously let the test run for an hour to verify the lower priority job was never preempted...but that's not really feasible. Why not? Isn't it going to take longer than an hour to wait for responses to this post? Also, you could set up the minimum time to a much smaller value, so it won't take as long to test.
Re: [slurm-users] Verifying preemption WON'T happen
I don't really have an answer for you other than a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 AM Groner, Rob wrote: > On our system, for some partitions, we guarantee that a job can run at > least an hour before being preempted by a higher priority job. We use the > QOS preempt exempt time for this, and it appears to be working. But of > course, I want to TEST that it works. > > So on a test system, I start a lower priority job on a specific node, wait > until it starts running, and then I start a higher priority job for the > same node. The test should only pass if the higher priority job has an > OPPORTUNITY to preempt the lower priority job, and doesn't. > > Now, I know I can get a preempt eligible time out of scontrol for the > lower priority job and verify that it's set for an hour (I do check that > already), but that's not good enough for me. I could obviously let the > test run for an hour to verify the lower priority job was never > preempted...but that's not really feasible. So instead, I want to verify > that the higher priority job has had a chance to preempt the lower priority > job, and it did not. > > So far, the way I've been doing that is to check the reported Scheduler in > the scontrol job output for the higher priority job. I figure that when > the scheduler changes to Backfill instead of Main, then the higher priority > job has been seen by the main scheduler and it passed on the chance to > preempt the lower priority job. > > Is that a good assumption? Is there any other, or potentially quicker, > way to verify that the higher priority job will NOT preempt the lower > priority job? > > Rob >
[slurm-users] Verifying preemption WON'T happen
On our system, for some partitions, we guarantee that a job can run at least an hour before being preempted by a higher priority job. We use the QOS preempt exempt time for this, and it appears to be working. But of course, I want to TEST that it works. So on a test system, I start a lower priority job on a specific node, wait until it starts running, and then I start a higher priority job for the same node. The test should only pass if the higher priority job has an OPPORTUNITY to preempt the lower priority job, and doesn't. Now, I know I can get a preempt eligible time out of scontrol for the lower priority job and verify that it's set for an hour (I do check that already), but that's not good enough for me. I could obviously let the test run for an hour to verify the lower priority job was never preempted...but that's not really feasible. So instead, I want to verify that the higher priority job has had a chance to preempt the lower priority job, and it did not. So far, the way I've been doing that is to check the reported Scheduler in the scontrol job output for the higher priority job. I figure that when the scheduler changes to Backfill instead of Main, then the higher priority job has been seen by the main scheduler and it passed on the chance to preempt the lower priority job. Is that a good assumption? Is there any other, or potentially quicker, way to verify that the higher priority job will NOT preempt the lower priority job? Rob