Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Chris Samuel

On 27/2/23 03:34, David Laehnemann wrote:


Hi Chris, hi Sean,


Hiya!


thanks also (and thanks again) for chiming in.


No worries.


Quick follow-up question:

Would `squeue` be a better fall-back command than `scontrol` from the
perspective of keeping `slurmctld` responsive?


Sadly not, whilst a site can do some tricks to enforce rate limiting on 
squeue via the cli_filter that doesn't mean others have that set up, so 
they are vulnerable to the same issue.



Also, just as a quick heads-up: I am documenting your input by linking
to the mailing list archives, I hope that's alright for you?
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467


No problem - but I would say it's got to be sacct.

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Chris Samuel

On 27/2/23 06:53, Brian Andrus wrote:

Sorry, I had to share that this is very much like "Are we there yet?" on 
a road trip with kids 


Slurm is trying to drive.


Oh I love this analogy!

Whereas sacct is like looking talking to the navigator. The navigator 
does talk to the driver to give directions, and the driver keeps them up 
to date with the current situation, but the kids can talk to the 
navigator without disrupting the drivers concentration.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
Hi Brian,

thanks for your ideas. Follow-up questions, because further digging
through the docs didn't get me anywhere definitive on this:

IMHO, the true solution is that if a job's info NEEDS updated that 
> often, have the job itself report what it is doing (but NOT via
> slurm 
> commands). There are numerous ways to do that for most jobs.

Do you have any examples or suggestions of such ways without using
slurm commands?

> Perhaps there is some additional lines that could be added to the
> job 
> that would do a call to a snakemake API and report itself? Or maybe
> such 
> an API could be created/expanded.

One option that could work somehow, would be to use the `--wait` option
of the `sbatch` command that snakemake uses to submit jobs, by `
--wrap`ping the respective shell command. In addition, `sbatch` would
have to record "Job Accounting" info before exiting (it somehow does
implicitly in the log file, although I am not sure how and where the
printing of this accounting info is configured; so I am not sure if
this info will always be available in the logs or whether this depends
on a Slurm cluster's configuration). One could then have snakemake wait
for the process to finish, and only then parse the "Job Accounting"
info in the log file to determine what happened. But this means we do
not know `JobId`s of submitted jobs in the meantime, as the `JobId` is
what is usually returned by `sbatch` upon successful submission (when
`--wait` is not used). As a result, things like running an `scancel` on
all currently running jobs when we want to stop a snakemake run becomes
more difficult, because we don't have a list of `JobId`s of currently
active jobs. Although, a single run-specific `name` for all jobs of a
run (as suggested by Sean) might help, as `scancel` seems to allow the
use of job names.

But as one can hopefully see, there are no simple solutions. And to me,
the documentation is not that easy to parse, especially if you are not
already familiar with the terminology, and I have not really found any
best practices regarding the best ways to query for or somehow
otherwise determine job status (which is not to say they don't exist,
but at least they don't seem easy to find -- pointers are welcome).
I'll try to document whatever solution I come up with as best as I can,
so that others can hopefully reuse as much as they can in their
contexts. But maybe some publicly available best practices (and no-gos) 
for slurm cluster users would be a useful resource that cluster admins
can then point / link to.

cheers,
david


On Mon, 2023-02-27 at 06:53 -0800, Brian Andrus wrote:
> Sorry, I had to share that this is very much like "Are we there yet?"
> on 
> a road trip with kids :)
> 
> Slurm is trying to drive. Any communication to slurmctld will involve
> an 
> RPC call (sinfo, squeue, scontrol, etc). You can see how many with
> sinfo.
> Too many RPC calls will cause failures. Asking slurmdbd will not do
> that 
> to you. In fact, you could have a separate slurmdbd just for queries
> if 
> you wanted. This is why that was suggested as a better option.
> 
> So, even if you run 'squeue' once every few seconds, it would impact
> the 
> system. More so depending on the size of the system. We have had
> that 
> issue with users running 'watch squeue' and had to address it.
> 
> IMHO, the true solution is that if a job's info NEEDS updated that 
> often, have the job itself report what it is doing (but NOT via
> slurm 
> commands). There are numerous ways to do that for most jobs.
> 
> Perhaps there is some additional lines that could be added to the
> job 
> that would do a call to a snakemake API and report itself? Or maybe
> such 
> an API could be created/expanded.
> 
> Just a quick 2 cents (We may be up to a few dollars with all of those
> so 
> far).
> 
> Brian Andrus
> 
> 
> On 2/27/2023 4:24 AM, Ward Poelmans wrote:
> > On 24/02/2023 18:34, David Laehnemann wrote:
> > > Those queries then should not have to happen too often, although
> > > do you
> > > have any indication of a range for when you say "you still
> > > wouldn't
> > > want to query the status too frequently." Because I don't really,
> > > and
> > > would probably opt for some compromise of every 30 seconds or so.
> > 
> > I think this is exactly why hpc sys admins are sometimes not very 
> > happy about these tools. You're talking about 1 of jobs on one 
> > hand yet you want fetch the status every 30 seconds? What is the
> > point 
> > of that other then overloading the scheduler?
> > 
> > We're telling your users not to query the slurm too often and
> > usually 
> > give 5 minutes as a good interval. You have to let slurm do it's
> > job. 
> > There is no point in querying in a loop every 30 seconds when
> > we're 
> > talking about large numbers of jobs.
> > 
> > 
> > Ward




Re: [slurm-users] priority access and QoS

2023-02-27 Thread Jason Simms
Hello all,

I haven't found any guidance that seems to be the current "better
practice," but this does seem to be a common use case. I imagine there are
multiple ways to accomplish this goal. For example, you could assuredly do
it with QoS, but you can likely also accomplish this with some other
weighting scheme based on, e.g., account. At my last position, I
accomplished this by having a partition containing the purchased nodes that
permitted a specific account only, which also had a PriorityTier setting,
and ensuring the cluster was configured to preempt based on a partition's
priority setting. So, even if the same nodes were in a different partition,
if a user in the account requested resources, it would preempt (if needed)
jobs from users not in that account. These are sample configuration lines
to illustrate (obviously simplified):

PreemptType=preempt/partition_prio
PreemptMode=REQUEUE

PartitionName=node PriorityTier=50 Nodes=node[01-06]
PartitionName=smithlab AllowAccounts=smithlab PriorityTier=100 Nodes=node06

I never heard from a user that this failed to preempt when necessary, so I
presume it works as advertised (in this case, if a user from smithlab ran a
job on node06, it would preempt non-smithlab users if the requested
resources were unavailable). Note that the user needs to specify the
smithlab account in, e.g., the batch submission file or on the command
line, especially if they have a non-smithlab account with the same username.

If someone can recommend why this approach isn't advisable, or if there is
a preferred approach, I would welcome feedback.

Warmest regards,
Jason


On Mon, Feb 27, 2023 at 2:09 PM Styrk, Daryl  wrote:

> Marko,
>
>
>
> I’m in a similar situation. We have many Accounts with dedicated hardware
> and recently ran into a situation where a user with dedicated submitted
> hundreds of jobs and they overflowed into the community hardware which
> caused an unexpected backlog. I believe QoS will help us with that as well.
> I’ve been researching and reading about best practices.
>
>
>
> Regards,
>
> Daryl
>
>
>
> *From: *slurm-users  on behalf of
> Marko Markoc 
> *Date: *Wednesday, February 22, 2023 at 1:56 PM
> *To: *slurm-users@lists.schedmd.com 
> *Subject: *[slurm-users] priority access and QoS
>
> Hi All, Currently in our environment we only have default one "free" tier
> of access to our resources and we are looking to add additional higher
> priority tier access. That means that the jobs from the users that
> "purchased"
>
> ZjQcmQRYFpfptBannerStart
>
> *This Message Is From an External Sender *
>
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> Hi All,
>
>
>
> Currently in our environment we only have default one "free" tier of
> access to our resources and we are looking to add additional higher
> priority tier access. That means that the jobs from the users that
> "purchased" a certain amount of service units will preempt jobs of the
> users in the free tier. I was thinking of using slurm QoS to achieve this
> by adding users/groups via sacctmgr to this newly created QoS tier but I
> wanted to check with all of you if there is a better way to
> accomplish this through slurm. Also, could GrpTRESMins be used to
> automatically keep track of SU usage by a certain user or group or is there
> some better usage tracking mechanism ?
>
>
>
> Thank You all,
>
> Marko
>


-- 
*Jason L. Simms, Ph.D., M.P.H.*
Manager of Research Computing
Swarthmore College
Information Technology Services
(610) 328-8102
Schedule a meeting: https://calendly.com/jlsimms


Re: [slurm-users] priority access and QoS

2023-02-27 Thread Styrk, Daryl
Marko,

I’m in a similar situation. We have many Accounts with dedicated hardware and 
recently ran into a situation where a user with dedicated submitted hundreds of 
jobs and they overflowed into the community hardware which caused an unexpected 
backlog. I believe QoS will help us with that as well. I’ve been researching 
and reading about best practices.

Regards,
Daryl

From: slurm-users  on behalf of Marko 
Markoc 
Date: Wednesday, February 22, 2023 at 1:56 PM
To: slurm-users@lists.schedmd.com 
Subject: [slurm-users] priority access and QoS
Hi All, Currently in our environment we only have default one "free" tier of 
access to our resources and we are looking to add additional higher priority 
tier access. That means that the jobs from the users that "purchased"
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi All,

Currently in our environment we only have default one "free" tier of access to 
our resources and we are looking to add additional higher priority tier access. 
That means that the jobs from the users that "purchased" a certain amount of 
service units will preempt jobs of the users in the free tier. I was thinking 
of using slurm QoS to achieve this by adding users/groups via sacctmgr to this 
newly created QoS tier but I wanted to check with all of you if there is a 
better way to accomplish this through slurm. Also, could GrpTRESMins be used to 
automatically keep track of SU usage by a certain user or group or is there 
some better usage tracking mechanism ?

Thank You all,
Marko


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Bas van der Vlies
We have many jupyterhub jobs on our cluster that also does a lot of job 
queries. Could adjust the query time. But what I did is that 1 process
queries all the jobs `squeue --json` and the jupyterhub query script 
looks in this output.


Instead that every jupyterhub job queries the batch system. I have only 
one. This is specific to the hub environment but if a lot of users run

snakemake you hit the same problem.


As admin I can understand the queries and it is not only snakemake there 
are plenty of other tools like hub that also do a lot queries. Some kind

of caching mechanism is nice. Most solve it with a wrapper script.

Just my 2 cents


On 27/02/2023 15:53, Brian Andrus wrote:
Sorry, I had to share that this is very much like "Are we there yet?" on 
a road trip with kids :)


Slurm is trying to drive. Any communication to slurmctld will involve an 
RPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo.
Too many RPC calls will cause failures. Asking slurmdbd will not do that 
to you. In fact, you could have a separate slurmdbd just for queries if 
you wanted. This is why that was suggested as a better option.


So, even if you run 'squeue' once every few seconds, it would impact the 
system. More so depending on the size of the system. We have had that 
issue with users running 'watch squeue' and had to address it.


IMHO, the true solution is that if a job's info NEEDS updated that 
often, have the job itself report what it is doing (but NOT via slurm 
commands). There are numerous ways to do that for most jobs.


Perhaps there is some additional lines that could be added to the job 
that would do a call to a snakemake API and report itself? Or maybe such 
an API could be created/expanded.


Just a quick 2 cents (We may be up to a few dollars with all of those so 
far).


Brian Andrus


On 2/27/2023 4:24 AM, Ward Poelmans wrote:

On 24/02/2023 18:34, David Laehnemann wrote:

Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.


I think this is exactly why hpc sys admins are sometimes not very 
happy about these tools. You're talking about 1 of jobs on one 
hand yet you want fetch the status every 30 seconds? What is the point 
of that other then overloading the scheduler?


We're telling your users not to query the slurm too often and usually 
give 5 minutes as a good interval. You have to let slurm do it's job. 
There is no point in querying in a loop every 30 seconds when we're 
talking about large numbers of jobs.



Ward




--
--
Bas van der Vlies
| High Performance Computing & Visualization | SURF| Science Park 140 | 
1098 XG  Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surf.nl | www.surf.nl |



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Ümit Seren
As a side note:
In Slurm 23.x a new rate limiting feature for client RPC calls was added:
(see this commit:
https://github.com/SchedMD/slurm/commit/674f118140e171d10c2501444a0040e1492f4eab#diff-b4e84d09d9b1d817a964fb78baba0a2ea6316bfc10c1405329a95ad0353ca33e
)
This would give operators the ability to limit the negative effect of
workflow managers on the scheduler.


On Mon, Feb 27, 2023 at 4:57 PM Davide DelVento 
wrote:

> > > And if you are seeing a workflow management system causing trouble on
> > > your system, probably the most sustainable way of getting this resolved
> > > is to file issues or pull requests with the respective project, with
> > > suggestions like the ones you made. For snakemake, a second good point
> > > to currently chime in, would be the issue discussing Slurm job array
> > > support: https://github.com/snakemake/snakemake/issues/301
> >
> > I have to disagree here.  I think the onus is on the people in a given
> > community to ensure that their software behaves well on the systems they
> > want to use, not on the operators of those system.  Those of us running
> > HPC systems often have to deal with a very large range of different
> > pieces of software and time and personell are limited.  If some program
> > used by only a subset of the users is causing disruption, then it
> > already costs us time and energy to mitigate those effects.  Even if I
> > had the appropriate skill set, I don't see my self be writing many
> > patches for workflow managers any time soon.
>
> As someone who has worked in both roles (and to a degree still is) and
> therefore can better understand the perspective from both parties, I
> side more with David than with Loris here.
>
> Yes, David wrote "or pull requests", but that's an OR.
>
> Loris, if you know or experience a problem, it takes close to zero
> time to file a bug report educating the author of the software about
> the problem (or pointing them to places where they can educate
> themselves). Otherwise they will never know about it, they will never
> fix it, and potentially they think it's fine and will make the problem
> worse. Yes, you could alternatively forbid the use of the problematic
> software on the machine (I've done that on our systems), but users
> with those needs will find ways to create the very same problem, and
> perhaps worse, in other ways (they have done it on our system). Yes,
> time is limited, and as operators of HPC systems we often don't have
> the time to understand all the nuances and needs of all the users, but
> that's not the point I am advocating. In fact it does seem to me that
> David is putting the onus on himself and his community to make the
> software behave correctly, and he is trying to educate himself about
> what "correct" is like. So just give him the input he's looking for,
> both here and (if and when snakemake causes troubles on your system)
> by opening tickets on that repo, explaining the problem (definitely
> not writing a PR for you, sorry David)
>
>


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Davide DelVento
> > And if you are seeing a workflow management system causing trouble on
> > your system, probably the most sustainable way of getting this resolved
> > is to file issues or pull requests with the respective project, with
> > suggestions like the ones you made. For snakemake, a second good point
> > to currently chime in, would be the issue discussing Slurm job array
> > support: https://github.com/snakemake/snakemake/issues/301
>
> I have to disagree here.  I think the onus is on the people in a given
> community to ensure that their software behaves well on the systems they
> want to use, not on the operators of those system.  Those of us running
> HPC systems often have to deal with a very large range of different
> pieces of software and time and personell are limited.  If some program
> used by only a subset of the users is causing disruption, then it
> already costs us time and energy to mitigate those effects.  Even if I
> had the appropriate skill set, I don't see my self be writing many
> patches for workflow managers any time soon.

As someone who has worked in both roles (and to a degree still is) and
therefore can better understand the perspective from both parties, I
side more with David than with Loris here.

Yes, David wrote "or pull requests", but that's an OR.

Loris, if you know or experience a problem, it takes close to zero
time to file a bug report educating the author of the software about
the problem (or pointing them to places where they can educate
themselves). Otherwise they will never know about it, they will never
fix it, and potentially they think it's fine and will make the problem
worse. Yes, you could alternatively forbid the use of the problematic
software on the machine (I've done that on our systems), but users
with those needs will find ways to create the very same problem, and
perhaps worse, in other ways (they have done it on our system). Yes,
time is limited, and as operators of HPC systems we often don't have
the time to understand all the nuances and needs of all the users, but
that's not the point I am advocating. In fact it does seem to me that
David is putting the onus on himself and his community to make the
software behave correctly, and he is trying to educate himself about
what "correct" is like. So just give him the input he's looking for,
both here and (if and when snakemake causes troubles on your system)
by opening tickets on that repo, explaining the problem (definitely
not writing a PR for you, sorry David)



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Loris Bennett
Hi David,

David Laehnemann  writes:

> Dear Ward,
>
> if used correctly (and that is a big caveat for any method for
> interacting with a cluster system), snakemake will only submit as many
> jobs as can fit within the resources of the cluster at one point of
> time (or however much resources you tell snakemake that it can use). So
> unless there are thousands of cores available (or you "lie" to
> snakemake, telling it that there are much more cores than actually
> exist), it will only ever submit hundreds of jobs (or a lot less, if
> the jobs each require multiple cores). Accordingly, any queries will
> also only be for this number of jobs that snakemake currently has
> submitted. And snakemake will only submit new jobs, once it registers
> previously submitted jobs as finished.
>
> So workflow managers can actually help reduce the strain on the
> scheduler, by only ever submitting stuff within the general limits of
> the system (as opposed to, for example, using some bash loop to just
> submit all of your analysis steps or samples at once).

I don't see this as a particular advantage for the scheduler.  If the
maximum number of jobs a user can submit is to, say, 5000, then it makes
no difference whether these 5000 jobs are generated by snakemake or a
batch script.  On our system strain tends mainly to occur when many
similar jobs fail immediately after they have started.

How does snakemake behave in such a situation?  If the job database is
already clogged up trying to record too many jobs completing within too
short a time, snakemake querying the database at that moment and maybe
starting more jobs (because others have failed and thus completed) could
potentially exacerbate the problem.

> And for example,
> snakemake has a mechanism to batch a number of smaller jobs into larger
> jobs for submission on the cluster, so this might be something to
> suggest to your users that cause trouble through using snakemake
> (especially the `--group-components` mechanism):
> https://snakemake.readthedocs.io/en/latest/executing/grouping.html

This seems to me, from the perspective of an operator, to be the main
advantage.

> The query mechanism for job status is a different story. And I'm
> specifically here on this mailing list to get as much input as possible
> to improve this -- and welcome anybody who wants to chime in on my
> respective work-in-progress pull request right here:
> https://github.com/snakemake/snakemake/pull/2136
>
> And if you are seeing a workflow management system causing trouble on
> your system, probably the most sustainable way of getting this resolved
> is to file issues or pull requests with the respective project, with
> suggestions like the ones you made. For snakemake, a second good point
> to currently chime in, would be the issue discussing Slurm job array
> support: https://github.com/snakemake/snakemake/issues/301

I have to disagree here.  I think the onus is on the people in a given
community to ensure that their software behaves well on the systems they
want to use, not on the operators of those system.  Those of us running
HPC systems often have to deal with a very large range of different
pieces of software and time and personell are limited.  If some program
used by only a subset of the users is causing disruption, then it
already costs us time and energy to mitigate those effects.  Even if I
had the appropriate skill set, I don't see my self be writing many
patches for workflow managers any time soon.

Cheers,

Loris

> And for Nextflow, another commonly used workflow manager in my field
> (bioinformatics), there's also an issue discussing Slurm job array
> support:
> https://github.com/nextflow-io/nextflow/issues/1477
>
> cheers,
> david
>
>
> On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote:
>> On 24/02/2023 18:34, David Laehnemann wrote:
>> > Those queries then should not have to happen too often, although do
>> > you
>> > have any indication of a range for when you say "you still wouldn't
>> > want to query the status too frequently." Because I don't really,
>> > and
>> > would probably opt for some compromise of every 30 seconds or so.
>> 
>> I think this is exactly why hpc sys admins are sometimes not very
>> happy about these tools. You're talking about 1 of jobs on one
>> hand yet you want fetch the status every 30 seconds? What is the
>> point of that other then overloading the scheduler?
>> 
>> We're telling your users not to query the slurm too often and usually
>> give 5 minutes as a good interval. You have to let slurm do it's job.
>> There is no point in querying in a loop every 30 seconds when we're
>> talking about large numbers of jobs.
>> 
>> 
>> Ward
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
Dear Ward,

if used correctly (and that is a big caveat for any method for
interacting with a cluster system), snakemake will only submit as many
jobs as can fit within the resources of the cluster at one point of
time (or however much resources you tell snakemake that it can use). So
unless there are thousands of cores available (or you "lie" to
snakemake, telling it that there are much more cores than actually
exist), it will only ever submit hundreds of jobs (or a lot less, if
the jobs each require multiple cores). Accordingly, any queries will
also only be for this number of jobs that snakemake currently has
submitted. And snakemake will only submit new jobs, once it registers
previously submitted jobs as finished.

So workflow managers can actually help reduce the strain on the
scheduler, by only ever submitting stuff within the general limits of
the system (as opposed to, for example, using some bash loop to just
submit all of your analysis steps or samples at once). And for example,
snakemake has a mechanism to batch a number of smaller jobs into larger
jobs for submission on the cluster, so this might be something to
suggest to your users that cause trouble through using snakemake
(especially the `--group-components` mechanism):
https://snakemake.readthedocs.io/en/latest/executing/grouping.html

The query mechanism for job status is a different story. And I'm
specifically here on this mailing list to get as much input as possible
to improve this -- and welcome anybody who wants to chime in on my
respective work-in-progress pull request right here:
https://github.com/snakemake/snakemake/pull/2136

And if you are seeing a workflow management system causing trouble on
your system, probably the most sustainable way of getting this resolved
is to file issues or pull requests with the respective project, with
suggestions like the ones you made. For snakemake, a second good point
to currently chime in, would be the issue discussing Slurm job array
support: https://github.com/snakemake/snakemake/issues/301

And for Nextflow, another commonly used workflow manager in my field
(bioinformatics), there's also an issue discussing Slurm job array
support:
https://github.com/nextflow-io/nextflow/issues/1477

cheers,
david


On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote:
> On 24/02/2023 18:34, David Laehnemann wrote:
> > Those queries then should not have to happen too often, although do
> > you
> > have any indication of a range for when you say "you still wouldn't
> > want to query the status too frequently." Because I don't really,
> > and
> > would probably opt for some compromise of every 30 seconds or so.
> 
> I think this is exactly why hpc sys admins are sometimes not very
> happy about these tools. You're talking about 1 of jobs on one
> hand yet you want fetch the status every 30 seconds? What is the
> point of that other then overloading the scheduler?
> 
> We're telling your users not to query the slurm too often and usually
> give 5 minutes as a good interval. You have to let slurm do it's job.
> There is no point in querying in a loop every 30 seconds when we're
> talking about large numbers of jobs.
> 
> 
> Ward




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Ward Poelmans

On 24/02/2023 18:34, David Laehnemann wrote:

Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.


I think this is exactly why hpc sys admins are sometimes not very happy about 
these tools. You're talking about 1 of jobs on one hand yet you want fetch 
the status every 30 seconds? What is the point of that other then overloading 
the scheduler?

We're telling your users not to query the slurm too often and usually give 5 
minutes as a good interval. You have to let slurm do it's job. There is no 
point in querying in a loop every 30 seconds when we're talking about large 
numbers of jobs.


Ward


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
Hi Chris, hi Sean,

thanks also (and thanks again) for chiming in.

Quick follow-up question:

Would `squeue` be a better fall-back command than `scontrol` from the
perspective of keeping `slurmctld` responsive? From what I can see in
the general overview of how slurm works (
https://slurm.schedmd.com/overview.html), both query `slurmctld`. But
would one be "better" than the other, as in generating less work for
`slurmctld`? Or will it roughly be an equivalent amount of work, so
that we can rather see which set of command-line arguments better suits
our needs?

Also, just as a quick heads-up: I am documenting your input by linking
to the mailing list archives, I hope that's alright for you?
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467

cheers,
david


On Sat, 2023-02-25 at 10:51 -0800, Chris Samuel wrote:
> On 23/2/23 2:55 am, David Laehnemann wrote:
> 
> > And consequently, would using `scontrol` thus be the better default
> > option (as opposed to `sacct`) for repeated job status checks by a
> > workflow management system?
> 
> Many others have commented on this, but use of scontrol in this way
> is 
> really really bad because of the impact it has on slurmctld. This is 
> because responding to the RPC (IIRC) requires taking read locks on 
> internal data structures and on a large, busy system (like ours, we 
> recently rolled over slurm job IDs back to 1 after ~6 years of
> operation 
> and run at over 90% occupancy most of the time) this can really
> damage 
> scheduling performance.
> 
> We've had numerous occasions where we've had to track down users
> abusing 
> scontrol in this way and redirect them to use sacct instead.
> 
> We already use the cli filter abilities in Slurm to impose a form of 
> rate limiting on RPCs from other commands, but unfortunately scontrol
> is 
> not covered by that.
> 
> All the best,
> Chris