Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Chris Samuel

On 27/2/23 03:34, David Laehnemann wrote:


Hi Chris, hi Sean,


Hiya!


thanks also (and thanks again) for chiming in.


No worries.


Quick follow-up question:

Would `squeue` be a better fall-back command than `scontrol` from the
perspective of keeping `slurmctld` responsive?


Sadly not, whilst a site can do some tricks to enforce rate limiting on 
squeue via the cli_filter that doesn't mean others have that set up, so 
they are vulnerable to the same issue.



Also, just as a quick heads-up: I am documenting your input by linking
to the mailing list archives, I hope that's alright for you?
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467


No problem - but I would say it's got to be sacct.

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Chris Samuel

On 27/2/23 06:53, Brian Andrus wrote:

Sorry, I had to share that this is very much like "Are we there yet?" on 
a road trip with kids 


Slurm is trying to drive.


Oh I love this analogy!

Whereas sacct is like looking talking to the navigator. The navigator 
does talk to the driver to give directions, and the driver keeps them up 
to date with the current situation, but the kids can talk to the 
navigator without disrupting the drivers concentration.


All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
Hi Brian,

thanks for your ideas. Follow-up questions, because further digging
through the docs didn't get me anywhere definitive on this:

IMHO, the true solution is that if a job's info NEEDS updated that 
> often, have the job itself report what it is doing (but NOT via
> slurm 
> commands). There are numerous ways to do that for most jobs.

Do you have any examples or suggestions of such ways without using
slurm commands?

> Perhaps there is some additional lines that could be added to the
> job 
> that would do a call to a snakemake API and report itself? Or maybe
> such 
> an API could be created/expanded.

One option that could work somehow, would be to use the `--wait` option
of the `sbatch` command that snakemake uses to submit jobs, by `
--wrap`ping the respective shell command. In addition, `sbatch` would
have to record "Job Accounting" info before exiting (it somehow does
implicitly in the log file, although I am not sure how and where the
printing of this accounting info is configured; so I am not sure if
this info will always be available in the logs or whether this depends
on a Slurm cluster's configuration). One could then have snakemake wait
for the process to finish, and only then parse the "Job Accounting"
info in the log file to determine what happened. But this means we do
not know `JobId`s of submitted jobs in the meantime, as the `JobId` is
what is usually returned by `sbatch` upon successful submission (when
`--wait` is not used). As a result, things like running an `scancel` on
all currently running jobs when we want to stop a snakemake run becomes
more difficult, because we don't have a list of `JobId`s of currently
active jobs. Although, a single run-specific `name` for all jobs of a
run (as suggested by Sean) might help, as `scancel` seems to allow the
use of job names.

But as one can hopefully see, there are no simple solutions. And to me,
the documentation is not that easy to parse, especially if you are not
already familiar with the terminology, and I have not really found any
best practices regarding the best ways to query for or somehow
otherwise determine job status (which is not to say they don't exist,
but at least they don't seem easy to find -- pointers are welcome).
I'll try to document whatever solution I come up with as best as I can,
so that others can hopefully reuse as much as they can in their
contexts. But maybe some publicly available best practices (and no-gos) 
for slurm cluster users would be a useful resource that cluster admins
can then point / link to.

cheers,
david


On Mon, 2023-02-27 at 06:53 -0800, Brian Andrus wrote:
> Sorry, I had to share that this is very much like "Are we there yet?"
> on 
> a road trip with kids :)
> 
> Slurm is trying to drive. Any communication to slurmctld will involve
> an 
> RPC call (sinfo, squeue, scontrol, etc). You can see how many with
> sinfo.
> Too many RPC calls will cause failures. Asking slurmdbd will not do
> that 
> to you. In fact, you could have a separate slurmdbd just for queries
> if 
> you wanted. This is why that was suggested as a better option.
> 
> So, even if you run 'squeue' once every few seconds, it would impact
> the 
> system. More so depending on the size of the system. We have had
> that 
> issue with users running 'watch squeue' and had to address it.
> 
> IMHO, the true solution is that if a job's info NEEDS updated that 
> often, have the job itself report what it is doing (but NOT via
> slurm 
> commands). There are numerous ways to do that for most jobs.
> 
> Perhaps there is some additional lines that could be added to the
> job 
> that would do a call to a snakemake API and report itself? Or maybe
> such 
> an API could be created/expanded.
> 
> Just a quick 2 cents (We may be up to a few dollars with all of those
> so 
> far).
> 
> Brian Andrus
> 
> 
> On 2/27/2023 4:24 AM, Ward Poelmans wrote:
> > On 24/02/2023 18:34, David Laehnemann wrote:
> > > Those queries then should not have to happen too often, although
> > > do you
> > > have any indication of a range for when you say "you still
> > > wouldn't
> > > want to query the status too frequently." Because I don't really,
> > > and
> > > would probably opt for some compromise of every 30 seconds or so.
> > 
> > I think this is exactly why hpc sys admins are sometimes not very 
> > happy about these tools. You're talking about 1 of jobs on one 
> > hand yet you want fetch the status every 30 seconds? What is the
> > point 
> > of that other then overloading the scheduler?
> > 
> > We're telling your users not to query the slurm too often and
> > usually 
> > give 5 minutes as a good interval. You have to let slurm do it's
> > job. 
> > There is no point in querying in a loop every 30 seconds when
> > we're 
> > talking about large numbers of jobs.
> > 
> > 
> > Ward




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Bas van der Vlies
We have many jupyterhub jobs on our cluster that also does a lot of job 
queries. Could adjust the query time. But what I did is that 1 process
queries all the jobs `squeue --json` and the jupyterhub query script 
looks in this output.


Instead that every jupyterhub job queries the batch system. I have only 
one. This is specific to the hub environment but if a lot of users run

snakemake you hit the same problem.


As admin I can understand the queries and it is not only snakemake there 
are plenty of other tools like hub that also do a lot queries. Some kind

of caching mechanism is nice. Most solve it with a wrapper script.

Just my 2 cents


On 27/02/2023 15:53, Brian Andrus wrote:
Sorry, I had to share that this is very much like "Are we there yet?" on 
a road trip with kids :)


Slurm is trying to drive. Any communication to slurmctld will involve an 
RPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo.
Too many RPC calls will cause failures. Asking slurmdbd will not do that 
to you. In fact, you could have a separate slurmdbd just for queries if 
you wanted. This is why that was suggested as a better option.


So, even if you run 'squeue' once every few seconds, it would impact the 
system. More so depending on the size of the system. We have had that 
issue with users running 'watch squeue' and had to address it.


IMHO, the true solution is that if a job's info NEEDS updated that 
often, have the job itself report what it is doing (but NOT via slurm 
commands). There are numerous ways to do that for most jobs.


Perhaps there is some additional lines that could be added to the job 
that would do a call to a snakemake API and report itself? Or maybe such 
an API could be created/expanded.


Just a quick 2 cents (We may be up to a few dollars with all of those so 
far).


Brian Andrus


On 2/27/2023 4:24 AM, Ward Poelmans wrote:

On 24/02/2023 18:34, David Laehnemann wrote:

Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.


I think this is exactly why hpc sys admins are sometimes not very 
happy about these tools. You're talking about 1 of jobs on one 
hand yet you want fetch the status every 30 seconds? What is the point 
of that other then overloading the scheduler?


We're telling your users not to query the slurm too often and usually 
give 5 minutes as a good interval. You have to let slurm do it's job. 
There is no point in querying in a loop every 30 seconds when we're 
talking about large numbers of jobs.



Ward




--
--
Bas van der Vlies
| High Performance Computing & Visualization | SURF| Science Park 140 | 
1098 XG  Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surf.nl | www.surf.nl |



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Ümit Seren
As a side note:
In Slurm 23.x a new rate limiting feature for client RPC calls was added:
(see this commit:
https://github.com/SchedMD/slurm/commit/674f118140e171d10c2501444a0040e1492f4eab#diff-b4e84d09d9b1d817a964fb78baba0a2ea6316bfc10c1405329a95ad0353ca33e
)
This would give operators the ability to limit the negative effect of
workflow managers on the scheduler.


On Mon, Feb 27, 2023 at 4:57 PM Davide DelVento 
wrote:

> > > And if you are seeing a workflow management system causing trouble on
> > > your system, probably the most sustainable way of getting this resolved
> > > is to file issues or pull requests with the respective project, with
> > > suggestions like the ones you made. For snakemake, a second good point
> > > to currently chime in, would be the issue discussing Slurm job array
> > > support: https://github.com/snakemake/snakemake/issues/301
> >
> > I have to disagree here.  I think the onus is on the people in a given
> > community to ensure that their software behaves well on the systems they
> > want to use, not on the operators of those system.  Those of us running
> > HPC systems often have to deal with a very large range of different
> > pieces of software and time and personell are limited.  If some program
> > used by only a subset of the users is causing disruption, then it
> > already costs us time and energy to mitigate those effects.  Even if I
> > had the appropriate skill set, I don't see my self be writing many
> > patches for workflow managers any time soon.
>
> As someone who has worked in both roles (and to a degree still is) and
> therefore can better understand the perspective from both parties, I
> side more with David than with Loris here.
>
> Yes, David wrote "or pull requests", but that's an OR.
>
> Loris, if you know or experience a problem, it takes close to zero
> time to file a bug report educating the author of the software about
> the problem (or pointing them to places where they can educate
> themselves). Otherwise they will never know about it, they will never
> fix it, and potentially they think it's fine and will make the problem
> worse. Yes, you could alternatively forbid the use of the problematic
> software on the machine (I've done that on our systems), but users
> with those needs will find ways to create the very same problem, and
> perhaps worse, in other ways (they have done it on our system). Yes,
> time is limited, and as operators of HPC systems we often don't have
> the time to understand all the nuances and needs of all the users, but
> that's not the point I am advocating. In fact it does seem to me that
> David is putting the onus on himself and his community to make the
> software behave correctly, and he is trying to educate himself about
> what "correct" is like. So just give him the input he's looking for,
> both here and (if and when snakemake causes troubles on your system)
> by opening tickets on that repo, explaining the problem (definitely
> not writing a PR for you, sorry David)
>
>


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Davide DelVento
> > And if you are seeing a workflow management system causing trouble on
> > your system, probably the most sustainable way of getting this resolved
> > is to file issues or pull requests with the respective project, with
> > suggestions like the ones you made. For snakemake, a second good point
> > to currently chime in, would be the issue discussing Slurm job array
> > support: https://github.com/snakemake/snakemake/issues/301
>
> I have to disagree here.  I think the onus is on the people in a given
> community to ensure that their software behaves well on the systems they
> want to use, not on the operators of those system.  Those of us running
> HPC systems often have to deal with a very large range of different
> pieces of software and time and personell are limited.  If some program
> used by only a subset of the users is causing disruption, then it
> already costs us time and energy to mitigate those effects.  Even if I
> had the appropriate skill set, I don't see my self be writing many
> patches for workflow managers any time soon.

As someone who has worked in both roles (and to a degree still is) and
therefore can better understand the perspective from both parties, I
side more with David than with Loris here.

Yes, David wrote "or pull requests", but that's an OR.

Loris, if you know or experience a problem, it takes close to zero
time to file a bug report educating the author of the software about
the problem (or pointing them to places where they can educate
themselves). Otherwise they will never know about it, they will never
fix it, and potentially they think it's fine and will make the problem
worse. Yes, you could alternatively forbid the use of the problematic
software on the machine (I've done that on our systems), but users
with those needs will find ways to create the very same problem, and
perhaps worse, in other ways (they have done it on our system). Yes,
time is limited, and as operators of HPC systems we often don't have
the time to understand all the nuances and needs of all the users, but
that's not the point I am advocating. In fact it does seem to me that
David is putting the onus on himself and his community to make the
software behave correctly, and he is trying to educate himself about
what "correct" is like. So just give him the input he's looking for,
both here and (if and when snakemake causes troubles on your system)
by opening tickets on that repo, explaining the problem (definitely
not writing a PR for you, sorry David)



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Loris Bennett
Hi David,

David Laehnemann  writes:

> Dear Ward,
>
> if used correctly (and that is a big caveat for any method for
> interacting with a cluster system), snakemake will only submit as many
> jobs as can fit within the resources of the cluster at one point of
> time (or however much resources you tell snakemake that it can use). So
> unless there are thousands of cores available (or you "lie" to
> snakemake, telling it that there are much more cores than actually
> exist), it will only ever submit hundreds of jobs (or a lot less, if
> the jobs each require multiple cores). Accordingly, any queries will
> also only be for this number of jobs that snakemake currently has
> submitted. And snakemake will only submit new jobs, once it registers
> previously submitted jobs as finished.
>
> So workflow managers can actually help reduce the strain on the
> scheduler, by only ever submitting stuff within the general limits of
> the system (as opposed to, for example, using some bash loop to just
> submit all of your analysis steps or samples at once).

I don't see this as a particular advantage for the scheduler.  If the
maximum number of jobs a user can submit is to, say, 5000, then it makes
no difference whether these 5000 jobs are generated by snakemake or a
batch script.  On our system strain tends mainly to occur when many
similar jobs fail immediately after they have started.

How does snakemake behave in such a situation?  If the job database is
already clogged up trying to record too many jobs completing within too
short a time, snakemake querying the database at that moment and maybe
starting more jobs (because others have failed and thus completed) could
potentially exacerbate the problem.

> And for example,
> snakemake has a mechanism to batch a number of smaller jobs into larger
> jobs for submission on the cluster, so this might be something to
> suggest to your users that cause trouble through using snakemake
> (especially the `--group-components` mechanism):
> https://snakemake.readthedocs.io/en/latest/executing/grouping.html

This seems to me, from the perspective of an operator, to be the main
advantage.

> The query mechanism for job status is a different story. And I'm
> specifically here on this mailing list to get as much input as possible
> to improve this -- and welcome anybody who wants to chime in on my
> respective work-in-progress pull request right here:
> https://github.com/snakemake/snakemake/pull/2136
>
> And if you are seeing a workflow management system causing trouble on
> your system, probably the most sustainable way of getting this resolved
> is to file issues or pull requests with the respective project, with
> suggestions like the ones you made. For snakemake, a second good point
> to currently chime in, would be the issue discussing Slurm job array
> support: https://github.com/snakemake/snakemake/issues/301

I have to disagree here.  I think the onus is on the people in a given
community to ensure that their software behaves well on the systems they
want to use, not on the operators of those system.  Those of us running
HPC systems often have to deal with a very large range of different
pieces of software and time and personell are limited.  If some program
used by only a subset of the users is causing disruption, then it
already costs us time and energy to mitigate those effects.  Even if I
had the appropriate skill set, I don't see my self be writing many
patches for workflow managers any time soon.

Cheers,

Loris

> And for Nextflow, another commonly used workflow manager in my field
> (bioinformatics), there's also an issue discussing Slurm job array
> support:
> https://github.com/nextflow-io/nextflow/issues/1477
>
> cheers,
> david
>
>
> On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote:
>> On 24/02/2023 18:34, David Laehnemann wrote:
>> > Those queries then should not have to happen too often, although do
>> > you
>> > have any indication of a range for when you say "you still wouldn't
>> > want to query the status too frequently." Because I don't really,
>> > and
>> > would probably opt for some compromise of every 30 seconds or so.
>> 
>> I think this is exactly why hpc sys admins are sometimes not very
>> happy about these tools. You're talking about 1 of jobs on one
>> hand yet you want fetch the status every 30 seconds? What is the
>> point of that other then overloading the scheduler?
>> 
>> We're telling your users not to query the slurm too often and usually
>> give 5 minutes as a good interval. You have to let slurm do it's job.
>> There is no point in querying in a loop every 30 seconds when we're
>> talking about large numbers of jobs.
>> 
>> 
>> Ward
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
Dear Ward,

if used correctly (and that is a big caveat for any method for
interacting with a cluster system), snakemake will only submit as many
jobs as can fit within the resources of the cluster at one point of
time (or however much resources you tell snakemake that it can use). So
unless there are thousands of cores available (or you "lie" to
snakemake, telling it that there are much more cores than actually
exist), it will only ever submit hundreds of jobs (or a lot less, if
the jobs each require multiple cores). Accordingly, any queries will
also only be for this number of jobs that snakemake currently has
submitted. And snakemake will only submit new jobs, once it registers
previously submitted jobs as finished.

So workflow managers can actually help reduce the strain on the
scheduler, by only ever submitting stuff within the general limits of
the system (as opposed to, for example, using some bash loop to just
submit all of your analysis steps or samples at once). And for example,
snakemake has a mechanism to batch a number of smaller jobs into larger
jobs for submission on the cluster, so this might be something to
suggest to your users that cause trouble through using snakemake
(especially the `--group-components` mechanism):
https://snakemake.readthedocs.io/en/latest/executing/grouping.html

The query mechanism for job status is a different story. And I'm
specifically here on this mailing list to get as much input as possible
to improve this -- and welcome anybody who wants to chime in on my
respective work-in-progress pull request right here:
https://github.com/snakemake/snakemake/pull/2136

And if you are seeing a workflow management system causing trouble on
your system, probably the most sustainable way of getting this resolved
is to file issues or pull requests with the respective project, with
suggestions like the ones you made. For snakemake, a second good point
to currently chime in, would be the issue discussing Slurm job array
support: https://github.com/snakemake/snakemake/issues/301

And for Nextflow, another commonly used workflow manager in my field
(bioinformatics), there's also an issue discussing Slurm job array
support:
https://github.com/nextflow-io/nextflow/issues/1477

cheers,
david


On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote:
> On 24/02/2023 18:34, David Laehnemann wrote:
> > Those queries then should not have to happen too often, although do
> > you
> > have any indication of a range for when you say "you still wouldn't
> > want to query the status too frequently." Because I don't really,
> > and
> > would probably opt for some compromise of every 30 seconds or so.
> 
> I think this is exactly why hpc sys admins are sometimes not very
> happy about these tools. You're talking about 1 of jobs on one
> hand yet you want fetch the status every 30 seconds? What is the
> point of that other then overloading the scheduler?
> 
> We're telling your users not to query the slurm too often and usually
> give 5 minutes as a good interval. You have to let slurm do it's job.
> There is no point in querying in a loop every 30 seconds when we're
> talking about large numbers of jobs.
> 
> 
> Ward




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Ward Poelmans

On 24/02/2023 18:34, David Laehnemann wrote:

Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.


I think this is exactly why hpc sys admins are sometimes not very happy about 
these tools. You're talking about 1 of jobs on one hand yet you want fetch 
the status every 30 seconds? What is the point of that other then overloading 
the scheduler?

We're telling your users not to query the slurm too often and usually give 5 
minutes as a good interval. You have to let slurm do it's job. There is no 
point in querying in a loop every 30 seconds when we're talking about large 
numbers of jobs.


Ward


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
Hi Chris, hi Sean,

thanks also (and thanks again) for chiming in.

Quick follow-up question:

Would `squeue` be a better fall-back command than `scontrol` from the
perspective of keeping `slurmctld` responsive? From what I can see in
the general overview of how slurm works (
https://slurm.schedmd.com/overview.html), both query `slurmctld`. But
would one be "better" than the other, as in generating less work for
`slurmctld`? Or will it roughly be an equivalent amount of work, so
that we can rather see which set of command-line arguments better suits
our needs?

Also, just as a quick heads-up: I am documenting your input by linking
to the mailing list archives, I hope that's alright for you?
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467

cheers,
david


On Sat, 2023-02-25 at 10:51 -0800, Chris Samuel wrote:
> On 23/2/23 2:55 am, David Laehnemann wrote:
> 
> > And consequently, would using `scontrol` thus be the better default
> > option (as opposed to `sacct`) for repeated job status checks by a
> > workflow management system?
> 
> Many others have commented on this, but use of scontrol in this way
> is 
> really really bad because of the impact it has on slurmctld. This is 
> because responding to the RPC (IIRC) requires taking read locks on 
> internal data structures and on a large, busy system (like ours, we 
> recently rolled over slurm job IDs back to 1 after ~6 years of
> operation 
> and run at over 90% occupancy most of the time) this can really
> damage 
> scheduling performance.
> 
> We've had numerous occasions where we've had to track down users
> abusing 
> scontrol in this way and redirect them to use sacct instead.
> 
> We already use the cli filter abilities in Slurm to impose a form of 
> rate limiting on RPCs from other commands, but unfortunately scontrol
> is 
> not covered by that.
> 
> All the best,
> Chris




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-25 Thread Chris Samuel

On 23/2/23 2:55 am, David Laehnemann wrote:


And consequently, would using `scontrol` thus be the better default
option (as opposed to `sacct`) for repeated job status checks by a
workflow management system?


Many others have commented on this, but use of scontrol in this way is 
really really bad because of the impact it has on slurmctld. This is 
because responding to the RPC (IIRC) requires taking read locks on 
internal data structures and on a large, busy system (like ours, we 
recently rolled over slurm job IDs back to 1 after ~6 years of operation 
and run at over 90% occupancy most of the time) this can really damage 
scheduling performance.


We've had numerous occasions where we've had to track down users abusing 
scontrol in this way and redirect them to use sacct instead.


We already use the cli filter abilities in Slurm to impose a form of 
rate limiting on RPCs from other commands, but unfortunately scontrol is 
not covered by that.


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-24 Thread Sean Maxwell
Hi David,

Those queries then should not have to happen too often, although do you
> have any indication of a range for when you say "you still wouldn't
> want to query the status too frequently." Because I don't really, and
> would probably opt for some compromise of every 30 seconds or so.
>

Every 30 seconds sounds reasonable. My cautioning was only in the sense
that everything has limitations. For example, the query processing time is
dependent on the size of the query and the overall load on the system, so
any static interval you select may not work well under some conditions. You
might want to defend against that by making the interval adaptive, like the
maximum of 30s or 5x the execution time of the last query, so that it
adapts to the overall burden of the query and the system load. That is just
an example to try and communicate what I was getting at.


> One thing I didn't understand from your eMail is the part about job
> names, as the command I gave doesn't use job names for its query:
>
> sacct -X -P -n --format=JobIdRaw,State -j ,,...
>
> Instead, it just uses the JobId, and isn't that guaranteed to be unique
> at any point in time? Or were you meaning to say that JobId can be non-
> unique? That would indeed spell trouble on a different level, and make
> status checks much more complicated...
>

Job id is unique. What I mean is, building a CSV list of jobs has
scalability issues. If you could assign the same job name to each job in
the snakemake pipeline, then the query is much shorter, and still returns
the status for each job id that snakemake has launched. Rather than falling
back to scontrol (which doesn't support querying by name) snakemake could
fall back to squeue which does support querying by name.

Best,

-Sean


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-24 Thread David Laehnemann
Hi Sean,

Thanks again for all the feedback!

I'll definitely try to implement batch queries, then. Both for the
default `sacct` query and for the fallback `scontrol` query. Also see
here:
https://github.com/snakemake/snakemake/pull/2136#issuecomment-1443295051

Those queries then should not have to happen too often, although do you
have any indication of a range for when you say "you still wouldn't
want to query the status too frequently." Because I don't really, and
would probably opt for some compromise of every 30 seconds or so.


One thing I didn't understand from your eMail is the part about job
names, as the command I gave doesn't use job names for its query:

sacct -X -P -n --format=JobIdRaw,State -j ,,...

Instead, it just uses the JobId, and isn't that guaranteed to be unique
at any point in time? Or were you meaning to say that JobId can be non-
unique? That would indeed spell trouble on a different level, and make
status checks much more complicated...

cheers,
david


On Thu, 2023-02-23 at 11:59 -0500, Sean Maxwell wrote:
> Hi David,
> 
> On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann <
> david.laehnem...@hhu.de>
> wrote:
> 
> > But from your comment I understand that handling these queries in
> > batches would be less work for slurmdbd, right? So instead of
> > querying
> > each jobid with a separate database query, it would do one database
> > query for the whole list? Is that really easier for the system, or
> > would it end up doing a call for each jobid, anyway?
> > 
> 
> From the perspective of avoiding RPC flood, it is much better to use
> a
> batch query. That said, if you have an extremely large number of jobs
> in
> the queue, you still wouldn't want to query the status too
> frequently.
> 
> 
> > And just to be as clear as possible, a call to sacct would then
> > look
> > like this:
> > sacct -X -P -n --format=JobIdRaw,State -j ,,...
> > 
> 
> That would be one way to do it, but I think there are other
> approaches that
> might be better. For example, there is no requirement for the job
> name to
> be unique. So if the snakemake pipeline has a configurable instance
> name="foo", and snakemake was configured to specify its own name as
> the job
> when submitting jobs (e.g. sbatch -J foo ...) then the query for all
> jobs
> in the pipeline is simply:
> 
> sacct --name=foo
> 
> Because we can of course rewrite the respective code section, so any
> > insight on how to do this job accounting more efficiently (and
> > better
> > tailored to how Slurm does things) is appreciated.
> > 
> 
> I appreciate that you are interested in improving the integration to
> make
> it more performant. We are seeing an increase in meta-scheduler use
> at our
> site, so this is a worthwhile problem to tackle.
> 
> Thanks,
> 
> -Sean




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Sean Maxwell
Hi David,

On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann 
wrote:

> But from your comment I understand that handling these queries in
> batches would be less work for slurmdbd, right? So instead of querying
> each jobid with a separate database query, it would do one database
> query for the whole list? Is that really easier for the system, or
> would it end up doing a call for each jobid, anyway?
>

>From the perspective of avoiding RPC flood, it is much better to use a
batch query. That said, if you have an extremely large number of jobs in
the queue, you still wouldn't want to query the status too frequently.


> And just to be as clear as possible, a call to sacct would then look
> like this:
> sacct -X -P -n --format=JobIdRaw,State -j ,,...
>

That would be one way to do it, but I think there are other approaches that
might be better. For example, there is no requirement for the job name to
be unique. So if the snakemake pipeline has a configurable instance
name="foo", and snakemake was configured to specify its own name as the job
when submitting jobs (e.g. sbatch -J foo ...) then the query for all jobs
in the pipeline is simply:

sacct --name=foo

Because we can of course rewrite the respective code section, so any
> insight on how to do this job accounting more efficiently (and better
> tailored to how Slurm does things) is appreciated.
>

I appreciate that you are interested in improving the integration to make
it more performant. We are seeing an increase in meta-scheduler use at our
site, so this is a worthwhile problem to tackle.

Thanks,

-Sean


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread David Laehnemann
Hi Sean,

yes, this is exactly what snakemake currently does. I didn't write that
code, but from my previous debugging, I think handling one job at a
time was simply the logic of the general executor for cluster systems,
and makes things like querying via scontrol as a fallback easier to
handle. But this is not set in stone.

But from your comment I understand that handling these queries in
batches would be less work for slurmdbd, right? So instead of querying
each jobid with a separate database query, it would do one database
query for the whole list? Is that really easier for the system, or
would it end up doing a call for each jobid, anyway?

And just to be as clear as possible, a call to sacct would then look
like this:
sacct -X -P -n --format=JobIdRaw,State -j ,,...

Because we can of course rewrite the respective code section, so any
insight on how to do this job accounting more efficiently (and better
tailored to how Slurm does things) is appreciated.

cheers,
david


On Thu, 2023-02-23 at 09:46 -0500, Sean Maxwell wrote:
> Hi David,
> 
> On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann <
> david.laehnem...@hhu.de>
> wrote:
> 
> > Quick follow-up question: do you have any indication of the rate of
> > job
> > status checks via sacct that slurmdbd will gracefully handle (per
> > second)? Or any suggestions how to roughly determine such a rate
> > for a
> > given cluster system?
> > 
> 
> I looked at your PR for context, and this line of snakemake looks
> problematic (I know this isn't part of your PR, it is part of the
> original
> code)
> https://github.com/snakemake/snakemake/commit/a0f04bab08113196fe1616a621bd6bf20fc05688#diff-d1b47826c1fc35806df72508e2f5e7f1d0424f9b2f7b9124810b051f5fe97f1bL296
> :
> 
> sacct_cmd = f"sacct -P -n --format=JobIdRaw,State -j {jobid}"
> 
> Since jobid is an int, this looks like snakmake will individually
> probe
> each Slurm job it has launched. If snakemake was using batch logic to
> gather status for all your running jobs with one call to sacct, then
> you
> could probably set the interval low. But it looks like it is going to
> probe
> each job individually by ID, so it will make as many RPC calls as
> their are
> jobs in the pipeline when it is time to check the status.
> 
> I could be wrong, but this is how I evaluated the code without going
> farther upstream.
> 
> Best,
> 
> -Sean




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Sean Maxwell
Hi David,

On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann 
wrote:

> Quick follow-up question: do you have any indication of the rate of job
> status checks via sacct that slurmdbd will gracefully handle (per
> second)? Or any suggestions how to roughly determine such a rate for a
> given cluster system?
>

I looked at your PR for context, and this line of snakemake looks
problematic (I know this isn't part of your PR, it is part of the original
code)
https://github.com/snakemake/snakemake/commit/a0f04bab08113196fe1616a621bd6bf20fc05688#diff-d1b47826c1fc35806df72508e2f5e7f1d0424f9b2f7b9124810b051f5fe97f1bL296
:

sacct_cmd = f"sacct -P -n --format=JobIdRaw,State -j {jobid}"

Since jobid is an int, this looks like snakmake will individually probe
each Slurm job it has launched. If snakemake was using batch logic to
gather status for all your running jobs with one call to sacct, then you
could probably set the interval low. But it looks like it is going to probe
each job individually by ID, so it will make as many RPC calls as their are
jobs in the pipeline when it is time to check the status.

I could be wrong, but this is how I evaluated the code without going
farther upstream.

Best,

-Sean


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Loris Bennett
Hi David,

David Laehnemann  writes:

[snip (16 lines)]

> P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from
> slurm that you can use to orchestrate large analysis workflows---on
> anything from a desktop or laptop computer to all kinds of cluster /
> cloud systems. In the case of Slurm it will submit each analysis step
> on a particular sample as a separate job, specifying the resources it
> needs. The scheduler then handles it from there. But because you can
> have (hundreds of) thousands of jobs, and with dependencies among them,
> you can't just submit everything all at once, but have to keep track of
> where you are at. And make sure you don't submit much more than the
> system can handle at any time, so you don't overwhelm the Slurm queue.

[snip (86 lines)]

I know what Snakemake and other workflow managers, such as Nextflow are
for, but my maybe ill-informed impression is that, while something of
this sort is obviously needed to manage complex dependencies, the
current solutions, probably because they originated outside the HPC
context, to try to do too much.  You say Snakemake helps

  make sure you don't submit much more than the system can handle

but that in my view should not be necessary.  Slurm has configuration
parameters which can be set to limit the number of jobs a user can
submit and/or run.  And when it comes to submitting (hundreds of)
thousands of jobs, Nextflow for example currently can't create job
arrays, and so generates large numbers of jobs with identical resource
requirements, which can prevent backfill from working properly.
Skimming the documentation for Snakemake, I also could not find any
reference to Slurm job arrays, so this could also be an issue.

Just my slightly grumpy 2¢.

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread David Laehnemann
Hi Sean, hi everybody,

thanks a lot for the quick insights!

My takeaway is: sacct is the better default for putting in lots of job
status checks after all, as it will not impact the slurmctld scheduler.

Quick follow-up question: do you have any indication of the rate of job
status checks via sacct that slurmdbd will gracefully handle (per
second)? Or any suggestions how to roughly determine such a rate for a
given cluster system?

cheers,
david


P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from
slurm that you can use to orchestrate large analysis workflows---on
anything from a desktop or laptop computer to all kinds of cluster /
cloud systems. In the case of Slurm it will submit each analysis step
on a particular sample as a separate job, specifying the resources it
needs. The scheduler then handles it from there. But because you can
have (hundreds of) thousands of jobs, and with dependencies among them,
you can't just submit everything all at once, but have to keep track of
where you are at. And make sure you don't submit much more than the
system can handle at any time, so you don't overwhelm the Slurm queue.



On Thu, 2023-02-23 at 07:55 -0500, Sean Maxwell wrote:
> Hi David,
> 
> scontrol - interacts with slurmctld using RPC, so it is faster, but
> requests put load on the scheduler itself.
> sacct - interacts with slurmdbd, so it doesn't place additional load
> on the
> scheduler.
> 
> There is a balance to reach, but the scontrol approach is riskier and
> can
> start to interfere with the cluster operation if used incorrectly.
> 
> Best,
> 
> -Sean
> 
> On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann <
> david.laehnem...@hhu.de>
> wrote:
> 
> > Dear Slurm users and developers,
> > 
> > TL;DR:
> > Do any of you know if `scontrol` status checks of jobs are always
> > expected to be quicker than `sacct` job status checks? Do you have
> > any
> > comparative timings between the two commands?
> > And consequently, would using `scontrol` thus be the better default
> > option (as opposed to `sacct`) for repeated job status checks by a
> > workflow management system?
> > 
> > 
> > And here's the long version with background infos and linkouts:
> > 
> > I have recently started using a Slurm cluster and am a regular user
> > of
> > the workflow management system snakemake (
> > https://snakemake.readthedocs.io/en/latest/). This workflow manager
> > recently integrated support for running analysis workflows pretty
> > seamlessly on Slurm clusters. It takes care of managing all job
> > dependecies and handles the submission of jobs according to your
> > global
> > (and job-specific) resource configurations.
> > 
> > One little hiccup when starting to use the snakemake-Slurm
> > combination
> > was a snakemake-internal rate-limitation for checking job statuses.
> > You
> > can find the full story here:
> > https://github.com/snakemake/snakemake/pull/2136
> > 
> > For debugging this, I obtained timings on `sacct` and `scontrol`,
> > with
> > `scontrol` consistently about 2.5x quicker in returning the job
> > status
> > when compared to `sacct`. Timings are recorded here:
> > 
> > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210
> > 
> > However, currently `sacct` is used for regularly checking the
> > status of
> > submitted jobs per default, and `scontrol` is only a fallback
> > whenever
> > `sacct` doesn't find the job (for example because it is not yet
> > running). Now, I was wondering if switching the default to
> > `scontrol`
> > would make sense. Thus, I would like to ask:
> > 
> > 1) Slurm users, whether they also have similar timings on different
> > Slurm clusters and whether those confirm that `scontrol` is
> > consistently quicker?
> > 
> > 2) Slurm developers, whether `scontrol` is expected to be quicker
> > from
> > its implementation and whether using `scontrol` would also be the
> > option that puts less strain on the scheduler in general?
> > 
> > Many thanks and best regards,
> > David
> > 
> > 
> > 




Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Sean Maxwell
Hi David,

scontrol - interacts with slurmctld using RPC, so it is faster, but
requests put load on the scheduler itself.
sacct - interacts with slurmdbd, so it doesn't place additional load on the
scheduler.

There is a balance to reach, but the scontrol approach is riskier and can
start to interfere with the cluster operation if used incorrectly.

Best,

-Sean

On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann 
wrote:

> Dear Slurm users and developers,
>
> TL;DR:
> Do any of you know if `scontrol` status checks of jobs are always
> expected to be quicker than `sacct` job status checks? Do you have any
> comparative timings between the two commands?
> And consequently, would using `scontrol` thus be the better default
> option (as opposed to `sacct`) for repeated job status checks by a
> workflow management system?
>
>
> And here's the long version with background infos and linkouts:
>
> I have recently started using a Slurm cluster and am a regular user of
> the workflow management system snakemake (
> https://snakemake.readthedocs.io/en/latest/). This workflow manager
> recently integrated support for running analysis workflows pretty
> seamlessly on Slurm clusters. It takes care of managing all job
> dependecies and handles the submission of jobs according to your global
> (and job-specific) resource configurations.
>
> One little hiccup when starting to use the snakemake-Slurm combination
> was a snakemake-internal rate-limitation for checking job statuses. You
> can find the full story here:
> https://github.com/snakemake/snakemake/pull/2136
>
> For debugging this, I obtained timings on `sacct` and `scontrol`, with
> `scontrol` consistently about 2.5x quicker in returning the job status
> when compared to `sacct`. Timings are recorded here:
>
> https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210
>
> However, currently `sacct` is used for regularly checking the status of
> submitted jobs per default, and `scontrol` is only a fallback whenever
> `sacct` doesn't find the job (for example because it is not yet
> running). Now, I was wondering if switching the default to `scontrol`
> would make sense. Thus, I would like to ask:
>
> 1) Slurm users, whether they also have similar timings on different
> Slurm clusters and whether those confirm that `scontrol` is
> consistently quicker?
>
> 2) Slurm developers, whether `scontrol` is expected to be quicker from
> its implementation and whether using `scontrol` would also be the
> option that puts less strain on the scheduler in general?
>
> Many thanks and best regards,
> David
>
>
>


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Bernstein, Noam CIV USN NRL (6393) Washington DC (USA)
On Feb 23, 2023, at 7:40 AM, Loris Bennett 
mailto:loris.benn...@fu-berlin.de>> wrote:

Hi David,

David Laehnemann mailto:david.laehnem...@hhu.de>> 
writes:

 by a
workflow management system?

I am probably being a bit naive, but I would have thought that the batch
system should just be able start your jobs when resources become
available.  Why do you need to check the status of jobs?  I would tend
to think that it is not something users should be doing.

"workflow management system" generally means some other piece of software that 
submits jobs as needed to complete some task.  It might need to know how 
current jobs are doing (running yet, completed, etc) to decide what to submit 
next. I assume that's the use case here.


Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread Loris Bennett
Hi David,

David Laehnemann  writes:

> Dear Slurm users and developers,
>
> TL;DR:
> Do any of you know if `scontrol` status checks of jobs are always
> expected to be quicker than `sacct` job status checks? Do you have any
> comparative timings between the two commands?
> And consequently, would using `scontrol` thus be the better default
> option (as opposed to `sacct`) for repeated job status checks by a
> workflow management system?

I am probably being a bit naive, but I would have thought that the batch
system should just be able start your jobs when resources become
available.  Why do you need to check the status of jobs?  I would tend
to think that it is not something users should be doing.

Cheers,

Loris

> And here's the long version with background infos and linkouts:
>
> I have recently started using a Slurm cluster and am a regular user of
> the workflow management system snakemake (
> https://snakemake.readthedocs.io/en/latest/). This workflow manager
> recently integrated support for running analysis workflows pretty
> seamlessly on Slurm clusters. It takes care of managing all job
> dependecies and handles the submission of jobs according to your global
> (and job-specific) resource configurations.
>
> One little hiccup when starting to use the snakemake-Slurm combination
> was a snakemake-internal rate-limitation for checking job statuses. You
> can find the full story here:
> https://github.com/snakemake/snakemake/pull/2136
>
> For debugging this, I obtained timings on `sacct` and `scontrol`, with
> `scontrol` consistently about 2.5x quicker in returning the job status
> when compared to `sacct`. Timings are recorded here:
> https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210
>
> However, currently `sacct` is used for regularly checking the status of
> submitted jobs per default, and `scontrol` is only a fallback whenever
> `sacct` doesn't find the job (for example because it is not yet
> running). Now, I was wondering if switching the default to `scontrol`
> would make sense. Thus, I would like to ask:
>
> 1) Slurm users, whether they also have similar timings on different
> Slurm clusters and whether those confirm that `scontrol` is
> consistently quicker?
>
> 2) Slurm developers, whether `scontrol` is expected to be quicker from
> its implementation and whether using `scontrol` would also be the
> option that puts less strain on the scheduler in general?
>
> Many thanks and best regards,
> David
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin



[slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread David Laehnemann
Dear Slurm users and developers,

TL;DR:
Do any of you know if `scontrol` status checks of jobs are always
expected to be quicker than `sacct` job status checks? Do you have any
comparative timings between the two commands?
And consequently, would using `scontrol` thus be the better default
option (as opposed to `sacct`) for repeated job status checks by a
workflow management system?


And here's the long version with background infos and linkouts:

I have recently started using a Slurm cluster and am a regular user of
the workflow management system snakemake (
https://snakemake.readthedocs.io/en/latest/). This workflow manager
recently integrated support for running analysis workflows pretty
seamlessly on Slurm clusters. It takes care of managing all job
dependecies and handles the submission of jobs according to your global
(and job-specific) resource configurations.

One little hiccup when starting to use the snakemake-Slurm combination
was a snakemake-internal rate-limitation for checking job statuses. You
can find the full story here:
https://github.com/snakemake/snakemake/pull/2136

For debugging this, I obtained timings on `sacct` and `scontrol`, with
`scontrol` consistently about 2.5x quicker in returning the job status
when compared to `sacct`. Timings are recorded here:
https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210

However, currently `sacct` is used for regularly checking the status of
submitted jobs per default, and `scontrol` is only a fallback whenever
`sacct` doesn't find the job (for example because it is not yet
running). Now, I was wondering if switching the default to `scontrol`
would make sense. Thus, I would like to ask:

1) Slurm users, whether they also have similar timings on different
Slurm clusters and whether those confirm that `scontrol` is
consistently quicker?

2) Slurm developers, whether `scontrol` is expected to be quicker from
its implementation and whether using `scontrol` would also be the
option that puts less strain on the scheduler in general?

Many thanks and best regards,
David