Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 27/2/23 03:34, David Laehnemann wrote: Hi Chris, hi Sean, Hiya! thanks also (and thanks again) for chiming in. No worries. Quick follow-up question: Would `squeue` be a better fall-back command than `scontrol` from the perspective of keeping `slurmctld` responsive? Sadly not, whilst a site can do some tricks to enforce rate limiting on squeue via the cli_filter that doesn't mean others have that set up, so they are vulnerable to the same issue. Also, just as a quick heads-up: I am documenting your input by linking to the mailing list archives, I hope that's alright for you? https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467 No problem - but I would say it's got to be sacct. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 27/2/23 06:53, Brian Andrus wrote: Sorry, I had to share that this is very much like "Are we there yet?" on a road trip with kids Slurm is trying to drive. Oh I love this analogy! Whereas sacct is like looking talking to the navigator. The navigator does talk to the driver to give directions, and the driver keeps them up to date with the current situation, but the kids can talk to the navigator without disrupting the drivers concentration. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi Brian, thanks for your ideas. Follow-up questions, because further digging through the docs didn't get me anywhere definitive on this: IMHO, the true solution is that if a job's info NEEDS updated that > often, have the job itself report what it is doing (but NOT via > slurm > commands). There are numerous ways to do that for most jobs. Do you have any examples or suggestions of such ways without using slurm commands? > Perhaps there is some additional lines that could be added to the > job > that would do a call to a snakemake API and report itself? Or maybe > such > an API could be created/expanded. One option that could work somehow, would be to use the `--wait` option of the `sbatch` command that snakemake uses to submit jobs, by ` --wrap`ping the respective shell command. In addition, `sbatch` would have to record "Job Accounting" info before exiting (it somehow does implicitly in the log file, although I am not sure how and where the printing of this accounting info is configured; so I am not sure if this info will always be available in the logs or whether this depends on a Slurm cluster's configuration). One could then have snakemake wait for the process to finish, and only then parse the "Job Accounting" info in the log file to determine what happened. But this means we do not know `JobId`s of submitted jobs in the meantime, as the `JobId` is what is usually returned by `sbatch` upon successful submission (when `--wait` is not used). As a result, things like running an `scancel` on all currently running jobs when we want to stop a snakemake run becomes more difficult, because we don't have a list of `JobId`s of currently active jobs. Although, a single run-specific `name` for all jobs of a run (as suggested by Sean) might help, as `scancel` seems to allow the use of job names. But as one can hopefully see, there are no simple solutions. And to me, the documentation is not that easy to parse, especially if you are not already familiar with the terminology, and I have not really found any best practices regarding the best ways to query for or somehow otherwise determine job status (which is not to say they don't exist, but at least they don't seem easy to find -- pointers are welcome). I'll try to document whatever solution I come up with as best as I can, so that others can hopefully reuse as much as they can in their contexts. But maybe some publicly available best practices (and no-gos) for slurm cluster users would be a useful resource that cluster admins can then point / link to. cheers, david On Mon, 2023-02-27 at 06:53 -0800, Brian Andrus wrote: > Sorry, I had to share that this is very much like "Are we there yet?" > on > a road trip with kids :) > > Slurm is trying to drive. Any communication to slurmctld will involve > an > RPC call (sinfo, squeue, scontrol, etc). You can see how many with > sinfo. > Too many RPC calls will cause failures. Asking slurmdbd will not do > that > to you. In fact, you could have a separate slurmdbd just for queries > if > you wanted. This is why that was suggested as a better option. > > So, even if you run 'squeue' once every few seconds, it would impact > the > system. More so depending on the size of the system. We have had > that > issue with users running 'watch squeue' and had to address it. > > IMHO, the true solution is that if a job's info NEEDS updated that > often, have the job itself report what it is doing (but NOT via > slurm > commands). There are numerous ways to do that for most jobs. > > Perhaps there is some additional lines that could be added to the > job > that would do a call to a snakemake API and report itself? Or maybe > such > an API could be created/expanded. > > Just a quick 2 cents (We may be up to a few dollars with all of those > so > far). > > Brian Andrus > > > On 2/27/2023 4:24 AM, Ward Poelmans wrote: > > On 24/02/2023 18:34, David Laehnemann wrote: > > > Those queries then should not have to happen too often, although > > > do you > > > have any indication of a range for when you say "you still > > > wouldn't > > > want to query the status too frequently." Because I don't really, > > > and > > > would probably opt for some compromise of every 30 seconds or so. > > > > I think this is exactly why hpc sys admins are sometimes not very > > happy about these tools. You're talking about 1 of jobs on one > > hand yet you want fetch the status every 30 seconds? What is the > > point > > of that other then overloading the scheduler? > > > > We're telling your users not to query the slurm too often and > > usually > > give 5 minutes as a good interval. You have to let slurm do it's > > job. > > There is no point in querying in a loop every 30 seconds when > > we're > > talking about large numbers of jobs. > > > > > > Ward
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
We have many jupyterhub jobs on our cluster that also does a lot of job queries. Could adjust the query time. But what I did is that 1 process queries all the jobs `squeue --json` and the jupyterhub query script looks in this output. Instead that every jupyterhub job queries the batch system. I have only one. This is specific to the hub environment but if a lot of users run snakemake you hit the same problem. As admin I can understand the queries and it is not only snakemake there are plenty of other tools like hub that also do a lot queries. Some kind of caching mechanism is nice. Most solve it with a wrapper script. Just my 2 cents On 27/02/2023 15:53, Brian Andrus wrote: Sorry, I had to share that this is very much like "Are we there yet?" on a road trip with kids :) Slurm is trying to drive. Any communication to slurmctld will involve an RPC call (sinfo, squeue, scontrol, etc). You can see how many with sinfo. Too many RPC calls will cause failures. Asking slurmdbd will not do that to you. In fact, you could have a separate slurmdbd just for queries if you wanted. This is why that was suggested as a better option. So, even if you run 'squeue' once every few seconds, it would impact the system. More so depending on the size of the system. We have had that issue with users running 'watch squeue' and had to address it. IMHO, the true solution is that if a job's info NEEDS updated that often, have the job itself report what it is doing (but NOT via slurm commands). There are numerous ways to do that for most jobs. Perhaps there is some additional lines that could be added to the job that would do a call to a snakemake API and report itself? Or maybe such an API could be created/expanded. Just a quick 2 cents (We may be up to a few dollars with all of those so far). Brian Andrus On 2/27/2023 4:24 AM, Ward Poelmans wrote: On 24/02/2023 18:34, David Laehnemann wrote: Those queries then should not have to happen too often, although do you have any indication of a range for when you say "you still wouldn't want to query the status too frequently." Because I don't really, and would probably opt for some compromise of every 30 seconds or so. I think this is exactly why hpc sys admins are sometimes not very happy about these tools. You're talking about 1 of jobs on one hand yet you want fetch the status every 30 seconds? What is the point of that other then overloading the scheduler? We're telling your users not to query the slurm too often and usually give 5 minutes as a good interval. You have to let slurm do it's job. There is no point in querying in a loop every 30 seconds when we're talking about large numbers of jobs. Ward -- -- Bas van der Vlies | High Performance Computing & Visualization | SURF| Science Park 140 | 1098 XG Amsterdam | T +31 (0) 20 800 1300 | bas.vandervl...@surf.nl | www.surf.nl |
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
As a side note: In Slurm 23.x a new rate limiting feature for client RPC calls was added: (see this commit: https://github.com/SchedMD/slurm/commit/674f118140e171d10c2501444a0040e1492f4eab#diff-b4e84d09d9b1d817a964fb78baba0a2ea6316bfc10c1405329a95ad0353ca33e ) This would give operators the ability to limit the negative effect of workflow managers on the scheduler. On Mon, Feb 27, 2023 at 4:57 PM Davide DelVento wrote: > > > And if you are seeing a workflow management system causing trouble on > > > your system, probably the most sustainable way of getting this resolved > > > is to file issues or pull requests with the respective project, with > > > suggestions like the ones you made. For snakemake, a second good point > > > to currently chime in, would be the issue discussing Slurm job array > > > support: https://github.com/snakemake/snakemake/issues/301 > > > > I have to disagree here. I think the onus is on the people in a given > > community to ensure that their software behaves well on the systems they > > want to use, not on the operators of those system. Those of us running > > HPC systems often have to deal with a very large range of different > > pieces of software and time and personell are limited. If some program > > used by only a subset of the users is causing disruption, then it > > already costs us time and energy to mitigate those effects. Even if I > > had the appropriate skill set, I don't see my self be writing many > > patches for workflow managers any time soon. > > As someone who has worked in both roles (and to a degree still is) and > therefore can better understand the perspective from both parties, I > side more with David than with Loris here. > > Yes, David wrote "or pull requests", but that's an OR. > > Loris, if you know or experience a problem, it takes close to zero > time to file a bug report educating the author of the software about > the problem (or pointing them to places where they can educate > themselves). Otherwise they will never know about it, they will never > fix it, and potentially they think it's fine and will make the problem > worse. Yes, you could alternatively forbid the use of the problematic > software on the machine (I've done that on our systems), but users > with those needs will find ways to create the very same problem, and > perhaps worse, in other ways (they have done it on our system). Yes, > time is limited, and as operators of HPC systems we often don't have > the time to understand all the nuances and needs of all the users, but > that's not the point I am advocating. In fact it does seem to me that > David is putting the onus on himself and his community to make the > software behave correctly, and he is trying to educate himself about > what "correct" is like. So just give him the input he's looking for, > both here and (if and when snakemake causes troubles on your system) > by opening tickets on that repo, explaining the problem (definitely > not writing a PR for you, sorry David) > >
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
> > And if you are seeing a workflow management system causing trouble on > > your system, probably the most sustainable way of getting this resolved > > is to file issues or pull requests with the respective project, with > > suggestions like the ones you made. For snakemake, a second good point > > to currently chime in, would be the issue discussing Slurm job array > > support: https://github.com/snakemake/snakemake/issues/301 > > I have to disagree here. I think the onus is on the people in a given > community to ensure that their software behaves well on the systems they > want to use, not on the operators of those system. Those of us running > HPC systems often have to deal with a very large range of different > pieces of software and time and personell are limited. If some program > used by only a subset of the users is causing disruption, then it > already costs us time and energy to mitigate those effects. Even if I > had the appropriate skill set, I don't see my self be writing many > patches for workflow managers any time soon. As someone who has worked in both roles (and to a degree still is) and therefore can better understand the perspective from both parties, I side more with David than with Loris here. Yes, David wrote "or pull requests", but that's an OR. Loris, if you know or experience a problem, it takes close to zero time to file a bug report educating the author of the software about the problem (or pointing them to places where they can educate themselves). Otherwise they will never know about it, they will never fix it, and potentially they think it's fine and will make the problem worse. Yes, you could alternatively forbid the use of the problematic software on the machine (I've done that on our systems), but users with those needs will find ways to create the very same problem, and perhaps worse, in other ways (they have done it on our system). Yes, time is limited, and as operators of HPC systems we often don't have the time to understand all the nuances and needs of all the users, but that's not the point I am advocating. In fact it does seem to me that David is putting the onus on himself and his community to make the software behave correctly, and he is trying to educate himself about what "correct" is like. So just give him the input he's looking for, both here and (if and when snakemake causes troubles on your system) by opening tickets on that repo, explaining the problem (definitely not writing a PR for you, sorry David)
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, David Laehnemann writes: > Dear Ward, > > if used correctly (and that is a big caveat for any method for > interacting with a cluster system), snakemake will only submit as many > jobs as can fit within the resources of the cluster at one point of > time (or however much resources you tell snakemake that it can use). So > unless there are thousands of cores available (or you "lie" to > snakemake, telling it that there are much more cores than actually > exist), it will only ever submit hundreds of jobs (or a lot less, if > the jobs each require multiple cores). Accordingly, any queries will > also only be for this number of jobs that snakemake currently has > submitted. And snakemake will only submit new jobs, once it registers > previously submitted jobs as finished. > > So workflow managers can actually help reduce the strain on the > scheduler, by only ever submitting stuff within the general limits of > the system (as opposed to, for example, using some bash loop to just > submit all of your analysis steps or samples at once). I don't see this as a particular advantage for the scheduler. If the maximum number of jobs a user can submit is to, say, 5000, then it makes no difference whether these 5000 jobs are generated by snakemake or a batch script. On our system strain tends mainly to occur when many similar jobs fail immediately after they have started. How does snakemake behave in such a situation? If the job database is already clogged up trying to record too many jobs completing within too short a time, snakemake querying the database at that moment and maybe starting more jobs (because others have failed and thus completed) could potentially exacerbate the problem. > And for example, > snakemake has a mechanism to batch a number of smaller jobs into larger > jobs for submission on the cluster, so this might be something to > suggest to your users that cause trouble through using snakemake > (especially the `--group-components` mechanism): > https://snakemake.readthedocs.io/en/latest/executing/grouping.html This seems to me, from the perspective of an operator, to be the main advantage. > The query mechanism for job status is a different story. And I'm > specifically here on this mailing list to get as much input as possible > to improve this -- and welcome anybody who wants to chime in on my > respective work-in-progress pull request right here: > https://github.com/snakemake/snakemake/pull/2136 > > And if you are seeing a workflow management system causing trouble on > your system, probably the most sustainable way of getting this resolved > is to file issues or pull requests with the respective project, with > suggestions like the ones you made. For snakemake, a second good point > to currently chime in, would be the issue discussing Slurm job array > support: https://github.com/snakemake/snakemake/issues/301 I have to disagree here. I think the onus is on the people in a given community to ensure that their software behaves well on the systems they want to use, not on the operators of those system. Those of us running HPC systems often have to deal with a very large range of different pieces of software and time and personell are limited. If some program used by only a subset of the users is causing disruption, then it already costs us time and energy to mitigate those effects. Even if I had the appropriate skill set, I don't see my self be writing many patches for workflow managers any time soon. Cheers, Loris > And for Nextflow, another commonly used workflow manager in my field > (bioinformatics), there's also an issue discussing Slurm job array > support: > https://github.com/nextflow-io/nextflow/issues/1477 > > cheers, > david > > > On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote: >> On 24/02/2023 18:34, David Laehnemann wrote: >> > Those queries then should not have to happen too often, although do >> > you >> > have any indication of a range for when you say "you still wouldn't >> > want to query the status too frequently." Because I don't really, >> > and >> > would probably opt for some compromise of every 30 seconds or so. >> >> I think this is exactly why hpc sys admins are sometimes not very >> happy about these tools. You're talking about 1 of jobs on one >> hand yet you want fetch the status every 30 seconds? What is the >> point of that other then overloading the scheduler? >> >> We're telling your users not to query the slurm too often and usually >> give 5 minutes as a good interval. You have to let slurm do it's job. >> There is no point in querying in a loop every 30 seconds when we're >> talking about large numbers of jobs. >> >> >> Ward -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Dear Ward, if used correctly (and that is a big caveat for any method for interacting with a cluster system), snakemake will only submit as many jobs as can fit within the resources of the cluster at one point of time (or however much resources you tell snakemake that it can use). So unless there are thousands of cores available (or you "lie" to snakemake, telling it that there are much more cores than actually exist), it will only ever submit hundreds of jobs (or a lot less, if the jobs each require multiple cores). Accordingly, any queries will also only be for this number of jobs that snakemake currently has submitted. And snakemake will only submit new jobs, once it registers previously submitted jobs as finished. So workflow managers can actually help reduce the strain on the scheduler, by only ever submitting stuff within the general limits of the system (as opposed to, for example, using some bash loop to just submit all of your analysis steps or samples at once). And for example, snakemake has a mechanism to batch a number of smaller jobs into larger jobs for submission on the cluster, so this might be something to suggest to your users that cause trouble through using snakemake (especially the `--group-components` mechanism): https://snakemake.readthedocs.io/en/latest/executing/grouping.html The query mechanism for job status is a different story. And I'm specifically here on this mailing list to get as much input as possible to improve this -- and welcome anybody who wants to chime in on my respective work-in-progress pull request right here: https://github.com/snakemake/snakemake/pull/2136 And if you are seeing a workflow management system causing trouble on your system, probably the most sustainable way of getting this resolved is to file issues or pull requests with the respective project, with suggestions like the ones you made. For snakemake, a second good point to currently chime in, would be the issue discussing Slurm job array support: https://github.com/snakemake/snakemake/issues/301 And for Nextflow, another commonly used workflow manager in my field (bioinformatics), there's also an issue discussing Slurm job array support: https://github.com/nextflow-io/nextflow/issues/1477 cheers, david On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote: > On 24/02/2023 18:34, David Laehnemann wrote: > > Those queries then should not have to happen too often, although do > > you > > have any indication of a range for when you say "you still wouldn't > > want to query the status too frequently." Because I don't really, > > and > > would probably opt for some compromise of every 30 seconds or so. > > I think this is exactly why hpc sys admins are sometimes not very > happy about these tools. You're talking about 1 of jobs on one > hand yet you want fetch the status every 30 seconds? What is the > point of that other then overloading the scheduler? > > We're telling your users not to query the slurm too often and usually > give 5 minutes as a good interval. You have to let slurm do it's job. > There is no point in querying in a loop every 30 seconds when we're > talking about large numbers of jobs. > > > Ward
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 24/02/2023 18:34, David Laehnemann wrote: Those queries then should not have to happen too often, although do you have any indication of a range for when you say "you still wouldn't want to query the status too frequently." Because I don't really, and would probably opt for some compromise of every 30 seconds or so. I think this is exactly why hpc sys admins are sometimes not very happy about these tools. You're talking about 1 of jobs on one hand yet you want fetch the status every 30 seconds? What is the point of that other then overloading the scheduler? We're telling your users not to query the slurm too often and usually give 5 minutes as a good interval. You have to let slurm do it's job. There is no point in querying in a loop every 30 seconds when we're talking about large numbers of jobs. Ward smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi Chris, hi Sean, thanks also (and thanks again) for chiming in. Quick follow-up question: Would `squeue` be a better fall-back command than `scontrol` from the perspective of keeping `slurmctld` responsive? From what I can see in the general overview of how slurm works ( https://slurm.schedmd.com/overview.html), both query `slurmctld`. But would one be "better" than the other, as in generating less work for `slurmctld`? Or will it roughly be an equivalent amount of work, so that we can rather see which set of command-line arguments better suits our needs? Also, just as a quick heads-up: I am documenting your input by linking to the mailing list archives, I hope that's alright for you? https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467 cheers, david On Sat, 2023-02-25 at 10:51 -0800, Chris Samuel wrote: > On 23/2/23 2:55 am, David Laehnemann wrote: > > > And consequently, would using `scontrol` thus be the better default > > option (as opposed to `sacct`) for repeated job status checks by a > > workflow management system? > > Many others have commented on this, but use of scontrol in this way > is > really really bad because of the impact it has on slurmctld. This is > because responding to the RPC (IIRC) requires taking read locks on > internal data structures and on a large, busy system (like ours, we > recently rolled over slurm job IDs back to 1 after ~6 years of > operation > and run at over 90% occupancy most of the time) this can really > damage > scheduling performance. > > We've had numerous occasions where we've had to track down users > abusing > scontrol in this way and redirect them to use sacct instead. > > We already use the cli filter abilities in Slurm to impose a form of > rate limiting on RPCs from other commands, but unfortunately scontrol > is > not covered by that. > > All the best, > Chris
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On 23/2/23 2:55 am, David Laehnemann wrote: And consequently, would using `scontrol` thus be the better default option (as opposed to `sacct`) for repeated job status checks by a workflow management system? Many others have commented on this, but use of scontrol in this way is really really bad because of the impact it has on slurmctld. This is because responding to the RPC (IIRC) requires taking read locks on internal data structures and on a large, busy system (like ours, we recently rolled over slurm job IDs back to 1 after ~6 years of operation and run at over 90% occupancy most of the time) this can really damage scheduling performance. We've had numerous occasions where we've had to track down users abusing scontrol in this way and redirect them to use sacct instead. We already use the cli filter abilities in Slurm to impose a form of rate limiting on RPCs from other commands, but unfortunately scontrol is not covered by that. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, Those queries then should not have to happen too often, although do you > have any indication of a range for when you say "you still wouldn't > want to query the status too frequently." Because I don't really, and > would probably opt for some compromise of every 30 seconds or so. > Every 30 seconds sounds reasonable. My cautioning was only in the sense that everything has limitations. For example, the query processing time is dependent on the size of the query and the overall load on the system, so any static interval you select may not work well under some conditions. You might want to defend against that by making the interval adaptive, like the maximum of 30s or 5x the execution time of the last query, so that it adapts to the overall burden of the query and the system load. That is just an example to try and communicate what I was getting at. > One thing I didn't understand from your eMail is the part about job > names, as the command I gave doesn't use job names for its query: > > sacct -X -P -n --format=JobIdRaw,State -j ,,... > > Instead, it just uses the JobId, and isn't that guaranteed to be unique > at any point in time? Or were you meaning to say that JobId can be non- > unique? That would indeed spell trouble on a different level, and make > status checks much more complicated... > Job id is unique. What I mean is, building a CSV list of jobs has scalability issues. If you could assign the same job name to each job in the snakemake pipeline, then the query is much shorter, and still returns the status for each job id that snakemake has launched. Rather than falling back to scontrol (which doesn't support querying by name) snakemake could fall back to squeue which does support querying by name. Best, -Sean
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi Sean, Thanks again for all the feedback! I'll definitely try to implement batch queries, then. Both for the default `sacct` query and for the fallback `scontrol` query. Also see here: https://github.com/snakemake/snakemake/pull/2136#issuecomment-1443295051 Those queries then should not have to happen too often, although do you have any indication of a range for when you say "you still wouldn't want to query the status too frequently." Because I don't really, and would probably opt for some compromise of every 30 seconds or so. One thing I didn't understand from your eMail is the part about job names, as the command I gave doesn't use job names for its query: sacct -X -P -n --format=JobIdRaw,State -j ,,... Instead, it just uses the JobId, and isn't that guaranteed to be unique at any point in time? Or were you meaning to say that JobId can be non- unique? That would indeed spell trouble on a different level, and make status checks much more complicated... cheers, david On Thu, 2023-02-23 at 11:59 -0500, Sean Maxwell wrote: > Hi David, > > On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann < > david.laehnem...@hhu.de> > wrote: > > > But from your comment I understand that handling these queries in > > batches would be less work for slurmdbd, right? So instead of > > querying > > each jobid with a separate database query, it would do one database > > query for the whole list? Is that really easier for the system, or > > would it end up doing a call for each jobid, anyway? > > > > From the perspective of avoiding RPC flood, it is much better to use > a > batch query. That said, if you have an extremely large number of jobs > in > the queue, you still wouldn't want to query the status too > frequently. > > > > And just to be as clear as possible, a call to sacct would then > > look > > like this: > > sacct -X -P -n --format=JobIdRaw,State -j ,,... > > > > That would be one way to do it, but I think there are other > approaches that > might be better. For example, there is no requirement for the job > name to > be unique. So if the snakemake pipeline has a configurable instance > name="foo", and snakemake was configured to specify its own name as > the job > when submitting jobs (e.g. sbatch -J foo ...) then the query for all > jobs > in the pipeline is simply: > > sacct --name=foo > > Because we can of course rewrite the respective code section, so any > > insight on how to do this job accounting more efficiently (and > > better > > tailored to how Slurm does things) is appreciated. > > > > I appreciate that you are interested in improving the integration to > make > it more performant. We are seeing an increase in meta-scheduler use > at our > site, so this is a worthwhile problem to tackle. > > Thanks, > > -Sean
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann wrote: > But from your comment I understand that handling these queries in > batches would be less work for slurmdbd, right? So instead of querying > each jobid with a separate database query, it would do one database > query for the whole list? Is that really easier for the system, or > would it end up doing a call for each jobid, anyway? > >From the perspective of avoiding RPC flood, it is much better to use a batch query. That said, if you have an extremely large number of jobs in the queue, you still wouldn't want to query the status too frequently. > And just to be as clear as possible, a call to sacct would then look > like this: > sacct -X -P -n --format=JobIdRaw,State -j ,,... > That would be one way to do it, but I think there are other approaches that might be better. For example, there is no requirement for the job name to be unique. So if the snakemake pipeline has a configurable instance name="foo", and snakemake was configured to specify its own name as the job when submitting jobs (e.g. sbatch -J foo ...) then the query for all jobs in the pipeline is simply: sacct --name=foo Because we can of course rewrite the respective code section, so any > insight on how to do this job accounting more efficiently (and better > tailored to how Slurm does things) is appreciated. > I appreciate that you are interested in improving the integration to make it more performant. We are seeing an increase in meta-scheduler use at our site, so this is a worthwhile problem to tackle. Thanks, -Sean
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi Sean, yes, this is exactly what snakemake currently does. I didn't write that code, but from my previous debugging, I think handling one job at a time was simply the logic of the general executor for cluster systems, and makes things like querying via scontrol as a fallback easier to handle. But this is not set in stone. But from your comment I understand that handling these queries in batches would be less work for slurmdbd, right? So instead of querying each jobid with a separate database query, it would do one database query for the whole list? Is that really easier for the system, or would it end up doing a call for each jobid, anyway? And just to be as clear as possible, a call to sacct would then look like this: sacct -X -P -n --format=JobIdRaw,State -j ,,... Because we can of course rewrite the respective code section, so any insight on how to do this job accounting more efficiently (and better tailored to how Slurm does things) is appreciated. cheers, david On Thu, 2023-02-23 at 09:46 -0500, Sean Maxwell wrote: > Hi David, > > On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann < > david.laehnem...@hhu.de> > wrote: > > > Quick follow-up question: do you have any indication of the rate of > > job > > status checks via sacct that slurmdbd will gracefully handle (per > > second)? Or any suggestions how to roughly determine such a rate > > for a > > given cluster system? > > > > I looked at your PR for context, and this line of snakemake looks > problematic (I know this isn't part of your PR, it is part of the > original > code) > https://github.com/snakemake/snakemake/commit/a0f04bab08113196fe1616a621bd6bf20fc05688#diff-d1b47826c1fc35806df72508e2f5e7f1d0424f9b2f7b9124810b051f5fe97f1bL296 > : > > sacct_cmd = f"sacct -P -n --format=JobIdRaw,State -j {jobid}" > > Since jobid is an int, this looks like snakmake will individually > probe > each Slurm job it has launched. If snakemake was using batch logic to > gather status for all your running jobs with one call to sacct, then > you > could probably set the interval low. But it looks like it is going to > probe > each job individually by ID, so it will make as many RPC calls as > their are > jobs in the pipeline when it is time to check the status. > > I could be wrong, but this is how I evaluated the code without going > farther upstream. > > Best, > > -Sean
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann wrote: > Quick follow-up question: do you have any indication of the rate of job > status checks via sacct that slurmdbd will gracefully handle (per > second)? Or any suggestions how to roughly determine such a rate for a > given cluster system? > I looked at your PR for context, and this line of snakemake looks problematic (I know this isn't part of your PR, it is part of the original code) https://github.com/snakemake/snakemake/commit/a0f04bab08113196fe1616a621bd6bf20fc05688#diff-d1b47826c1fc35806df72508e2f5e7f1d0424f9b2f7b9124810b051f5fe97f1bL296 : sacct_cmd = f"sacct -P -n --format=JobIdRaw,State -j {jobid}" Since jobid is an int, this looks like snakmake will individually probe each Slurm job it has launched. If snakemake was using batch logic to gather status for all your running jobs with one call to sacct, then you could probably set the interval low. But it looks like it is going to probe each job individually by ID, so it will make as many RPC calls as their are jobs in the pipeline when it is time to check the status. I could be wrong, but this is how I evaluated the code without going farther upstream. Best, -Sean
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, David Laehnemann writes: [snip (16 lines)] > P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from > slurm that you can use to orchestrate large analysis workflows---on > anything from a desktop or laptop computer to all kinds of cluster / > cloud systems. In the case of Slurm it will submit each analysis step > on a particular sample as a separate job, specifying the resources it > needs. The scheduler then handles it from there. But because you can > have (hundreds of) thousands of jobs, and with dependencies among them, > you can't just submit everything all at once, but have to keep track of > where you are at. And make sure you don't submit much more than the > system can handle at any time, so you don't overwhelm the Slurm queue. [snip (86 lines)] I know what Snakemake and other workflow managers, such as Nextflow are for, but my maybe ill-informed impression is that, while something of this sort is obviously needed to manage complex dependencies, the current solutions, probably because they originated outside the HPC context, to try to do too much. You say Snakemake helps make sure you don't submit much more than the system can handle but that in my view should not be necessary. Slurm has configuration parameters which can be set to limit the number of jobs a user can submit and/or run. And when it comes to submitting (hundreds of) thousands of jobs, Nextflow for example currently can't create job arrays, and so generates large numbers of jobs with identical resource requirements, which can prevent backfill from working properly. Skimming the documentation for Snakemake, I also could not find any reference to Slurm job arrays, so this could also be an issue. Just my slightly grumpy 2¢. Cheers, Loris -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi Sean, hi everybody, thanks a lot for the quick insights! My takeaway is: sacct is the better default for putting in lots of job status checks after all, as it will not impact the slurmctld scheduler. Quick follow-up question: do you have any indication of the rate of job status checks via sacct that slurmdbd will gracefully handle (per second)? Or any suggestions how to roughly determine such a rate for a given cluster system? cheers, david P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from slurm that you can use to orchestrate large analysis workflows---on anything from a desktop or laptop computer to all kinds of cluster / cloud systems. In the case of Slurm it will submit each analysis step on a particular sample as a separate job, specifying the resources it needs. The scheduler then handles it from there. But because you can have (hundreds of) thousands of jobs, and with dependencies among them, you can't just submit everything all at once, but have to keep track of where you are at. And make sure you don't submit much more than the system can handle at any time, so you don't overwhelm the Slurm queue. On Thu, 2023-02-23 at 07:55 -0500, Sean Maxwell wrote: > Hi David, > > scontrol - interacts with slurmctld using RPC, so it is faster, but > requests put load on the scheduler itself. > sacct - interacts with slurmdbd, so it doesn't place additional load > on the > scheduler. > > There is a balance to reach, but the scontrol approach is riskier and > can > start to interfere with the cluster operation if used incorrectly. > > Best, > > -Sean > > On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann < > david.laehnem...@hhu.de> > wrote: > > > Dear Slurm users and developers, > > > > TL;DR: > > Do any of you know if `scontrol` status checks of jobs are always > > expected to be quicker than `sacct` job status checks? Do you have > > any > > comparative timings between the two commands? > > And consequently, would using `scontrol` thus be the better default > > option (as opposed to `sacct`) for repeated job status checks by a > > workflow management system? > > > > > > And here's the long version with background infos and linkouts: > > > > I have recently started using a Slurm cluster and am a regular user > > of > > the workflow management system snakemake ( > > https://snakemake.readthedocs.io/en/latest/). This workflow manager > > recently integrated support for running analysis workflows pretty > > seamlessly on Slurm clusters. It takes care of managing all job > > dependecies and handles the submission of jobs according to your > > global > > (and job-specific) resource configurations. > > > > One little hiccup when starting to use the snakemake-Slurm > > combination > > was a snakemake-internal rate-limitation for checking job statuses. > > You > > can find the full story here: > > https://github.com/snakemake/snakemake/pull/2136 > > > > For debugging this, I obtained timings on `sacct` and `scontrol`, > > with > > `scontrol` consistently about 2.5x quicker in returning the job > > status > > when compared to `sacct`. Timings are recorded here: > > > > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210 > > > > However, currently `sacct` is used for regularly checking the > > status of > > submitted jobs per default, and `scontrol` is only a fallback > > whenever > > `sacct` doesn't find the job (for example because it is not yet > > running). Now, I was wondering if switching the default to > > `scontrol` > > would make sense. Thus, I would like to ask: > > > > 1) Slurm users, whether they also have similar timings on different > > Slurm clusters and whether those confirm that `scontrol` is > > consistently quicker? > > > > 2) Slurm developers, whether `scontrol` is expected to be quicker > > from > > its implementation and whether using `scontrol` would also be the > > option that puts less strain on the scheduler in general? > > > > Many thanks and best regards, > > David > > > > > >
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, scontrol - interacts with slurmctld using RPC, so it is faster, but requests put load on the scheduler itself. sacct - interacts with slurmdbd, so it doesn't place additional load on the scheduler. There is a balance to reach, but the scontrol approach is riskier and can start to interfere with the cluster operation if used incorrectly. Best, -Sean On Thu, Feb 23, 2023 at 5:59 AM David Laehnemann wrote: > Dear Slurm users and developers, > > TL;DR: > Do any of you know if `scontrol` status checks of jobs are always > expected to be quicker than `sacct` job status checks? Do you have any > comparative timings between the two commands? > And consequently, would using `scontrol` thus be the better default > option (as opposed to `sacct`) for repeated job status checks by a > workflow management system? > > > And here's the long version with background infos and linkouts: > > I have recently started using a Slurm cluster and am a regular user of > the workflow management system snakemake ( > https://snakemake.readthedocs.io/en/latest/). This workflow manager > recently integrated support for running analysis workflows pretty > seamlessly on Slurm clusters. It takes care of managing all job > dependecies and handles the submission of jobs according to your global > (and job-specific) resource configurations. > > One little hiccup when starting to use the snakemake-Slurm combination > was a snakemake-internal rate-limitation for checking job statuses. You > can find the full story here: > https://github.com/snakemake/snakemake/pull/2136 > > For debugging this, I obtained timings on `sacct` and `scontrol`, with > `scontrol` consistently about 2.5x quicker in returning the job status > when compared to `sacct`. Timings are recorded here: > > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210 > > However, currently `sacct` is used for regularly checking the status of > submitted jobs per default, and `scontrol` is only a fallback whenever > `sacct` doesn't find the job (for example because it is not yet > running). Now, I was wondering if switching the default to `scontrol` > would make sense. Thus, I would like to ask: > > 1) Slurm users, whether they also have similar timings on different > Slurm clusters and whether those confirm that `scontrol` is > consistently quicker? > > 2) Slurm developers, whether `scontrol` is expected to be quicker from > its implementation and whether using `scontrol` would also be the > option that puts less strain on the scheduler in general? > > Many thanks and best regards, > David > > >
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
On Feb 23, 2023, at 7:40 AM, Loris Bennett mailto:loris.benn...@fu-berlin.de>> wrote: Hi David, David Laehnemann mailto:david.laehnem...@hhu.de>> writes: by a workflow management system? I am probably being a bit naive, but I would have thought that the batch system should just be able start your jobs when resources become available. Why do you need to check the status of jobs? I would tend to think that it is not something users should be doing. "workflow management system" generally means some other piece of software that submits jobs as needed to complete some task. It might need to know how current jobs are doing (running yet, completed, etc) to decide what to submit next. I assume that's the use case here.
Re: [slurm-users] speed / efficiency of sacct vs. scontrol
Hi David, David Laehnemann writes: > Dear Slurm users and developers, > > TL;DR: > Do any of you know if `scontrol` status checks of jobs are always > expected to be quicker than `sacct` job status checks? Do you have any > comparative timings between the two commands? > And consequently, would using `scontrol` thus be the better default > option (as opposed to `sacct`) for repeated job status checks by a > workflow management system? I am probably being a bit naive, but I would have thought that the batch system should just be able start your jobs when resources become available. Why do you need to check the status of jobs? I would tend to think that it is not something users should be doing. Cheers, Loris > And here's the long version with background infos and linkouts: > > I have recently started using a Slurm cluster and am a regular user of > the workflow management system snakemake ( > https://snakemake.readthedocs.io/en/latest/). This workflow manager > recently integrated support for running analysis workflows pretty > seamlessly on Slurm clusters. It takes care of managing all job > dependecies and handles the submission of jobs according to your global > (and job-specific) resource configurations. > > One little hiccup when starting to use the snakemake-Slurm combination > was a snakemake-internal rate-limitation for checking job statuses. You > can find the full story here: > https://github.com/snakemake/snakemake/pull/2136 > > For debugging this, I obtained timings on `sacct` and `scontrol`, with > `scontrol` consistently about 2.5x quicker in returning the job status > when compared to `sacct`. Timings are recorded here: > https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210 > > However, currently `sacct` is used for regularly checking the status of > submitted jobs per default, and `scontrol` is only a fallback whenever > `sacct` doesn't find the job (for example because it is not yet > running). Now, I was wondering if switching the default to `scontrol` > would make sense. Thus, I would like to ask: > > 1) Slurm users, whether they also have similar timings on different > Slurm clusters and whether those confirm that `scontrol` is > consistently quicker? > > 2) Slurm developers, whether `scontrol` is expected to be quicker from > its implementation and whether using `scontrol` would also be the > option that puts less strain on the scheduler in general? > > Many thanks and best regards, > David -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin
[slurm-users] speed / efficiency of sacct vs. scontrol
Dear Slurm users and developers, TL;DR: Do any of you know if `scontrol` status checks of jobs are always expected to be quicker than `sacct` job status checks? Do you have any comparative timings between the two commands? And consequently, would using `scontrol` thus be the better default option (as opposed to `sacct`) for repeated job status checks by a workflow management system? And here's the long version with background infos and linkouts: I have recently started using a Slurm cluster and am a regular user of the workflow management system snakemake ( https://snakemake.readthedocs.io/en/latest/). This workflow manager recently integrated support for running analysis workflows pretty seamlessly on Slurm clusters. It takes care of managing all job dependecies and handles the submission of jobs according to your global (and job-specific) resource configurations. One little hiccup when starting to use the snakemake-Slurm combination was a snakemake-internal rate-limitation for checking job statuses. You can find the full story here: https://github.com/snakemake/snakemake/pull/2136 For debugging this, I obtained timings on `sacct` and `scontrol`, with `scontrol` consistently about 2.5x quicker in returning the job status when compared to `sacct`. Timings are recorded here: https://github.com/snakemake/snakemake/blob/b91651d5ea2314b954a3b4b096d7f327ce743b94/snakemake/scheduler.py#L199-L210 However, currently `sacct` is used for regularly checking the status of submitted jobs per default, and `scontrol` is only a fallback whenever `sacct` doesn't find the job (for example because it is not yet running). Now, I was wondering if switching the default to `scontrol` would make sense. Thus, I would like to ask: 1) Slurm users, whether they also have similar timings on different Slurm clusters and whether those confirm that `scontrol` is consistently quicker? 2) Slurm developers, whether `scontrol` is expected to be quicker from its implementation and whether using `scontrol` would also be the option that puts less strain on the scheduler in general? Many thanks and best regards, David