Hi there, I know that you can’t directly raise the size of a running job, and I’ve read the FAQ item related to it:
http://slurm.schedmd.com/faq.html#job_size It does appear as if there’s a way to raise the size of a job by scheduling a new job with --dependency=expand:<jobid>, but I can’t figure out exactly how to use it in the context of the example. The reason I most often want to do something like this is that as a sysadmin, I’ll notice someone who has requested 1 core but is really using 16, for example. In many cases, I will not have noticed this for quite awhile, and the job is running on a node by itself (because it is common for people to request full nodes). I’d like to adjust the allocation for this job to prevent other jobs from using the cores that are in use. I tried to build on the example that exists, and did “salloc -N1 -c16 -n1 --dependency=expand:1066922” but got an error: salloc: error: Job submit/allocate failed: Job dependency problem. Am I misunderstanding how this should work? It would be nice if it were possible for root to make adjustments to raise the size of a job directly with some sort of override, at least in cases where it would not affect scheduling. In some cases, the user has willfully not followed instructions and is trying to game the system and I feel fine just cancelling/requeueing their work with the correct parameters. But in my example case, someone has just made a mistake, done many hours of work with no job waiting that will be affected by the resize, but I think my only options are to drain the node until the job finishes, risk another user being able to run a job that would oversubscribe the node, or perhaps running a fake job of some sort that blocks the other resources on the node. I’m all ears if anyone has any other ideas. Thanks! -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `'
signature.asc
Description: Message signed with OpenPGP using GPGMail