Hi there,

I know that you can’t directly raise the size of a running job, and I’ve read 
the FAQ item related to it:

http://slurm.schedmd.com/faq.html#job_size

It does appear as if there’s a way to raise the size of a job by scheduling a 
new job with --dependency=expand:<jobid>, but I can’t figure out exactly how to 
use it in the context of the example.

The reason I most often want to do something like this is that as a sysadmin, 
I’ll notice someone who has requested 1 core but is really using 16, for 
example. In many cases, I will not have noticed this for quite awhile, and the 
job is running on a node by itself (because it is common for people to request 
full nodes). I’d like to adjust the allocation for this job to prevent other 
jobs from using the cores that are in use.

I tried to build on the example that exists, and did “salloc -N1 -c16 -n1 
--dependency=expand:1066922” but got an error: salloc: error: Job 
submit/allocate failed: Job dependency problem. Am I misunderstanding how this 
should work?

It would be nice if it were possible for root to make adjustments to raise the 
size of a job directly with some sort of override, at least in cases where it 
would not affect scheduling. In some cases, the user has willfully not followed 
instructions and is trying to game the system and I feel fine just 
cancelling/requeueing their work with the correct parameters. But in my example 
case, someone has just made a mistake, done many hours of work with no job 
waiting that will be affected by the resize, but I think my only options are to 
drain the node until the job finishes, risk another user being able to run a 
job that would oversubscribe the node, or perhaps running a fake job of some 
sort that blocks the other resources on the node.

I’m all ears if anyone has any other ideas. Thanks!

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to