Hi again,

I'd just like to raise the issue of GrpCPUMins and GrpWall causing running jobs
to be killed, when limits are reached.

I personally think this is a bit heavy-handed..

I would prefer the system to prevent the job from being started, rather than
killing a running job.

This obviously would require (much) more logic at the job launch stage
to calculate requested time * allocated cpus, and check if that added to
the current usage would bring it over the limit. If you take into account
multiple users in an assocation submitting multiple jobs, I appreciate that
this is a non-trivial issue. It has shades of GOLD pre-allocation of time, of
which I don't have fond memories!


Perhaps a compromise might be an additional slurm.conf boolean value, something
like:

AccountingStorageEnforceAllowFinish=true

(that's a terrible name!)

It could default to false, to preserve the current behaviour, but if set to
true, it would allow running jobs to finish, even if they run over the limit.

That way it's less cruel to users, but they still end up going over the limit,
and it affects their future jobs, rather than their currently running jobs.
Sure, a user could end up having multiple jobs go over the limit, but eventually
they won't be able to run.

To implement this, you'd need additional slurm.conf parsing logic, and then in
the src/slurmctld/job_mgr.c:job_time_limit() function you'd have an additional
boolean check in each of the usage checks, similar to my previously proposed
patch.


Any thoughts / comments?

Thanks,
Paddy

-- 
Paddy Doyle
Trinity Centre for High Performance Computing,
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
Phone: +353-1-896-3725
http://www.tchpc.tcd.ie/

Reply via email to