Hi again, I'd just like to raise the issue of GrpCPUMins and GrpWall causing running jobs to be killed, when limits are reached.
I personally think this is a bit heavy-handed.. I would prefer the system to prevent the job from being started, rather than killing a running job. This obviously would require (much) more logic at the job launch stage to calculate requested time * allocated cpus, and check if that added to the current usage would bring it over the limit. If you take into account multiple users in an assocation submitting multiple jobs, I appreciate that this is a non-trivial issue. It has shades of GOLD pre-allocation of time, of which I don't have fond memories! Perhaps a compromise might be an additional slurm.conf boolean value, something like: AccountingStorageEnforceAllowFinish=true (that's a terrible name!) It could default to false, to preserve the current behaviour, but if set to true, it would allow running jobs to finish, even if they run over the limit. That way it's less cruel to users, but they still end up going over the limit, and it affects their future jobs, rather than their currently running jobs. Sure, a user could end up having multiple jobs go over the limit, but eventually they won't be able to run. To implement this, you'd need additional slurm.conf parsing logic, and then in the src/slurmctld/job_mgr.c:job_time_limit() function you'd have an additional boolean check in each of the usage checks, similar to my previously proposed patch. Any thoughts / comments? Thanks, Paddy -- Paddy Doyle Trinity Centre for High Performance Computing, Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. Phone: +353-1-896-3725 http://www.tchpc.tcd.ie/