Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?
Hi Loris, We have a completely separate test system, complete with a few worker nodes, separate slurmctld/slurmdbd, so we can test Slurm upgrades etc. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Mon, 7 Dec 2020 at 19:01, Loris Bennett wrote: > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > Hi Sean, > > Thanks for the code - looks like you have put a lot more thought into it > than I have into mine. I'll certainly have to look at handling the > 'tres-per-*' options. > > By the way, how to you do your testing? As I don't have at test > cluster, currently I'm doing "open heart" testing, but I really need a > minimal test cluster, maybe using VMs. > > Cheers, > > Loris > > Sean Crosby writes: > > > Hi Loris, > > > > This is our submit filter for what you're asking. It checks for both > --gres and --gpus > > > > ESLURM_INVALID_GRES=2072 > > ESLURM_BAD_TASK_COUNT=2025 > > if ( job_desc.partition ~= slurm.NO_VAL ) then > > if (job_desc.partition ~= nil) then > > if (string.match(job_desc.partition,"gpgpu") or > string.match(job_desc.partition,"gpgputest")) then > > --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu > partition") > > --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0 > > if (job_desc.gres and string.find(job_desc.gres, "gpu")) then > > local numgpu = string.match(job_desc.gres, ":%d+$") > > if(numgpu ~= nil) then > > numgpu = numgpu:gsub(':', '') > > if ( tonumber(numgpu) < 1) then > > slurm.log_user("Invalid GPGPU count specified in GRES, > must be greater than 0") > > return ESLURM_INVALID_GRES > > end > > end > > else > > --Alternative use gpus in new version of slurm > > if (job_desc.tres_per_node == nil) then > > if (job_desc.tres_per_socket == nil) then > > if (job_desc.tres_per_task == nil) then > > slurm.log_user("You tried submitting to a GPGPU > partition, but you didn't request one with GRES or GPUS") > > return ESLURM_INVALID_GRES > > else > >if (job_desc.num_tasks == slurm.NO_VAL) then > > slurm.user_msg("--gpus-per-task option requires > --tasks specification") > > return ESLURM_BAD_TASK_COUNT > >end > > end > > end > > end > > end > > end > >end > > > > Let me know if you improve it please? We're always on the hunt to fix up > some of the logic in the submit filter. > > > > Cheers, > > Sean > > > > -- > > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > > Research Computing Services | Business Services > > The University of Melbourne, Victoria 3010 Australia > > > > On Fri, 4 Dec 2020 at 23:58, Loris Bennett > wrote: > > > > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > > > Hi, > > > > I want to reject jobs that don't specify any GPUs when accessing our GPU > > partition and have the following in job_submit.lua: > > > >if (job_desc.partition == "gpu" and job_desc.gres == nil ) then > > slurm.log_user(string.format("Please request GPU resources in the > partition 'gpu', " .. > > "e.g. '#SBATCH --gres=gpu:1' " .. > > "Please see 'man sbatch' for more > details)")) > > slurm.log_info(string.format("check_parameters: user '%s' did not > request GPUs in partition 'gpu'", > >username)) > > return slurm.ERROR > >end > > > > If GRES is not given for the GPU partition, this produces > > > >sbatch: error: Please request GPU resources in the partition 'gpu', > e.g. '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details) > >sbatch: error: Batch job submission failed: Unspecified error > > > > My questions are: > > > > 1. Is there a better error to return? The 'slurm.ERROR' produces the > > generic second error line above (slurm_errno.h just seems to have > > ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin > > might raise). This is misleading, since the error is in fact known > > and specific. > > 2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have > > a 'gpus' field corresponding to the sbatch/srun option '--gpus'? > > > > Cheers, > > > > Loris > > > > -- > > Dr. Loris Bennett (Hr./Mr.) > > ZEDAT, Freie Universität Berlin Email > loris.benn...@fu-berlin.de > > > -- > Dr. Loris Bennett (Hr./Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > >
Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?
Hi Sean, Thanks for the code - looks like you have put a lot more thought into it than I have into mine. I'll certainly have to look at handling the 'tres-per-*' options. By the way, how to you do your testing? As I don't have at test cluster, currently I'm doing "open heart" testing, but I really need a minimal test cluster, maybe using VMs. Cheers, Loris Sean Crosby writes: > Hi Loris, > > This is our submit filter for what you're asking. It checks for both --gres > and --gpus > > ESLURM_INVALID_GRES=2072 > ESLURM_BAD_TASK_COUNT=2025 > if ( job_desc.partition ~= slurm.NO_VAL ) then > if (job_desc.partition ~= nil) then > if (string.match(job_desc.partition,"gpgpu") or > string.match(job_desc.partition,"gpgputest")) then > --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu > partition") > --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0 > if (job_desc.gres and string.find(job_desc.gres, "gpu")) then > local numgpu = string.match(job_desc.gres, ":%d+$") > if(numgpu ~= nil) then > numgpu = numgpu:gsub(':', '') > if ( tonumber(numgpu) < 1) then > slurm.log_user("Invalid GPGPU count specified in GRES, must > be greater than 0") > return ESLURM_INVALID_GRES > end > end > else > --Alternative use gpus in new version of slurm > if (job_desc.tres_per_node == nil) then > if (job_desc.tres_per_socket == nil) then > if (job_desc.tres_per_task == nil) then > slurm.log_user("You tried submitting to a GPGPU partition, > but you didn't request one with GRES or GPUS") > return ESLURM_INVALID_GRES > else >if (job_desc.num_tasks == slurm.NO_VAL) then > slurm.user_msg("--gpus-per-task option requires --tasks > specification") > return ESLURM_BAD_TASK_COUNT >end > end > end > end > end > end >end > > Let me know if you improve it please? We're always on the hunt to fix up some > of the logic in the submit filter. > > Cheers, > Sean > > -- > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead > Research Computing Services | Business Services > The University of Melbourne, Victoria 3010 Australia > > On Fri, 4 Dec 2020 at 23:58, Loris Bennett wrote: > > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > Hi, > > I want to reject jobs that don't specify any GPUs when accessing our GPU > partition and have the following in job_submit.lua: > >if (job_desc.partition == "gpu" and job_desc.gres == nil ) then > slurm.log_user(string.format("Please request GPU resources in the > partition 'gpu', " .. > "e.g. '#SBATCH --gres=gpu:1' " .. > "Please see 'man sbatch' for more > details)")) > slurm.log_info(string.format("check_parameters: user '%s' did not > request GPUs in partition 'gpu'", >username)) > return slurm.ERROR >end > > If GRES is not given for the GPU partition, this produces > >sbatch: error: Please request GPU resources in the partition 'gpu', e.g. > '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details) >sbatch: error: Batch job submission failed: Unspecified error > > My questions are: > > 1. Is there a better error to return? The 'slurm.ERROR' produces the > generic second error line above (slurm_errno.h just seems to have > ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin > might raise). This is misleading, since the error is in fact known > and specific. > 2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have > a 'gpus' field corresponding to the sbatch/srun option '--gpus'? > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Hr./Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > -- Dr. Loris Bennett (Hr./Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?
Hi Loris, This is our submit filter for what you're asking. It checks for both --gres and --gpus ESLURM_INVALID_GRES=2072 ESLURM_BAD_TASK_COUNT=2025 if ( job_desc.partition ~= slurm.NO_VAL ) then if (job_desc.partition ~= nil) then if (string.match(job_desc.partition,"gpgpu") or string.match(job_desc.partition,"gpgputest")) then --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu partition") --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0 if (job_desc.gres and string.find(job_desc.gres, "gpu")) then local numgpu = string.match(job_desc.gres, ":%d+$") if(numgpu ~= nil) then numgpu = numgpu:gsub(':', '') if ( tonumber(numgpu) < 1) then slurm.log_user("Invalid GPGPU count specified in GRES, must be greater than 0") return ESLURM_INVALID_GRES end end else --Alternative use gpus in new version of slurm if (job_desc.tres_per_node == nil) then if (job_desc.tres_per_socket == nil) then if (job_desc.tres_per_task == nil) then slurm.log_user("You tried submitting to a GPGPU partition, but you didn't request one with GRES or GPUS") return ESLURM_INVALID_GRES else if (job_desc.num_tasks == slurm.NO_VAL) then slurm.user_msg("--gpus-per-task option requires --tasks specification") return ESLURM_BAD_TASK_COUNT end end end end end end end Let me know if you improve it please? We're always on the hunt to fix up some of the logic in the submit filter. Cheers, Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Fri, 4 Dec 2020 at 23:58, Loris Bennett wrote: > UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts > > Hi, > > I want to reject jobs that don't specify any GPUs when accessing our GPU > partition and have the following in job_submit.lua: > > if (job_desc.partition == "gpu" and job_desc.gres == nil ) then > slurm.log_user(string.format("Please request GPU resources in the > partition 'gpu', " .. > "e.g. '#SBATCH --gres=gpu:1' " .. > "Please see 'man sbatch' for more > details)")) > slurm.log_info(string.format("check_parameters: user '%s' did not > request GPUs in partition 'gpu'", > username)) > return slurm.ERROR > end > > If GRES is not given for the GPU partition, this produces > > sbatch: error: Please request GPU resources in the partition 'gpu', e.g. > '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details) > sbatch: error: Batch job submission failed: Unspecified error > > My questions are: > > 1. Is there a better error to return? The 'slurm.ERROR' produces the >generic second error line above (slurm_errno.h just seems to have >ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin >might raise). This is misleading, since the error is in fact known >and specific. > 2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have >a 'gpus' field corresponding to the sbatch/srun option '--gpus'? > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Hr./Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > >