Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

2020-12-07 Thread Sean Crosby
Hi Loris,

We have a completely separate test system, complete with a few worker
nodes, separate slurmctld/slurmdbd, so we can test Slurm upgrades etc.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Mon, 7 Dec 2020 at 19:01, Loris Bennett 
wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi Sean,
>
> Thanks for the code - looks like you have put a lot more thought into it
> than I have into mine.  I'll certainly have to look at handling the
> 'tres-per-*' options.
>
> By the way, how to you do your testing?  As I don't have at test
> cluster, currently I'm doing "open heart" testing, but I really need a
> minimal test cluster, maybe using VMs.
>
> Cheers,
>
> Loris
>
> Sean Crosby  writes:
>
> > Hi Loris,
> >
> > This is our submit filter for what you're asking. It checks for both
> --gres and --gpus
> >
> >   ESLURM_INVALID_GRES=2072
> >   ESLURM_BAD_TASK_COUNT=2025
> >   if ( job_desc.partition ~= slurm.NO_VAL ) then
> > if (job_desc.partition ~= nil) then
> >   if (string.match(job_desc.partition,"gpgpu") or
> string.match(job_desc.partition,"gpgputest")) then
> > --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu
> partition")
> > --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
> > if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
> >   local numgpu = string.match(job_desc.gres, ":%d+$")
> >   if(numgpu ~= nil) then
> >   numgpu = numgpu:gsub(':', '')
> >   if ( tonumber(numgpu) < 1) then
> > slurm.log_user("Invalid GPGPU count specified in GRES,
> must be greater than 0")
> > return ESLURM_INVALID_GRES
> >   end
> >   end
> > else
> > --Alternative use gpus in new version of slurm
> >   if (job_desc.tres_per_node == nil) then
> > if (job_desc.tres_per_socket == nil) then
> >   if (job_desc.tres_per_task == nil) then
> >  slurm.log_user("You tried submitting to a GPGPU
> partition, but you didn't request one with GRES or GPUS")
> >  return ESLURM_INVALID_GRES
> >  else
> >if (job_desc.num_tasks == slurm.NO_VAL) then
> >  slurm.user_msg("--gpus-per-task option requires
> --tasks specification")
> > return ESLURM_BAD_TASK_COUNT
> >end
> >  end
> >   end
> > end
> >  end
> >   end
> >end
> >
> > Let me know if you improve it please? We're always on the hunt to fix up
> some of the logic in the submit filter.
> >
> > Cheers,
> > Sean
> >
> > --
> > Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> > Research Computing Services | Business Services
> > The University of Melbourne, Victoria 3010 Australia
> >
> > On Fri, 4 Dec 2020 at 23:58, Loris Bennett 
> wrote:
> >
> >  UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
> >
> >  Hi,
> >
> >  I want to reject jobs that don't specify any GPUs when accessing our GPU
> >  partition and have the following in job_submit.lua:
> >
> >if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
> >   slurm.log_user(string.format("Please request GPU resources in the
> partition 'gpu', " ..
> >   "e.g. '#SBATCH --gres=gpu:1' " ..
> >   "Please see 'man sbatch' for more
> details)"))
> >   slurm.log_info(string.format("check_parameters: user '%s' did not
> request GPUs in partition 'gpu'",
> >username))
> >   return slurm.ERROR
> >end
> >
> >  If GRES is not given for the GPU partition, this produces
> >
> >sbatch: error: Please request GPU resources in the partition 'gpu',
> e.g. '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
> >sbatch: error: Batch job submission failed: Unspecified error
> >
> >  My questions are:
> >
> >  1. Is there a better error to return?  The 'slurm.ERROR' produces the
> > generic second error line above (slurm_errno.h just seems to have
> > ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
> > might raise).  This is misleading, since the error is in fact known
> > and specific.
> >  2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
> > a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
> >
> >  Cheers,
> >
> >  Loris
> >
> >  --
> >  Dr. Loris Bennett (Hr./Mr.)
> >  ZEDAT, Freie Universität Berlin Email
> loris.benn...@fu-berlin.de
> >
> --
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>


Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

2020-12-07 Thread Loris Bennett
Hi Sean,

Thanks for the code - looks like you have put a lot more thought into it
than I have into mine.  I'll certainly have to look at handling the
'tres-per-*' options.

By the way, how to you do your testing?  As I don't have at test
cluster, currently I'm doing "open heart" testing, but I really need a
minimal test cluster, maybe using VMs.  

Cheers,

Loris

Sean Crosby  writes:

> Hi Loris,
>
> This is our submit filter for what you're asking. It checks for both --gres 
> and --gpus
>
>   ESLURM_INVALID_GRES=2072
>   ESLURM_BAD_TASK_COUNT=2025
>   if ( job_desc.partition ~= slurm.NO_VAL ) then
> if (job_desc.partition ~= nil) then
>   if (string.match(job_desc.partition,"gpgpu") or 
> string.match(job_desc.partition,"gpgputest")) then
> --slurm.log_info("slurm_job_submit (lua): detect job for gpgpu 
> partition")
> --Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
> if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
>   local numgpu = string.match(job_desc.gres, ":%d+$")
>   if(numgpu ~= nil) then
>   numgpu = numgpu:gsub(':', '')
>   if ( tonumber(numgpu) < 1) then
> slurm.log_user("Invalid GPGPU count specified in GRES, must 
> be greater than 0")
> return ESLURM_INVALID_GRES
>   end
>   end
> else
> --Alternative use gpus in new version of slurm
>   if (job_desc.tres_per_node == nil) then
> if (job_desc.tres_per_socket == nil) then
>   if (job_desc.tres_per_task == nil) then
>  slurm.log_user("You tried submitting to a GPGPU partition, 
> but you didn't request one with GRES or GPUS")
>  return ESLURM_INVALID_GRES
>  else
>if (job_desc.num_tasks == slurm.NO_VAL) then
>  slurm.user_msg("--gpus-per-task option requires --tasks 
> specification")
> return ESLURM_BAD_TASK_COUNT
>end
>  end
>   end
> end
>  end
>   end
>end
>
> Let me know if you improve it please? We're always on the hunt to fix up some 
> of the logic in the submit filter.
>
> Cheers,
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
> On Fri, 4 Dec 2020 at 23:58, Loris Bennett  wrote:
>
>  UoM notice: External email. Be cautious of links, attachments, or 
> impersonation attempts
>
>  Hi,
>
>  I want to reject jobs that don't specify any GPUs when accessing our GPU
>  partition and have the following in job_submit.lua:
>
>if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
>   slurm.log_user(string.format("Please request GPU resources in the 
> partition 'gpu', " ..
>   "e.g. '#SBATCH --gres=gpu:1' " ..
>   "Please see 'man sbatch' for more 
> details)"))
>   slurm.log_info(string.format("check_parameters: user '%s' did not 
> request GPUs in partition 'gpu'",
>username))
>   return slurm.ERROR
>end
>
>  If GRES is not given for the GPU partition, this produces
>
>sbatch: error: Please request GPU resources in the partition 'gpu', e.g. 
> '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
>sbatch: error: Batch job submission failed: Unspecified error
>
>  My questions are:
>
>  1. Is there a better error to return?  The 'slurm.ERROR' produces the
> generic second error line above (slurm_errno.h just seems to have
> ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
> might raise).  This is misleading, since the error is in fact known
> and specific.
>  2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
> a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Hr./Mr.)
>  ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

2020-12-04 Thread Sean Crosby
Hi Loris,

This is our submit filter for what you're asking. It checks for both --gres
and --gpus

  ESLURM_INVALID_GRES=2072
  ESLURM_BAD_TASK_COUNT=2025
  if ( job_desc.partition ~= slurm.NO_VAL ) then
if (job_desc.partition ~= nil) then
  if (string.match(job_desc.partition,"gpgpu") or
string.match(job_desc.partition,"gpgputest")) then
--slurm.log_info("slurm_job_submit (lua): detect job for gpgpu
partition")
--Alert on invalid gpu count - eg: gpu:0 , gpu:p100:0
if (job_desc.gres and string.find(job_desc.gres, "gpu")) then
  local numgpu = string.match(job_desc.gres, ":%d+$")
  if(numgpu ~= nil) then
  numgpu = numgpu:gsub(':', '')
  if ( tonumber(numgpu) < 1) then
slurm.log_user("Invalid GPGPU count specified in GRES, must
be greater than 0")
return ESLURM_INVALID_GRES
  end
  end
else
--Alternative use gpus in new version of slurm
  if (job_desc.tres_per_node == nil) then
if (job_desc.tres_per_socket == nil) then
  if (job_desc.tres_per_task == nil) then
 slurm.log_user("You tried submitting to a GPGPU partition,
but you didn't request one with GRES or GPUS")
 return ESLURM_INVALID_GRES
 else
   if (job_desc.num_tasks == slurm.NO_VAL) then
 slurm.user_msg("--gpus-per-task option requires
--tasks specification")
return ESLURM_BAD_TASK_COUNT
   end
 end
  end
end
 end
  end
   end

Let me know if you improve it please? We're always on the hunt to fix up
some of the logic in the submit filter.

Cheers,
Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Fri, 4 Dec 2020 at 23:58, Loris Bennett 
wrote:

> UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts
>
> Hi,
>
> I want to reject jobs that don't specify any GPUs when accessing our GPU
> partition and have the following in job_submit.lua:
>
>   if (job_desc.partition == "gpu" and job_desc.gres == nil ) then
>  slurm.log_user(string.format("Please request GPU resources in the
> partition 'gpu', " ..
>  "e.g. '#SBATCH --gres=gpu:1' " ..
>  "Please see 'man sbatch' for more
> details)"))
>  slurm.log_info(string.format("check_parameters: user '%s' did not
> request GPUs in partition 'gpu'",
>   username))
>  return slurm.ERROR
>   end
>
> If GRES is not given for the GPU partition, this produces
>
>   sbatch: error: Please request GPU resources in the partition 'gpu', e.g.
> '#SBATCH --gres=gpu:1' Please see 'man sbatch' for more details)
>   sbatch: error: Batch job submission failed: Unspecified error
>
> My questions are:
>
> 1. Is there a better error to return?  The 'slurm.ERROR' produces the
>generic second error line above (slurm_errno.h just seems to have
>ESLURM_MISSING_TIME_LIMIT and ESLURM_INVALID_KNL as errors a plugin
>might raise).  This is misleading, since the error is in fact known
>and specific.
> 2. I am right in thinking that 'job_desc' does not, as of 20.02.06, have
>a 'gpus' field corresponding to the sbatch/srun option '--gpus'?
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Hr./Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>