Re: [slurm-users] What is the complete logic to calculate node number in job_submit.lua

2022-09-26 Thread Ole Holm Nielsen

On 9/26/22 08:48, taleinterve...@sjtu.edu.cn wrote:
When designing restriction in job_submit.lua, I found there is no member 
in job_desc struct can directly be used to determine the node number 
finally allocated to a job. The *job_desc.min_nodes *seem to be a close 
answer, but it will be 0xFFFE when user not specify –node option. Then 
in such case we think we can use *job_desc.num_tasks* and 
*job_desc.ntasks_per_node *to calculate node number. But again, we find 
*ntasks_per_node* may also be default value 0xFFFE if user not specify 
related option.


The hex-values which you quote are actually defined as symbols in Slurm as 
slurm.NO_VAL16, slurm.NO_VAL, and slurm.NO_VAL64 which are easier to 
understand :-)  See my notes in 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#lua-functions-for-the-job-submit-plugin


So what is the complete and elegant way to predict the job node number in 
job_submit.lua in all case, no matter how user write their submit options?


The sbatch command provides defaults for nodes and tasks, if these are not 
defined in the user's job script, see the sbatch manual page:


If -N is not specified, the  default behavior  is  to  allocate enough nodes to satisfy the requested resources as expressed by per-job  specification  options,  e.g. -n,  -c  and --gpus.  


and


-n The default is one task per node, but note that the --cpus-per-task option 
will change this default.


Therefore you do not need to guess the number of nodes and tasks in your 
job_submit.lua script.


/Ole



Re: [slurm-users] What is the complete logic to calculate node number in job_submit.lua

2022-09-26 Thread Loris Bennett
 writes:

> Hi all:
>
>  
>
> When designing restriction in job_submit.lua, I found there is no member in 
> job_desc struct can directly be used to determine the node number finally 
> allocated to a job. The job_desc.min_nodes seem to
> be a close answer, but it will be 0xFFFE when user not specify –node 
> option. Then in such case we think we can use job_desc.num_tasks and 
> job_desc.ntasks_per_node to calculate node number.
> But again, we find ntasks_per_node may also be default value 0xFFFE if user 
> not specify related option.
>
> So what is the complete and elegant way to predict the job node number in 
> job_submit.lua in all case, no matter how user write their submit options?

I don't think you can expect to know the node(s) a job will eventually
run on at submission time.  How would this work?  Resources will become
available earlier than Slurm expects, if jobs finish before the given
time-time (or if they crash).  If your are using fairshare, jobs can be
scheduled which have a higher priority than the currently waiting jobs.

What is your use-case for needing to know the node the job will run on?

Cheers,

Loris



[slurm-users] slurm jobs and and amount of licenses (matlab)

2022-09-26 Thread Josef Dvoracek

hello @list!

anyone who was dealing with following scenario?

* we have limited amount of Matlab network licenses ( and various 
features have various amount of available seats, eg. machine learning: N 
licenses, Image_Toolbox: M licenses)
* licenses are being used by slurm jobs and by individual users directly 
at their workstations (workstations are not under my control)


Sometimes it happens, that licenses for certain feature, used in 
particular slurm job is already fully consumed, and job fails.


Is there any straightforward trick how to deal with that? Other than 
buying dedicated pool of licenses for our slurm-based data processing 
facility?


EG. let slurm job wait, until there is required license available?

cheers

josef






Re: [slurm-users] slurm jobs and and amount of licenses (matlab)

2022-09-26 Thread Davide DelVento
Are your licenses used only for the slurm cluster(s) or are they
shared with laptops, workstations and/or other computing equipment not
managed by slurm?
In the former case, the "local" licenses described in the
documentation will do the trick (but slurm does not automatically
enforce their use, so either strong user education is needed, or
further scripting). In the latter case, more work is needed. See my
other thread on this topic two weeks ago, which I plan to pick up
later this week.

On Mon, Sep 26, 2022 at 5:07 AM Josef Dvoracek  wrote:
>
> hello @list!
>
> anyone who was dealing with following scenario?
>
> * we have limited amount of Matlab network licenses ( and various
> features have various amount of available seats, eg. machine learning: N
> licenses, Image_Toolbox: M licenses)
> * licenses are being used by slurm jobs and by individual users directly
> at their workstations (workstations are not under my control)
>
> Sometimes it happens, that licenses for certain feature, used in
> particular slurm job is already fully consumed, and job fails.
>
> Is there any straightforward trick how to deal with that? Other than
> buying dedicated pool of licenses for our slurm-based data processing
> facility?
>
> EG. let slurm job wait, until there is required license available?
>
> cheers
>
> josef
>
>
>
>



Re: [slurm-users] What is the complete logic to calculate node number in job_submit.lua

2022-09-26 Thread Ole Holm Nielsen

Hi Loris,

On 9/26/22 12:51, Loris Bennett wrote:

When designing restriction in job_submit.lua, I found there is no member in 
job_desc struct can directly be used to determine the node number finally 
allocated to a job. The job_desc.min_nodes seem to
be a close answer, but it will be 0xFFFE when user not specify –node 
option. Then in such case we think we can use job_desc.num_tasks and 
job_desc.ntasks_per_node to calculate node number.
But again, we find ntasks_per_node may also be default value 0xFFFE if user not 
specify related option.

So what is the complete and elegant way to predict the job node number in 
job_submit.lua in all case, no matter how user write their submit options?


I don't think you can expect to know the node(s) a job will eventually
run on at submission time.  How would this work?  Resources will become
available earlier than Slurm expects, if jobs finish before the given
time-time (or if they crash).  If your are using fairshare, jobs can be
scheduled which have a higher priority than the currently waiting jobs.

What is your use-case for needing to know the node the job will run on?


I think he meant the *number of nodes*, and not the *hostnames* of the 
compute nodes selected by Slurm at a later time.


/Ole



Re: [slurm-users] Recommended amount of memory for the database server

2022-09-26 Thread Paul Edmon
It should generally be as much as you need to hold the full database in 
memory.  That said if you are storing Job Envs and Scripts that will be 
a lot of data, even with the deduping they are doing.  We've generally 
done about 90 GB buffer size here with out much of any issue even though 
our database is bigger than that.


-Paul Edmon-

On 9/25/22 5:18 PM, byron wrote:

Hi

Does anyone know what is the recommended amount of memory to give 
slurms mariadb database server?


I seem to remember reading a simple estimate based on the size of 
certain tables (or something along those lines) but I can't find it now.


Thanks





Re: [slurm-users] slurm jobs and and amount of licenses (matlab)

2022-09-26 Thread Alois Schlögl




Am 9/26/22 um 13:04 schrieb Josef Dvoracek:

hello @list!

anyone who was dealing with following scenario?

* we have limited amount of Matlab network licenses ( and various 
features have various amount of available seats, eg. machine learning: 
N licenses, Image_Toolbox: M licenses)
* licenses are being used by slurm jobs and by individual users 
directly at their workstations (workstations are not under my control)


Sometimes it happens, that licenses for certain feature, used in 
particular slurm job is already fully consumed, and job fails.


Is there any straightforward trick how to deal with that? Other than 
buying dedicated pool of licenses for our slurm-based data processing 
facility?


EG. let slurm job wait, until there is required license available?

cheers

josef




Hello Josef,


yes, we've similar scenario. There is no straightforward way of handling 
this, and slurm configuration can help only to a certain extend.


The main reason for this is that the license usage depends how the jobs 
a distributed among nodes, licenses are used per user and per node.
That means a user running 5 jobs on a single node requires 1 license, 5 
users running each 1 job require 5 licenses, if 1 user runs 5 jobs on 5 
different nodes, it requires also 5 licenses.
If 5 users run 5 jobs on 5 different nodes each, up to 25 licenses might 
be needed. If you have 10 users and 20 nodes, you might need up to 200 
licenses.
And node-based licenses can not be remotely accessed by multiple users. 
Because of that even a dedicated pool with one license per node might 
not be sufficient.


Our approach is to restrict the number of nodes were matlab is running, 
and identify those nodes with the feature "matlab" that can be selected 
with "--constraint=matlab".
Moreover, the largest nodes have matlab so that these jobs run on fewer 
nodes, and a smaller number of licenses is needed. You might also want 
to enforce -singleCompThread, because the speedup of multithreading is 
not always what you expect, especially if you have nodes with large 
number of cores.


In addition, we monitor the overall license usage, on a per toolbox 
basis, and based on these numbers we decide how many licenses are needed 
from which toolbox.


And of somebody is not happy with the limited number of licenses, we 
point out that there is also Octave, where the number of available 
licenses is not limited, and where the number of cores used can be 
controlled with OMP_NUM_THREADS (unlike in Matlab).



Cheers,
  Alois





Re: [slurm-users] What is the complete logic to calculate node number in job_submit.lua

2022-09-26 Thread Loris Bennett
Hi Ole,

Ole Holm Nielsen  writes:

> Hi Loris,
>
> On 9/26/22 12:51, Loris Bennett wrote:
>>> When designing restriction in job_submit.lua, I found there is no member in 
>>> job_desc struct can directly be used to determine the node number finally 
>>> allocated to a job. The job_desc.min_nodes seem to
>>> be a close answer, but it will be 0xFFFE when user not specify –node 
>>> option. Then in such case we think we can use job_desc.num_tasks and 
>>> job_desc.ntasks_per_node to calculate node number.
>>> But again, we find ntasks_per_node may also be default value 0xFFFE if user 
>>> not specify related option.
>>>
>>> So what is the complete and elegant way to predict the job node number in 
>>> job_submit.lua in all case, no matter how user write their submit options?
>> I don't think you can expect to know the node(s) a job will eventually
>> run on at submission time.  How would this work?  Resources will become
>> available earlier than Slurm expects, if jobs finish before the given
>> time-time (or if they crash).  If your are using fairshare, jobs can be
>> scheduled which have a higher priority than the currently waiting jobs.
>> What is your use-case for needing to know the node the job will run on?
>
> I think he meant the *number of nodes*, and not the *hostnames* of the compute
> nodes selected by Slurm at a later time.

Ah, OK, you may be right.  However, unless the user restricts the job to
an exact number of nodes, the actual number which will beused in the end
is also unknowable at the time of submission, isn't it?

Cheers,

Loris