Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-28 Thread Miguel Oliveira
Hi Gérard,

If I understood you correctly your goal was to limit the number of minutes each 
project can run. By associating each project to a slurm account with a nodecay 
QoS then you will have achieved your goal.
Try a project with a very small limit and you will see that it won’t run.

You don’t have to add anything. Each QoS will accumulate its respective usage, 
i.e, the usage of all users on that account. Users can even be on different 
accounts (projects) and charge the respective project with the parameter 
--account on sbatch.
The GrpTRESMins is always changed on the QoS with a command like:

sacctmgr update qos where qos=... set GrpTRESMin=cpu=….

Hope that makes sense!

Best,

MAO

> On 28 Jun 2022, at 18:30, gerard@cines.fr wrote:
> 
> Hi Miguel,
> 
> OK, I did'nt know this command.
> 
> I'm not sure to understand how it works regarding to my goal.
> I use the following command inspired by the command you gave me and I obtain 
> a UsageRaw for each QOS. 
> 
> scontrol -o show assoc_mgr -accounts=myaccount Users=" "
> 
> 
> Do I have to sumup all QOS RawUsage to obtain the RawUsage of myaccount with 
> NoDecay ?
> If I set GrpTRESMins for an Account and not for a QOS, does SLURM handle to 
> sumpup these QOS RawUsage to control if the GrpTRESMins account limit is 
> reach ?
> 
> Thanks again for your precious help.
> 
> Gérard 
>  
> 
> De: "Miguel Oliveira" 
> À: "Slurm-users" 
> Envoyé: Mardi 28 Juin 2022 17:23:18
> Objet: Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage
> Hi Gérard,
> 
> The way you are checking is against the association and as such it ought to 
> be decreasing in order to be used by fair share appropriately.
> The counter used that does not decrease is on the QoS, not the association. 
> You can check that with:
> 
> scontrol -o show assoc_mgr | grep "^QOS='+account+’”
> 
> That ought to give you two numbers. The first is the limit, or N for not 
> limit, and the second in parenthesis the usage.
> 
> Hope that helps.
> 
> Best,
> 
> Miguel Afonso Oliveira
> 
> On 28 Jun 2022, at 08:58, gerard@cines.fr  
> wrote:
> 
> Hi Miguel,
> 
> 
> I modified my test configuration to evaluate the effect of NoDecay.
> 
> 
> 
> 
> I modified all QOS adding NoDecay Flag.
> 
> 
> toto@login1:~/TEST$ sacctmgr show QOS
>   Name   Priority  GraceTimePreempt   PreemptExemptTime PreemptMode   
>  Flags UsageThres UsageFactor   GrpTRES   
> GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall   MaxTRES 
> MaxTRESPerNode   MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU  
>MaxTRESPA MaxJobsPA MaxSubmitPA   MinTRES 
> -- -- -- -- --- --- 
>  -- --- - 
> - - --- - --- - 
> -- - --- - - --- 
> - - --- - 
> normal  0   00:00:00cluster   
>NoDecay   1.00 
>   
>   
>  
> interactif 10   00:00:00cluster   
>NoDecay   1.00   node=50   
>   node=22 
>   1-00:00:00   node=50
>  
>  petit  4   00:00:00cluster   
>NoDecay   1.00 node=1500   
>   node=22 
>   1-00:00:00  node=300
>  
>   gros  6   00:00:00cluster   
>NoDecay   1.00 node=2106   
>  node=700 
>   1-00:00:00  node=700
>  
>  court  8   00:00:00cluster   
>NoDecay   1.00 node=1100   
>  node=100 
> 02:00:00  node=300
>  
>   long  4   00:00:00cluster   
>   

Re: [slurm-users] "Plugin is corrupted" message when using drmaa / debugging libslurm

2022-06-28 Thread Chris Samuel

On 28/6/22 12:19 pm, Jean-Christophe HAESSIG wrote:


Hi,

I'm facing a weird issue where launching a job through drmaa
(https://github.com/natefoo/slurm-drmaa) aborts with the message "Plugin
is corrupted", but only when that job is placed from one of my compute
nodes. Running the command from the login node seems to work.


I suspect this is where your error is happening:

https://github.com/SchedMD/slurm/blob/1ce55318222f89fbc862ce559edfd17e911fee38/src/common/plugin.c#L284

it's when it's checking it can load the plugin and not hit any 
unresolved library symbols. The fact you are hitting this sounds like 
you're missing libraries from the compute nodes that are present on the 
login node (or there's some reason they're not getting found if present).


[...]

Anyway, the message seems to originate from libslurm36 and I would like
to activate the debug messages (debug3, debug4). Is there a way to do
this with an environment variable or any other convenient method ?


This depends on what part of Slurm is generating these errors, is this 
something like sbatch or srun? If so using multiple -v's will increase 
the debug level so you can pick those up. If it's from slurmd then 
you'll want to set SlurmdDebug to "debug3" in your slurm.conf.


Once that's done you should get the information on what symbols are not 
being found and that should give you some insight into what's going on.


Best of luck,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-28 Thread gerard . gil
Hi Miguel, 

OK, I did'nt know this command. 

I'm not sure to understand how it works regarding to my goal. 
I use the following command inspired by the command you gave me and I obtain a 
UsageRaw for each QOS. 

scontrol -o show assoc_mgr -accounts=myaccount Users=" " 

Do I have to sumup all QOS RawUsage to obtain the RawUsage of myaccount with 
NoDecay ? 
If I set GrpTRESMins for an Account and not for a QOS, does SLURM handle to 
sumpup these QOS RawUsage to control if the GrpTRESMins account limit is reach 
? 

Thanks again for your precious help. 

Gérard 
[ http://www.cines.fr/ ] 

> De: "Miguel Oliveira" 
> À: "Slurm-users" 
> Envoyé: Mardi 28 Juin 2022 17:23:18
> Objet: Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

> Hi Gérard,

> The way you are checking is against the association and as such it ought to be
> decreasing in order to be used by fair share appropriately.
> The counter used that does not decrease is on the QoS, not the association. 
> You
> can check that with:

> scontrol -o show assoc_mgr | grep "^QOS='+account+’ ”

> That ought to give you two numbers. The first is the limit, or N for not 
> limit,
> and the second in parenthesis the usage.

> Hope that helps.

> Best,

> Miguel Afonso Oliveira

>> On 28 Jun 2022, at 08:58, [ mailto:gerard@cines.fr | gerard@cines.fr 
>> ]
>> wrote:

>> Hi Miguel,

>> I modified my test configuration to evaluate the effect of NoDecay.

>> I modified all QOS adding NoDecay Flag.

>> toto@login1:~/TEST$ sacctmgr show QOS
>> Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags 
>> UsageThres
>> UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall 
>> MaxTRES
>> MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA
>> MaxJobsPA MaxSubmitPA MinTRES
>> -- -- -- -- --- ---
>>  -- --- -
>> - - --- - --- -
>> -- - --- - - ---
>> - - --- -
>> normal 0 00:00:00 cluster NoDecay 1.00
>> interactif 10 00:00:00 cluster NoDecay 1.00 node=50 node=22 1-00:00:00
>> node=50
>> petit 4 00:00:00 cluster NoDecay 1.00 node=1500 node=22 1-00:00:00 
>> node=300
>> gros 6 00:00:00 cluster NoDecay 1.00 node=2106 node=700 1-00:00:00 
>> node=700
>> court 8 00:00:00 cluster NoDecay 1.00 node=1100 node=100 02:00:00 
>> node=300
>> long 4 00:00:00 cluster NoDecay 1.00 node=500 node=200 5-00:00:00 
>> node=200
>> special 10 00:00:00 cluster NoDecay 1.00 node=2106 node=2106 5-00:00:00
>> node=2106
>> support 10 00:00:00 cluster NoDecay 1.00 node=2106 node=700 1-00:00:00
>> node=2106
>> visu 10 00:00:00 cluster NoDecay 1.00 node=4 node=700 06:00:00 node=4

>> I submitted a bunch of jobs to control the NoDecay efficiency and I noticed
>> RawUsage as well as GrpTRESRaw cpu is still decreasing.

>> toto@login1:~/TEST$ sshare -A dci -u " " -o account,user,GrpTRESRaw%80,
>> GrpTRESMins ,RawUsage
>> Account User GrpTRESRaw GrpTRESMins RawUsage
>>  --
>> -
>> -- ---
>> dci cpu=6932
>> ,mem=12998963,energy=0,node=216,billing=6932,fs/disk=0,vmem=0,pages=0 
>> cpu=17150
>> 415966
>> toto@login1:~/TEST$ sshare -A dci -u " " -o account,user,GrpTRESRaw%80,
>> GrpTRESMins , RawUsage
>> Account User GrpTRESRaw GrpTRESMins RawUsage
>>  --
>> -
>> -- ---
>> dci cpu=6931
>> ,mem=12995835,energy=0,node=216,billing=6931,fs/disk=0,vmem=0,pages=0 
>> cpu=17150
>> 415866
>> toto@login1:~/TEST$ sshare -A dci -u " " -o
>> account,user,GrpTRESRaw%80,GrpTRESMins,RawUsage
>> Account User GrpTRESRaw GrpTRESMins RawUsage
>>  --
>> -
>> -- ---
>> dci cpu=6929
>> ,mem=12992708,energy=0,node=216,billing=6929,fs/disk=0,vmem=0,pages=0 
>> cpu=17150
>> 415766

>> Something I forgot to do ?

>> Best,
>> Gérard

>> Cordialement,
>> Gérard Gil

>> Département Calcul Intensif
>> Centre Informatique National de l'Enseignement Superieur
>> 950, rue de Saint Priest
>> 34097 Montpellier CEDEX 5
>> FRANCE

>> tel : (334) 67 14 14 14
>> fax : (334) 67 52 37 63
>> web : [ http://www.cines.fr/ | http://www.cines.fr ]

>>> De: "Gérard Gil" < [ mailto:gerard@cines.fr | gerard@cines.fr ] >
>>> À: "Slurm-users" < [ mailto:slurm-users@lists.schedmd.com |
>>> slurm-users@lists.schedmd.com ] >
>>> Cc: "slurm-users" < [ mailto:slurm-us...@schedmd.com | 
>>> slurm-us...@schedmd.com ]
>>> >
>>> Envoyé: Vendredi 24 Juin 2022 14:52:12
>>> Objet: Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

>>> Hi 

[slurm-users] "Plugin is corrupted" message when using drmaa / debugging libslurm

2022-06-28 Thread Jean-Christophe HAESSIG
Hi,

I'm facing a weird issue where launching a job through drmaa 
(https://github.com/natefoo/slurm-drmaa) aborts with the message "Plugin 
is corrupted", but only when that job is placed from one of my compute 
nodes. Running the command from the login node seems to work.

My cluster runs Slurm 20.11 and the issue appeared when it was migrated 
to that version or the version before (19.05). It is hard to tell 
because the two updates were very close.

Anyway, the message seems to originate from libslurm36 and I would like 
to activate the debug messages (debug3, debug4). Is there a way to do 
this with an environment variable or any other convenient method ?

I'd like to follow where exactly it fails since I compared Slurm 
libraries on the compute nodes and on my login node and couldn't find a 
difference. Strace didn't yield anything interesting either.

Thank you,
J.C. Haessig

Re: [slurm-users] GrpTRESMins and GrpTRESRaw usage

2022-06-28 Thread Miguel Oliveira
Hi Gérard,

The way you are checking is against the association and as such it ought to be 
decreasing in order to be used by fair share appropriately.
The counter used that does not decrease is on the QoS, not the association. You 
can check that with:

scontrol -o show assoc_mgr | grep "^QOS='+account+’”

That ought to give you two numbers. The first is the limit, or N for not limit, 
and the second in parenthesis the usage.

Hope that helps.

Best,

Miguel Afonso Oliveira

> On 28 Jun 2022, at 08:58, gerard@cines.fr wrote:
> 
> Hi Miguel,
> 
> 
> I modified my test configuration to evaluate the effect of NoDecay.
> 
> 
> 
> 
> I modified all QOS adding NoDecay Flag.
> 
> 
> toto@login1:~/TEST$ sacctmgr show QOS
>   Name   Priority  GraceTimePreempt   PreemptExemptTime PreemptMode   
>  Flags UsageThres UsageFactor   GrpTRES   
> GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall   MaxTRES 
> MaxTRESPerNode   MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU  
>MaxTRESPA MaxJobsPA MaxSubmitPA   MinTRES 
> -- -- -- -- --- --- 
>  -- --- - 
> - - --- - --- - 
> -- - --- - - --- 
> - - --- - 
> normal  0   00:00:00cluster   
>NoDecay   1.00 
>   
>   
>  
> interactif 10   00:00:00cluster   
>NoDecay   1.00   node=50   
>   node=22 
>   1-00:00:00   node=50
>  
>  petit  4   00:00:00cluster   
>NoDecay   1.00 node=1500   
>   node=22 
>   1-00:00:00  node=300
>  
>   gros  6   00:00:00cluster   
>NoDecay   1.00 node=2106   
>  node=700 
>   1-00:00:00  node=700
>  
>  court  8   00:00:00cluster   
>NoDecay   1.00 node=1100   
>  node=100 
> 02:00:00  node=300
>  
>   long  4   00:00:00cluster   
>NoDecay   1.00  node=500   
>  node=200 
>   5-00:00:00  node=200
>  
>special 10   00:00:00cluster   
>NoDecay   1.00 node=2106   
> node=2106 
>   5-00:00:00 node=2106
>  
>support 10   00:00:00cluster   
>NoDecay   1.00 node=2106   
>  node=700 
>   1-00:00:00 node=2106
>  
>   visu 10   00:00:00cluster   
>NoDecay   1.00node=4   
>  node=700 
> 06:00:00node=4   
> 
> 
> 
> I submitted a bunch of jobs to control the NoDecay efficiency and I noticed 
> RawUsage as well as GrpTRESRaw cpu is still decreasing.
> 
> 
> toto@login1:~/TEST$ sshare -A dci -u " " -o 
> account,user,GrpTRESRaw%80,GrpTRESMins,RawUsage
>  Account