Re: [slurm-users] [EXT] rejecting jobs that exceed QOS limits

2021-05-28 Thread Sean Crosby
Hi Paul,

Try

sacctmgr modify qos gputest set flags=DenyOnLimit

Sean

From: slurm-users  on behalf of Paul 
Raines 
Sent: Saturday, 29 May 2021 12:48
To: slurm-users@lists.schedmd.com 
Subject: [EXT] [slurm-users] rejecting jobs that exceed QOS limits

External email: Please exercise caution


I want to dedicate one of our GPU servers for testing where
users are only allowed to run 1 job at a time using 1 GPU and
8 cores of the server.  So I put one server in a partition on its
own and setup a QOS for it as follows:

  sacctmgr add qos gputest
  sacctmgr modify qos gputest set priority=20
  sacctmgr modify qos gputest set MaxJobsPerUser=1
  sacctmgr modify qos gputest set MaxTRESPerUser=cpu=8,gres/gpu=1
  sacctmgr show qos format=name,priority,MaxTRESPerUser,MaxJobsPerUser

In slurm.conf I have:

AccountingStorageEnforce=safe,qos
AccountingStorageTRES=Billing,CPU,Energy,Mem,Node,FS/Disk,Pages,VMem,gres/gpu
EnforcePartLimits=ALL


This works but when I submit a job asking for 2 more more GPUs, instead
of being immediate rejected it queues but never runs. Same if I
ask for more than 8 cores

Is there a way to get it immediately rejected?



[slurm-users] rejecting jobs that exceed QOS limits

2021-05-28 Thread Paul Raines



I want to dedicate one of our GPU servers for testing where
users are only allowed to run 1 job at a time using 1 GPU and
8 cores of the server.  So I put one server in a partition on its
own and setup a QOS for it as follows:

 sacctmgr add qos gputest
 sacctmgr modify qos gputest set priority=20
 sacctmgr modify qos gputest set MaxJobsPerUser=1
 sacctmgr modify qos gputest set MaxTRESPerUser=cpu=8,gres/gpu=1
 sacctmgr show qos format=name,priority,MaxTRESPerUser,MaxJobsPerUser

In slurm.conf I have:

AccountingStorageEnforce=safe,qos
AccountingStorageTRES=Billing,CPU,Energy,Mem,Node,FS/Disk,Pages,VMem,gres/gpu
EnforcePartLimits=ALL


This works but when I submit a job asking for 2 more more GPUs, instead
of being immediate rejected it queues but never runs. Same if I
ask for more than 8 cores

Is there a way to get it immediately rejected?



Re: [slurm-users] DMTCP or MANA with Slurm?

2021-05-28 Thread Christopher Samuel

On 5/27/21 12:26 pm, Prentice Bisbal wrote:

Given the lack of traffic on the mailing list and lack of releases, I'm 
beginning to think that both of these project are all but abandoned.


They're definitely actively working on it - I've given them a heads up 
on this to let them know how it's being perceived. Thanks for mentioning it!


All the best!
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



[slurm-users] REST API

2021-05-28 Thread Hoot Thompson
I have the REST API basically working but I am having a problem with job 
submission syntax. The error I receive is ‘Unable to parse query”. I have 
followed the guides found on-line to no avail. Is there somewhere to look for 
what the issue may be?



[slurm-users] %x in job names

2021-05-28 Thread Bill Barth
We noticed today that a %x anywhere in a job name like 

#SBATCH -J abcdefghijklmnopqrstuvw%xyz

Etc. will send scontrol (and maybe other %x-respecting programs) into an 
infinite loop. We had a user cron launching 'scontrol show job ##' 
regularly on a system and it was just going off the rails and eating resources 
until we killed it. The Slurm version 18.08.4 release email says that

-- Expand %x in job name in 'scontrol show job'.

...so I wonder if that is armored to look for self-refferential calls. I 
haven't looked at the code, myself. I thought I'd give a heads up. I don't 
think our user was being malicious, and their actual -J was

#SBATCH -J sd-PBEpvw9040%x

Probably a hash and probably machine-generated/unlucky. 

I hope this helps and is actually a problem report. We're on 18.08.5, so I hope 
we don't have to go backwards to stop this error.

Best regards,
Bill.

-- 
Bill Barth, Ph.D., Director, FutureTechnologies
bba...@tacc.utexas.edu|   Phone: (512) 232-7069
Office: ROC 1.435|   Fax:   (512) 475-9445
 
 



Re: [slurm-users] Parent accounts

2021-05-28 Thread Ole Holm Nielsen

Hi Stefan,

On 5/28/21 3:31 PM, Stefan Staeglich wrote:

for our monitoring system I want to query the account hierarchy. Is there a
better approach than to parse the output of

sacctmgr list account withasso -nP


One approach is to use the Slurm sreport tool which displays the account 
hierarchy tree:


$ sreport -t hourper --tres=cpu,gpu cluster AccountUtilizationByUser 
Start=0501 End=0528 format=Accounts,Login,Proper%30,TresName%9,Used  tree


I think you could perhaps also be inspired by this example:

$ scontrol -o show assoc_mgr users=XXX account=camdvip flags=Assoc
Current Association Manager state
Association Records
ClusterName=niflheim Account=camdvip UserName= Partition= Priority=0 ID=25 
SharesRaw/Norm/Level/Factor=2147483647/0.00/549/0.00 
UsageRaw/Norm/Efctv=8677881859.88/0.18/0.18 ParentAccount=camd(16) 
Lft=1385 DefAssoc=No GrpJobs=N(109) GrpJobsAccrue=N(86) 
GrpSubmitJobs=N(678) GrpWall=N(2357791.66) 
GrpTRES=cpu=N(5176),mem=N(56741000),energy=N(0),node=N(179),billing=N(6788),fs/disk=N(0),vmem=N(0),pages=N(0) 
GrpTRESMins=cpu=N(110849595),mem=N(1098107088383),energy=N(0),node=N(4074184),billing=N(143938180),fs/disk=N(0),vmem=N(0),pages=N(0) 
GrpTRESRunMins=cpu=N(7715490),mem=N(85269922566),energy=N(0),node=N(272766),billing=N(9733605),fs/disk=N(0),vmem=N(0),pages=N(0) 
MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= 
MaxTRESMinsPJ= MinPrioThresh=
ClusterName=niflheim Account=camdvip UserName=XXX(261375) Partition= 
Priority=0 ID=712 SharesRaw/Norm/Level/Factor=3/0.01/549/0.21 
UsageRaw/Norm/Efctv=471217489.26/0.01/0.01 ParentAccount= Lft=1392 
DefAssoc=Yes GrpJobs=N(6) GrpJobsAccrue=30(0) GrpSubmitJobs=N(12) 
GrpWall=N(116502.58) 
GrpTRES=cpu=1500(240),mem=N(225),energy=N(0),node=N(6),billing=N(396),fs/disk=N(0),vmem=N(0),pages=N(0) 
GrpTRESMins=cpu=N(4727973),mem=N(44203335956),energy=N(0),node=N(119926),billing=N(7763388),fs/disk=N(0),vmem=N(0),pages=N(0) 
GrpTRESRunMins=cpu=400(214620),mem=N(2012062500),energy=N(0),node=N(5365),billing=N(354123),fs/disk=N(0),vmem=N(0),pages=N(0) 
MaxJobs=500(6) MaxJobsAccrue=30(0) MaxSubmitJobs=1000(12) MaxWallPJ= 
MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh=



The line with UserName= (empty string) is the parent account.

I'm using this approach to print user limits in my showuserlimits tool,
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits

I hope this helps.

/Ole



[slurm-users] Parent accounts

2021-05-28 Thread Stefan Staeglich
Hi,

for our monitoring system I want to query the account hierarchy. Is there a 
better approach than to parse the output of

sacctmgr list account withasso -nP

?

Something like

sacctmgr list account parent=bla withasso -nP

doesn't work.

Best,
Stefan
-- 
Stefan Stäglich,  Universität Freiburg,  Institut für Informatik
Georges-Köhler-Allee,  Geb.52,   79110 Freiburg,Germany

E-Mail : staeg...@informatik.uni-freiburg.de
WWW: gki.informatik.uni-freiburg.de
Telefon: +49 761 203-8223
Fax: +49 761 203-8222






Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Marcus Boden
I have the same in our config.log and the x11 forwarding works fine. No 
other lines around it (about some failing checks or something), just this:


[...]
configure:22134: WARNING: unable to locate rrdtool installation
configure:22176: support for ucx disabled
configure:22296: checking whether Slurm internal X11 support is enabled
configure:22311: result:
configure:22350: checking for check >= 0.9.8
[...]

Best,
Marcus


On 28.05.21 09:26, Bjørn-Helge Mevik wrote:

Thekla Loizou  writes:


Also, when compiling SLURM in the config.log I get:

configure:22291: checking whether Slurm internal X11 support is enabled
configure:22306: result:

The result is empty. I read that X11 is build by default so I don't
expect a special flag to be given during compilation time right?


My guess is that some X development library is missing.  Perhaps look in
the configure script for how this test was done (typically it will try
to compile something with those devel libraries, and fail).  Then see
which package contains that library, install it and try again.



--
Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001
-



smime.p7s
Description: S/MIME Cryptographic Signature


[slurm-users] get job status of completed & cleared jobs from rest interface

2021-05-28 Thread Simone Riggi
Dear all,
I am writing to ask you a question.

Is it possible to retrieve the status of cleared jobs (e.g. after completed
with either success or failed) from the Slurm rest interface ?

When a job (job id=131 in the example below) is cleared, the rest interface
returns this after some time after completion:

{"meta":{"plugin":{"type":"openapi/v0.0.36","name":"REST
v0.0.36"},"Slurm":{"version":{"major":20,"micro":7,"minor":11},"release":"20.11.7"}},"errors":[{"error":"_handle_job_get:
unknown job 131","error_code":0}],"jobs":[]}

 I activated the job status storage in mysql:

sacct -j 131
   JobIDJobName  PartitionAccount  AllocCPUS  State
ExitCode
 -- -- -- -- --

131  testjob.sh cirasa 2  COMPLETED
 0:0
131.batch batch2  COMPLETED
 0:0
131.0  hostname2  COMPLETED
 0:0
131.1 sleep2  COMPLETED  0:0

but the rest service does not seem to pick the status from it.
Do you have hints?

Just to understand more:

- how many seconds the completed job stays available to be queried from
squeue or rest API methods? Can this "time-to-live-before-cleanup" be
configured, eventually increased a bit? This would be useful to avoid
polling the status very frequently.

- do we have a push mechanism to send job status to external web services,
rather than polling it using rest API methods?

Thanks very much for your help,

Cheers,

Simone

PS: Using Slurm v20.11.7 on Centos 7







Simone Riggi, PhD
INAF, Osservatorio Astrofisico di Catania
Via S. Sofia 78
95123, Catania - Italy
phone:  +39 095 7332 extension 282 (or 310)
e-mail: simone.ri...@gmail.com,
sri...@inaf.it ,
sri...@pec.it 
skype: simone.riggi



Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Bjørn-Helge Mevik
Thekla Loizou  writes:

> Also, when compiling SLURM in the config.log I get:
>
> configure:22291: checking whether Slurm internal X11 support is enabled
> configure:22306: result:
>
> The result is empty. I read that X11 is build by default so I don't
> expect a special flag to be given during compilation time right?

My guess is that some X development library is missing.  Perhaps look in
the configure script for how this test was done (typically it will try
to compile something with those devel libraries, and fail).  Then see
which package contains that library, install it and try again.

-- 
B/H


signature.asc
Description: PGP signature


Re: [slurm-users] Building SLURM with X11 support

2021-05-28 Thread Thekla Loizou

Thank you both for your replies.

Our OS is CentOS 7.7. We have the dependencies installed and also the 
PrologFlags=X11 in the slurm.conf.


Perhaps I am missing some X11 packages? But X11 is working outside SLURM.

When getting interactive access on a node basically I get:

salloc -N1 --x11
salloc: Granted job allocation 4694
salloc: Waiting for resource configuration
salloc: Job allocation 4694 has been revoked.

Also, when compiling SLURM in the config.log I get:

configure:22291: checking whether Slurm internal X11 support is enabled
configure:22306: result:

The result is empty. I read that X11 is build by default so I don't 
expect a special flag to be given during compilation time right?



Thanks,

Thekla

On 27/5/21 3:23 μ.μ., Ole Holm Nielsen wrote:

 On 5/27/21 2:07 PM, Thekla Loizou wrote:

I am trying to use X11 forwarding in SLURM with no success.

We are installing SLURM using RPMs that we generate with the command 
"rpmbuild -ta slurm*.tar.bz2" as per the documentation.


I am currently working with SLURM version 20.11.7-1.

What I am missing when it comes to build SLURM with X11 enabled? 
Which flags and packages are required?


What is your OS?  Do you have X11 installed?

Did you install all Slurm prerequisites?  For CentOS 7 it is:

yum install rpm-build gcc openssl openssl-devel libssh2-devel 
pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel 
readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel 
libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker


see 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites


I hope this helps.

/Ole