Re: [slurm-users] [EXT] rejecting jobs that exceed QOS limits
Hi Paul, Try sacctmgr modify qos gputest set flags=DenyOnLimit Sean From: slurm-users on behalf of Paul Raines Sent: Saturday, 29 May 2021 12:48 To: slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] rejecting jobs that exceed QOS limits External email: Please exercise caution I want to dedicate one of our GPU servers for testing where users are only allowed to run 1 job at a time using 1 GPU and 8 cores of the server. So I put one server in a partition on its own and setup a QOS for it as follows: sacctmgr add qos gputest sacctmgr modify qos gputest set priority=20 sacctmgr modify qos gputest set MaxJobsPerUser=1 sacctmgr modify qos gputest set MaxTRESPerUser=cpu=8,gres/gpu=1 sacctmgr show qos format=name,priority,MaxTRESPerUser,MaxJobsPerUser In slurm.conf I have: AccountingStorageEnforce=safe,qos AccountingStorageTRES=Billing,CPU,Energy,Mem,Node,FS/Disk,Pages,VMem,gres/gpu EnforcePartLimits=ALL This works but when I submit a job asking for 2 more more GPUs, instead of being immediate rejected it queues but never runs. Same if I ask for more than 8 cores Is there a way to get it immediately rejected?
[slurm-users] rejecting jobs that exceed QOS limits
I want to dedicate one of our GPU servers for testing where users are only allowed to run 1 job at a time using 1 GPU and 8 cores of the server. So I put one server in a partition on its own and setup a QOS for it as follows: sacctmgr add qos gputest sacctmgr modify qos gputest set priority=20 sacctmgr modify qos gputest set MaxJobsPerUser=1 sacctmgr modify qos gputest set MaxTRESPerUser=cpu=8,gres/gpu=1 sacctmgr show qos format=name,priority,MaxTRESPerUser,MaxJobsPerUser In slurm.conf I have: AccountingStorageEnforce=safe,qos AccountingStorageTRES=Billing,CPU,Energy,Mem,Node,FS/Disk,Pages,VMem,gres/gpu EnforcePartLimits=ALL This works but when I submit a job asking for 2 more more GPUs, instead of being immediate rejected it queues but never runs. Same if I ask for more than 8 cores Is there a way to get it immediately rejected?
Re: [slurm-users] DMTCP or MANA with Slurm?
On 5/27/21 12:26 pm, Prentice Bisbal wrote: Given the lack of traffic on the mailing list and lack of releases, I'm beginning to think that both of these project are all but abandoned. They're definitely actively working on it - I've given them a heads up on this to let them know how it's being perceived. Thanks for mentioning it! All the best! Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
[slurm-users] REST API
I have the REST API basically working but I am having a problem with job submission syntax. The error I receive is ‘Unable to parse query”. I have followed the guides found on-line to no avail. Is there somewhere to look for what the issue may be?
[slurm-users] %x in job names
We noticed today that a %x anywhere in a job name like #SBATCH -J abcdefghijklmnopqrstuvw%xyz Etc. will send scontrol (and maybe other %x-respecting programs) into an infinite loop. We had a user cron launching 'scontrol show job ##' regularly on a system and it was just going off the rails and eating resources until we killed it. The Slurm version 18.08.4 release email says that -- Expand %x in job name in 'scontrol show job'. ...so I wonder if that is armored to look for self-refferential calls. I haven't looked at the code, myself. I thought I'd give a heads up. I don't think our user was being malicious, and their actual -J was #SBATCH -J sd-PBEpvw9040%x Probably a hash and probably machine-generated/unlucky. I hope this helps and is actually a problem report. We're on 18.08.5, so I hope we don't have to go backwards to stop this error. Best regards, Bill. -- Bill Barth, Ph.D., Director, FutureTechnologies bba...@tacc.utexas.edu| Phone: (512) 232-7069 Office: ROC 1.435| Fax: (512) 475-9445
Re: [slurm-users] Parent accounts
Hi Stefan, On 5/28/21 3:31 PM, Stefan Staeglich wrote: for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP One approach is to use the Slurm sreport tool which displays the account hierarchy tree: $ sreport -t hourper --tres=cpu,gpu cluster AccountUtilizationByUser Start=0501 End=0528 format=Accounts,Login,Proper%30,TresName%9,Used tree I think you could perhaps also be inspired by this example: $ scontrol -o show assoc_mgr users=XXX account=camdvip flags=Assoc Current Association Manager state Association Records ClusterName=niflheim Account=camdvip UserName= Partition= Priority=0 ID=25 SharesRaw/Norm/Level/Factor=2147483647/0.00/549/0.00 UsageRaw/Norm/Efctv=8677881859.88/0.18/0.18 ParentAccount=camd(16) Lft=1385 DefAssoc=No GrpJobs=N(109) GrpJobsAccrue=N(86) GrpSubmitJobs=N(678) GrpWall=N(2357791.66) GrpTRES=cpu=N(5176),mem=N(56741000),energy=N(0),node=N(179),billing=N(6788),fs/disk=N(0),vmem=N(0),pages=N(0) GrpTRESMins=cpu=N(110849595),mem=N(1098107088383),energy=N(0),node=N(4074184),billing=N(143938180),fs/disk=N(0),vmem=N(0),pages=N(0) GrpTRESRunMins=cpu=N(7715490),mem=N(85269922566),energy=N(0),node=N(272766),billing=N(9733605),fs/disk=N(0),vmem=N(0),pages=N(0) MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh= ClusterName=niflheim Account=camdvip UserName=XXX(261375) Partition= Priority=0 ID=712 SharesRaw/Norm/Level/Factor=3/0.01/549/0.21 UsageRaw/Norm/Efctv=471217489.26/0.01/0.01 ParentAccount= Lft=1392 DefAssoc=Yes GrpJobs=N(6) GrpJobsAccrue=30(0) GrpSubmitJobs=N(12) GrpWall=N(116502.58) GrpTRES=cpu=1500(240),mem=N(225),energy=N(0),node=N(6),billing=N(396),fs/disk=N(0),vmem=N(0),pages=N(0) GrpTRESMins=cpu=N(4727973),mem=N(44203335956),energy=N(0),node=N(119926),billing=N(7763388),fs/disk=N(0),vmem=N(0),pages=N(0) GrpTRESRunMins=cpu=400(214620),mem=N(2012062500),energy=N(0),node=N(5365),billing=N(354123),fs/disk=N(0),vmem=N(0),pages=N(0) MaxJobs=500(6) MaxJobsAccrue=30(0) MaxSubmitJobs=1000(12) MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh= The line with UserName= (empty string) is the parent account. I'm using this approach to print user limits in my showuserlimits tool, https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits I hope this helps. /Ole
[slurm-users] Parent accounts
Hi, for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP ? Something like sacctmgr list account parent=bla withasso -nP doesn't work. Best, Stefan -- Stefan Stäglich, Universität Freiburg, Institut für Informatik Georges-Köhler-Allee, Geb.52, 79110 Freiburg,Germany E-Mail : staeg...@informatik.uni-freiburg.de WWW: gki.informatik.uni-freiburg.de Telefon: +49 761 203-8223 Fax: +49 761 203-8222
Re: [slurm-users] Building SLURM with X11 support
I have the same in our config.log and the x11 forwarding works fine. No other lines around it (about some failing checks or something), just this: [...] configure:22134: WARNING: unable to locate rrdtool installation configure:22176: support for ucx disabled configure:22296: checking whether Slurm internal X11 support is enabled configure:22311: result: configure:22350: checking for check >= 0.9.8 [...] Best, Marcus On 28.05.21 09:26, Bjørn-Helge Mevik wrote: Thekla Loizou writes: Also, when compiling SLURM in the config.log I get: configure:22291: checking whether Slurm internal X11 support is enabled configure:22306: result: The result is empty. I read that X11 is build by default so I don't expect a special flag to be given during compilation time right? My guess is that some X development library is missing. Perhaps look in the configure script for how this test was done (typically it will try to compile something with those devel libraries, and fail). Then see which package contains that library, install it and try again. -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience, HPC-Team Tel.: +49 (0)551 201-2191, E-Mail: mbo...@gwdg.de - Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG) Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: g...@gwdg.de Geschäftsführer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau Sitz der Gesellschaft: Göttingen Registergericht: Göttingen, Handelsregister-Nr. B 598 Zertifiziert nach ISO 9001 - smime.p7s Description: S/MIME Cryptographic Signature
[slurm-users] get job status of completed & cleared jobs from rest interface
Dear all, I am writing to ask you a question. Is it possible to retrieve the status of cleared jobs (e.g. after completed with either success or failed) from the Slurm rest interface ? When a job (job id=131 in the example below) is cleared, the rest interface returns this after some time after completion: {"meta":{"plugin":{"type":"openapi/v0.0.36","name":"REST v0.0.36"},"Slurm":{"version":{"major":20,"micro":7,"minor":11},"release":"20.11.7"}},"errors":[{"error":"_handle_job_get: unknown job 131","error_code":0}],"jobs":[]} I activated the job status storage in mysql: sacct -j 131 JobIDJobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- 131 testjob.sh cirasa 2 COMPLETED 0:0 131.batch batch2 COMPLETED 0:0 131.0 hostname2 COMPLETED 0:0 131.1 sleep2 COMPLETED 0:0 but the rest service does not seem to pick the status from it. Do you have hints? Just to understand more: - how many seconds the completed job stays available to be queried from squeue or rest API methods? Can this "time-to-live-before-cleanup" be configured, eventually increased a bit? This would be useful to avoid polling the status very frequently. - do we have a push mechanism to send job status to external web services, rather than polling it using rest API methods? Thanks very much for your help, Cheers, Simone PS: Using Slurm v20.11.7 on Centos 7 Simone Riggi, PhD INAF, Osservatorio Astrofisico di Catania Via S. Sofia 78 95123, Catania - Italy phone: +39 095 7332 extension 282 (or 310) e-mail: simone.ri...@gmail.com, sri...@inaf.it , sri...@pec.it skype: simone.riggi
Re: [slurm-users] Building SLURM with X11 support
Thekla Loizou writes: > Also, when compiling SLURM in the config.log I get: > > configure:22291: checking whether Slurm internal X11 support is enabled > configure:22306: result: > > The result is empty. I read that X11 is build by default so I don't > expect a special flag to be given during compilation time right? My guess is that some X development library is missing. Perhaps look in the configure script for how this test was done (typically it will try to compile something with those devel libraries, and fail). Then see which package contains that library, install it and try again. -- B/H signature.asc Description: PGP signature
Re: [slurm-users] Building SLURM with X11 support
Thank you both for your replies. Our OS is CentOS 7.7. We have the dependencies installed and also the PrologFlags=X11 in the slurm.conf. Perhaps I am missing some X11 packages? But X11 is working outside SLURM. When getting interactive access on a node basically I get: salloc -N1 --x11 salloc: Granted job allocation 4694 salloc: Waiting for resource configuration salloc: Job allocation 4694 has been revoked. Also, when compiling SLURM in the config.log I get: configure:22291: checking whether Slurm internal X11 support is enabled configure:22306: result: The result is empty. I read that X11 is build by default so I don't expect a special flag to be given during compilation time right? Thanks, Thekla On 27/5/21 3:23 μ.μ., Ole Holm Nielsen wrote: On 5/27/21 2:07 PM, Thekla Loizou wrote: I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am missing when it comes to build SLURM with X11 enabled? Which flags and packages are required? What is your OS? Do you have X11 installed? Did you install all Slurm prerequisites? For CentOS 7 it is: yum install rpm-build gcc openssl openssl-devel libssh2-devel pam-devel numactl numactl-devel hwloc hwloc-devel lua lua-devel readline-devel rrdtool-devel ncurses-devel gtk2-devel libssh2-devel libibmad libibumad perl-Switch perl-ExtUtils-MakeMaker see https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites I hope this helps. /Ole