Re: [slurm-users] Limiting the number of CPU

2019-11-15 Thread Sukman
Hi all,

thank you for the comment and input.

Yes, it is true, the uppercase is one of the main problem.

After correcting the letter case, the job does not stuck anymore.
However, as Daniel notices, there is memory problem.

Running the same script, the job successfully passes the QOS limit.
However, the job cannot be executed because of memory overlimit.

Below is the job running output:

slurmstepd: error: Job 90 exceeded memory limit (1188 > 1024), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 90 ON cn110 CANCELLED AT 2019-11-15T18:45:23 ***


Attached is my slurm.conf
It seems that no memory configuration there.
Yet, I suffer this problem.

Would anyone mind giving any comment or suggestion?


Additionally, following is the limit setting for user Sukman.

# sacctmgr show association where user=sukman 
format=user,grpTRES,grpwall,grptresmins,maxjobs,maxtres,maxtrespernode,maxwall,qos,defaultqos
  User   GrpTRES GrpWall   GrpTRESMins MaxJobs   MaxTRES 
MaxTRESPerNode MaxWall  QOS   Def QOS 
-- - --- - --- - 
-- ---  - 
sukman  
  normal_compute



Thanks.


--

Suksmandhira H
ITB Indonesia




- Original Message -
From: "Daniel Letai" 
To: slurm-users@lists.schedmd.com
Sent: Thursday, November 14, 2019 10:51:10 PM
Subject: Re: [slurm-users] Limiting the number of CPU



3 possible issue, inline below 



On 14/11/2019 14:58:29, Sukman wrote: 


Hi Brian,

thank you for the suggestion.

It appears that my node is in drain state.
I rebooted the node and everything became fine.

However, the QOS still cannot be applied properly.
Do you have any opinion regarding this issue?


$ sacctmgr show qos where Name=normal_compute 
format=Name,Priority,MaxWal,MaxTRESPU
  Name   Priority MaxWall MaxTRESPU
-- -- --- -
normal_co+ 1000:01:00  cpu=2,mem=1G


when I run the following script:

#!/bin/bash
#SBATCH --job-name=hostname
#sbatch --time=00:50
#sbatch --mem=1M I believe those should be uppercase #SBATCH 


#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --nodelist=cn110

srun hostname


It turns out that the QOSMaxMemoryPerUser has been met

$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
88  defq hostname   sukman PD   0:00  1 
(QOSMaxMemoryPerUser)


$ scontrol show job 88
JobId=88 JobName=hostname
   UserId=sukman(1000) GroupId=nobody(1000) MCS_label=N/A
   Priority=4294901753 Nice=0 Account=user QOS=normal_compute
   JobState=PENDING Reason=QOSMaxMemoryPerUser Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2019-11-14T19:49:37 EligibleTime=2019-11-14T19:49:37
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-11-14T19:55:50
   Partition=defq AllocNode:Sid=itbhn02:51072
   ReqNodeList=cn110 ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,node=1
   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=257758M MinTmpDiskNode=0 MinMemoryNode seems to 
require more than FreeMem in Node below 


Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/sukman/script/test_hostname.sh
   WorkDir=/home/sukman/script
   StdErr=/home/sukman/script/slurm-88.out
   StdIn=/dev/null
   StdOut=/home/sukman/script/slurm-88.out
   Power=


$ scontrol show node cn110
NodeName=cn110 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=0 CPUErr=0 CPUTot=56 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cn110 NodeHostName=cn110 Version=17.11
   OS=Linux 3.10.0-693.2.2.el7.x86_64 #1 SMP Tue Sep 12 22:26:13 UTC 2017
   RealMemory=257758 AllocMem=0 FreeMem=255742 Sockets=56 Boards=1 

This would appear to be wrong - 56 sockets? 

How did you configure the node in slurm.conf? 

FreeMem lower than MinMemoryNode - not sure if that is relevant. 


State=IDLE ThreadsPerCore=1 TmpDisk=268629 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defq
   BootTime=2019-11-14T18:50:56 SlurmdStartTime=2019-11-14T18:53:23
   CfgTRES=cpu=56,mem=257758M,billing=56
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


---

Sukman
ITB Indonesia




- Original Message -
From: "Brian Andrus"  To: slurm-users@lists.schedmd.com 
Sent: Tuesday, 

Re: [slurm-users] sacct: job state code CANCELLED+

2019-11-15 Thread Chris Samuel
On Friday, 15 November 2019 2:13:15 AM PST Loris Bennett wrote:

>   If the contents of the column are wider than the column, they
>   will be truncated - this is indicated by the '+'.

You can also use the -p option to sacct to make it parseable (which outputs 
the full width of fields too).

-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] [External] Re: Get GPU usage from sacct?

2019-11-15 Thread Prentice Bisbal

Thanks!

Prentice

On 11/15/19 6:58 AM, Janne Blomqvist wrote:

On 14/11/2019 20.41, Prentice Bisbal wrote:

Is there any way to see how much a job used the GPU(s) on a cluster
using sacct or any other slurm command?


We have created
https://github.com/AaltoScienceIT/ansible-role-sacct_gpu/ as a quick
hack to put GPU utilization stats into the comment field at the end of
the job.

The above is an ansible role, but if you're not using ansible you can
just pull the scripts from the "files" subdirectory.





Re: [slurm-users] Get GPU usage from sacct?

2019-11-15 Thread Miguel Oliveira
Thank! Nice code and just what I was needing! A few wrinkles:

a) on reading the Gres from scontrol for each job on my version this is on a 
TRES record not as an individual Gres. Possibly version/configuration issue.
b) converting pid2id from /proc//cgroup is problematic on array jobs.

Again many thanks for sharing,

MAO

--
Miguel Afonso Oliveira
Senior HPC Systems Engineer
The Francis Crick Institute
1, Midland Road
London NW1 1AT
United Kingdom

T: 020 3796 3712
E: miguel.olive...@crick.ac.uk
W: www.crick.ac.uk

>
> We have created
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAaltoScienceIT%2Fansible-role-sacct_gpu%2Fdata=02%7C01%7C%7C39c349510efb4c7a8d8608d769c3d25f%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C637094162080068245sdata=NIk60Oa3N6FYP84XubaW11ZauA28BzxpCc6DepaoGR8%3Dreserved=0
>  as a quick
> hack to put GPU utilization stats into the comment field at the end of
> the job.
>
> The above is an ansible role, but if you're not using ansible you can
> just pull the scripts from the "files" subdirectory.
>
> --
> Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
> Aalto University School of Science, PHYS & NBE
> +358503841576 || janne.blomqv...@aalto.fi
>

The Francis Crick Institute Limited is a registered charity in England and 
Wales no. 1140062 and a company registered in England and Wales no. 06885462, 
with its registered office at 1 Midland Road London NW1 1AT



Re: [slurm-users] Get GPU usage from sacct?

2019-11-15 Thread Janne Blomqvist
On 14/11/2019 20.41, Prentice Bisbal wrote:
> Is there any way to see how much a job used the GPU(s) on a cluster
> using sacct or any other slurm command?
> 

We have created
https://github.com/AaltoScienceIT/ansible-role-sacct_gpu/ as a quick
hack to put GPU utilization stats into the comment field at the end of
the job.

The above is an ansible role, but if you're not using ansible you can
just pull the scripts from the "files" subdirectory.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



Re: [slurm-users] sacct: job state code CANCELLED+

2019-11-15 Thread Loris Bennett
Loris Bennett  writes:

> Hi Uwe,
>
> Uwe Seher  writes:
>
>> Hello!
>> Whats the meaning of the plus sign? I can not fand anything in the 
>> documentation. This is the full output when a job is cancelled:
>>
>> 277  1808_Modell_107vh1 CANCELLED+  
>> UNLIMITED 2019-11-14T11:28:39 2019-11-14T13:12:06   01:43:27 115 
>>   
>> 277.ba+   batch CANCELLED
>> 2019-11-14T11:28:39 2019-11-14T13:12:07   01:43:28 115   
>>  
>> 277.0 ortedFAILED
>> 2019-11-14T11:28:58 2019-11-14T13:12:06   01:43:08 1 1   
>>
>
> If the contents of the column wider the column, they
> will truncated - this is indicated by the '+'.

That should be:

  If the contents of the column are wider than the column, they
  will be truncated - this is indicated by the '+'.

(I must have switched on broken-english-mode by mistake ) 

> You can increase the width via the '-o' option, e.g.
>
>   sacct -o jobid%20,jobname%50,state%30
>
> Cheers,
>
> Loris
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



Re: [slurm-users] sacct: job state code CANCELLED+

2019-11-15 Thread Loris Bennett
Hi Uwe,

Uwe Seher  writes:

> Hello!
> Whats the meaning of the plus sign? I can not fand anything in the 
> documentation. This is the full output when a job is cancelled:
>
> 277  1808_Modell_107vh1 CANCELLED+  UNLIMITED 
> 2019-11-14T11:28:39 2019-11-14T13:12:06   01:43:27 115
>
> 277.ba+   batch CANCELLED
> 2019-11-14T11:28:39 2019-11-14T13:12:07   01:43:28 115
> 
> 277.0 ortedFAILED
> 2019-11-14T11:28:58 2019-11-14T13:12:06   01:43:08 1 1
>   

If the contents of the column wider the column, they
will truncated - this is indicated by the '+'.

You can increase the width via the '-o' option, e.g.

  sacct -o jobid%20,jobname%50,state%30

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de



[slurm-users] sacct: job state code CANCELLED+

2019-11-15 Thread Uwe Seher
Hello!
Whats the meaning of the plus sign? I can not fand anything in the
documentation. This is the full output when a job is cancelled:

277  1808_Modell_107vh1 CANCELLED+
 UNLIMITED 2019-11-14T11:28:39 2019-11-14T13:12:06   01:43:27 115

277.ba+   batch CANCELLED
 2019-11-14T11:28:39 2019-11-14T13:12:07   01:43:28 115

277.0 ortedFAILED
 2019-11-14T11:28:58 2019-11-14T13:12:06   01:43:08 1 1


Best regards
Uwe Seher