[slurm-dev] SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?

John DeSantis Tue, 05 Sep 2017 10:40:07 -0700

Hello all,

We were recently alerted by a user whose long running jobs (>= 6 days) were 
killed by oom.


A closer look revealed that jag_common_poll_data isn't consistent in logging 
the total memory usage,
depending on the application in question.  For example, consider the following 
job submission:

#!/bin/bash
#
#SBATCH --job-name=sif14_1co2
#SBATCH --output=runlog.log
#SBATCH --time=48:00:00
#SBATCH --nodes=8 --ntasks-per-node=6
#SBATCH --partition=p2016 --qos=p16
#SBATCH --mail-type=END
#SBATCH --mail-user=user

module purge
module add apps/vasp/5.4.1

# For standard vasp
mpirun vasp_std


Since no memory syntax was offered, the DefMemPerCPU value would take over, 
which is 512 MB.

Looking at the batch host stats, here is what is actually being reported:

  PPID    LWP PSR     TIME S   RSS     ELAPSED COMMAND            VSZ
 28388  28394   1 00:00:00 S  1332    17:47:24 slurm_script    106116
 28394  28403   1 00:00:00 S  1404    17:47:24 mpirun          106248
 28403  28416  10 00:00:00 S  1664    17:47:24 mpiexec.hydra    17180
 28416  28417   5 00:00:00 S  4996    17:47:24 srun            708424
 28417  28421   3 00:00:00 S   668    17:47:24 srun             37712
 28453  28458   0 00:00:00 S  1928    17:47:24 pmi_proxy        16384
 28458  28471   4 15:55:13 R 1059004  17:47:24 vasp_std        2953636
 28458  28472  10 17:01:43 R 1057672  17:47:24 vasp_std        2950168
 28458  28473  20 17:35:54 R 1055412  17:47:24 vasp_std        2947896
 28458  28474   3 17:53:39 R 1060060  17:47:24 vasp_std        2952712
 28458  28475  11 17:51:52 R 1055168  17:47:24 vasp_std        2947772
 28458  28476  23 17:53:42 R 1063888  17:47:24 vasp_std        2956572

[2017-09-05T13:11:09.124] [10212120] debug:  jag_common_poll_data: Task average 
frequency = 2197 pid
28394 mem size 4400 229544 time 0.020000(0+0)
[2017-09-05T13:11:09.280] [10212120.0] debug:  jag_common_poll_data: Task 
average frequency = 2197
pid 28458 mem size 1928 205988 time 0.020000(0+0)
[2017-09-05T13:11:31.729] [10212120] debug:  jag_common_poll_data: Task average 
frequency = 2197 pid
28394 mem size 4400 229544 time 0.020000(0+0)
[2017-09-05T13:11:31.757] [10212120.0] debug:  jag_common_poll_data: Task 
average frequency = 2197
pid 28458 mem size 1928 205988 time 0.020000(0+0)
[2017-09-05T13:11:39.126] [10212120] debug:  jag_common_poll_data: Task average 
frequency = 2197 pid
28394 mem size 4400 229544 time 0.020000(0+0)
[2017-09-05T13:11:39.283] [10212120.0] debug:  jag_common_poll_data: Task 
average frequency = 2197
pid 28458 mem size 1928 205988 time 0.020000(0+0)

However, if a slightly older version of the same software is used, memory seems 
to be reported
correctly.  Consider the submission script below.

#!/bin/sh

#SBATCH --time=04:00:00
#SBATCH -N 2 --ntasks-per-node=2 -J "vasp_run"
#SBATCH --mem-per-cpu=1800
#SBATCH --reservation=blah
#SBATCH -p blah
#SBATCH -w nodes-[1-2]

module purge
module load apps/vasp/5.3.3

mpirun vasp

And, the reported stats:

 PPID   LWP PSR     TIME S   RSS     ELAPSED COMMAND
22046 22050   0 00:00:00 S  1324       03:32 slurm_script
22050 22053   0 00:00:00 S  1476       03:32 mpirun
22053 22111   1 00:00:00 S  1428       03:32 mpiexec.hydra
22111 22112   2 00:00:00 S  4800       03:32 srun
22112 22113   3 00:00:00 S   668       03:32 srun
22121 22126   3 00:00:00 S  1352       03:32 pmi_proxy
22126 22127   5 00:03:30 R 1781544     03:32 vasp
22126 22128   0 00:03:30 R 1778052     03:32 vasp


[2017-09-05T13:17:58.097] [10215464] debug:  jag_common_poll_data: Task average 
frequency = 2527 pid
22050 mem size 4228 230384 time 0.010000(0+0)
[2017-09-05T13:17:58.265] [10215464.0] debug:  jag_common_poll_data: Task 
average frequency = 2527
pid 22126 mem size 3538576 7561020 time 59.630000(56+2)
[2017-09-05T13:18:28.101] [10215464] debug:  jag_common_poll_data: Task average 
frequency = 2527 pid
22050 mem size 4228 230384 time 0.010000(0+0)
[2017-09-05T13:18:28.270] [10215464.0] debug:  jag_common_poll_data: Task 
average frequency = 2505
pid 22126 mem size 3560928 7655316 time 119.590000(116+3)
[2017-09-05T13:18:58.105] [10215464] debug:  jag_common_poll_data: Task average 
frequency = 2527 pid
22050 mem size 4228 230384 time 0.010000(0+0)
[2017-09-05T13:18:58.275] [10215464.0] debug:  jag_common_poll_data: Task 
average frequency = 2498
pid 22126 mem size 3560932 7655316 time 179.570000(175+3)
[2017-09-05T13:19:28.109] [10215464] debug:  jag_common_poll_data: Task average 
frequency = 2527 pid
22050 mem size 4228 230384 time 0.010000(0+0)
[2017-09-05T13:19:28.281] [10215464.0] debug:  jag_common_poll_data: Task 
average frequency = 2495
pid 22126 mem size 3560932 7655316 time 239.550000(234+4)
[2017-09-05T13:19:58.113] [10215464] debug:  jag_common_poll_data: Task average 
frequency = 2527 pid
22050 mem size 4228 230384 time 0.010000(0+0)
[2017-09-05T13:19:58.286] [10215464.0] debug:  jag_common_poll_data: Task 
average frequency = 2493
pid 22126 mem size 3560936 7655316 time 299.510000(294+5)



Has anyone else noticed anything similar on their cluster(s)?  I cannot confirm 
if this was
happening before we upgraded from 15.08.4 to 16.05.10-2.

Thanks,
John DeSantis

[slurm-dev] SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?

Reply via email to