Hello all, We were recently alerted by a user whose long running jobs (>= 6 days) were killed by oom.
A closer look revealed that jag_common_poll_data isn't consistent in logging the total memory usage, depending on the application in question. For example, consider the following job submission: #!/bin/bash # #SBATCH --job-name=sif14_1co2 #SBATCH --output=runlog.log #SBATCH --time=48:00:00 #SBATCH --nodes=8 --ntasks-per-node=6 #SBATCH --partition=p2016 --qos=p16 #SBATCH --mail-type=END #SBATCH --mail-user=user module purge module add apps/vasp/5.4.1 # For standard vasp mpirun vasp_std Since no memory syntax was offered, the DefMemPerCPU value would take over, which is 512 MB. Looking at the batch host stats, here is what is actually being reported: PPID LWP PSR TIME S RSS ELAPSED COMMAND VSZ 28388 28394 1 00:00:00 S 1332 17:47:24 slurm_script 106116 28394 28403 1 00:00:00 S 1404 17:47:24 mpirun 106248 28403 28416 10 00:00:00 S 1664 17:47:24 mpiexec.hydra 17180 28416 28417 5 00:00:00 S 4996 17:47:24 srun 708424 28417 28421 3 00:00:00 S 668 17:47:24 srun 37712 28453 28458 0 00:00:00 S 1928 17:47:24 pmi_proxy 16384 28458 28471 4 15:55:13 R 1059004 17:47:24 vasp_std 2953636 28458 28472 10 17:01:43 R 1057672 17:47:24 vasp_std 2950168 28458 28473 20 17:35:54 R 1055412 17:47:24 vasp_std 2947896 28458 28474 3 17:53:39 R 1060060 17:47:24 vasp_std 2952712 28458 28475 11 17:51:52 R 1055168 17:47:24 vasp_std 2947772 28458 28476 23 17:53:42 R 1063888 17:47:24 vasp_std 2956572 [2017-09-05T13:11:09.124] [10212120] debug: jag_common_poll_data: Task average frequency = 2197 pid 28394 mem size 4400 229544 time 0.020000(0+0) [2017-09-05T13:11:09.280] [10212120.0] debug: jag_common_poll_data: Task average frequency = 2197 pid 28458 mem size 1928 205988 time 0.020000(0+0) [2017-09-05T13:11:31.729] [10212120] debug: jag_common_poll_data: Task average frequency = 2197 pid 28394 mem size 4400 229544 time 0.020000(0+0) [2017-09-05T13:11:31.757] [10212120.0] debug: jag_common_poll_data: Task average frequency = 2197 pid 28458 mem size 1928 205988 time 0.020000(0+0) [2017-09-05T13:11:39.126] [10212120] debug: jag_common_poll_data: Task average frequency = 2197 pid 28394 mem size 4400 229544 time 0.020000(0+0) [2017-09-05T13:11:39.283] [10212120.0] debug: jag_common_poll_data: Task average frequency = 2197 pid 28458 mem size 1928 205988 time 0.020000(0+0) However, if a slightly older version of the same software is used, memory seems to be reported correctly. Consider the submission script below. #!/bin/sh #SBATCH --time=04:00:00 #SBATCH -N 2 --ntasks-per-node=2 -J "vasp_run" #SBATCH --mem-per-cpu=1800 #SBATCH --reservation=blah #SBATCH -p blah #SBATCH -w nodes-[1-2] module purge module load apps/vasp/5.3.3 mpirun vasp And, the reported stats: PPID LWP PSR TIME S RSS ELAPSED COMMAND 22046 22050 0 00:00:00 S 1324 03:32 slurm_script 22050 22053 0 00:00:00 S 1476 03:32 mpirun 22053 22111 1 00:00:00 S 1428 03:32 mpiexec.hydra 22111 22112 2 00:00:00 S 4800 03:32 srun 22112 22113 3 00:00:00 S 668 03:32 srun 22121 22126 3 00:00:00 S 1352 03:32 pmi_proxy 22126 22127 5 00:03:30 R 1781544 03:32 vasp 22126 22128 0 00:03:30 R 1778052 03:32 vasp [2017-09-05T13:17:58.097] [10215464] debug: jag_common_poll_data: Task average frequency = 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) [2017-09-05T13:17:58.265] [10215464.0] debug: jag_common_poll_data: Task average frequency = 2527 pid 22126 mem size 3538576 7561020 time 59.630000(56+2) [2017-09-05T13:18:28.101] [10215464] debug: jag_common_poll_data: Task average frequency = 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) [2017-09-05T13:18:28.270] [10215464.0] debug: jag_common_poll_data: Task average frequency = 2505 pid 22126 mem size 3560928 7655316 time 119.590000(116+3) [2017-09-05T13:18:58.105] [10215464] debug: jag_common_poll_data: Task average frequency = 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) [2017-09-05T13:18:58.275] [10215464.0] debug: jag_common_poll_data: Task average frequency = 2498 pid 22126 mem size 3560932 7655316 time 179.570000(175+3) [2017-09-05T13:19:28.109] [10215464] debug: jag_common_poll_data: Task average frequency = 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) [2017-09-05T13:19:28.281] [10215464.0] debug: jag_common_poll_data: Task average frequency = 2495 pid 22126 mem size 3560932 7655316 time 239.550000(234+4) [2017-09-05T13:19:58.113] [10215464] debug: jag_common_poll_data: Task average frequency = 2527 pid 22050 mem size 4228 230384 time 0.010000(0+0) [2017-09-05T13:19:58.286] [10215464.0] debug: jag_common_poll_data: Task average frequency = 2493 pid 22126 mem size 3560936 7655316 time 299.510000(294+5) Has anyone else noticed anything similar on their cluster(s)? I cannot confirm if this was happening before we upgraded from 15.08.4 to 16.05.10-2. Thanks, John DeSantis