[gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Derrick Lin Wed, 01 Nov 2017 21:07:26 -0700

Dear all,

Recently, I have users reported some of their jobs failed silently. I
picked one up and check, found:


11/02/2017 05:30:18|  main|delta-5-3|W|job 610608 exceeds job hard limit
"h_vmem" of queue "[email protected]" (8942456832.00000 >
limit:8589934592.00000) - sending SIGKILL

[root@alpha00 rocks_ansible]# qacct -j 610608
==============================================================
qname        short.q
hostname     xxxxxx.local
group        g_xxxxxxx
owner        glsai
project      NONE
department   xxxxxxx
jobname      .name.out
jobnumber    610608
taskid       undefined
account      sge
priority     0
qsub_time    Thu Nov  2 05:30:15 2017
start_time   Thu Nov  2 05:30:17 2017
end_time     Thu Nov  2 05:30:18 2017
granted_pe   NONE
slots        1
failed       100 : assumedly after job
exit_status  137
ru_wallclock 1
ru_utime     0.007
ru_stime     0.006
ru_maxrss    1388
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    640
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   16
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     15
ru_nivcsw    3
cpu          0.013
mem          0.000
io           0.000
iow          0.000
maxvmem      8.328G
arid         undefined

So of course, it is killed due to over the h_vmem limited (exist status
137, 137=128+9). Few things in my mind:

1) the same jobs have been running fine for long time, it started failing
two weeks ago (nothing has changed since I was on holiday)

2) the job almost failed instantly (like after 1 second). The job seems to
fail on the very first command which is "cd" to a directory and print an
output. There is not way a "cd" command can consume 8GB memory right?

3) the same job will likely run successfully after re-submitting. So
currently our users just keep re-submitting the failed jobs until they run
successfully.

4) this happens on multiple execution hosts and multiple queues. So it
seems not to be host and queue specific.

I am wondering if there is possible to be caused by the qmaster?

Regards,
Derrick

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Reply via email to