Dear all, Recently, I have users reported some of their jobs failed silently. I picked one up and check, found:
11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit "h_vmem" of queue "[email protected]" (8942456832.00000 > limit:8589934592.00000) - sending SIGKILL [root@alpha00 rocks_ansible]# qacct -j 610608 ============================================================== qname short.q hostname xxxxxx.local group g_xxxxxxx owner glsai project NONE department xxxxxxx jobname .name.out jobnumber 610608 taskid undefined account sge priority 0 qsub_time Thu Nov 2 05:30:15 2017 start_time Thu Nov 2 05:30:17 2017 end_time Thu Nov 2 05:30:18 2017 granted_pe NONE slots 1 failed 100 : assumedly after job exit_status 137 ru_wallclock 1 ru_utime 0.007 ru_stime 0.006 ru_maxrss 1388 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 640 ru_majflt 0 ru_nswap 0 ru_inblock 0 ru_oublock 16 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 15 ru_nivcsw 3 cpu 0.013 mem 0.000 io 0.000 iow 0.000 maxvmem 8.328G arid undefined So of course, it is killed due to over the h_vmem limited (exist status 137, 137=128+9). Few things in my mind: 1) the same jobs have been running fine for long time, it started failing two weeks ago (nothing has changed since I was on holiday) 2) the job almost failed instantly (like after 1 second). The job seems to fail on the very first command which is "cd" to a directory and print an output. There is not way a "cd" command can consume 8GB memory right? 3) the same job will likely run successfully after re-submitting. So currently our users just keep re-submitting the failed jobs until they run successfully. 4) this happens on multiple execution hosts and multiple queues. So it seems not to be host and queue specific. I am wondering if there is possible to be caused by the qmaster? Regards, Derrick
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
