Hi, > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>: > > Dear all, > > Recently, I have users reported some of their jobs failed silently. I picked > one up and check, found: > > 11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit > "h_vmem" of queue "[email protected]" (8942456832.00000 > > limit:8589934592.00000) - sending SIGKILL > > [root@alpha00 rocks_ansible]# qacct -j 610608 > ============================================================== > qname short.q > hostname xxxxxx.local > group g_xxxxxxx > owner glsai > project NONE > department xxxxxxx > jobname .name.out > jobnumber 610608 > taskid undefined > account sge > priority 0 > qsub_time Thu Nov 2 05:30:15 2017 > start_time Thu Nov 2 05:30:17 2017 > end_time Thu Nov 2 05:30:18 2017 > granted_pe NONE > slots 1 > failed 100 : assumedly after job > exit_status 137 > ru_wallclock 1 > ru_utime 0.007 > ru_stime 0.006 > ru_maxrss 1388 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 640 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 16 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 15 > ru_nivcsw 3 > cpu 0.013 > mem 0.000 > io 0.000 > iow 0.000 > maxvmem 8.328G > arid undefined > > So of course, it is killed due to over the h_vmem limited (exist status 137, > 137=128+9). Few things in my mind: > > 1) the same jobs have been running fine for long time, it started failing two > weeks ago (nothing has changed since I was on holiday) > > 2) the job almost failed instantly (like after 1 second). The job seems to > fail on the very first command which is "cd" to a directory and print an > output. There is not way a "cd" command can consume 8GB memory right?
Depends on the command interpreter. Maybe it's a huge bash version. Bash is addressed in the #! line of the script and any #$ lines for SGE have proper format? Or do you use the -S option to SGE? -- Reuti > 3) the same job will likely run successfully after re-submitting. So > currently our users just keep re-submitting the failed jobs until they run > successfully. > > 4) this happens on multiple execution hosts and multiple queues. So it seems > not to be host and queue specific. > > I am wondering if there is possible to be caused by the qmaster? > > Regards, > Derrick > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
