Hi,

> Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>:
> 
> Dear all,
> 
> Recently, I have users reported some of their jobs failed silently. I picked 
> one up and check, found:
> 
> 11/02/2017 05:30:18|  main|delta-5-3|W|job 610608 exceeds job hard limit 
> "h_vmem" of queue "[email protected]" (8942456832.00000 > 
> limit:8589934592.00000) - sending SIGKILL
> 
> [root@alpha00 rocks_ansible]# qacct -j 610608
> ==============================================================
> qname        short.q
> hostname     xxxxxx.local
> group        g_xxxxxxx
> owner        glsai
> project      NONE
> department   xxxxxxx
> jobname      .name.out
> jobnumber    610608
> taskid       undefined
> account      sge
> priority     0
> qsub_time    Thu Nov  2 05:30:15 2017
> start_time   Thu Nov  2 05:30:17 2017
> end_time     Thu Nov  2 05:30:18 2017
> granted_pe   NONE
> slots        1
> failed       100 : assumedly after job
> exit_status  137
> ru_wallclock 1
> ru_utime     0.007
> ru_stime     0.006
> ru_maxrss    1388
> ru_ixrss     0
> ru_ismrss    0
> ru_idrss     0
> ru_isrss     0
> ru_minflt    640
> ru_majflt    0
> ru_nswap     0
> ru_inblock   0
> ru_oublock   16
> ru_msgsnd    0
> ru_msgrcv    0
> ru_nsignals  0
> ru_nvcsw     15
> ru_nivcsw    3
> cpu          0.013
> mem          0.000
> io           0.000
> iow          0.000
> maxvmem      8.328G
> arid         undefined
> 
> So of course, it is killed due to over the h_vmem limited (exist status 137, 
> 137=128+9). Few things in my mind:
> 
> 1) the same jobs have been running fine for long time, it started failing two 
> weeks ago (nothing has changed since I was on holiday)
> 
> 2) the job almost failed instantly (like after 1 second). The job seems to 
> fail on the very first command which is "cd" to a directory and print an 
> output. There is not way a "cd" command can consume 8GB memory right?

Depends on the command interpreter. Maybe it's a huge bash version. Bash is 
addressed in the #! line of the script and any #$ lines for SGE have proper 
format? Or do you use the -S option to SGE?

-- Reuti


> 3) the same job will likely run successfully after re-submitting. So 
> currently our users just keep re-submitting the failed jobs until they run 
> successfully.
> 
> 4) this happens on multiple execution hosts and multiple queues. So it seems 
> not to be host and queue specific.
> 
> I am wondering if there is possible to be caused by the qmaster?
> 
> Regards,
> Derrick
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to