Hi Reuti, One of the users indicates -S was used in his job:
qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G ./cheat_script_0.sge I have setup my own test just do a simple dd in a local disk #!/bin/bash # #$ -j y #$ -cwd #$ -N bigtmpfile #$ -l h_vmem=32G # echo "$HOST $tmp_requested $TMPDIR" dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200 Our SGE has h_vmem=8gb as default for any job which does not have h_vmem specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed successfully. But I found some thing interesting to the maxvmem value from qacct -j result such as: ru_nvcsw 46651 ru_nivcsw 1355 cpu 146.611 mem 87.885 io 199.501 iow 0.000 maxvmem 736.727M arid undefined The maxvmem value for those 10 jobs are: 1 x 9.920G 1 x 5.540G 8 x 736.727M So this explains my test can fail if default h_vmem=8gb is used. I have to confess that I don't have a full understanding on maxvmem inside SGE. Why 10 jobs of the same command, few of them have much higher maxvmem value? Regards, Derrick On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]> wrote: > Hi, > > > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>: > > > > Dear all, > > > > Recently, I have users reported some of their jobs failed silently. I > picked one up and check, found: > > > > 11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit > "h_vmem" of queue "[email protected]" (8942456832.00000 > limit: > 8589934592.00000) - sending SIGKILL > > > > [root@alpha00 rocks_ansible]# qacct -j 610608 > > ============================================================== > > qname short.q > > hostname xxxxxx.local > > group g_xxxxxxx > > owner glsai > > project NONE > > department xxxxxxx > > jobname .name.out > > jobnumber 610608 > > taskid undefined > > account sge > > priority 0 > > qsub_time Thu Nov 2 05:30:15 2017 > > start_time Thu Nov 2 05:30:17 2017 > > end_time Thu Nov 2 05:30:18 2017 > > granted_pe NONE > > slots 1 > > failed 100 : assumedly after job > > exit_status 137 > > ru_wallclock 1 > > ru_utime 0.007 > > ru_stime 0.006 > > ru_maxrss 1388 > > ru_ixrss 0 > > ru_ismrss 0 > > ru_idrss 0 > > ru_isrss 0 > > ru_minflt 640 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 0 > > ru_oublock 16 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 15 > > ru_nivcsw 3 > > cpu 0.013 > > mem 0.000 > > io 0.000 > > iow 0.000 > > maxvmem 8.328G > > arid undefined > > > > So of course, it is killed due to over the h_vmem limited (exist status > 137, 137=128+9). Few things in my mind: > > > > 1) the same jobs have been running fine for long time, it started > failing two weeks ago (nothing has changed since I was on holiday) > > > > 2) the job almost failed instantly (like after 1 second). The job seems > to fail on the very first command which is "cd" to a directory and print an > output. There is not way a "cd" command can consume 8GB memory right? > > Depends on the command interpreter. Maybe it's a huge bash version. Bash > is addressed in the #! line of the script and any #$ lines for SGE have > proper format? Or do you use the -S option to SGE? > > -- Reuti > > > > 3) the same job will likely run successfully after re-submitting. So > currently our users just keep re-submitting the failed jobs until they run > successfully. > > > > 4) this happens on multiple execution hosts and multiple queues. So it > seems not to be host and queue specific. > > > > I am wondering if there is possible to be caused by the qmaster? > > > > Regards, > > Derrick > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
