Hi, Am 02.11.2017 um 11:39 schrieb Derrick Lin:
> Hi Reuti, > > One of the users indicates -S was used in his job: > > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S > /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G > ./cheat_script_0.sge > > I have setup my own test just do a simple dd in a local disk > > #!/bin/bash > # > #$ -j y > #$ -cwd > #$ -N bigtmpfile > #$ -l h_vmem=32G > # > > echo "$HOST $tmp_requested $TMPDIR" > > dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200 > > Our SGE has h_vmem=8gb as default for any job which does not have h_vmem > specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I > put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed > successfully. But I found some thing interesting to the maxvmem value from > qacct -j result such as: > > ru_nvcsw 46651 > ru_nivcsw 1355 > cpu 146.611 > mem 87.885 > io 199.501 > iow 0.000 > maxvmem 736.727M > arid undefined > > The maxvmem value for those 10 jobs are: > > 1 x 9.920G > 1 x 5.540G > 8 x 736.727M Is anything else running on the nodes, which has by accident the same additional group ID (the range you defined in `qconf -mconf`? This additional group ID is used to allow SGE to keep track of each job's resource consumptions. Somehow I remember an issue where former additional group IDs were reused(?) although they are still in use. Can you please try to extend the range for the additional group ID and check whether the problem persists. Or OTOH shrink the range and check whether it happens more often. -- Reuti > > So this explains my test can fail if default h_vmem=8gb is used. I have to > confess that I don't have a full understanding on maxvmem inside SGE. Why 10 > jobs of the same command, few of them have much higher maxvmem value? > > Regards, > Derrick > > On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]> wrote: > Hi, > > > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>: > > > > Dear all, > > > > Recently, I have users reported some of their jobs failed silently. I > > picked one up and check, found: > > > > 11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit > > "h_vmem" of queue "[email protected]" (8942456832.00000 > > > limit:8589934592.00000) - sending SIGKILL > > > > [root@alpha00 rocks_ansible]# qacct -j 610608 > > ============================================================== > > qname short.q > > hostname xxxxxx.local > > group g_xxxxxxx > > owner glsai > > project NONE > > department xxxxxxx > > jobname .name.out > > jobnumber 610608 > > taskid undefined > > account sge > > priority 0 > > qsub_time Thu Nov 2 05:30:15 2017 > > start_time Thu Nov 2 05:30:17 2017 > > end_time Thu Nov 2 05:30:18 2017 > > granted_pe NONE > > slots 1 > > failed 100 : assumedly after job > > exit_status 137 > > ru_wallclock 1 > > ru_utime 0.007 > > ru_stime 0.006 > > ru_maxrss 1388 > > ru_ixrss 0 > > ru_ismrss 0 > > ru_idrss 0 > > ru_isrss 0 > > ru_minflt 640 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 0 > > ru_oublock 16 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 15 > > ru_nivcsw 3 > > cpu 0.013 > > mem 0.000 > > io 0.000 > > iow 0.000 > > maxvmem 8.328G > > arid undefined > > > > So of course, it is killed due to over the h_vmem limited (exist status > > 137, 137=128+9). Few things in my mind: > > > > 1) the same jobs have been running fine for long time, it started failing > > two weeks ago (nothing has changed since I was on holiday) > > > > 2) the job almost failed instantly (like after 1 second). The job seems to > > fail on the very first command which is "cd" to a directory and print an > > output. There is not way a "cd" command can consume 8GB memory right? > > Depends on the command interpreter. Maybe it's a huge bash version. Bash is > addressed in the #! line of the script and any #$ lines for SGE have proper > format? Or do you use the -S option to SGE? > > -- Reuti > > > > 3) the same job will likely run successfully after re-submitting. So > > currently our users just keep re-submitting the failed jobs until they run > > successfully. > > > > 4) this happens on multiple execution hosts and multiple queues. So it > > seems not to be host and queue specific. > > > > I am wondering if there is possible to be caused by the qmaster? > > > > Regards, > > Derrick > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
