Hi,

Am 02.11.2017 um 11:39 schrieb Derrick Lin:

> Hi Reuti,
> 
> One of the users indicates -S was used in his job:
> 
> qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S 
> /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G 
> ./cheat_script_0.sge
> 
> I have setup my own test just do a simple dd in a local disk
> 
> #!/bin/bash
> #
> #$ -j y
> #$ -cwd
> #$ -N bigtmpfile
> #$ -l h_vmem=32G
> #
> 
> echo "$HOST $tmp_requested $TMPDIR"
> 
> dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200
> 
> Our SGE has h_vmem=8gb as default for any job which does not have h_vmem 
> specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I 
> put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed 
> successfully. But I found some thing interesting to the maxvmem value from 
> qacct -j result such as:
> 
> ru_nvcsw     46651
> ru_nivcsw    1355
> cpu          146.611
> mem          87.885
> io           199.501
> iow          0.000
> maxvmem      736.727M
> arid         undefined
> 
> The maxvmem value for those 10 jobs are:
> 
> 1 x 9.920G
> 1 x 5.540G
> 8 x 736.727M

Is anything else running on the nodes, which has by accident the same 
additional group ID (the range you defined in `qconf -mconf`? This additional 
group ID is used to allow SGE to keep track of each job's resource 
consumptions. Somehow I remember an issue where former additional group IDs 
were reused(?) although they are still in use.

Can you please try to extend the range for the additional group ID and check 
whether the problem persists. Or OTOH shrink the range and check whether it 
happens more often.

-- Reuti


> 
> So this explains my test can fail if default h_vmem=8gb is used. I have to 
> confess that I don't have a full understanding on maxvmem inside SGE. Why 10 
> jobs of the same command, few of them have much higher maxvmem value?
> 
> Regards,
> Derrick
> 
> On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]> wrote:
> Hi,
> 
> > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>:
> >
> > Dear all,
> >
> > Recently, I have users reported some of their jobs failed silently. I 
> > picked one up and check, found:
> >
> > 11/02/2017 05:30:18|  main|delta-5-3|W|job 610608 exceeds job hard limit 
> > "h_vmem" of queue "[email protected]" (8942456832.00000 > 
> > limit:8589934592.00000) - sending SIGKILL
> >
> > [root@alpha00 rocks_ansible]# qacct -j 610608
> > ==============================================================
> > qname        short.q
> > hostname     xxxxxx.local
> > group        g_xxxxxxx
> > owner        glsai
> > project      NONE
> > department   xxxxxxx
> > jobname      .name.out
> > jobnumber    610608
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Thu Nov  2 05:30:15 2017
> > start_time   Thu Nov  2 05:30:17 2017
> > end_time     Thu Nov  2 05:30:18 2017
> > granted_pe   NONE
> > slots        1
> > failed       100 : assumedly after job
> > exit_status  137
> > ru_wallclock 1
> > ru_utime     0.007
> > ru_stime     0.006
> > ru_maxrss    1388
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    640
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   16
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     15
> > ru_nivcsw    3
> > cpu          0.013
> > mem          0.000
> > io           0.000
> > iow          0.000
> > maxvmem      8.328G
> > arid         undefined
> >
> > So of course, it is killed due to over the h_vmem limited (exist status 
> > 137, 137=128+9). Few things in my mind:
> >
> > 1) the same jobs have been running fine for long time, it started failing 
> > two weeks ago (nothing has changed since I was on holiday)
> >
> > 2) the job almost failed instantly (like after 1 second). The job seems to 
> > fail on the very first command which is "cd" to a directory and print an 
> > output. There is not way a "cd" command can consume 8GB memory right?
> 
> Depends on the command interpreter. Maybe it's a huge bash version. Bash is 
> addressed in the #! line of the script and any #$ lines for SGE have proper 
> format? Or do you use the -S option to SGE?
> 
> -- Reuti
> 
> 
> > 3) the same job will likely run successfully after re-submitting. So 
> > currently our users just keep re-submitting the failed jobs until they run 
> > successfully.
> >
> > 4) this happens on multiple execution hosts and multiple queues. So it 
> > seems not to be host and queue specific.
> >
> > I am wondering if there is possible to be caused by the qmaster?
> >
> > Regards,
> > Derrick
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to