Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Derrick Lin Thu, 02 Nov 2017 03:43:15 -0700

Hi Reuti,

One of the users indicates -S was used in his job:


qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S
/bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G
./cheat_script_0.sge

I have setup my own test just do a simple dd in a local disk

#!/bin/bash
#
#$ -j y
#$ -cwd
#$ -N bigtmpfile
#$ -l h_vmem=32G
#

echo "$HOST $tmp_requested $TMPDIR"

dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200

Our SGE has h_vmem=8gb as default for any job which does not have h_vmem
specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I
put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed
successfully. But I found some thing interesting to the maxvmem value from
qacct -j result such as:

ru_nvcsw     46651
ru_nivcsw    1355
cpu          146.611
mem          87.885
io           199.501
iow          0.000
maxvmem      736.727M
arid         undefined

The maxvmem value for those 10 jobs are:

1 x 9.920G
1 x 5.540G
8 x 736.727M

So this explains my test can fail if default h_vmem=8gb is used. I have to
confess that I don't have a full understanding on maxvmem inside SGE. Why
10 jobs of the same command, few of them have much higher maxvmem value?

Regards,
Derrick

On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]> wrote:

> Hi,
>
> > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>:
> >
> > Dear all,
> >
> > Recently, I have users reported some of their jobs failed silently. I
> picked one up and check, found:
> >
> > 11/02/2017 05:30:18|  main|delta-5-3|W|job 610608 exceeds job hard limit
> "h_vmem" of queue "[email protected]" (8942456832.00000 > limit:
> 8589934592.00000) - sending SIGKILL
> >
> > [root@alpha00 rocks_ansible]# qacct -j 610608
> > ==============================================================
> > qname        short.q
> > hostname     xxxxxx.local
> > group        g_xxxxxxx
> > owner        glsai
> > project      NONE
> > department   xxxxxxx
> > jobname      .name.out
> > jobnumber    610608
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Thu Nov  2 05:30:15 2017
> > start_time   Thu Nov  2 05:30:17 2017
> > end_time     Thu Nov  2 05:30:18 2017
> > granted_pe   NONE
> > slots        1
> > failed       100 : assumedly after job
> > exit_status  137
> > ru_wallclock 1
> > ru_utime     0.007
> > ru_stime     0.006
> > ru_maxrss    1388
> > ru_ixrss     0
> > ru_ismrss    0
> > ru_idrss     0
> > ru_isrss     0
> > ru_minflt    640
> > ru_majflt    0
> > ru_nswap     0
> > ru_inblock   0
> > ru_oublock   16
> > ru_msgsnd    0
> > ru_msgrcv    0
> > ru_nsignals  0
> > ru_nvcsw     15
> > ru_nivcsw    3
> > cpu          0.013
> > mem          0.000
> > io           0.000
> > iow          0.000
> > maxvmem      8.328G
> > arid         undefined
> >
> > So of course, it is killed due to over the h_vmem limited (exist status
> 137, 137=128+9). Few things in my mind:
> >
> > 1) the same jobs have been running fine for long time, it started
> failing two weeks ago (nothing has changed since I was on holiday)
> >
> > 2) the job almost failed instantly (like after 1 second). The job seems
> to fail on the very first command which is "cd" to a directory and print an
> output. There is not way a "cd" command can consume 8GB memory right?
>
> Depends on the command interpreter. Maybe it's a huge bash version. Bash
> is addressed in the #! line of the script and any #$ lines for SGE have
> proper format? Or do you use the -S option to SGE?
>
> -- Reuti
>
>
> > 3) the same job will likely run successfully after re-submitting. So
> currently our users just keep re-submitting the failed jobs until they run
> successfully.
> >
> > 4) this happens on multiple execution hosts and multiple queues. So it
> seems not to be host and queue specific.
> >
> > I am wondering if there is possible to be caused by the qmaster?
> >
> > Regards,
> > Derrick
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Reply via email to