Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Derrick Lin Sun, 05 Nov 2017 15:45:23 -0800

Hi Reuti,

Before I make the change, I want to check this is the one I look at:


gid_range                    20000-20100

*gid_range*
       The *gid_range* is a comma-separated list of range expressions of the
       form *m**-**n*, where *m* and *n* are integer numbers greater
than 99, and *m* is
       an abbreviation for *m**-**m*.  These numbers are used in
*sge_execd(8) <https://arc.liv.ac.uk/SGE/htmlman/htmlman8/sge_execd.html>*
to
       identify processes belonging to the same job.

       Each *sge_execd(8)
<https://arc.liv.ac.uk/SGE/htmlman/htmlman8/sge_execd.html>* may use a
separate set of group ids for this purpose.
       All numbers in the group id range have to be unused supplementary group
       ids on the system, where the *sge_execd(8)
<https://arc.liv.ac.uk/SGE/htmlman/htmlman8/sge_execd.html>* is
started.

       Changing *gid_range* will take immediate effect.  There is no default for
       *gid_range*. The administrator will have to assign a value for
*gid_range*
       during installation of Grid Engine.

       The global configuration entry for this value may be overwritten by the
       execution host local configuration.


It is true that the problematic hosts all seem to be busy with other jobs.
Also array jobs are very popular run on these hosts, and it is common to
have more than 100+ of sub processes on each host.

Is it safe to set it to something like 20000-20500?

Cheers,
Derrick

On Mon, Nov 6, 2017 at 9:57 AM, Reuti <[email protected]> wrote:

> Hi,
>
> Am 02.11.2017 um 11:39 schrieb Derrick Lin:
>
> > Hi Reuti,
> >
> > One of the users indicates -S was used in his job:
> >
> > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S
> /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G
> ./cheat_script_0.sge
> >
> > I have setup my own test just do a simple dd in a local disk
> >
> > #!/bin/bash
> > #
> > #$ -j y
> > #$ -cwd
> > #$ -N bigtmpfile
> > #$ -l h_vmem=32G
> > #
> >
> > echo "$HOST $tmp_requested $TMPDIR"
> >
> > dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200
> >
> > Our SGE has h_vmem=8gb as default for any job which does not have h_vmem
> specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I
> put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed
> successfully. But I found some thing interesting to the maxvmem value from
> qacct -j result such as:
> >
> > ru_nvcsw     46651
> > ru_nivcsw    1355
> > cpu          146.611
> > mem          87.885
> > io           199.501
> > iow          0.000
> > maxvmem      736.727M
> > arid         undefined
> >
> > The maxvmem value for those 10 jobs are:
> >
> > 1 x 9.920G
> > 1 x 5.540G
> > 8 x 736.727M
>
> Is anything else running on the nodes, which has by accident the same
> additional group ID (the range you defined in `qconf -mconf`? This
> additional group ID is used to allow SGE to keep track of each job's
> resource consumptions. Somehow I remember an issue where former additional
> group IDs were reused(?) although they are still in use.
>
> Can you please try to extend the range for the additional group ID and
> check whether the problem persists. Or OTOH shrink the range and check
> whether it happens more often.
>
> -- Reuti
>
>
> >
> > So this explains my test can fail if default h_vmem=8gb is used. I have
> to confess that I don't have a full understanding on maxvmem inside SGE.
> Why 10 jobs of the same command, few of them have much higher maxvmem value?
> >
> > Regards,
> > Derrick
> >
> > On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> > > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>:
> > >
> > > Dear all,
> > >
> > > Recently, I have users reported some of their jobs failed silently. I
> picked one up and check, found:
> > >
> > > 11/02/2017 05:30:18|  main|delta-5-3|W|job 610608 exceeds job hard
> limit "h_vmem" of queue "[email protected]" (8942456832.00000 >
> limit:8589934592.00000) - sending SIGKILL
> > >
> > > [root@alpha00 rocks_ansible]# qacct -j 610608
> > > ==============================================================
> > > qname        short.q
> > > hostname     xxxxxx.local
> > > group        g_xxxxxxx
> > > owner        glsai
> > > project      NONE
> > > department   xxxxxxx
> > > jobname      .name.out
> > > jobnumber    610608
> > > taskid       undefined
> > > account      sge
> > > priority     0
> > > qsub_time    Thu Nov  2 05:30:15 2017
> > > start_time   Thu Nov  2 05:30:17 2017
> > > end_time     Thu Nov  2 05:30:18 2017
> > > granted_pe   NONE
> > > slots        1
> > > failed       100 : assumedly after job
> > > exit_status  137
> > > ru_wallclock 1
> > > ru_utime     0.007
> > > ru_stime     0.006
> > > ru_maxrss    1388
> > > ru_ixrss     0
> > > ru_ismrss    0
> > > ru_idrss     0
> > > ru_isrss     0
> > > ru_minflt    640
> > > ru_majflt    0
> > > ru_nswap     0
> > > ru_inblock   0
> > > ru_oublock   16
> > > ru_msgsnd    0
> > > ru_msgrcv    0
> > > ru_nsignals  0
> > > ru_nvcsw     15
> > > ru_nivcsw    3
> > > cpu          0.013
> > > mem          0.000
> > > io           0.000
> > > iow          0.000
> > > maxvmem      8.328G
> > > arid         undefined
> > >
> > > So of course, it is killed due to over the h_vmem limited (exist
> status 137, 137=128+9). Few things in my mind:
> > >
> > > 1) the same jobs have been running fine for long time, it started
> failing two weeks ago (nothing has changed since I was on holiday)
> > >
> > > 2) the job almost failed instantly (like after 1 second). The job
> seems to fail on the very first command which is "cd" to a directory and
> print an output. There is not way a "cd" command can consume 8GB memory
> right?
> >
> > Depends on the command interpreter. Maybe it's a huge bash version. Bash
> is addressed in the #! line of the script and any #$ lines for SGE have
> proper format? Or do you use the -S option to SGE?
> >
> > -- Reuti
> >
> >
> > > 3) the same job will likely run successfully after re-submitting. So
> currently our users just keep re-submitting the failed jobs until they run
> successfully.
> > >
> > > 4) this happens on multiple execution hosts and multiple queues. So it
> seems not to be host and queue specific.
> > >
> > > I am wondering if there is possible to be caused by the qmaster?
> > >
> > > Regards,
> > > Derrick
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Job killed instantly due to h_vmem exceeds hard limit

Reply via email to