Hi Reuti,
Before I make the change, I want to check this is the one I look at:
gid_range 20000-20100
*gid_range*
The *gid_range* is a comma-separated list of range expressions of the
form *m**-**n*, where *m* and *n* are integer numbers greater
than 99, and *m* is
an abbreviation for *m**-**m*. These numbers are used in
*sge_execd(8) <https://arc.liv.ac.uk/SGE/htmlman/htmlman8/sge_execd.html>*
to
identify processes belonging to the same job.
Each *sge_execd(8)
<https://arc.liv.ac.uk/SGE/htmlman/htmlman8/sge_execd.html>* may use a
separate set of group ids for this purpose.
All numbers in the group id range have to be unused supplementary group
ids on the system, where the *sge_execd(8)
<https://arc.liv.ac.uk/SGE/htmlman/htmlman8/sge_execd.html>* is
started.
Changing *gid_range* will take immediate effect. There is no default for
*gid_range*. The administrator will have to assign a value for
*gid_range*
during installation of Grid Engine.
The global configuration entry for this value may be overwritten by the
execution host local configuration.
It is true that the problematic hosts all seem to be busy with other jobs.
Also array jobs are very popular run on these hosts, and it is common to
have more than 100+ of sub processes on each host.
Is it safe to set it to something like 20000-20500?
Cheers,
Derrick
On Mon, Nov 6, 2017 at 9:57 AM, Reuti <[email protected]> wrote:
> Hi,
>
> Am 02.11.2017 um 11:39 schrieb Derrick Lin:
>
> > Hi Reuti,
> >
> > One of the users indicates -S was used in his job:
> >
> > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S
> /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G
> ./cheat_script_0.sge
> >
> > I have setup my own test just do a simple dd in a local disk
> >
> > #!/bin/bash
> > #
> > #$ -j y
> > #$ -cwd
> > #$ -N bigtmpfile
> > #$ -l h_vmem=32G
> > #
> >
> > echo "$HOST $tmp_requested $TMPDIR"
> >
> > dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200
> >
> > Our SGE has h_vmem=8gb as default for any job which does not have h_vmem
> specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I
> put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed
> successfully. But I found some thing interesting to the maxvmem value from
> qacct -j result such as:
> >
> > ru_nvcsw 46651
> > ru_nivcsw 1355
> > cpu 146.611
> > mem 87.885
> > io 199.501
> > iow 0.000
> > maxvmem 736.727M
> > arid undefined
> >
> > The maxvmem value for those 10 jobs are:
> >
> > 1 x 9.920G
> > 1 x 5.540G
> > 8 x 736.727M
>
> Is anything else running on the nodes, which has by accident the same
> additional group ID (the range you defined in `qconf -mconf`? This
> additional group ID is used to allow SGE to keep track of each job's
> resource consumptions. Somehow I remember an issue where former additional
> group IDs were reused(?) although they are still in use.
>
> Can you please try to extend the range for the additional group ID and
> check whether the problem persists. Or OTOH shrink the range and check
> whether it happens more often.
>
> -- Reuti
>
>
> >
> > So this explains my test can fail if default h_vmem=8gb is used. I have
> to confess that I don't have a full understanding on maxvmem inside SGE.
> Why 10 jobs of the same command, few of them have much higher maxvmem value?
> >
> > Regards,
> > Derrick
> >
> > On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> > > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>:
> > >
> > > Dear all,
> > >
> > > Recently, I have users reported some of their jobs failed silently. I
> picked one up and check, found:
> > >
> > > 11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard
> limit "h_vmem" of queue "[email protected]" (8942456832.00000 >
> limit:8589934592.00000) - sending SIGKILL
> > >
> > > [root@alpha00 rocks_ansible]# qacct -j 610608
> > > ==============================================================
> > > qname short.q
> > > hostname xxxxxx.local
> > > group g_xxxxxxx
> > > owner glsai
> > > project NONE
> > > department xxxxxxx
> > > jobname .name.out
> > > jobnumber 610608
> > > taskid undefined
> > > account sge
> > > priority 0
> > > qsub_time Thu Nov 2 05:30:15 2017
> > > start_time Thu Nov 2 05:30:17 2017
> > > end_time Thu Nov 2 05:30:18 2017
> > > granted_pe NONE
> > > slots 1
> > > failed 100 : assumedly after job
> > > exit_status 137
> > > ru_wallclock 1
> > > ru_utime 0.007
> > > ru_stime 0.006
> > > ru_maxrss 1388
> > > ru_ixrss 0
> > > ru_ismrss 0
> > > ru_idrss 0
> > > ru_isrss 0
> > > ru_minflt 640
> > > ru_majflt 0
> > > ru_nswap 0
> > > ru_inblock 0
> > > ru_oublock 16
> > > ru_msgsnd 0
> > > ru_msgrcv 0
> > > ru_nsignals 0
> > > ru_nvcsw 15
> > > ru_nivcsw 3
> > > cpu 0.013
> > > mem 0.000
> > > io 0.000
> > > iow 0.000
> > > maxvmem 8.328G
> > > arid undefined
> > >
> > > So of course, it is killed due to over the h_vmem limited (exist
> status 137, 137=128+9). Few things in my mind:
> > >
> > > 1) the same jobs have been running fine for long time, it started
> failing two weeks ago (nothing has changed since I was on holiday)
> > >
> > > 2) the job almost failed instantly (like after 1 second). The job
> seems to fail on the very first command which is "cd" to a directory and
> print an output. There is not way a "cd" command can consume 8GB memory
> right?
> >
> > Depends on the command interpreter. Maybe it's a huge bash version. Bash
> is addressed in the #! line of the script and any #$ lines for SGE have
> proper format? Or do you use the -S option to SGE?
> >
> > -- Reuti
> >
> >
> > > 3) the same job will likely run successfully after re-submitting. So
> currently our users just keep re-submitting the failed jobs until they run
> successfully.
> > >
> > > 4) this happens on multiple execution hosts and multiple queues. So it
> seems not to be host and queue specific.
> > >
> > > I am wondering if there is possible to be caused by the qmaster?
> > >
> > > Regards,
> > > Derrick
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users