Am 06.11.2017 um 00:42 schrieb Derrick Lin: > Hi Reuti, > > Before I make the change, I want to check this is the one I look at: > > gid_range 20000-20100 > > gid_range > > The > gid_range > is a comma-separated list of range expressions of the > form > m-n, where m and n are integer numbers greater than 99, and m > is > an abbreviation for > m-m. These numbers are used in sge_execd(8) > to > identify processes belonging to the same job. > > Each > sge_execd(8) > may use a separate set of group ids for this purpose. > All numbers in the group id range have to be unused supplementary group > ids on the system, where the > sge_execd(8) > is started. > > Changing > gid_range > will take immediate effect. There is no default for > > gid_range. The administrator will have to assign a value for gid_range > > during installation of Grid Engine. > > The global configuration entry for this value may be overwritten by the > execution host local configuration. > > > It is true that the problematic hosts all seem to be busy with other jobs. > Also array jobs are very popular run on these hosts, and it is common to have > more than 100+ of sub processes on each host. > > Is it safe to set it to something like 20000-20500?
Yes, unless your users' groups fall into this range. -- Reuti > > Cheers, > Derrick > > On Mon, Nov 6, 2017 at 9:57 AM, Reuti <[email protected]> wrote: > Hi, > > Am 02.11.2017 um 11:39 schrieb Derrick Lin: > > > Hi Reuti, > > > > One of the users indicates -S was used in his job: > > > > qsub -P RNABiologyandPlasticity -cwd -V -pe smp 1 -N CyborgSummer -S > > /bin/bash -t 1-11 -v mem_requested=12G,h_vmem=12G,tmp_requested=50G > > ./cheat_script_0.sge > > > > I have setup my own test just do a simple dd in a local disk > > > > #!/bin/bash > > # > > #$ -j y > > #$ -cwd > > #$ -N bigtmpfile > > #$ -l h_vmem=32G > > # > > > > echo "$HOST $tmp_requested $TMPDIR" > > > > dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200 > > > > Our SGE has h_vmem=8gb as default for any job which does not have h_vmem > > specific. When h_vmem=8gb, some of my tests job finished OK, some failed. I > > put up the h_vmem to 32gb, I re-launch 10 jobs, all jobs completed > > successfully. But I found some thing interesting to the maxvmem value from > > qacct -j result such as: > > > > ru_nvcsw 46651 > > ru_nivcsw 1355 > > cpu 146.611 > > mem 87.885 > > io 199.501 > > iow 0.000 > > maxvmem 736.727M > > arid undefined > > > > The maxvmem value for those 10 jobs are: > > > > 1 x 9.920G > > 1 x 5.540G > > 8 x 736.727M > > Is anything else running on the nodes, which has by accident the same > additional group ID (the range you defined in `qconf -mconf`? This additional > group ID is used to allow SGE to keep track of each job's resource > consumptions. Somehow I remember an issue where former additional group IDs > were reused(?) although they are still in use. > > Can you please try to extend the range for the additional group ID and check > whether the problem persists. Or OTOH shrink the range and check whether it > happens more often. > > -- Reuti > > > > > > So this explains my test can fail if default h_vmem=8gb is used. I have to > > confess that I don't have a full understanding on maxvmem inside SGE. Why > > 10 jobs of the same command, few of them have much higher maxvmem value? > > > > Regards, > > Derrick > > > > On Thu, Nov 2, 2017 at 6:17 PM, Reuti <[email protected]> wrote: > > Hi, > > > > > Am 02.11.2017 um 04:54 schrieb Derrick Lin <[email protected]>: > > > > > > Dear all, > > > > > > Recently, I have users reported some of their jobs failed silently. I > > > picked one up and check, found: > > > > > > 11/02/2017 05:30:18| main|delta-5-3|W|job 610608 exceeds job hard limit > > > "h_vmem" of queue "[email protected]" (8942456832.00000 > > > > limit:8589934592.00000) - sending SIGKILL > > > > > > [root@alpha00 rocks_ansible]# qacct -j 610608 > > > ============================================================== > > > qname short.q > > > hostname xxxxxx.local > > > group g_xxxxxxx > > > owner glsai > > > project NONE > > > department xxxxxxx > > > jobname .name.out > > > jobnumber 610608 > > > taskid undefined > > > account sge > > > priority 0 > > > qsub_time Thu Nov 2 05:30:15 2017 > > > start_time Thu Nov 2 05:30:17 2017 > > > end_time Thu Nov 2 05:30:18 2017 > > > granted_pe NONE > > > slots 1 > > > failed 100 : assumedly after job > > > exit_status 137 > > > ru_wallclock 1 > > > ru_utime 0.007 > > > ru_stime 0.006 > > > ru_maxrss 1388 > > > ru_ixrss 0 > > > ru_ismrss 0 > > > ru_idrss 0 > > > ru_isrss 0 > > > ru_minflt 640 > > > ru_majflt 0 > > > ru_nswap 0 > > > ru_inblock 0 > > > ru_oublock 16 > > > ru_msgsnd 0 > > > ru_msgrcv 0 > > > ru_nsignals 0 > > > ru_nvcsw 15 > > > ru_nivcsw 3 > > > cpu 0.013 > > > mem 0.000 > > > io 0.000 > > > iow 0.000 > > > maxvmem 8.328G > > > arid undefined > > > > > > So of course, it is killed due to over the h_vmem limited (exist status > > > 137, 137=128+9). Few things in my mind: > > > > > > 1) the same jobs have been running fine for long time, it started failing > > > two weeks ago (nothing has changed since I was on holiday) > > > > > > 2) the job almost failed instantly (like after 1 second). The job seems > > > to fail on the very first command which is "cd" to a directory and print > > > an output. There is not way a "cd" command can consume 8GB memory right? > > > > Depends on the command interpreter. Maybe it's a huge bash version. Bash is > > addressed in the #! line of the script and any #$ lines for SGE have proper > > format? Or do you use the -S option to SGE? > > > > -- Reuti > > > > > > > 3) the same job will likely run successfully after re-submitting. So > > > currently our users just keep re-submitting the failed jobs until they > > > run successfully. > > > > > > 4) this happens on multiple execution hosts and multiple queues. So it > > > seems not to be host and queue specific. > > > > > > I am wondering if there is possible to be caused by the qmaster? > > > > > > Regards, > > > Derrick > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
