Yes I setup local spool directories, and sgeadmin account can write to it. In fact it is writing to the messages file in that local dir, but it still gives that error.
$ ll total 108 drwxr-xr-x 3 sgeadmin sgeadmin 4096 Aug 22 13:44 active_jobs -rw-r--r-- 1 sgeadmin sgeadmin 6 Jul 10 14:18 execd.pid drwxr-xr-x 3 sgeadmin sgeadmin 4096 Aug 18 16:58 jobs drwxr-xr-x 2 sgeadmin sgeadmin 4096 Jul 10 14:18 job_scripts -rw-r--r-- 1 sgeadmin sgeadmin 93014 Aug 22 13:44 messages [johnt@BJSMICDS126 BJSMICDS126]$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/sda4 486G 8.2G 453G 2% / [johnt@BJSMICDS126 BJSMICDS126]$ qconf -sconf |grep spool execd_spool_dir /opt/sge/default/spool [johnt@BJSMICDS126 BJSMICDS126]$ pwd /opt/sge/default/spool/BJSMICDS126 [johnt@BJSMICDS126 BJSMICDS126]$ -----Original Message----- From: Reuti [mailto:[email protected]] Sent: Monday, August 21, 2017 5:13 To: John_Tai Cc: [email protected] Subject: Re: [gridengine users] error reason 1: can not find an unused add_grp_id Hi, > Am 21.08.2017 um 09:18 schrieb John_Tai <[email protected]>: > > I changed gid_range, it used to be just 20000. Now it's 20000-20200 Unless you have more than 201 cores per exechost, this is fine. > However now when I submit a job the host goes in error state. I checked the > messages log: > > 08/21/2017 15:06:55| main|BJSMICDS126|E|shepherd of job 89.1 exited with > exit status = 7 > 08/21/2017 15:06:55| main|BJSMICDS126|E|can't open pid file > "active_jobs/89.1/pid" for job 89.1 > > There must be another config problem. Can the exechosts write to the location of the spool directory? Often it's better to have at least the nodes writing to a local place. This can even be done after installation: shut down the exechosts, change the setting of the spool directory to a local place on the exechosts (`qconf -mconf`), create these directories like /var/spool/sge (the exechost specific directory will be created when the sge_execd starts up). https://arc.liv.ac.uk/SGE/howto/nfsreduce.html -- Reuti > Any ideas? > > > > > > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Friday, August 18, 2017 4:37 > To: John_Tai > Cc: [email protected] > Subject: Re: [gridengine users] error reason 1: can not find an unused > add_grp_id > > Hi, > >> Am 18.08.2017 um 02:30 schrieb John_Tai <[email protected]>: >> >> When I submit more than 1 job to a queue, the job is queued even though >> there are free slots available. When I check this waiting job status with >> qstat –j I find this error message: >> >> error reason 1: can not find an unused add_grp_id >> >> What does it mean? > > Each job in SGE gets an additional group ID attached, which enables SGE to > track the consumed resources. > > What is your setting of: > > $ qconf -sconf > #global: > … > gid_range 20000-20100 > > Is this range in your case lower than the number of installed cores per > exechost? As there might be a delay when old group IDs are released again, it > would help to have some more IDs than the real number of cores (resp. threads > in case you use them). > > -- Reuti > ________________________________ > > This email (including its attachments, if any) may be confidential and > proprietary information of SMIC, and intended only for the use of the named > recipient(s) above. Any unauthorized use or disclosure of this email is > strictly prohibited. If you are not the intended recipient(s), please notify > the sender immediately and delete this email from your computer. > ________________________________ This email (including its attachments, if any) may be confidential and proprietary information of SMIC, and intended only for the use of the named recipient(s) above. Any unauthorized use or disclosure of this email is strictly prohibited. If you are not the intended recipient(s), please notify the sender immediately and delete this email from your computer. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
