[gridengine users] Queinstance stuck in E

Coleman, Marcus [JRDUS Non-J&J] Thu, 02 Jun 2016 15:49:54 -0700

Hi all

I am having a crazy time fixing an issue I have having with 3 qinstance stuck 
in E.


[root@rndusljpp2 opt]# qstat -explain E
queuename                      qtype resv/used/tot. load_avg arch          
states
---------------------------------------------------------------------------------
allhosts.q@c1                  BIP   0/0/12         0.03     lx-amd64      E
        queue allhosts.q marked QERROR as result of job 1546377's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546378's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546379's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546380's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546381's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546382's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546383's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546384's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546385's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546386's failure at 
host c1
       queue allhosts.q marked QERROR as result of job 1546387's failure at 
host c1
        queue allhosts.q marked QERROR as result of job 1546388's failure at 
host c1
[root@rndusljpp2 opt]# qacct -j 1546377
==============================================================
qname        allhosts.q
hostname     c1
group        mseierst
owner        mseierst
project      NONE
department   defaultdepartment
jobname      macrocycle_2D_s
jobnumber    1546377
taskid       undefined
account      sge
priority     0
qsub_time    Wed Dec 31 16:00:00 1969
start_time   -/-
end_time     -/-
granted_pe   NONE
slots        0
failed       1   : assumedly before job
exit_status  0
ru_wallclock 0s
ru_utime     0.000s
ru_stime     0.000s
ru_maxrss    0.000B
ru_ixrss     0.000B
ru_ismrss    0.000B
ru_idrss     0.000B
ru_isrss     0.000B
ru_minflt    0
ru_majflt    0
ru_nswap     0
ru_inblock   0
ru_oublock   0
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     0
ru_nivcsw    0
cpu          0.000s
mem          0.000GBs
io           0.000GB
iow          0.000s
maxvmem      0.000B
arid         undefined
ar_sub_time  undefined
category     -q allhosts.q
[root@rndusljpp2 opt]#

Cat /opt/sge/default/spool/qmaster  --------- from qmaster

06/02/2016 15:17:14|worker|rndusljpp2|W|job 1546377.1 failed on host c1 general 
assumedly before job because: can't create directory active_jobs/1546377.1: No 
such file or directory
06/02/2016 15:17:14|worker|rndusljpp2|W|rescheduling job 1546377.1
06/02/2016 15:17:14|worker|rndusljpp2|E|queue allhosts.q marked QERROR as 
result of job 1546377's failure at host c1



Qstat shows the job has been rescheduled to c5 and is running
[root@rndusljpp2 common]# qstat -u "*"
job-ID  prior   name       user         state submit/start at     queue         
                 slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1386811 0.55500 macrocycle mseierst     r     05/27/2016 10:02:53 allhosts.q@c6 
                     1
1545754 0.55500 macrocycle mseierst     r     06/02/2016 13:22:31 allhosts.q@c7 
                     1
1545760 0.55500 macrocycle mseierst     r     06/02/2016 13:23:08 allhosts.q@c7 
                     1
1545772 0.55500 macrocycle mseierst     r     06/02/2016 13:24:04 allhosts.q@c7 
                     1
1545773 0.55500 macrocycle mseierst     r     06/02/2016 13:24:10 allhosts.q@c7 
                     1
1545780 0.55500 macrocycle mseierst     r     06/02/2016 13:24:29 allhosts.q@c7 
                     1
1545785 0.55500 macrocycle mseierst     r     06/02/2016 13:24:57 allhosts.q@c7 
                     1
1545787 0.55500 macrocycle mseierst     r     06/02/2016 13:25:03 allhosts.q@c7 
                     1
1545796 0.55500 macrocycle mseierst     r     06/02/2016 13:25:45 allhosts.q@c7 
                     1
1545806 0.55500 macrocycle mseierst     r     06/02/2016 13:26:36 allhosts.q@c7 
                     1
1545807 0.55500 macrocycle mseierst     r     06/02/2016 13:26:36 allhosts.q@c7 
                     1
1545815 0.55500 macrocycle mseierst     r     06/02/2016 13:27:08 allhosts.q@c7 
                     1
1545822 0.55500 macrocycle mseierst     r     06/02/2016 13:27:54 allhosts.q@c7 
                     1
1546062 0.55500 macrocycle mseierst     r     06/02/2016 14:15:29 allhosts.q@c7 
                     1
1546212 0.55500 macrocycle mseierst     r     06/02/2016 14:46:40 allhosts.q@c7 
                     1
1546313 0.55500 macrocycle mseierst     r     06/02/2016 15:05:06 allhosts.q@c3 
                     1
1546326 0.55500 macrocycle mseierst     r     06/02/2016 15:06:38 allhosts.q@c8 
                     1
1546327 0.55500 macrocycle mseierst     r     06/02/2016 15:06:44 allhosts.q@c3 
                     1
1546328 0.55500 macrocycle mseierst     r     06/02/2016 15:06:47 allhosts.q@c8 
                     1
1546331 0.55500 macrocycle mseierst     r     06/02/2016 15:07:43 allhosts.q@c3 
                     1
1546332 0.55500 macrocycle mseierst     r     06/02/2016 15:07:43 allhosts.q@c3 
                     1
1546333 0.55500 macrocycle mseierst     r     06/02/2016 15:07:49 allhosts.q@c3 
                     1
1546335 0.55500 macrocycle mseierst     r     06/02/2016 15:07:55 allhosts.q@c8 
                     1
1546336 0.55500 macrocycle mseierst     r     06/02/2016 15:08:11 allhosts.q@c8 
                     1
1546338 0.55500 macrocycle mseierst     r     06/02/2016 15:10:07 allhosts.q@c6 
                     1
1546340 0.55500 macrocycle mseierst     r     06/02/2016 15:10:34 allhosts.q@c3 
                     1
1546341 0.55500 macrocycle mseierst     r     06/02/2016 15:10:34 allhosts.q@c8 
                     1
1546343 0.55500 macrocycle mseierst     r     06/02/2016 15:11:13 allhosts.q@c5 
                     1
1546344 0.55500 macrocycle mseierst     r     06/02/2016 15:11:19 allhosts.q@c8 
                     1
1546346 0.55500 macrocycle mseierst     r     06/02/2016 15:12:23 allhosts.q@c8 
                     1
1546348 0.55500 macrocycle mseierst     r     06/02/2016 15:12:43 allhosts.q@c7 
                     1
1546349 0.55500 macrocycle mseierst     r     06/02/2016 15:12:46 allhosts.q@c5 
                     1
1546350 0.55500 macrocycle mseierst     r     06/02/2016 15:12:46 allhosts.q@c6 
                     1
1546351 0.55500 macrocycle mseierst     r     06/02/2016 15:12:46 allhosts.q@c6 
                     1
1546353 0.55500 macrocycle mseierst     r     06/02/2016 15:12:52 allhosts.q@c8 
                     1
1546354 0.55500 macrocycle mseierst     r     06/02/2016 15:12:58 allhosts.q@c5 
                     1
1546355 0.55500 macrocycle mseierst     r     06/02/2016 15:13:01 allhosts.q@c8 
                     1
1546357 0.55500 macrocycle mseierst     r     06/02/2016 15:13:10 allhosts.q@c5 
                     1
1546358 0.55500 macrocycle mseierst     r     06/02/2016 15:13:22 allhosts.q@c3 
                     1
1546359 0.55500 macrocycle mseierst     r     06/02/2016 15:13:22 allhosts.q@c3 
                     1
1546360 0.55500 macrocycle mseierst     r     06/02/2016 15:13:22 allhosts.q@c8 
                     1
1546361 0.55500 macrocycle mseierst     r     06/02/2016 15:13:22 allhosts.q@c6 
                     1
1546362 0.55500 macrocycle mseierst     r     06/02/2016 15:13:28 allhosts.q@c5 
                     1
1546363 0.55500 macrocycle mseierst     r     06/02/2016 15:13:45 allhosts.q@c6 
                     1
1546364 0.55500 macrocycle mseierst     r     06/02/2016 15:13:55 allhosts.q@c8 
                     1
1546365 0.55500 macrocycle mseierst     r     06/02/2016 15:13:58 allhosts.q@c6 
                     1
1546366 0.55500 macrocycle mseierst     r     06/02/2016 15:14:04 allhosts.q@c3 
                     1
1546367 0.55500 macrocycle mseierst     r     06/02/2016 15:14:04 allhosts.q@c6 
                     1
1546368 0.55500 macrocycle mseierst     r     06/02/2016 15:14:55 allhosts.q@c8 
                     1
1546369 0.55500 macrocycle mseierst     r     06/02/2016 15:15:02 allhosts.q@c6 
                     1
1546370 0.55500 macrocycle mseierst     r     06/02/2016 15:15:05 allhosts.q@c3 
                     1
1546371 0.55500 macrocycle mseierst     r     06/02/2016 15:15:13 allhosts.q@c5 
                     1
1546372 0.55500 macrocycle mseierst     r     06/02/2016 15:15:13 allhosts.q@c5 
                     1
1546373 0.55500 macrocycle mseierst     r     06/02/2016 15:15:13 allhosts.q@c5 
                     1
1546374 0.55500 macrocycle mseierst     r     06/02/2016 15:15:13 allhosts.q@c8 
                     1
1546375 0.55500 macrocycle mseierst     r     06/02/2016 15:15:16 allhosts.q@c5 
                     1
1546376 0.55500 macrocycle mseierst     r     06/02/2016 15:15:36 allhosts.q@c6 
                     1
1546377 0.55500 macrocycle mseierst     r     06/02/2016 15:17:41 allhosts.q@c5 
                     1                  <--------------------- running


Message on C1

06/02/2016 15:19:11|  main|c1|E|can't create directory "active_jobs/1546377.1": 
No such file or directory
06/02/2016 15:19:11|  main|c1|E|can't start job "1546377": can't create 
directory active_jobs/1546377.1: No such file or directory

06/02/2016 15:19:20|  main|c1|E|received task belongs to job 1546377 but that 
job is not here
06/02/2016 15:19:20|  main|c1|E|acknowledge for unknown job 1546377.1/master
06/02/2016 15:19:20|  main|c1|E|can't find active jobs directory 
"active_jobs/1546377.1" for reaping job 1546377
06/02/2016 15:19:20|  main|c1|E|unlink(jobs/00/0154/6377.1) failed: No such 
file or directory
06/02/2016 15:19:20|  main|c1|E|can not remove file job spool file: 
jobs/00/0154/6377.1
06/02/2016 15:19:20|  main|c1|E|can not remove file task spool file: No such 
file or directory
06/02/2016 15:19:20|  main|c1|E|can not remove file task spool file: No such 
file or directory
06/02/2016 15:19:20|  main|c1|E|can't remove directory "active_jobs/1546377.1": 
opendir(active_jobs/1546377.1) failed: No such file or directory

The directory is there on C1 ( and has 777 for permissions)

[root@c1 active_jobs]# pwd
/opt/sge/default/spool/c1/active_jobs
[root@c1 active_jobs]#

[root@c1 c1]# ls -l
total 5980
drwxrwxrwx 32000 sgeadmin sgeadmin  999424 May 30 04:54 active_jobs 
<------------------------- 777
-rw-r--r--     1 sgeadmin sgeadmin       5 Jun  1 21:21 execd.pid
drwxr-xr-x     2 sgeadmin sgeadmin    4096 May 31 14:03 jobs
drwxr-xr-x     2 sgeadmin sgeadmin    4096 May 30 05:07 job_scripts
-rw-r--r--     1 sgeadmin sgeadmin 5095417 Jun  2 15:19 messages
[root@c1 c1]#

When I stop the execd service on c1 this happens:  (these are the last jobs 
that finish? On the qinstance before it stopped working)
So SGE was able to write to this directory with no problems before....

[root@c1 c1]# service sgeexecd.LJ_SGE_ClusterA stop
   Shutting down Grid Engine execution daemon
   Shutting down Grid Engine shepherd of job 1197405.1
   Shutting down Grid Engine shepherd of job 1197432.1
   Shutting down Grid Engine shepherd of job 1197433.1
   Shutting down Grid Engine shepherd of job 1197434.1


So I try to clear the qinstance or the que itself...
[root@rndusljpp2 ~]# qmod -c allhosts.q
Queue instance "allhosts.q@c6" is already in the specified state: no error
r...@rndusljpp2.na.jnj.com changed state of "allhosts.q@c1" (no error)
Queue instance "allhosts.q@c2" is already in the specified state: no error
Queue instance "allhosts.q@c4" is already in the specified state: no error
Queue instance "allhosts.q@c8" is already in the specified state: no error
Queue instance "allhosts.q@c5" is already in the specified state: no error
Queue instance "allhosts.q@c7" is already in the specified state: no error
Queue instance "allhosts.q@c3" is already in the specified state: no error



And the E comes right back after a few seconds...
So I am force to disable the qinstance...

Any help would be appreciated

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] Queinstance stuck in E

Reply via email to