On Mon, 15 Jun 2015 17:27:47 +0000
Daniele Gregori <[email protected]> wrote:
 
> [root@hactar ~]# qstat -f
> 
> queuename                      qtype resv/used/tot. load_avg arch          
> states
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-1              BIP   0/0/24         0.18     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-10             BIP   0/0/24         0.13     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-11             BIP   0/0/24         0.03     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-12             BIP   0/0/24         0.12     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-13             BIP   0/0/24         0.03     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-14             BIP   0/0/24         0.10     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-2              BIP   0/0/24         0.12     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-3              BIP   0/0/24         0.10     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-4              BIP   0/0/24         0.16     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-5              BIP   0/0/24         0.12     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-6              BIP   0/0/24         0.07     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-7              BIP   0/0/24         0.05     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-8              BIP   0/0/24         0.04     linux-x64     E
> 
> ---------------------------------------------------------------------------------
> 
> all.q@compute-1-9              BIP   0/0/24         0.09     linux-x64     E
> 
Well the above reveals the proximate cause of your problem.  Your queues are 
all in an error
state.  This usually happens  when something goes wrong when a job starts and 
grid engine decides
that the cause is related to the node rather than the job.

If you run qstat -qs E -explain E it will probably point at the job that 
triggered the problem.
It is possible that a clue to what happened may appear in the output of the job 
which triggered
the problem or in the execd messages file of the node with the problem.

If that doesn't tell you what the problem is you can enable KEEP_ACTIVE in the 
execd_params of the sge config it will
retain the job's active directory after the job terminates/exits.  Next time a 
job triggers a queue into an error state you 
can examine the additional logfiles left in the active directory.  As the man 
page says this is a debug option so turn it off
again when you've finished diagnosing/fixing.   

You can clear the error state with qmod -cq <queue name> but if you haven't 
identified and fixed the root 
of the problem it will likely reoccur.

 
-- 
William Hay <[email protected]>

Attachment: pgpKE5pYaiNZt.pgp
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to