On Mon, 15 Jun 2015 17:27:47 +0000 Daniele Gregori <[email protected]> wrote: > [root@hactar ~]# qstat -f > > queuename qtype resv/used/tot. load_avg arch > states > > --------------------------------------------------------------------------------- > > all.q@compute-1-1 BIP 0/0/24 0.18 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-10 BIP 0/0/24 0.13 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-11 BIP 0/0/24 0.03 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-12 BIP 0/0/24 0.12 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-13 BIP 0/0/24 0.03 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-14 BIP 0/0/24 0.10 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-2 BIP 0/0/24 0.12 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-3 BIP 0/0/24 0.10 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-4 BIP 0/0/24 0.16 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-5 BIP 0/0/24 0.12 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-6 BIP 0/0/24 0.07 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-7 BIP 0/0/24 0.05 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-8 BIP 0/0/24 0.04 linux-x64 E > > --------------------------------------------------------------------------------- > > all.q@compute-1-9 BIP 0/0/24 0.09 linux-x64 E > Well the above reveals the proximate cause of your problem. Your queues are all in an error state. This usually happens when something goes wrong when a job starts and grid engine decides that the cause is related to the node rather than the job.
If you run qstat -qs E -explain E it will probably point at the job that triggered the problem. It is possible that a clue to what happened may appear in the output of the job which triggered the problem or in the execd messages file of the node with the problem. If that doesn't tell you what the problem is you can enable KEEP_ACTIVE in the execd_params of the sge config it will retain the job's active directory after the job terminates/exits. Next time a job triggers a queue into an error state you can examine the additional logfiles left in the active directory. As the man page says this is a debug option so turn it off again when you've finished diagnosing/fixing. You can clear the error state with qmod -cq <queue name> but if you haven't identified and fixed the root of the problem it will likely reoccur. -- William Hay <[email protected]>
pgpKE5pYaiNZt.pgp
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
