All,

How do I clear a node (E)rror status?  One of our users ran a job over the 
weekend that put 34 of our nodes in (E)rror status, effectively shutting down 
our cluster.  The error status looks like this:


$ qstat -f
[email protected]   BIP   0/2/20         2.00     linux-x64     E
... for 34 nodes...


"qstat -explain E" gives me this, with the same Job ID for all 34 nodes...

[email protected]   BIP   0/2/20         2.00     linux-x64     E
        queue standard.q marked QERROR as result of job 193498's failure at 
host compute-0-8.local
        queue standard.q marked QERROR as result of job 193498's failure at 
host compute-0-8.local


I cannot just delete the job, because this job no longer exists...

$ qstat -j 193498
Following jobs do not exist:
193498


Our records do however show who ran this job and when...


193498 0.54184 analyze-an xxxxx        r     04/29/2017 13:13:31 
[email protected]      8 2


Node 19, oddly, is not one of the nodes with an (E)rror.  This is also the 
first time that any of our users has apparently tried submitting an array job 
to our cluster.  (Going by the "2" in the ja-task-ID field.  I just did a quick 
check and I do not see any values in this field in any of our previously 
submitted jobs.)


It appears that pre-existing jobs are still running.  New jobs are put into 
"qw" indefinitely.  Before this was discovered, the user complained that it was 
taking his jobs 30 minutes to go from "qw" to "r" even though nodes were 
available.  He also had complained about a disk quota problem, so that may, or 
may not be related.  (Although we have had other users run into disk quotas 
without this happening.)


Can someone tell me how to recover from this situation?


--Chet Langin, SIU


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to