Assuming the cause of the error was corrected, you can use "qmod -cq" to
clear the error state:

qmod -cq [email protected]

Or if the error ended up on multiple queue isntances on the same exec host:

qmod -cq '*'@compute-0-8.local

On Mon, May 01, 2017 at 04:27:53PM +0000, Chester Langin wrote:
> All,
> 
> 
> How do I clear a node (E)rror status?  One of our users ran a job over the 
> weekend that put 34 of our nodes in (E)rror status, effectively shutting down 
> our cluster.  The error status looks like this:
> 
> 
> $ qstat -f
> [email protected]   BIP   0/2/20         2.00     linux-x64     E
> ... for 34 nodes...
> 
> 
> "qstat -explain E" gives me this, with the same Job ID for all 34 nodes...
> 
> [email protected]   BIP   0/2/20         2.00     linux-x64     E
>         queue standard.q marked QERROR as result of job 193498's failure at 
> host compute-0-8.local
>         queue standard.q marked QERROR as result of job 193498's failure at 
> host compute-0-8.local
> 
> 
> I cannot just delete the job, because this job no longer exists...
> 
> $ qstat -j 193498
> Following jobs do not exist:
> 193498
> 
> 
> Our records do however show who ran this job and when...
> 
> 
> 193498 0.54184 analyze-an xxxxx        r     04/29/2017 13:13:31 
> [email protected]      8 2
> 
> 
> Node 19, oddly, is not one of the nodes with an (E)rror.  This is also the 
> first time that any of our users has apparently tried submitting an array job 
> to our cluster.  (Going by the "2" in the ja-task-ID field.  I just did a 
> quick check and I do not see any values in this field in any of our 
> previously submitted jobs.)
> 
> 
> It appears that pre-existing jobs are still running.  New jobs are put into 
> "qw" indefinitely.  Before this was discovered, the user complained that it 
> was taking his jobs 30 minutes to go from "qw" to "r" even though nodes were 
> available.  He also had complained about a disk quota problem, so that may, 
> or may not be related.  (Although we have had other users run into disk 
> quotas without this happening.)
> 
> 
> Can someone tell me how to recover from this situation?
> 
> 
> --Chet Langin, SIU
> 
> 

> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


-- 
-- Skylar Thompson ([email protected])
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354
-- University of Washington School of Medicine
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to