Assuming the cause of the error was corrected, you can use "qmod -cq" to clear the error state:
qmod -cq [email protected] Or if the error ended up on multiple queue isntances on the same exec host: qmod -cq '*'@compute-0-8.local On Mon, May 01, 2017 at 04:27:53PM +0000, Chester Langin wrote: > All, > > > How do I clear a node (E)rror status? One of our users ran a job over the > weekend that put 34 of our nodes in (E)rror status, effectively shutting down > our cluster. The error status looks like this: > > > $ qstat -f > [email protected] BIP 0/2/20 2.00 linux-x64 E > ... for 34 nodes... > > > "qstat -explain E" gives me this, with the same Job ID for all 34 nodes... > > [email protected] BIP 0/2/20 2.00 linux-x64 E > queue standard.q marked QERROR as result of job 193498's failure at > host compute-0-8.local > queue standard.q marked QERROR as result of job 193498's failure at > host compute-0-8.local > > > I cannot just delete the job, because this job no longer exists... > > $ qstat -j 193498 > Following jobs do not exist: > 193498 > > > Our records do however show who ran this job and when... > > > 193498 0.54184 analyze-an xxxxx r 04/29/2017 13:13:31 > [email protected] 8 2 > > > Node 19, oddly, is not one of the nodes with an (E)rror. This is also the > first time that any of our users has apparently tried submitting an array job > to our cluster. (Going by the "2" in the ja-task-ID field. I just did a > quick check and I do not see any values in this field in any of our > previously submitted jobs.) > > > It appears that pre-existing jobs are still running. New jobs are put into > "qw" indefinitely. Before this was discovered, the user complained that it > was taking his jobs 30 minutes to go from "qw" to "r" even though nodes were > available. He also had complained about a disk quota problem, so that may, > or may not be related. (Although we have had other users run into disk > quotas without this happening.) > > > Can someone tell me how to recover from this situation? > > > --Chet Langin, SIU > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users -- -- Skylar Thompson ([email protected]) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
