All,
How do I clear a node (E)rror status? One of our users ran a job over the weekend that put 34 of our nodes in (E)rror status, effectively shutting down our cluster. The error status looks like this: $ qstat -f [email protected] BIP 0/2/20 2.00 linux-x64 E ... for 34 nodes... "qstat -explain E" gives me this, with the same Job ID for all 34 nodes... [email protected] BIP 0/2/20 2.00 linux-x64 E queue standard.q marked QERROR as result of job 193498's failure at host compute-0-8.local queue standard.q marked QERROR as result of job 193498's failure at host compute-0-8.local I cannot just delete the job, because this job no longer exists... $ qstat -j 193498 Following jobs do not exist: 193498 Our records do however show who ran this job and when... 193498 0.54184 analyze-an xxxxx r 04/29/2017 13:13:31 [email protected] 8 2 Node 19, oddly, is not one of the nodes with an (E)rror. This is also the first time that any of our users has apparently tried submitting an array job to our cluster. (Going by the "2" in the ja-task-ID field. I just did a quick check and I do not see any values in this field in any of our previously submitted jobs.) It appears that pre-existing jobs are still running. New jobs are put into "qw" indefinitely. Before this was discovered, the user complained that it was taking his jobs 30 minutes to go from "qw" to "r" even though nodes were available. He also had complained about a disk quota problem, so that may, or may not be related. (Although we have had other users run into disk quotas without this happening.) Can someone tell me how to recover from this situation? --Chet Langin, SIU
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
