-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi,
Am 01.05.2017 um 18:27 schrieb Chester Langin: > All, > > How do I clear a node (E)rror status? One of our users ran a job over the > weekend that put 34 of our nodes in (E)rror status, effectively shutting down > our cluster. The error status looks like this: > > $ qstat -f > [email protected] BIP 0/2/20 2.00 linux-x64 E > ... for 34 nodes... $ qmod -cq standard.q (one could even limit it to certain hosts or address all queues on a particular machine like "*@compute-0-8.local") > "qstat -explain E" gives me this, with the same Job ID for all 34 nodes... > > [email protected] BIP 0/2/20 2.00 linux-x64 E > queue standard.q marked QERROR as result of job 193498's failure at > host compute-0-8.local > queue standard.q marked QERROR as result of job 193498's failure at > host compute-0-8.local > > I cannot just delete the job, because this job no longer exists... > > $ qstat -j 193498 > Following jobs do not exist: > 193498 > > Our records do however show who ran this job and when... > > 193498 0.54184 analyze-an xxxxx r 04/29/2017 13:13:31 > [email protected] 8 2 > > Node 19, oddly, is not one of the nodes with an (E)rror. This is also the > first time that any of our users has apparently tried submitting an array job > to our cluster. (Going by the "2" in the ja-task-ID field. I just did a > quick check and I do not see any values in this field in any of our > previously submitted jobs.) What does: $ qacct -j 193498 show? "8 2" means 8 slots for task id 2. Maybe he submitted a larger range of tasks. > It appears that pre-existing jobs are still running. New jobs are put into > "qw" indefinitely. Before this was discovered, the user complained that it > was taking his jobs 30 minutes to go from "qw" to "r" even though nodes were > available. He also had complained about a disk quota problem, so that may, > or may not be related. (Although we have had other users run into disk > quotas without this happening.) What is shown in the messages file of the node? In a common spool directory it would be in /usr/sge/default/spool/compute-0-8.local/messages - -- Reuti -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iEYEARECAAYFAlkHZswACgkQo/GbGkBRnRpv1ACgzi4Cdnwctqy7LpAq64SkiUki 44cAn3PXxV46VoYeHL5IroDfT2ZFbU0k =+ify -----END PGP SIGNATURE----- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
