Re: [gridengine users] How to clear node (E)rror status?

Reuti Mon, 01 May 2017 09:51:12 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,


Am 01.05.2017 um 18:27 schrieb Chester Langin:

> All,
> 
> How do I clear a node (E)rror status?  One of our users ran a job over the 
> weekend that put 34 of our nodes in (E)rror status, effectively shutting down 
> our cluster.  The error status looks like this:
> 
> $ qstat -f
> [email protected]   BIP   0/2/20         2.00     linux-x64     E
> ... for 34 nodes...

$ qmod -cq standard.q

(one could even limit it to certain hosts or address all queues on a particular 
machine like "*@compute-0-8.local")


> "qstat -explain E" gives me this, with the same Job ID for all 34 nodes...
> 
> [email protected]   BIP   0/2/20         2.00     linux-x64     E
>         queue standard.q marked QERROR as result of job 193498's failure at 
> host compute-0-8.local
>         queue standard.q marked QERROR as result of job 193498's failure at 
> host compute-0-8.local
> 
> I cannot just delete the job, because this job no longer exists...
> 
> $ qstat -j 193498
> Following jobs do not exist:
> 193498
> 
> Our records do however show who ran this job and when...
> 
> 193498 0.54184 analyze-an xxxxx        r     04/29/2017 13:13:31 
> [email protected]      8 2
> 
> Node 19, oddly, is not one of the nodes with an (E)rror.  This is also the 
> first time that any of our users has apparently tried submitting an array job 
> to our cluster.  (Going by the "2" in the ja-task-ID field.  I just did a 
> quick check and I do not see any values in this field in any of our 
> previously submitted jobs.)

What does:

$ qacct -j 193498

show? "8 2" means 8 slots for task id 2. Maybe he submitted a larger range of 
tasks.


> It appears that pre-existing jobs are still running.  New jobs are put into 
> "qw" indefinitely.  Before this was discovered, the user complained that it 
> was taking his jobs 30 minutes to go from "qw" to "r" even though nodes were 
> available.  He also had complained about a disk quota problem, so that may, 
> or may not be related.  (Although we have had other users run into disk 
> quotas without this happening.)

What is shown in the messages file of the node? In a common spool directory it 
would be in /usr/sge/default/spool/compute-0-8.local/messages

- -- Reuti
-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - https://gpgtools.org

iEYEARECAAYFAlkHZswACgkQo/GbGkBRnRpv1ACgzi4Cdnwctqy7LpAq64SkiUki
44cAn3PXxV46VoYeHL5IroDfT2ZFbU0k
=+ify
-----END PGP SIGNATURE-----

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to clear node (E)rror status?

Reply via email to