You need to find out what the job does to cause the node to go into error state. A queue instance gets put into error state when Grid Engine tries to run a job, but it fails for a reason that it considers not specific to the job. The queue instances are put into an error state as a precaution (to prevent jobs being scheduled onto a faulty node).

There's no real way to prevent this from happening - if Grid Engine can't definitely decide that a failure was a job failure (in which case the job goes into error state), that's what it does. You don't really want it to keep scheduling jobs onto actually faulty nodes, either!

You can use qstat (-explain 'E') to find out why things are in an error state.

Tina

On 18/05/15 13:38, Gavin W. Burris wrote:
Hello, Sudha.

Give this a try:  qmod -c "*"

Cheers.


On 10:51AM Mon 05/18/15 +0000, [email protected] wrote:
Hi,

We have few hosts added to a queue. Due to one single job submitted to the 
queue the whole queue goes into error state.

As a result, no new jobs can  be submitted to the queue unless we clear the 
error state.

Can anyone please let me know what could be the reason for this and how to fix 
it permanently.

Ex

test.q@host1              BIP   7/40      10.86    lx24-amd64    E
         queue test.q marked QERROR as result of job 8169748's failure at host 
host1
---------------------------------------------------------------------------
test.q@host2              BIP   7/40      10.74    lx24-amd64    E
         queue test.q marked QERROR as result of job 8169748's failure at host 
host2
----------------------------------------------------------------------------
test.q@host3              BIP   10/40     10.73    lx24-amd64    E
         queue test.q marked QERROR as result of job 8169748's failure at host 
host3
----------------------------------------------------------------------------
test.q@host4              BIP   8/40      11.28    lx24-amd64    E
         queue test.q marked QERROR as result of job 8169748's failure at host 
host4
----------------------------------------------------------------------------
test.q@host5             BIP   7/40      11.52    lx24-amd64    E
         queue test.q marked QERROR as result of job 8169748's failure at host 
host5
----------------------------------------------------------------------------
test.q@host6              BIP   8/40      10.41    lx24-amd64    E
         queue test.q marked QERROR as result of job 8169748's failure at host 
host6

Regards,
Sudha
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users




--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to