You need to find out what the job does to cause the node to go into
error state. A queue instance gets put into error state when Grid Engine
tries to run a job, but it fails for a reason that it considers not
specific to the job. The queue instances are put into an error state as
a precaution (to prevent jobs being scheduled onto a faulty node).
There's no real way to prevent this from happening - if Grid Engine
can't definitely decide that a failure was a job failure (in which case
the job goes into error state), that's what it does. You don't really
want it to keep scheduling jobs onto actually faulty nodes, either!
You can use qstat (-explain 'E') to find out why things are in an error
state.
Tina
On 18/05/15 13:38, Gavin W. Burris wrote:
Hello, Sudha.
Give this a try: qmod -c "*"
Cheers.
On 10:51AM Mon 05/18/15 +0000, [email protected] wrote:
Hi,
We have few hosts added to a queue. Due to one single job submitted to the
queue the whole queue goes into error state.
As a result, no new jobs can be submitted to the queue unless we clear the
error state.
Can anyone please let me know what could be the reason for this and how to fix
it permanently.
Ex
test.q@host1 BIP 7/40 10.86 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure at host
host1
---------------------------------------------------------------------------
test.q@host2 BIP 7/40 10.74 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure at host
host2
----------------------------------------------------------------------------
test.q@host3 BIP 10/40 10.73 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure at host
host3
----------------------------------------------------------------------------
test.q@host4 BIP 8/40 11.28 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure at host
host4
----------------------------------------------------------------------------
test.q@host5 BIP 7/40 11.52 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure at host
host5
----------------------------------------------------------------------------
test.q@host6 BIP 8/40 10.41 lx24-amd64 E
queue test.q marked QERROR as result of job 8169748's failure at host
host6
Regards,
Sudha
The information contained in this electronic message and any attachments to
this message are intended for the exclusive use of the addressee(s) and may
contain proprietary, confidential or privileged information. If you are not the
intended recipient, you should not disseminate, distribute or copy this e-mail.
Please notify the sender immediately and destroy all copies of this message and
any attachments. WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the presence of
viruses. The company accepts no liability for any damage caused by any virus
transmitted by this email. www.wipro.com
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If you
are not the intended addressee or an authorised recipient of the addressee
please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and
Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users