On Mon, 18 May 2015 12:45:48 +0000 "[email protected]" <[email protected]> wrote:
> Hi Gavin, > > I clear the error state using qmod -c "*". > > Wanted to know the root cause and the solution to fix the issue permanently. For that you'll probably need to investigate to find the underlying issue. qstat -j on the job or qstat -explain E when a queue is in an E state may provide clues but you'll probably need to poke through the execd log files to really find out. You may need to instruct grid engine to retain the job's spool directory with KEEP_ACTIVE in order to get all the useful log files There are two possibilites really: 1)There is a problem with all the nodes in your cluster that (presumably) only affects some jobs in which case you need to fix that problem - nothing to do with grid engine per se. 2)There is a problem with the job that grid engine mistakes for a node problem. You may be able to detect this issue from a jsv or prolog script and arrange for the job to be rejected rather than the node taken off line. The details will depend on the nature of the problem. If the problem is grid engine mistaking the source of an error(node when it should be job) you could raise a ticket with maintainers of the version of grid engine you use. William > -----Original Message----- > From: Gavin W. Burris [mailto:[email protected]] > Sent: Monday, May 18, 2015 6:08 PM > To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) > Cc: [email protected] > Subject: Re: [gridengine users] Grid queue goes into an error state due to > one job > > Hello, Sudha. > > Give this a try: qmod -c "*" > > Cheers. > > > On 10:51AM Mon 05/18/15 +0000, [email protected] wrote: > > Hi, > > > > We have few hosts added to a queue. Due to one single job submitted to the > > queue the whole queue goes into error state. > > > > As a result, no new jobs can be submitted to the queue unless we clear the > > error state. > > > > Can anyone please let me know what could be the reason for this and how to > > fix it permanently. > > > > Ex > > > > test.q@host1 BIP 7/40 10.86 lx24-amd64 E > > queue test.q marked QERROR as result of job 8169748's failure > > at host host1 > > --------------------------------------------------------------------------- > > test.q@host2 BIP 7/40 10.74 lx24-amd64 E > > queue test.q marked QERROR as result of job 8169748's failure > > at host host2 > > ---------------------------------------------------------------------------- > > test.q@host3 BIP 10/40 10.73 lx24-amd64 E > > queue test.q marked QERROR as result of job 8169748's failure > > at host host3 > > ---------------------------------------------------------------------------- > > test.q@host4 BIP 8/40 11.28 lx24-amd64 E > > queue test.q marked QERROR as result of job 8169748's failure > > at host host4 > > ---------------------------------------------------------------------------- > > test.q@host5 BIP 7/40 11.52 lx24-amd64 E > > queue test.q marked QERROR as result of job 8169748's failure > > at host host5 > > ---------------------------------------------------------------------------- > > test.q@host6 BIP 8/40 10.41 lx24-amd64 E > > queue test.q marked QERROR as result of job 8169748's failure > > at host host6 > > > > Regards, > > Sudha > > The information contained in this electronic message and any > > attachments to this message are intended for the exclusive use of the > > addressee(s) and may contain proprietary, confidential or privileged > > information. If you are not the intended recipient, you should not > > disseminate, distribute or copy this e-mail. Please notify the sender > > immediately and destroy all copies of this message and any > > attachments. WARNING: Computer viruses can be transmitted via email. > > The recipient should check this email and any attachments for the > > presence of viruses. The company accepts no liability for any damage > > caused by any virus transmitted by this email. www.wipro.com > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > > > -- > Gavin W. Burris > Senior Project Leader for Research Computing The Wharton School University of > Pennsylvania > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted via > email. The recipient should check this email and any attachments for the > presence of viruses. The company accepts no liability for any damage caused > by any virus transmitted by this email. www.wipro.com > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users -- William Hay <[email protected]>
pgpCnHiOsQl6m.pgp
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
