That's been our experience too, with the second highest cause a segfault in the user's code.
You can figure out for sure by looking at the exec daemon's messages file. On Mon, May 18, 2015 at 02:52:15PM +0200, Nicols Serrano Martnez-Santos wrote: > It can be caused by multiple issues. The most common cause in my department is > that HDD of the execution host is full, so Grid Engine put the host in error > to > prevent more errors. > > NiCo > > Excerpts from sudha.penmetsa's message of 2015-05-18 14:45:48 +0200: > > Hi Gavin, > > > > I clear the error state using qmod -c "*". > > > > Wanted to know the root cause and the solution to fix the issue permanently. > > > > Regards, > > Sudha > > > > -----Original Message----- > > From: Gavin W. Burris [mailto:[email protected]] > > Sent: Monday, May 18, 2015 6:08 PM > > To: Sudha Padmini Penmetsa (WT01 - Global Media & Telecom) > > Cc: [email protected] > > Subject: Re: [gridengine users] Grid queue goes into an error state due to > > one job > > > > Hello, Sudha. > > > > Give this a try: qmod -c "*" > > > > Cheers. > > > > > > On 10:51AM Mon 05/18/15 +0000, [email protected] wrote: > > > Hi, > > > > > > We have few hosts added to a queue. Due to one single job submitted to > > > the queue the whole queue goes into error state. > > > > > > As a result, no new jobs can be submitted to the queue unless we clear > > > the error state. > > > > > > Can anyone please let me know what could be the reason for this and how > > > to fix it permanently. > > > > > > Ex > > > > > > test.q@host1 BIP 7/40 10.86 lx24-amd64 E > > > queue test.q marked QERROR as result of job 8169748's failure > > > at host host1 > > > --------------------------------------------------------------------------- > > > test.q@host2 BIP 7/40 10.74 lx24-amd64 E > > > queue test.q marked QERROR as result of job 8169748's failure > > > at host host2 > > > ---------------------------------------------------------------------------- > > > test.q@host3 BIP 10/40 10.73 lx24-amd64 E > > > queue test.q marked QERROR as result of job 8169748's failure > > > at host host3 > > > ---------------------------------------------------------------------------- > > > test.q@host4 BIP 8/40 11.28 lx24-amd64 E > > > queue test.q marked QERROR as result of job 8169748's failure > > > at host host4 > > > ---------------------------------------------------------------------------- > > > test.q@host5 BIP 7/40 11.52 lx24-amd64 E > > > queue test.q marked QERROR as result of job 8169748's failure > > > at host host5 > > > ---------------------------------------------------------------------------- > > > test.q@host6 BIP 8/40 10.41 lx24-amd64 E > > > queue test.q marked QERROR as result of job 8169748's failure > > > at host host6 > > > > > > Regards, > > > Sudha > > > The information contained in this electronic message and any > > > attachments to this message are intended for the exclusive use of the > > > addressee(s) and may contain proprietary, confidential or privileged > > > information. If you are not the intended recipient, you should not > > > disseminate, distribute or copy this e-mail. Please notify the sender > > > immediately and destroy all copies of this message and any > > > attachments. WARNING: Computer viruses can be transmitted via email. > > > The recipient should check this email and any attachments for the > > > presence of viruses. The company accepts no liability for any damage > > > caused by any virus transmitted by this email. www.wipro.com > > > > > _______________________________________________ > > > users mailing list > > > [email protected] > > > https://gridengine.org/mailman/listinfo/users > > > > > > -- > > Gavin W. Burris > > Senior Project Leader for Research Computing The Wharton School University > > of Pennsylvania > > The information contained in this electronic message and any attachments to > > this message are intended for the exclusive use of the addressee(s) and may > > contain proprietary, confidential or privileged information. If you are not > > the intended recipient, you should not disseminate, distribute or copy this > > e-mail. Please notify the sender immediately and destroy all copies of this > > message and any attachments. WARNING: Computer viruses can be transmitted > > via email. The recipient should check this email and any attachments for > > the presence of viruses. The company accepts no liability for any damage > > caused by any virus transmitted by this email. www.wipro.com > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users -- -- Skylar Thompson ([email protected]) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
