Hi William, Alex, thank you very much for your replies and suggestions in Vol. 62, Issue 2. I took up the point of deleting potentially offending jobs (actually, as most jobs were gone anyway, the entire job spool) from the internal job list in SGE, and so far (since Monday morning) the SGE cluster has been stable again.
I will continue to observe and follow up if more incidents come up. Regards, Manfred Message: 2 Date: Fri, 5 Feb 2016 15:52:47 +0000 From: William Hay <[email protected]> To: <[email protected]> Subject: Re: [gridengine users] sgemaster crash Message-ID: <[email protected]> Content-Type: text/plain; charset="us-ascii" On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote: > Hi, > > > > this week I have observed the (6.2u5) sgemaster crashing several times on > one of our sites. > > The last message in the "messages" file was always like this: > > > > 02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other > list !!! > > > > Automatic migration to the alternate master hosts (as define in the shadow > host list) also failed, with the new sge_qmaster also crashing (after one > minute or less). > > Only after several attempts I was able to start the master again, but not > without having some queues damaged (jobs being lost). > > > > This has never happened before since I took over the SGE admin role in our > company more than four years ago, and the messages file does not provide > an obvious reason. Sometimes I see a line like this before crashing: > > > > 02/05/2016 14:37:12| main|mnsrvgems-02v|W|removing reference to no longer > existing job 5335536 of user ... Is the jobid consistent? The most common cause of qmaster crashes in my experience is a corrupted job spool. Normal procedure is to stop the qmaster and manually delete the job from the spool (traditional spool) before restarting. > If anybody has a good idea what I could look into, I'd appreciate this a > lot. > > Is there an efficient way to trace (strace?) the master process? You could enable the built in debugging (man sge_dl). William ------------------------------ Message: 3 Date: Fri, 5 Feb 2016 13:41:38 -0800 From: Alex Chekholko <[email protected]> To: [email protected] Subject: Re: [gridengine users] sgemaster crash Message-ID: <[email protected]> Content-Type: text/plain; charset=windows-1252; format=flowed IME you are hitting some kind of rare bug. Last time we had a thing like this it was because a user was specifying many hundreds of jobids in the hold_jid parameter. Before that, it had something to do with parallel jobs not cleaning up quite right, and IIRC disabling the scheduling reporting parameters fixed it. In each case, the "easiest" way is to delete your job spool and restart your qmaster and then monitor closely to try to figure out which user's jobs it is that makes it crash. And then get the user to modify their job parameters till your qmaster doesn't crash anymore :) On 02/05/2016 07:52 AM, William Hay wrote: > On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote: >> Hi, >> >> >> >> this week I have observed the (6.2u5) sgemaster crashing several times on >> one of our sites. >> >> The last message in the "messages" file was always like this: >> >> >> >> 02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other >> list !!! >> >> >> >> Automatic migration to the alternate master hosts (as define in the >> shadow >> host list) also failed, with the new sge_qmaster also crashing (after one >> minute or less). >> >> Only after several attempts I was able to start the master again, but not >> without having some queues damaged (jobs being lost). >> >> >> >> This has never happened before since I took over the SGE admin role in >> our >> company more than four years ago, and the messages file does not provide >> an obvious reason. Sometimes I see a line like this before crashing: >> >> >> >> 02/05/2016 14:37:12| main|mnsrvgems-02v|W|removing reference to no >> longer >> existing job 5335536 of user ... > Is the jobid consistent? The most common cause of qmaster crashes in > my experience is a corrupted job spool. Normal procedure is to stop > the qmaster and manually delete the job from the spool (traditional spool) > before restarting. > > >> If anybody has a good idea what I could look into, I'd appreciate this a >> lot. >> >> Is there an efficient way to trace (strace?) the master process? > You could enable the built in debugging (man sge_dl). > > William > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > -- Alex Chekholko [email protected] 347-401-4860 ________________________________ Dialog Semiconductor GmbH Neue Str. 95 D-73230 Kirchheim Managing Directors: Dr. Jalal Bagherli, Carsten Dahl Chairman of the Supervisory Board: Rich Beyer Commercial register: Amtsgericht Stuttgart: HRB 231181 UST-ID-Nr. DE 811121668 Legal Disclaimer: This e-mail communication (and any attachment/s) is confidential and contains proprietary information, some or all of which may be legally privileged. It is intended solely for the use of the individual or entity to which it is addressed. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please consider the environment before printing this e-mail _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
