[gridengine users] sgemaster crash

Manfred Selz Tue, 09 Feb 2016 23:37:03 -0800

Hi William, Alex,

thank you very much for your replies and suggestions in Vol. 62, Issue 2.
I took up the point of deleting potentially offending jobs (actually, as most 
jobs were gone anyway, the entire job spool) from the internal job list in SGE, 
and so far (since Monday morning) the SGE cluster has been stable again.


I will continue to observe and follow up if more incidents come up.

Regards,
Manfred


Message: 2
Date: Fri, 5 Feb 2016 15:52:47 +0000
From: William Hay <[email protected]>
To: <[email protected]>
Subject: Re: [gridengine users] sgemaster crash
Message-ID: <[email protected]>
Content-Type: text/plain; charset="us-ascii"

On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
>    Hi,
>
>
>
>    this week I have observed the (6.2u5) sgemaster crashing several times on
>    one of our sites.
>
>    The last message in the "messages" file was always like this:
>
>
>
>    02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
>    list !!!
>
>
>
>    Automatic migration to the alternate master hosts (as define in the shadow
>    host list) also failed, with the new sge_qmaster also crashing (after one
>    minute or less).
>
>    Only after several attempts I was able to start the master again, but not
>    without having some queues damaged (jobs being lost).
>
>
>
>    This has never happened before since I took over the SGE admin role in our
>    company more than four years ago, and the messages file does not provide
>    an obvious reason. Sometimes I see a line like this before crashing:
>
>
>
>    02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer
>    existing job 5335536 of user ...
Is the jobid consistent?  The most common cause of qmaster crashes in my 
experience is a corrupted job spool.  Normal procedure is to stop the qmaster 
and manually delete the job from the spool (traditional spool) before 
restarting.


>    If anybody has a good idea what I could look into, I'd appreciate this a
>    lot.
>
>    Is there an efficient way to trace (strace?) the master process?
You could enable the built in debugging (man sge_dl).

William



------------------------------

Message: 3
Date: Fri, 5 Feb 2016 13:41:38 -0800
From: Alex Chekholko <[email protected]>
To: [email protected]
Subject: Re: [gridengine users] sgemaster crash
Message-ID: <[email protected]>
Content-Type: text/plain; charset=windows-1252; format=flowed

IME you are hitting some kind of rare bug.

Last time we had a thing like this it was because a user was specifying many 
hundreds of jobids in the hold_jid parameter.

Before that, it had something to do with parallel jobs not cleaning up quite 
right, and IIRC disabling the scheduling reporting parameters fixed it.

In each case, the "easiest" way is to delete your job spool and restart your 
qmaster and then monitor closely to try to figure out which user's jobs it is 
that makes it crash.  And then get the user to modify their job parameters till 
your qmaster doesn't crash anymore :)



On 02/05/2016 07:52 AM, William Hay wrote:
> On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
>>     Hi,
>>
>>
>>
>>     this week I have observed the (6.2u5) sgemaster crashing several times on
>>     one of our sites.
>>
>>     The last message in the "messages" file was always like this:
>>
>>
>>
>>     02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
>>     list !!!
>>
>>
>>
>>     Automatic migration to the alternate master hosts (as define in the 
>> shadow
>>     host list) also failed, with the new sge_qmaster also crashing (after one
>>     minute or less).
>>
>>     Only after several attempts I was able to start the master again, but not
>>     without having some queues damaged (jobs being lost).
>>
>>
>>
>>     This has never happened before since I took over the SGE admin role in 
>> our
>>     company more than four years ago, and the messages file does not provide
>>     an obvious reason. Sometimes I see a line like this before crashing:
>>
>>
>>
>>     02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no 
>> longer
>>     existing job 5335536 of user ...
> Is the jobid consistent?  The most common cause of qmaster crashes in
> my experience is a corrupted job spool.  Normal procedure is to stop
> the qmaster and manually delete the job from the spool (traditional spool) 
> before restarting.
>
>
>>     If anybody has a good idea what I could look into, I'd appreciate this a
>>     lot.
>>
>>     Is there an efficient way to trace (strace?) the master process?
> You could enable the built in debugging (man sge_dl).
>
> William
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

--
Alex Chekholko [email protected] 347-401-4860

________________________________

Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Carsten Dahl
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is 
confidential and contains proprietary information, some or all of which may be 
legally privileged. It is intended solely for the use of the individual or 
entity to which it is addressed. Access to this email by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it, is 
prohibited and may be unlawful.

Please consider the environment before printing this e-mail



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] sgemaster crash

Reply via email to