Re: [gridengine users] sgemaster crash

William Hay Fri, 05 Feb 2016 07:55:50 -0800

On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote:
>    Hi,
> 
>     
> 
>    this week I have observed the (6.2u5) sgemaster crashing several times on
>    one of our sites.
> 
>    The last message in the "messages" file was always like this:
> 
>     
> 
>    02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other
>    list !!!
> 
>     
> 
>    Automatic migration to the alternate master hosts (as define in the shadow
>    host list) also failed, with the new sge_qmaster also crashing (after one
>    minute or less).
> 
>    Only after several attempts I was able to start the master again, but not
>    without having some queues damaged (jobs being lost).
> 
>     
> 
>    This has never happened before since I took over the SGE admin role in our
>    company more than four years ago, and the messages file does not provide
>    an obvious reason. Sometimes I see a line like this before crashing:
> 
>     
> 
>    02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer
>    existing job 5335536 of user ...
Is the jobid consistent?  The most common cause of qmaster crashes in my 
experience
is a corrupted job spool.  Normal procedure is to stop the qmaster and manually 
delete
the job from the spool (traditional spool) before restarting.



>    If anybody has a good idea what I could look into, I'd appreciate this a
>    lot.
> 
>    Is there an efficient way to trace (strace?) the master process?
You could enable the built in debugging (man sge_dl).

William

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] sgemaster crash

Reply via email to