Hi,

this week I have observed the (6.2u5) sgemaster crashing several times on one 
of our sites.
The last message in the "messages" file was always like this:

02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other list !!!

Automatic migration to the alternate master hosts (as define in the shadow host 
list) also failed, with the new sge_qmaster also crashing (after one minute or 
less).
Only after several attempts I was able to start the master again, but not 
without having some queues damaged (jobs being lost).

This has never happened before since I took over the SGE admin role in our 
company more than four years ago, and the messages file does not provide an 
obvious reason. Sometimes I see a line like this before crashing:

02/05/2016 14:37:12|  main|mnsrvgems-02v|W|removing reference to no longer 
existing job 5335536 of user ...

I have also looked at the local host specific messages files.
If anybody has a good idea what I could look into, I'd appreciate this a lot.
Is there an efficient way to trace (strace?) the master process?

Regards,
Manfred

_____________________________________________________________
Manfred Selz
Senior CAD Engineer
Direct Dial: +49 (0)7021 805-562
[email protected]<mailto:[email protected]>| 
www.diasemi.com<http://www.diasemi.com/>
Dialog Semiconductor GmbH, Neue Strasse 95, 73230 Kirchheim/Teck-Nabern, Germany
_____________________________________________________________


________________________________

Dialog Semiconductor GmbH
Neue Str. 95
D-73230 Kirchheim
Managing Directors: Dr. Jalal Bagherli, Jean-Michel Richard
Chairman of the Supervisory Board: Rich Beyer
Commercial register: Amtsgericht Stuttgart: HRB 231181
UST-ID-Nr. DE 811121668

Legal Disclaimer: This e-mail communication (and any attachment/s) is 
confidential and contains proprietary information, some or all of which may be 
legally privileged. It is intended solely for the use of the individual or 
entity to which it is addressed. Access to this email by anyone else is 
unauthorized. If you are not the intended recipient, any disclosure, copying, 
distribution or any action taken or omitted to be taken in reliance on it, is 
prohibited and may be unlawful.

Please consider the environment before printing this e-mail


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to