Hi, this week I have observed the (6.2u5) sgemaster crashing several times on one of our sites. The last message in the "messages" file was always like this:
02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other list !!! Automatic migration to the alternate master hosts (as define in the shadow host list) also failed, with the new sge_qmaster also crashing (after one minute or less). Only after several attempts I was able to start the master again, but not without having some queues damaged (jobs being lost). This has never happened before since I took over the SGE admin role in our company more than four years ago, and the messages file does not provide an obvious reason. Sometimes I see a line like this before crashing: 02/05/2016 14:37:12| main|mnsrvgems-02v|W|removing reference to no longer existing job 5335536 of user ... I have also looked at the local host specific messages files. If anybody has a good idea what I could look into, I'd appreciate this a lot. Is there an efficient way to trace (strace?) the master process? Regards, Manfred _____________________________________________________________ Manfred Selz Senior CAD Engineer Direct Dial: +49 (0)7021 805-562 [email protected]<mailto:[email protected]>| www.diasemi.com<http://www.diasemi.com/> Dialog Semiconductor GmbH, Neue Strasse 95, 73230 Kirchheim/Teck-Nabern, Germany _____________________________________________________________ ________________________________ Dialog Semiconductor GmbH Neue Str. 95 D-73230 Kirchheim Managing Directors: Dr. Jalal Bagherli, Jean-Michel Richard Chairman of the Supervisory Board: Rich Beyer Commercial register: Amtsgericht Stuttgart: HRB 231181 UST-ID-Nr. DE 811121668 Legal Disclaimer: This e-mail communication (and any attachment/s) is confidential and contains proprietary information, some or all of which may be legally privileged. It is intended solely for the use of the individual or entity to which it is addressed. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please consider the environment before printing this e-mail
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
