On Fri, Feb 05, 2016 at 03:02:52PM +0000, Manfred Selz wrote: > Hi, > > > > this week I have observed the (6.2u5) sgemaster crashing several times on > one of our sites. > > The last message in the "messages" file was always like this: > > > > 02/05/2016 14:37:30|worker|mnsrvgems-02v|C|Removing element from other > list !!! > > > > Automatic migration to the alternate master hosts (as define in the shadow > host list) also failed, with the new sge_qmaster also crashing (after one > minute or less). > > Only after several attempts I was able to start the master again, but not > without having some queues damaged (jobs being lost). > > > > This has never happened before since I took over the SGE admin role in our > company more than four years ago, and the messages file does not provide > an obvious reason. Sometimes I see a line like this before crashing: > > > > 02/05/2016 14:37:12| main|mnsrvgems-02v|W|removing reference to no longer > existing job 5335536 of user ... Is the jobid consistent? The most common cause of qmaster crashes in my experience is a corrupted job spool. Normal procedure is to stop the qmaster and manually delete the job from the spool (traditional spool) before restarting.
> If anybody has a good idea what I could look into, I'd appreciate this a > lot. > > Is there an efficient way to trace (strace?) the master process? You could enable the built in debugging (man sge_dl). William _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
