On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote:
On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote:

   We have noticed that our sge_qmaster process fails inconsistently and
   jumps between shadow and master servers.

   Issue occurs every 2-5 days.
One possibility that occurs to me is that you might be suffering from a memory 
leak
that causes the oom_killer to target the qmaster.

That will be logged via syslog, and in the kernel dmesg buffer.  It
should be fairly obvious if this the case.


   We don't understand the root cause and the qmaster messages file does not
   indicate any issue.

I would suggest increasing the loglevel and also checking to see if there is
anything that immediately precedes the failure repeatedly (the qmaster starting 
up
again should be fairly obvious).


It's been a while (years), but I've seen cases where bad configuration
settings will cause the qmaster to segfault.  This very annoyingly
happened during an upgrade, which made for a Bad Day.

The logs are usually kept in $SGE_ROOT/$SGE_CELL/spool/qmaster

However, in extreme cases, including startup, there may logs put into
/tmp (for cases where writing into $SGE_ROOT fails for some reason).



   What are the best practices debugging this issue and resolving the problem
   without interrupting normal operation of sge_qmaster?

There is also running the qmaster with debugging turned up but that
could easily generate excessive an excessive volume of messages especially
if you don't know what you are looking for.


Try running the qmaster under strace or GDB.  You'll likely have to
either modify the init script, or run it by hand.


--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to