I doubt its related to memory issue, since servers are really strong (128G RAM) 
and sge_qmaster does not get higher than 1.5G resident memory (usually)

We use CentOS 6.5 x86_64.


-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, June 21, 2016 6:51 PM
To: Jesse Becker <[email protected]>
Cc: William Hay <[email protected]>; [email protected]; Yuri Burmachenko 
<[email protected]>
Subject: Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently 
and fail-over occurs quite often - best practices debugging and resolving the 
issue.

In the message dated: Tue, 21 Jun 2016 10:09:11 -0400, The pithy ruminations 
from Jesse Becker on
<Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently and fail- 
over occurs quite often - best practices debugging and resolving the issue.> we
re:
=> On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote:
=> >On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote:
=> >>
=> >>    We have noticed that our sge_qmaster process fails inconsistently and
=> >>    jumps between shadow and master servers.
=> >>
=> >>    Issue occurs every 2-5 days.
=> >One possibility that occurs to me is that you might be suffering from a 
memory leak => >that causes the oom_killer to target the qmaster.
=>
=> That will be logged via syslog, and in the kernel dmesg buffer.  It
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^             ^^^^^^^^^^^^^^^^^^^
               (1)                                    (2)

(1) If syslog hasn't already been killed by the OOM-killer. :(

(2) I seem to recall that older OS releases (CentOS5-ish?) only provided the
    PID in the dmesg buffer when a process was killed, not the command name.
    Recent kernels seem to be better at that, but I'm not willing to cause
    an OOM situation in order to run regression tests to check...

I consider any machine that's had an OOM event to be fundamentally unstable, 
and reboot it as soon as possible.

[SNIP!]

=>
=> It's been a while (years), but I've seen cases where bad configuration => 
settings will cause the qmaster to segfault.  This very annoyingly


Yeah, I've run into issues (very high memory use, CPU thrashing) when
        schedd_job_info
is enabled, but I believe that only affected versions well before 8.1.8.

Another possibility:

        /tmp fills up (from SGE or another process), SGE dies, some
        cleaner process then frees up space in /tmp, SGE restarts. Wash,
        rinse, repeat.


=> happened during an upgrade, which made for a Bad Day.
=> 
=> The logs are usually kept in $SGE_ROOT/$SGE_CELL/spool/qmaster

Yeah, more generally, SGE doesn't behave very well if the qmaster can't write
stuff to disk.

=> 
=> However, in extreme cases, including startup, there may logs put into
=> /tmp (for cases where writing into $SGE_ROOT fails for some reason).
=> 
=> 

=> 
=> Try running the qmaster under strace or GDB.  You'll likely have to
=> either modify the init script, or run it by hand.

There are debugging hooks, see:

        $SGE_ROOT/util/dl.sh

Most easily used by putting the following into the sge_qmaster init script and
defining $DEBUG_LEVEL

-----------------------------------
        source $SGE_ROOT/util/dl.sh     # for access to "dl" command to enable 
debugging
        dl $DEBUG_LEVEL
-----------------------------------

Mark

=> 
=> 
=> -- 
=> Jesse Becker (Contractor)

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to