In the message dated: Tue, 21 Jun 2016 10:09:11 -0400,
The pithy ruminations from Jesse Becker on
<Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently and fail-
over occurs quite often - best practices debugging and resolving the issue.> we
re:
=> On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote:
=> >On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote:
=> >>
=> >> We have noticed that our sge_qmaster process fails inconsistently and
=> >> jumps between shadow and master servers.
=> >>
=> >> Issue occurs every 2-5 days.
=> >One possibility that occurs to me is that you might be suffering from a
memory leak
=> >that causes the oom_killer to target the qmaster.
=>
=> That will be logged via syslog, and in the kernel dmesg buffer. It
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^
(1) (2)
(1) If syslog hasn't already been killed by the OOM-killer. :(
(2) I seem to recall that older OS releases (CentOS5-ish?) only provided the
PID in the dmesg buffer when a process was killed, not the command name.
Recent kernels seem to be better at that, but I'm not willing to cause
an OOM situation in order to run regression tests to check...
I consider any machine that's had an OOM event to be fundamentally unstable,
and reboot it as soon as possible.
[SNIP!]
=>
=> It's been a while (years), but I've seen cases where bad configuration
=> settings will cause the qmaster to segfault. This very annoyingly
Yeah, I've run into issues (very high memory use, CPU thrashing) when
schedd_job_info
is enabled, but I believe that only affected versions well before 8.1.8.
Another possibility:
/tmp fills up (from SGE or another process), SGE dies, some
cleaner process then frees up space in /tmp, SGE restarts. Wash,
rinse, repeat.
=> happened during an upgrade, which made for a Bad Day.
=>
=> The logs are usually kept in $SGE_ROOT/$SGE_CELL/spool/qmaster
Yeah, more generally, SGE doesn't behave very well if the qmaster can't write
stuff to disk.
=>
=> However, in extreme cases, including startup, there may logs put into
=> /tmp (for cases where writing into $SGE_ROOT fails for some reason).
=>
=>
=>
=> Try running the qmaster under strace or GDB. You'll likely have to
=> either modify the init script, or run it by hand.
There are debugging hooks, see:
$SGE_ROOT/util/dl.sh
Most easily used by putting the following into the sge_qmaster init script and
defining $DEBUG_LEVEL
-----------------------------------
source $SGE_ROOT/util/dl.sh # for access to "dl" command to enable
debugging
dl $DEBUG_LEVEL
-----------------------------------
Mark
=>
=>
=> --
=> Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users