On Tue, Jun 21, 2016 at 03:25:25PM +0000, Yuri Burmachenko wrote:
The dmesg is clean and no indication on oom_killer or segfaults.
The only thing we have in dmesg & /var/log/messages is this line:

possible SYN flooding on port 7444. Sending cookies.

7444 is the sge_qmaster port.

How many exec nodes do you have?  While surprising, if the exec nodes
all hammer the qmaster host at the same time, I can imagine the kenrel
thinking it's under attack.

Didn't find any files in /tmp and $SGE_ROOT/$SGE_CELL/spool/qmaster has the 
following files & folders (looks like regular):
admin_hosts           centry      heartbeat    jobseqnum      messages.2.gz     
           messages_shadowd.mtlxsge02  qinstances                          
shadowd_mtlxsge002.yok.mtl.com.pid  users
advance_reservations  ckpt        hostgroups   managers       
messages_shadowd.mtlxsge001  operators                   qmaster.pid            
             shadowd_mtlxsge01.yok.mtl.com.pid   usersets
arseqnum              cqueues     jobs         messages       
messages_shadowd.mtlxsge002  pe                          resource_quotas        
             shadowd_mtlxsge02.yok.mtl.com.pid   zombies
calendars             exec_hosts  job_scripts  messages.1.gz  
messages_shadowd.mtlxsge01   projects                    
shadowd_mtlxsge001.yok.mtl.com.pid  submit_hosts

The 'messages' file would be where to look--shortly before a failover
event...


Previously we run strace and this is the last several lines before it was 
detached due to process failure:
[pid 42490] <... futex resumed> )       = 0
[pid 42489] futex(0xfba3d4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
327103507, {1459997595, 48056000}, ffffffff <unfinished ...>
[pid 42490] futex(0xfba3d4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
327103508, {1459997595, 48126000}, ffffffff <unfinished ...>
[pid 42489] <... futex resumed> )       = -1 EAGAIN (Resource temporarily 
unavailable)
[pid 42489] futex(0xfba3d4, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 
327103508, {1459997595, 48056000}, ffffffff <unfinished ...>
[pid 42487] mkdir("jobs", 0755)         = -1 EEXIST (File exists)
[pid 42487] mkdir("jobs/00", 0755)      = -1 EEXIST (File exists)
[pid 42487] mkdir("jobs/00/0133", 0755) = -1 EEXIST (File exists)
[pid 42487] open("jobs/00/0133/.1144", O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished 
...>
[pid 42491] write(2, "======= Backtrace: =========\n", 29) = 29
[pid 42491] writev(2, [{"/lib64/libc.so.6", 16}, {"[0x", 3}, {"3354076166", 10}, 
{"]\n", 2}], 4) = 31
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_free", 8}, {"+0x", 3}, {"16", 2}, 
{")", 1}, {"[0x", 3}, {"5e3e76", 6}, {"]\n", 2}], 9) = 59
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"lSetString", 10}, {"+0x", 3}, {"d8", 
2}, {")", 1}, {"[0x", 3}, {"590238", 6}, {"]\n", 2}], 9) = 61
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"[0x", 3}, 
{"4cbfbf", 6}, {"]\n", 2}], 4) = 44
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"[0x", 3}, 
{"4cd565", 6}, {"]\n", 2}], 4) = 44
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_select_parallel_environment", 31}, {"+0x", 
3}, {"939", 3}, {")", 1}, {"[0x", 3}, {"4ceb59", 6}, {"]\n", 2}], 9) = 83
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"scheduler_method", 16}, {"+0x", 3}, 
{"3150", 4}, {")", 1}, {"[0x", 3}, {"43f470", 6}, {"]\n", 2}], 9) = 69
[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_scheduler_main", 18}, {"+0x", 3}, 
{"f7c", 3}, {")", 1}, {"[0x", 3}, {"4342fc", 6}, {"]\n", 2}], 9) = 70
[pid 42491] writev(2, [{"/lib64/libpthread.so.0", 22}, {"[0x", 3}, {"33548079d1", 10}, 
{"]\n", 2}], 4) = 37
[pid 42491] writev(2, [{"/lib64/libc.so.6", 16}, {"(", 1}, {"clone", 5}, {"+0x", 3}, {"6d", 2}, {")", 1}, 
{"[0x", 3}, {"33540e8b6d", 10}, {"]\n", 2}], 9) = 43
[pid 42491] write(2, "======= Memory map: ========\n", 29) = 29
[pid 42491] open("/proc/self/maps", O_RDONLY) = 385
[pid 42491] read(385, "00400000-0065a000 r-xp 00000000 "..., 1024) = 1024
[pid 42491] write(2, "00400000-0065a000 r-xp 00000000 "..., 1024) = 1024
[pid 42491] read(385, "so\n3354000000-335418b000 r-xp 00"..., 1024) = 1024
[pid 42491] write(2, "so\n3354000000-335418b000 r-xp 00"..., 1024) = 1024
[pid 42491] read(385, "17000 08:03 16646600            "..., 1024) = 1024
[pid 42491] write(2, "17000 08:03 16646600            "..., 1024) = 1024
[pid 42491] read(385, "0000-3355c16000 r-xp 00000000 08"..., 1024) = 1024
[pid 42491] write(2, "0000-3355c16000 r-xp 00000000 08"..., 1024) = 1024
[pid 42491] read(385, "   /lib64/libcom_err.so.2.1\n335b"..., 1024) = 1024
[pid 42491] write(2, "   /lib64/libcom_err.so.2.1\n335b"..., 1024) = 1024
[pid 42491] read(385, ".3.1\n335c829000-335ca29000 ---p "..., 1024) = 1024
[pid 42491] write(2, ".3.1\n335c829000-335ca29000 ---p "..., 1024) = 1024
[pid 42491] read(385, "/libkrb5.so.3.3\n335d2e4000-335d2"..., 1024) = 1024
[pid 42491] write(2, "/libkrb5.so.3.3\n335d2e4000-335d2"..., 1024) = 1024
[pid 42491] read(385, " \n7faea0000000-7faea4000000 rw-p"..., 1024) = 1024
[pid 42491] write(2, " \n7faea0000000-7faea4000000 rw-p"..., 1024) = 1024
[pid 42491] read(385, ":00 0 \n7faed0000000-7faed4000000"..., 1024) = 1024
[pid 42491] write(2, ":00 0 \n7faed0000000-7faed4000000"..., 1024) = 1024
[pid 42491] read(385, "s_nis-2.12.so\n7faee70cf000-7faee"..., 1024) = 1024
[pid 42491] write(2, "s_nis-2.12.so\n7faee70cf000-7faee"..., 1024) = 1024
[pid 42491] read(385, "6646172                   /lib64"..., 1024) = 1024
[pid 42491] write(2, "6646172                   /lib64"..., 1024) = 1024
[pid 42491] read(385, "faef024e000 ---p 00138000 08:03 "..., 1024) = 889
[pid 42491] write(2, "faef024e000 ---p 00138000 08:03 "..., 889) = 889
[pid 42491] read(385, "", 1024)         = 0
[pid 42491] close(385)                  = 0
[pid 42491] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid 42491] tgkill(42474, 42491, SIGABRT) = 0
[pid 42491] --- SIGABRT (Aborted) @ 0 (0) ---


Now, this is kind of intersting.  It looks like the qmaster is killing
part of itself:

      tgkill()  sends  the  signal  sig to the thread with the
      thread ID tid in the thread group tgid.  (By contrast, kill(2)
      can only be used to send a signal to a process (i.e., thread
      group) as a whole, and the signal will be delivered to an
      arbitrary thread within that process.)

It's not clear what thread 42474 is doing, but that's part of the same
group as the process calling tgkill.

Also, and this is possibly unrelated, one of the calls in the backtrace
is:

 >[pid 42491] writev(2, [{"/opt/sge/bin/lx-amd64/sge_qmaste"..., 33}, {"(", 1}, {"sge_select_parallel_environment", 31}, {"+0x", 
3}, {"939", 3}, {")", 1}, {"[0x", 3}, {"4ceb59", 6}, {"]\n", 2}], 9) = 83

Perhaps check your PE configurations to make sure they do not have
any strange characters or settings in them?  You may also want to
consider removing them and reconfiguring them if possible.


Process 42491 detached
[pid 42490] +++ killed by SIGABRT +++
[pid 42489] +++ killed by SIGABRT +++
[pid 42488] +++ killed by SIGABRT +++
[pid 42487] +++ killed by SIGABRT +++
[pid 42486] +++ killed by SIGABRT +++
[pid 42485] +++ killed by SIGABRT +++
[pid 42484] +++ killed by SIGABRT +++
[pid 42478] +++ killed by SIGABRT +++
[pid 42476] +++ killed by SIGABRT +++
[pid 42475] +++ killed by SIGABRT +++
[pid 42477] +++ killed by SIGABRT +++
+++ killed by SIGABRT +++


Regarding GDB - what is the best practices to use it, by default it stops the 
process once attached ( we are running sge_qmaster in production) - how can we 
attach to sge_qmaster so it won't stop its normal operation?

Once you connect, you should be able to type "run" to continue running
the process.

Also what will happen when sge_qmaster will fail - AFAIK it holds the process in 
"T" state - will fail-over occur or not if gdb will be attached to the process 
during its failure?

Failover will (probably) still occur, since the shadowmaster will notice
that the heartbeat file has not been updated.


-----Original Message-----
From: Jesse Becker [mailto:[email protected]]
Sent: Tuesday, June 21, 2016 5:09 PM
To: William Hay <[email protected]>
Cc: Yuri Burmachenko <[email protected]>; [email protected]
Subject: Re: [gridengine users] SoGE 8.1.8 - sge_qmaster fails inconsistently 
and fail-over occurs quite often - best practices debugging and resolving the 
issue.

On Tue, Jun 21, 2016 at 10:12:41AM +0100, William Hay wrote:
On Tue, Jun 21, 2016 at 08:16:25AM +0000, Yuri Burmachenko wrote:

   We have noticed that our sge_qmaster process fails inconsistently and
   jumps between shadow and master servers.

   Issue occurs every 2-5 days.
One possibility that occurs to me is that you might be suffering from a
memory leak that causes the oom_killer to target the qmaster.

That will be logged via syslog, and in the kernel dmesg buffer.  It should be 
fairly obvious if this the case.


   We don't understand the root cause and the qmaster messages file does not
   indicate any issue.

I would suggest increasing the loglevel and also checking to see if
there is anything that immediately precedes the failure repeatedly (the
qmaster starting up again should be fairly obvious).


It's been a while (years), but I've seen cases where bad configuration settings 
will cause the qmaster to segfault.  This very annoyingly happened during an 
upgrade, which made for a Bad Day.

The logs are usually kept in $SGE_ROOT/$SGE_CELL/spool/qmaster

However, in extreme cases, including startup, there may logs put into /tmp (for 
cases where writing into $SGE_ROOT fails for some reason).



   What are the best practices debugging this issue and resolving the problem
   without interrupting normal operation of sge_qmaster?

There is also running the qmaster with debugging turned up but that
could easily generate excessive an excessive volume of messages
especially if you don't know what you are looking for.


Try running the qmaster under strace or GDB.  You'll likely have to either 
modify the init script, or run it by hand.


--
Jesse Becker (Contractor)

--
Jesse Becker (Contractor)
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to