Running 6.2u5.
qmaster running on RHEL 5.4.  Exec host machines running on 5.5/5.6.
(Currently in upgrade process to 5.6)
Qmaster keeps dying seemingly randomly (9 times since Friday afternoon.)
Have not experienced this issue since installing a year ago.
Problem started a month or so ago and has increased in frequency.
Currently running a crontab every 2 minutes to check if qmaster is down
and if so, do a restart.
I can't find any indication anywhere, e.g., log files etc., as to why it
is dying.
So I did an strace on the qmaster PID.
It shows a segmentation fault (last few lines below.)
Any ideas?
 
[pid 24778] futex(0x7375e0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 24774] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 24753] futex(0x7375e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 24744] gettimeofday( <unfinished ...>
[pid 24743] futex(0x2b662bd40c24, FUTEX_CMP_REQUEUE_PRIVATE, 1,
2147483647, 0x2b662bd40bc0, 7404026 <unfinished ...>
[pid 24778] <... futex resumed> )       = -1 EAGAIN (Resource
temporarily unavailable)
[pid 24776] <... futex resumed> )       = 0
[pid 24774] <... clock_gettime resumed> {1304038113, 8112000}) = 0
[pid 24753] <... futex resumed> )       = 0
[pid 24744] <... gettimeofday resumed> {1304038113, 8320}, NULL) = 0
[pid 24743] <... futex resumed> )       = 2
[pid 24778] futex(0x7375e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 24776] futex(0x2b662bd40bc0, FUTEX_WAIT_PRIVATE, 2, NULL
<unfinished ...>
[pid 24774] futex(0x2aaaabc5aa0c, FUTEX_WAIT_PRIVATE, 2519512, {0,
998853000} <unfinished ...>
[pid 24753] gettimeofday( <unfinished ...>
[pid 24744] futex(0x2b662bd409e0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 24743] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 24778] <... futex resumed> )       = 0
[pid 24777] <... futex resumed> )       = 0
[pid 24776] <... futex resumed> )       = -1 EAGAIN (Resource
temporarily unavailable)
[pid 24753] <... gettimeofday resumed> {1304038113, 9573}, {0,
1304038113}) = 0
[pid 24744] <... futex resumed> )       = 0
[pid 24743] <... futex resumed> )       = 1
[pid 24778] gettimeofday( <unfinished ...>
[pid 24777] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 24776] futex(0x2b662bd40bc0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 24753] gettimeofday( <unfinished ...>
[pid 24744] poll([{fd=38, events=POLLOUT}], 1, 5 <unfinished ...>
[pid 24743] gettimeofday( <unfinished ...>
[pid 24778] <... gettimeofday resumed> {1304038113, 10670}, {0,
1304038113}) = 0
[pid 24777] <... futex resumed> )       = 0
[pid 24776] <... futex resumed> )       = 0
[pid 24753] <... gettimeofday resumed> {1304038113, 11054}, NULL) = 0
[pid 24744] <... poll resumed> )        = 1 ([{fd=38, revents=POLLOUT}])
[pid 24743] <... gettimeofday resumed> {1304038113, 11228}, NULL) = 0
[pid 24778] --- SIGSEGV (Segmentation fault) @ 0 (0) ---
Process 24778 detached
[pid 24794] +++ killed by SIGSEGV +++
[pid 24793] +++ killed by SIGSEGV +++
[pid 24790] +++ killed by SIGSEGV +++
[pid 24789] +++ killed by SIGSEGV +++
[pid 24788] +++ killed by SIGSEGV +++
[pid 24787] +++ killed by SIGSEGV +++
[pid 24786] +++ killed by SIGSEGV +++
[pid 24785] +++ killed by SIGSEGV +++
[pid 24784] +++ killed by SIGSEGV +++
[pid 24783] +++ killed by SIGSEGV +++
[pid 24782] +++ killed by SIGSEGV +++
[pid 24781] +++ killed by SIGSEGV +++
[pid 24780] +++ killed by SIGSEGV +++
[pid 24779] +++ killed by SIGSEGV +++
[pid 24777] +++ killed by SIGSEGV +++
[pid 24776] +++ killed by SIGSEGV +++
[pid 24774] +++ killed by SIGSEGV +++
[pid 24755] +++ killed by SIGSEGV +++
[pid 24754] +++ killed by SIGSEGV +++
[pid 24753] +++ killed by SIGSEGV +++
[pid 24752] +++ killed by SIGSEGV +++
[pid 24744] +++ killed by SIGSEGV +++
[pid 24743] +++ killed by SIGSEGV +++
[pid 24742] +++ killed by SIGSEGV +++
[pid 24740] +++ killed by SIGSEGV +++
+++ killed by SIGSEGV +++

 
 
Best Regards,
Brian Murphy
________________________________________
Siemens Energy, Inc.
Global Engineering Computing Operations
Engineering Applications Administrator
Compute Grid Administrator
Orlando, Florida, USA
407.736.5215
 
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to