[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089 ] Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:54 PM: Mesos master stopped responding to HTTP request at around 16:30PM. At around 17:00PM, master was restarted. Logs are attached after the stack trace. Logs of Mesos master around the same time: {noformat} I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669457 85889 master.cpp:8397]
[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088 ] Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:50 PM: Stack trace of the Mesos master when the hang was detected. Captured using gdb. {noformat} Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf128758) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at