[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834156#comment-16834156 ] Gaurav Garg commented on MESOS-9767: I only queried the /metrics/snapshot endpoint when the master was stuck. Next time i will query the /master/health endpoint also. Unfortunately, i captured the stack trace only once. My bad. Next time i will capture multiple stack traces. > Add self health monitoring in Mesos master > -- > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master >Affects Versions: 1.6.0 >Reporter: Gaurav Garg >Priority: Major > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the > [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] > for this one. One downside is that HealthChecker currently takes TaskId as > an input which is not valid for master health check. > We can add following flags to control the self heath checking: > # self_monitoring_enabled: Whether self monitoring is enabled. > # self_monitoring_consecutive_failures: After this many number of health > failures, master is crashed. > # self_monitoring_interval_secs: Interval at which health checks are > performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834089#comment-16834089 ] Gaurav Garg commented on MESOS-9767: Mesos master stopped responding to HTTP request at around 16:30PM. At around 17:00PM, master was restarted. Logs are attached after the stack trace. Logs of Mesos master around the same time: === I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669457 85889
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834088#comment-16834088 ] Gaurav Garg commented on MESOS-9767: Stack trace of the Mesos master when the hang was detected. Captured using gdb. Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700 ---Type to continue, or q to quit--- #7 operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 ---Type to continue, or q to quit--- #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf128758) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in w
[jira] [Created] (MESOS-9767) Add self health monitoring in Mesos master
Gaurav Garg created MESOS-9767: -- Summary: Add self health monitoring in Mesos master Key: MESOS-9767 URL: https://issues.apache.org/jira/browse/MESOS-9767 Project: Mesos Issue Type: Task Components: master Affects Versions: 1.6.2 Reporter: Gaurav Garg Fix For: 1.7.2 We have seen issue where Mesos master got stuck and was not responding to HTTP endpoints like "/metrics/snapshot". This results in calls by the frameworks and metrics collector to the master to hang. Currently we emit 'master alive' metric using prometheus. If master hangs, this metrics is not published and we detect the hangs using alerts on top of this metrics. By the time someone would have got the alert and restarted the master process, 15-30mins would have passed by. This results in SLA violation by Mesos cluster users. It will be nice to implement a self health check monitoring to detect if the Mesos master is hung/stuck. This will help us to quickly crash the master process so that one of the other member of the quorum can acquire ZK leadership lock. We can use the "/master/health" endpoint for health checks. Health checks can be initiated in [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] just after the child master process is [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] We can leverage the [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] for this one. One downside is that HealthChecker currently takes TaskId as an input which is not valid for master health check. We can add following flags to control the self heath checking: # self_monitoring_enabled: Whether self monitoring is enabled. # self_monitoring_consecutive_failures: After this many number of health failures, master is crashed. # self_monitoring_interval_secs: Interval at which health checks are performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)