[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834156#comment-16834156
 ] 

Gaurav Garg commented on MESOS-9767:


I only queried the /metrics/snapshot endpoint when the master was stuck. Next 
time i will query the /master/health endpoint also. 
Unfortunately, i captured the stack trace only once. My bad. Next time i will 
capture multiple stack traces.

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834089#comment-16834089
 ] 

Gaurav Garg commented on MESOS-9767:


Mesos master stopped responding to HTTP request at around 16:30PM. At around 
17:00PM, master was restarted. Logs are attached after the stack trace.

Logs of Mesos master around the same time:

===

I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669457 85889 

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834088#comment-16834088
 ] 

Gaurav Garg commented on MESOS-9767:


Stack trace of the Mesos master when the hang was detected. Captured using gdb.

 

Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700

---Type  to continue, or q  to quit---

#7  operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

---Type  to continue, or q  to quit---

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf128758)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in w

[jira] [Created] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-03 Thread Gaurav Garg (JIRA)
Gaurav Garg created MESOS-9767:
--

 Summary: Add self health monitoring in Mesos master
 Key: MESOS-9767
 URL: https://issues.apache.org/jira/browse/MESOS-9767
 Project: Mesos
  Issue Type: Task
  Components: master
Affects Versions: 1.6.2
Reporter: Gaurav Garg
 Fix For: 1.7.2


We have seen issue where Mesos master got stuck and was not responding to HTTP 
endpoints like "/metrics/snapshot". This results in calls by the frameworks and 
metrics collector to the master to hang. Currently we emit 'master alive' 
metric using prometheus. If master hangs, this metrics is not published and we 
detect the hangs using alerts on top of this metrics. By the time someone would 
have got the alert and restarted the master process, 15-30mins would have 
passed by. This results in SLA violation by Mesos cluster users.

It will be nice to implement a self health check monitoring to detect if the 
Mesos master is hung/stuck. This will help us to quickly crash the master 
process so that one of the other member of the quorum can acquire ZK leadership 
lock.



We can use the "/master/health" endpoint for health checks. 
Health checks can be initiated in 
[src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
 just after the child master process is 
[spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]

We can leverage the 
[HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
 for this one. One downside is that HealthChecker currently takes TaskId as an 
input which is not valid for master health check. 



We can add following flags to control the self heath checking:
 # self_monitoring_enabled: Whether self monitoring is enabled.
 # self_monitoring_consecutive_failures: After this many number of health 
failures, master is crashed.
 # self_monitoring_interval_secs: Interval at which health checks are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)