[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-10-21 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956543#comment-16956543
 ] 

Benjamin Mahler commented on MESOS-9767:


[~ggarg] is this issue still affecting you or should we close it?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834156#comment-16834156
 ] 

Gaurav Garg commented on MESOS-9767:


I only queried the /metrics/snapshot endpoint when the master was stuck. Next 
time i will query the /master/health endpoint also. 
Unfortunately, i captured the stack trace only once. My bad. Next time i will 
capture multiple stack traces.

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834119#comment-16834119
 ] 

Benjamin Mahler commented on MESOS-9767:


The bizarre thread stack is:

{noformat}
Thread 21 (Thread 0x7fa1e0e4d700 (LWP 85889)):

#0  0x7fa1f05f01c2 in hash_combine_impl (k=52, h=)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:264
#1  hash_combine (v=, seed=)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#2  hash_range<__gnu_cxx::__normal_iterator > > (last=...,
first=52 '4') at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:351
#3  hash_value > (v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:410
#4  operator() (this=, v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:486
#5  boost::hash_combine (seed=seed@entry=@0x7fa1e0e4c770: 0, v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#6  0x7fa1f06ad178 in operator() (this=0x7fa1cc02d068, taskId=...)
at /mesos/include/mesos/type_utils.hpp:634
#7  _M_hash_code (this=0x7fa1cc02d068, __k=...) at 
/usr/include/c++/4.9/bits/hashtable_policy.h:1261
#8  std::_Hashtable > > >, 
std::allocator > > > 
>, std::__detail::_Select1st, std::equal_to, 
std::hash, std::__detail::_Mod_range_hashing, 
std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, 
std::__detail::_Hashtable_traits >::count 
(this=this@entry=0x7fa1cc02d068, __k=...)
at /usr/include/c++/4.9/bits/hashtable.h:1336
#9  0x7fa1f0663eb2 in count (__x=..., this=0x7fa1cc02d068)
at /usr/include/c++/4.9/bits/unordered_map.h:592
#10 contains (key=..., this=0x7fa1cc02d068) at 
/mesos/3rdparty/stout/include/stout/hashmap.hpp:88
#11 erase (key=..., this=0x7fa1cc02d050)
at /mesos/3rdparty/stout/include/stout/boundedhashmap.hpp:92
#12 mesos::internal::master::Master::__reregisterSlave(process::UPID const&, 
mesos::internal::ReregisterSlaveMessage&&, process::Future const&) 
(this=0x561dcf047380, pid=...,
reregisterSlaveMessage=, future=...) at /mesos/src/master/master.cpp:7369
#13 0x7fa1f14d54e1 in operator() (args#0=0x561dcf048620, this=)
at /mesos/3rdparty/libprocess/../stout/include/stout/lambda.hpp:443
#14 process::ProcessBase::consume(process::DispatchEvent&&) (this=,
event=) at /mesos/3rdparty/libprocess/src/process.cpp:3577
#15 0x7fa1f14e89b2 in serve (
event=,
this=0x561dcf048620) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#16 process::ProcessManager::resume (this=, 
process=0x561dcf048620)
at /mesos/3rdparty/libprocess/src/process.cpp:3002
#17 0x7fa1f14ee226 in operator() (__closure=0x561dcf119158)
at /mesos/3rdparty/libprocess/src/process.cpp:2511
#18 _M_invoke<> (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1700
#19 operator() (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1688
#20 
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf119140) at /usr/include/c++/4.9/thread:115
#21 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x7fa1ee520064 in start_thread (arg=0x7fa1e0e4d700) at 
pthread_create.c:309
#23 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

[~ggarg] is this trace present whenever it's hanging?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
> Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> 

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834118#comment-16834118
 ] 

Greg Mann commented on MESOS-9767:
--

[~ggarg] thanks for the info! Did the master respond to requests for the 
{{/master/health}} endpoint while it was stuck?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
> Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089
 ] 

Gaurav Garg commented on MESOS-9767:


Mesos master stopped responding to HTTP request at around 16:30PM. At around 
17:00PM, master was restarted. Logs are attached after the stack trace.

Logs of Mesos master around the same time:

===

I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'

I0429 16:26:45.669457 85889 master.cpp:8397] 

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Gaurav Garg (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088
 ] 

Gaurav Garg commented on MESOS-9767:


Stack trace of the Mesos master when the hang was detected. Captured using gdb.

 

Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700

---Type  to continue, or q  to quit---

#7  operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)

    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115

#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154

---Type  to continue, or q  to quit---

#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73

#4  process::ProcessManager::dequeue (this=0x561dcf063970)

    at /mesos/3rdparty/libprocess/src/process.cpp:3305

#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf128758)

    at /mesos/3rdparty/libprocess/src/process.cpp:2505

#6  _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700

#7  operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688

#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115

#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6

#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at 
pthread_create.c:309

#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

 

Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85

#1  0x7fa1f14d6e82 in wait (this=)