[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

Benjamin Mahler (JIRA) Mon, 06 May 2019 11:59:37 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834119#comment-16834119
 ]


Benjamin Mahler commented on MESOS-9767:
----------------------------------------

The bizarre thread stack is:

{noformat}
Thread 21 (Thread 0x7fa1e0e4d700 (LWP 85889)):

#0  0x00007fa1f05f01c2 in hash_combine_impl (k=52, h=<synthetic pointer>)
    at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:264
#1  hash_combine<char> (v=<optimized out>, seed=<synthetic pointer>)
    at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#2  hash_range<__gnu_cxx::__normal_iterator<char const*, 
std::basic_string<char> > > (last=...,
    first=52 '4') at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:351
#3  hash_value<char, std::allocator<char> > (v=...)
    at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:410
#4  operator() (this=<optimized out>, v=...)
    at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:486
#5  boost::hash_combine<std::string> (seed=seed@entry=@0x7fa1e0e4c770: 0, v=...)
    at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#6  0x00007fa1f06ad178 in operator() (this=0x7fa1cc02d068, taskId=...)
    at /mesos/include/mesos/type_utils.hpp:634
#7  _M_hash_code (this=0x7fa1cc02d068, __k=...) at 
/usr/include/c++/4.9/bits/hashtable_policy.h:1261
#8  std::_Hashtable<mesos::TaskID, std::pair<mesos::TaskID const, 
std::_List_iterator<std::pair<mesos
::TaskID, process::Owned<mesos::Task> > > >, 
std::allocator<std::pair<mesos::TaskID const, 
std::_List_iterator<std::pair<mesos::TaskID, process::Owned<mesos::Task> > > > 
>, std::__detail::_Select1st, std::equal_to<mesos::TaskID>, 
std::hash<mesos::TaskID>, std::__detail::_Mod_range_hashing, 
std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, 
std::__detail::_Hashtable_traits<true, false, true> >::count 
(this=this@entry=0x7fa1cc02d068, __k=...)
    at /usr/include/c++/4.9/bits/hashtable.h:1336
#9  0x00007fa1f0663eb2 in count (__x=..., this=0x7fa1cc02d068)
    at /usr/include/c++/4.9/bits/unordered_map.h:592
#10 contains (key=..., this=0x7fa1cc02d068) at 
/mesos/3rdparty/stout/include/stout/hashmap.hpp:88
#11 erase (key=..., this=0x7fa1cc02d050)
    at /mesos/3rdparty/stout/include/stout/boundedhashmap.hpp:92
#12 mesos::internal::master::Master::__reregisterSlave(process::UPID const&, 
mesos::internal::ReregisterSlaveMessage&&, process::Future<bool> const&) 
(this=0x561dcf047380, pid=...,
    reregisterSlaveMessage=<unknown type in /usr/local/lib/libmesos-1.6.0.so, 
CU 0x30075d6, DIE 0x38a83be>, future=...) at /mesos/src/master/master.cpp:7369
#13 0x00007fa1f14d54e1 in operator() (args#0=0x561dcf048620, this=<optimized 
out>)
    at /mesos/3rdparty/libprocess/../stout/include/stout/lambda.hpp:443
#14 process::ProcessBase::consume(process::DispatchEvent&&) (this=<optimized 
out>,
    event=<optimized out>) at /mesos/3rdparty/libprocess/src/process.cpp:3577
#15 0x00007fa1f14e89b2 in serve (
    event=<unknown type in /usr/local/lib/libmesos-1.6.0.so, CU 0x14b8d81b, DIE 
0x14e9f25d>,
    this=0x561dcf048620) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#16 process::ProcessManager::resume (this=<optimized out>, 
process=0x561dcf048620)
    at /mesos/3rdparty/libprocess/src/process.cpp:3002
#17 0x00007fa1f14ee226 in operator() (__closure=0x561dcf119158)
    at /mesos/3rdparty/libprocess/src/process.cpp:2511
#18 _M_invoke<> (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1700
#19 operator() (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1688
#20 
std::thread::_Impl<std::_Bind_simple<process::ProcessManager::init_threads()::<lambda()>()>
 >::_M_run(void) (this=0x561dcf119140) at /usr/include/c++/4.9/thread:115
#21 0x00007fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007fa1ee520064 in start_thread (arg=0x7fa1e0e4d700) at 
pthread_create.c:309
#23 0x00007fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

[~ggarg] is this trace present whenever it's hanging?

> Add self health monitoring in Mesos master
> ------------------------------------------
>
>                 Key: MESOS-9767
>                 URL: https://issues.apache.org/jira/browse/MESOS-9767
>             Project: Mesos
>          Issue Type: Task
>          Components: master
>    Affects Versions: 1.6.0
>            Reporter: Gaurav Garg
>            Priority: Major
>             Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
> [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
>  for this one. One downside is that HealthChecker currently takes TaskId as 
> an input which is not valid for master health check. 
> We can add following flags to control the self heath checking:
>  # self_monitoring_enabled: Whether self monitoring is enabled.
>  # self_monitoring_consecutive_failures: After this many number of health 
> failures, master is crashed.
>  # self_monitoring_interval_secs: Interval at which health checks are 
> performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

Reply via email to