Ji Huang created MESOS-2014:
-------------------------------

             Summary: error of Recovery failed: Failed to recover registrar: 
Failed to perform fetch within 5mins
                 Key: MESOS-2014
                 URL: https://issues.apache.org/jira/browse/MESOS-2014
             Project: Mesos
          Issue Type: Bug
          Components: master
    Affects Versions: 0.20.1
         Environment: CentOS 6.3
 3.10.5-12.1.x86_64 #1 SMP Fri Aug 16 01:42:38 UTC 2013 x86_64 x86_64 x86_64 
GNU/Linux
            Reporter: Ji Huang


I set  up a mesos master cluster with 3 nodes. at the first, everything goes 
well, but when the leader master had dead, other candidate node  can not 
recovery and elect new leader, all of candidate node will dead too.

I1030 15:01:32.005691  6741 detector.cpp:138] Detected a new leader: (id='16')
I1030 15:01:32.005692  6737 network.hpp:423] ZooKeeper group memberships changed
I1030 15:01:32.006089  6741 group.cpp:658] Trying to get 
'/mesos/info_0000000016' in ZooKeeper
I1030 15:01:32.006222  6738 group.cpp:658] Trying to get 
'/mesos/log_replicas/0000000015' in ZooKeeper
I1030 15:01:32.007230  6738 group.cpp:658] Trying to get 
'/mesos/log_replicas/0000000016' in ZooKeeper
I1030 15:01:32.007268  6736 detector.cpp:426] A new leading master 
(UPID=master@10.99.169.5:5050) is detected
I1030 15:01:32.007546  6742 master.cpp:1196] The newly elected leader is 
master@10.99.169.5:5050 with id 20141030-150042-94987018-5050-6735
I1030 15:01:32.007640  6742 master.cpp:1209] Elected as the leading master!
I1030 15:01:32.007730  6742 master.cpp:1027] Recovering from registrar
I1030 15:01:32.007895  6736 registrar.cpp:313] Recovering registrar
I1030 15:01:32.008388  6742 network.hpp:461] ZooKeeper group PIDs: { 
log-replica(1)@10.99.169.5:5050, log-replica(1)@10.99.169.6:5050 }
I1030 15:01:32.051316  6742 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:32.889194  6738 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:33.469511  6743 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:34.324684  6740 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:35.263629  6736 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:36.212492  6739 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:37.015682  6742 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:37.781746  6743 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:38.494547  6737 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:39.186830  6740 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:40.072258  6736 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:40.855337  6743 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:41.516916  6739 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:41.556437  6744 recover.cpp:111] Unable to finish the recover 
protocol in 10secs, retrying
I1030 15:01:41.557253  6741 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:41.557502  6739 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I1030 15:01:41.558156  6741 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I1030 15:01:42.153370  6737 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:42.505698  6742 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I1030 15:01:42.506060  6738 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I1030 15:01:42.507046  6742 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
......
F1030 15:06:32.009464  6741 master.cpp:1016] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 5mins


Core dump info:
#0  0x0000003d636328a5 in raise () from /lib64/libc.so.6
#1  0x0000003d63634085 in abort () from /lib64/libc.so.6
#2  0x00007f7a452f0e19 in google::DumpStackTraceAndExit () at 
src/utilities.cc:147
#3  0x00007f7a452e7d5d in google::LogMessage::Fail () at src/logging.cc:1458
#4  0x00007f7a452ebd77 in google::LogMessage::SendToLog (this=0x7f7a41d8f9d0) 
at src/logging.cc:1412
#5  0x00007f7a452e9bf9 in google::LogMessage::Flush (this=0x7f7a41d8f9d0) at 
src/logging.cc:1281
#6  0x00007f7a452e9efd in google::LogMessageFatal::~LogMessageFatal 
(this=0x7f7a41d8f9d0, __in_chrg=<value optimized out>) at src/logging.cc:1984
#7  0x00007f7a44d6759c in mesos::internal::master::fail (message="Recovery 
failed", failure="Failed to recover registrar: Failed to perform fetch within 
5mins") at ../../src/master/master.cpp:1016
#8  0x00007f7a44da75a6 in __call<std::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const&, 0, 1> (__functor=<value 
optimized out>, __args#0=
    "Failed to recover registrar: Failed to perform fetch within 5mins") at 
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
#9  operator()<const std::basic_string<char, std::char_traits<char>, 
std::allocator<char> > > (__functor=<value optimized out>, __args#0="Failed to 
recover registrar: Failed to perform fetch within 5mins")
    at 
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
#10 std::tr1::_Function_handler<void(const std::string&), std::tr1::_Bind<void 
(*(const char*, std::tr1::_Placeholder<1>))(const std::string&, const 
std::string&)> >::_M_invoke(const std::tr1::_Any_data &, const std::string &) 
(__functor=<value optimized out>, __args#0="Failed to recover registrar: Failed 
to perform fetch within 5mins")
    at 
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
#11 0x00007f7a44caff3c in process::Future<Nothing>::fail (this=0x7f7a140164f8, 
_message=<value optimized out>) at 
../../3rdparty/libprocess/include/process/future.hpp:1628
#12 0x00007f7a44de1a6a in fail (promise=std::tr1::shared_ptr (count 1) 
0x7f7a140164f0, f=..., future=<value optimized out>) at 
../../3rdparty/libprocess/include/process/future.hpp:789
#13 process::internal::thenf<mesos::internal::Registry, Nothing>(const 
std::tr1::shared_ptr<process::Promise<Nothing> > &, const 
std::tr1::function<process::Future<Nothing>(const mesos::internal::Registry&)> 
&, const process::Future<mesos::internal::Registry> &) 
(promise=std::tr1::shared_ptr (count 1) 0x7f7a140164f0, f=..., future=<value 
optimized out>) at ../../3rdparty/libprocess/include/process/future.hpp:1438
#14 0x00007f7a44e18ffc in process::Future<mesos::internal::Registry>::fail 
(this=0x7f7a2800be68, _message=<value optimized out>) at 
../../3rdparty/libprocess/include/process/future.hpp:1634
#15 0x00007f7a44e18f9c in process::Future<mesos::internal::Registry>::fail 
(this=0x7f7a2801c488, _message=<value optimized out>) at 
../../3rdparty/libprocess/include/process/future.hpp:1628
#16 0x00007f7a44e0cf4c in fail (this=0x2179b80, info=<value optimized out>, 
recovery=<value optimized out>) at 
../../3rdparty/libprocess/include/process/future.hpp:789
#17 mesos::internal::master::RegistrarProcess::_recover (this=0x2179b80, 
info=<value optimized out>, recovery=<value optimized out>) at 
../../src/master/registrar.cpp:341
#18 0x00007f7a44e24181 in __call<process::ProcessBase*&, 0, 1> 
(__functor=<value optimized out>, __args#0=<value optimized out>)
    at 
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1137
#19 operator()<process::ProcessBase*> (__functor=<value optimized out>, 
__args#0=<value optimized out>) at 
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1191
#20 std::tr1::_Function_handler<void(process::ProcessBase*), 
std::tr1::_Bind<void (*(std::tr1::_Placeholder<1>, 
std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
 >))(process::ProcessBase*, 
std::tr1::shared_ptr<std::tr1::function<void(mesos::internal::master::RegistrarProcess*)>
 >)> >::_M_invoke(const std::tr1::_Any_data &, process::ProcessBase *) 
(__functor=<value optimized out>, 
    __args#0=<value optimized out>) at 
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/tr1_impl/functional:1668
#21 0x00007f7a452814f4 in process::ProcessManager::resume (this=0x214b690, 
process=0x2179e28) at ../../../3rdparty/libprocess/src/process.cpp:2848
#22 0x00007f7a45281dec in process::schedule (arg=<value optimized out>) at 
../../../3rdparty/libprocess/src/process.cpp:1479
#23 0x0000003d63a07851 in start_thread () from /lib64/libpthread.so.0
#24 0x0000003d636e811d in clone () from /lib64/libc.so.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to