[ https://issues.apache.org/jira/browse/MESOS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14220245#comment-14220245 ]
Benjamin Mahler commented on MESOS-2148: ---------------------------------------- Yes, that's definitely a requirement of running HA masters, they need to be restarted in a timely manner (O(seconds)). > Cluster formation often fails when starting mesos-masters with > --registry=replicated_log > ---------------------------------------------------------------------------------------- > > Key: MESOS-2148 > URL: https://issues.apache.org/jira/browse/MESOS-2148 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.20.1 > Environment: Mesos 0.20.1 > Zookeeper 3.3.5 > Debian Wheezy > Reporter: Jeremy Lingmann > > When starting mesos-masters in close proximity (within 1min of each other) to > form a new cluster, we often see the cluster fail to form with 1 master > committing suicide and the other 2 exiting with a stack trace. Here are the > logs for a cluster with 3 mesos masters showing this behavior. > https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447 > {code} > Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:01.858300 2321 recover.cpp:188] Received a recover response from a > replica in EMPTY status > Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:01.859736 2321 recover.cpp:188] Received a recover response from a > replica in EMPTY status > Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:01.927644 2321 replica.cpp:638] Replica in EMPTY status received a > broadcasted recover request > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 > 22:30:02.621289 2328 master.cpp:1016] Recovery failed: Failed to recover > registrar: Failed to perform fetch within 1mins > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check > failure stack trace: *** > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:02.654814 2321 replica.cpp:638] Replica in EMPTY status received a > broadcasted recover request > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:02.655992 2327 recover.cpp:188] Received a recover response from a > replica in EMPTY status > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:02.658712 2327 recover.cpp:188] Received a recover response from a > replica in EMPTY status > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 > 22:30:02.708247 2327 replica.cpp:638] Replica in EMPTY status received a > broadcasted recover request > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383caf3acd google::LogMessage::Fail() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383caf5883 google::LogMessage::SendToLog() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383caf36e8 google::LogMessage::Flush() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383caf60de google::LogMessageFatal::~LogMessageFatal() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c5c7946 mesos::internal::master::fail() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c622c31 > _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c622caa std::tr1::_Function_handler<>::_M_invoke() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c521aa9 process::Future<>::fail() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c661b7a process::internal::thenf<>() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c67b1e8 process::Future<>::fail() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c67b179 process::Future<>::fail() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c66a629 mesos::internal::master::RegistrarProcess::_recover() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c67f1cb > _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383c67f224 std::tr1::_Function_handler<>::_M_invoke() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383ca9e57c process::ProcessManager::resume() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383ca9ee1c process::schedule() > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383a7bdb50 start_thread > Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ > 0x7f383a5077bd (unknown) > {code} > https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468 > {code} > Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 > 22:30:14.013592 2329 master.cpp:1196] The newly elected leader is > master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157 > Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 > 22:30:14.013720 2329 master.cpp:1209] Elected as the leading master! > Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 > 22:30:14.013757 2329 master.cpp:1027] Recovering from registrar > Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 > 22:30:14.013967 2327 registrar.cpp:313] Recovering registrar > Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 > 22:30:23.576931 2330 recover.cpp:111] Unable to finish the recover protocol > in 10secs, retrying > Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 > 22:30:33.578037 2330 recover.cpp:111] Unable to finish the recover protocol > in 10secs, retrying > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 > 22:31:14.015543 2329 master.cpp:1016] Recovery failed: Failed to recover > registrar: Failed to perform fetch within 1mins > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check > failure stack trace: *** > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491c36bacd google::LogMessage::Fail() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491c36d883 google::LogMessage::SendToLog() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491c36b6e8 google::LogMessage::Flush() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491c36e0de google::LogMessageFatal::~LogMessageFatal() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491be3f946 mesos::internal::master::fail() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491be9ac31 > _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491be9acaa std::tr1::_Function_handler<>::_M_invoke() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bd99aa9 process::Future<>::fail() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bed9b7a process::internal::thenf<>() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bef31e8 process::Future<>::fail() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bef3179 process::Future<>::fail() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bee2629 mesos::internal::master::RegistrarProcess::_recover() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bef71cb > _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491bef7224 std::tr1::_Function_handler<>::_M_invoke() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491c31657c process::ProcessManager::resume() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491c316e1c process::schedule() > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f491a035b50 start_thread > Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ > 0x7f4919d7f7bd (unknown) > {code} > https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168 > {code} > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 > 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: > Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): > server refused to accept the client > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 > 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated > connection to server [10.67.129.108:2181] > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 > 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: > Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed > while receiving a server response > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 > 22:28:35.643823 2180 group.cpp:456] Timed out waiting to reconnect to > ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 > 22:28:35.645727 2182 group.cpp:472] ZooKeeper session expired > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 > 22:28:35.645884 2179 detector.cpp:138] Detected a new leader: None > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 > 22:28:35.645786 2181 group.cpp:472] ZooKeeper session expired > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 > 22:28:35.645982 2179 contender.cpp:196] Membership cancelled: 1 > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 > 22:28:35.646026 2179 master.cpp:1196] The newly elected leader is None > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost > leadership... committing suicide! > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 > 22:28:35.646095 2180 group.cpp:472] ZooKeeper session expired > Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 > 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing > zookeeper resources for sessionId=0x149c5026bf20002 > {code} > Here are the mesos-master startup options being used: > {code} > /usr/local/sbin/mesos-master \ > --zk=zk://10.132.193.161:2181,10.132.193.153:2181,10.132.193.160:2181/mesos > \ > --port=5050 \ > --log_dir=/var/log/mesos \ > --cluster=test \ > --hostname=10.132.193.161 \ > --ip=10.132.193.161 \ > --quorum=2 \ > --work_dir=/var/lib/mesos > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)