When you start a new group of masters, the masters will not initialize their replicated log (from the EMPTY state) until all masters are present. This means (quorum * 2 - 1) masters must be up and reachable.
We enforce this behavior because the replicated log can get into a inconsistent state otherwise. Consider a simple case where you have an existing group of 3 masters: 1) Each master's replicated log is up-to-date with the leader. 2) Two masters are completely destroyed (their disks blow up, or something). 3) You bring two new masters up. If we allow a quorum of new masters to rejoin an existing cluster, the old master's data becomes the source of truth because it has the highest log position. This is not necessarily correct. By dis-allowing a quorum of new masters to rejoin an existing cluster, it becomes the operator's job to recover after catastrophic failures. On Tue, Jul 19, 2016 at 2:57 AM, 梦开始的地方 <382607...@qq.com> wrote: > > yes,I start 1 master ,work fined,I try 3 masters later,thanks > > ------------------ 原始邮件 ------------------ > *发件人:* "haosdent";<haosd...@gmail.com>; > *发送时间:* 2016年7月19日(星期二) 下午5:39 > *收件人:* "user"<user@mesos.apache.org>; > *主题:* Re: mesos crash > > I think may because you start 2 masters and set MESOS_quorum to 2, then > the election could not finish successfully. May you start 3 masters? Or > remove zookeeper and just start 1 master. > > On Tue, Jul 19, 2016 at 5:26 PM, 梦开始的地方 <382607...@qq.com> wrote: > >> No,I deployed in two different server >> >> >> ------------------ 原始邮件 ------------------ >> *发件人:* "haosdent";<haosd...@gmail.com>; >> *发送时间:* 2016年7月19日(星期二) 下午5:05 >> *收件人:* "user"<user@mesos.apache.org>; >> *主题:* Re: mesos crash >> >> Hi, >> >I start two master node : >> Did you start them in a same server with same work dir? >> >> On Tue, Jul 19, 2016 at 12:18 PM, 梦开始的地方 <382607...@qq.com> wrote: >> >>> >>> >>> Hello,I deploy mesos on centos,kernel is 3.14.73,mesos version 1.0.0, >>> this is my master config: >>> export MESOS_log_dir=/apps/mesos/logs/ >>> export MESOS_ip=0.0.0.0 >>> export MESOS_hostname=`hostname` >>> export MESOS_logging_level=INFO >>> export MESOS_quorum=2 >>> export MESOS_work_dir=/apps/mesos/master >>> export MESOS_zk=zk://zk1:2181,zk2:2181,zk3:2181/oss-mesos >>> export MESOS_allocator=HierarchicalDRF >>> export MESOS_cluster=oss-mesos >>> export MESOS_credentials=/apps/mesos/etc/mesos/credentials.txt >>> export MESOS_registry=replicated_log >>> export MESOS_webui_dir=/apps/mesos/share/mesos/webui >>> export MESOS_zk_session_timeout=90secs >>> export MESOS_max_executors_per_slave=10 >>> export MESOS_registry_fetch_timeout=2mins >>> >>> I start two master node : >>> but master nodes will crash in a few minute >>> the log message is >>> >>> I0719 11:50:22.673280 5376 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (287)@10.10.186.76:5050 >>> I0719 11:50:23.154119 5381 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (504)@10.10.179.252:5050 >>> I0719 11:50:23.154749 5376 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:23.156838 5378 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:23.563072 5382 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (289)@10.10.186.76:5050 >>> I0719 11:50:23.883855 5376 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (507)@10.10.179.252:5050 >>> I0719 11:50:23.884414 5380 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:23.886569 5375 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:24.163056 5379 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (291)@10.10.186.76:5050 >>> I0719 11:50:24.425379 5378 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (510)@10.10.179.252:5050 >>> I0719 11:50:24.425864 5379 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:24.428951 5375 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:24.935673 5379 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (293)@10.10.186.76:5050 >>> F0719 11:50:25.262277 5381 master.cpp:1662] Recovery failed: Failed to >>> recover registrar: Failed to perform fetch within 2mins >>> *** Check failure stack trace: *** >>> @ 0x7fe6fa0ac37c google::LogMessage::Fail() >>> @ 0x7fe6fa0ac2d8 google::LogMessage::SendToLog() >>> @ 0x7fe6fa0abcce google::LogMessage::Flush() >>> @ 0x7fe6fa0aea88 google::LogMessageFatal::~LogMessageFatal() >>> @ 0x7fe6f900a64c mesos::internal::master::fail() >>> @ 0x7fe6f90deffb >>> >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE >>> @ 0x7fe6f90b98df >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_ >>> @ 0x7fe6f9086783 >>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_ >>> @ 0x7fe6f90df0cd >>> >>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E >>> _E9_M_invokeERKSt9_Any_dataS1_ >>> @ 0x4a4833 std::function<>::operator()() >>> @ 0x49f0eb >>> >>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ >>> @ 0x4997c2 process::Future<>::fail() >>> @ 0x7fe6f8ccfa22 process::Promise<>::fail() >>> @ 0x7fe6f90dc4f0 process::internal::thenf<>() >>> @ 0x7fe6f9120bd9 >>> >>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12 >>> >>> _PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE >>> @ 0x7fe6f91178cd std::_Bind<>::operator()<>() >>> @ 0x7fe6f90fe821 std::_Function_handler<>::_M_invoke() >>> @ 0x7fe6f9117aff std::function<>::operator()() >>> @ 0x7fe6f90fe955 >>> >>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_ >>> @ 0x7fe6f9120c85 >>> >>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeER >>> KSt9_Any_dataS7_ >>> @ 0x7fe6f9117aff std::function<>::operator()() >>> @ 0x7fe6f91807c4 process::internal::run<>() >>> @ 0x7fe6f9176ef4 process::Future<>::fail() >>> @ 0x7fe6f91b12de std::_Mem_fn<>::operator()<>() >>> I0719 11:50:25.414069 5382 replica.cpp:673] Replica in EMPTY status >>> received a broadcasted recover request from (513)@10.10.179.252:5050 >>> I0719 11:50:25.414718 5376 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> @ 0x7fe6f91ac6c7 >>> >>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT >>> 0_EESt12_Index_tupleIIXspT1_EEE >>> I0719 11:50:25.416431 5377 recover.cpp:197] Received a recover >>> response from a replica in EMPTY status >>> I0719 11:50:25.418115 5379 http.cpp:381] HTTP GET for /master/state >>> from 10.10.159.106:3363 with User-Agent='Mozilla/5.0 (X11; Linux >>> x86_64) AppleWebKit/537.36 (KHTML, like >>> Gecko) Chrome/48.0.2564.116 Safari/537.36' >>> @ 0x7fe6f91a4d23 >>> >>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_ >>> @ 0x7fe6f919ac63 >>> >>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9 >>> _E_clES9_ >>> @ 0x7fe6f91ac752 >>> >>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEE >>> bEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ >>> @ 0x4a4833 std::function<>::operator()() >>> @ 0x49f0eb >>> >>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ >>> @ 0x7fe6f9176ecc process::Future<>::fail() >>> @ 0x7fe6f916feac process::Promise<>::fail() >>> >>> >>> >>> error logs is >>> Log file created at: 2016/07/19 11:50:25 >>> Running on machine: oss-mesos-master-bjc-001 >>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg >>> F0719 11:50:25.262277 5381 master.cpp:1662] Recovery failed: Failed to >>> recover registrar: Failed to perform fetch within 2mins >>> can you help me.thanks >>> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> > > > > -- > Best Regards, > Haosdent Huang >