Re: mesos crash

Joseph Wu Tue, 19 Jul 2016 11:16:11 -0700

When you start a new group of masters, the masters will not initialize
their replicated log (from the EMPTY state) until all masters are present.
This means (quorum * 2 - 1) masters must be up and reachable.


We enforce this behavior because the replicated log can get into a
inconsistent state otherwise.  Consider a simple case where you have an
existing group of 3 masters:
1) Each master's replicated log is up-to-date with the leader.
2) Two masters are completely destroyed (their disks blow up, or something).
3) You bring two new masters up.

If we allow a quorum of new masters to rejoin an existing cluster, the old
master's data becomes the source of truth because it has the highest log
position.  This is not necessarily correct.
By dis-allowing a quorum of new masters to rejoin an existing cluster, it
becomes the operator's job to recover after catastrophic failures.

On Tue, Jul 19, 2016 at 2:57 AM, 梦开始的地方 <382607...@qq.com> wrote:

>
> yes,I start 1 master ,work fined,I try 3 masters later,thanks
>
> ------------------ 原始邮件 ------------------
> *发件人:* "haosdent";<haosd...@gmail.com>;
> *发送时间:* 2016年7月19日(星期二) 下午5:39
> *收件人:* "user"<user@mesos.apache.org>;
> *主题:* Re: mesos crash
>
> I think may because you start 2 masters and set MESOS_quorum to 2, then
> the election could not finish successfully. May you start 3 masters? Or
> remove zookeeper and just start 1 master.
>
> On Tue, Jul 19, 2016 at 5:26 PM, 梦开始的地方 <382607...@qq.com> wrote:
>
>> No,I deployed in two different server
>>
>>
>> ------------------ 原始邮件 ------------------
>> *发件人:* "haosdent";<haosd...@gmail.com>;
>> *发送时间:* 2016年7月19日(星期二) 下午5:05
>> *收件人:* "user"<user@mesos.apache.org>;
>> *主题:* Re: mesos crash
>>
>> Hi,
>> >I start two master node :
>> Did you start them in a same server with same work dir?
>>
>> On Tue, Jul 19, 2016 at 12:18 PM, 梦开始的地方 <382607...@qq.com> wrote:
>>
>>>
>>>
>>> Hello,I deploy mesos on centos,kernel is  3.14.73,mesos version 1.0.0,
>>> this is my master config:
>>> export MESOS_log_dir=/apps/mesos/logs/
>>> export MESOS_ip=0.0.0.0
>>> export MESOS_hostname=`hostname`
>>> export MESOS_logging_level=INFO
>>> export MESOS_quorum=2
>>> export MESOS_work_dir=/apps/mesos/master
>>> export MESOS_zk=zk://zk1:2181,zk2:2181,zk3:2181/oss-mesos
>>> export MESOS_allocator=HierarchicalDRF
>>> export MESOS_cluster=oss-mesos
>>> export MESOS_credentials=/apps/mesos/etc/mesos/credentials.txt
>>> export MESOS_registry=replicated_log
>>> export MESOS_webui_dir=/apps/mesos/share/mesos/webui
>>> export MESOS_zk_session_timeout=90secs
>>> export MESOS_max_executors_per_slave=10
>>> export MESOS_registry_fetch_timeout=2mins
>>>
>>> I start two master node :
>>> but master nodes will crash in a few minute
>>> the log message is
>>>
>>> I0719 11:50:22.673280  5376 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (287)@10.10.186.76:5050
>>> I0719 11:50:23.154119  5381 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (504)@10.10.179.252:5050
>>> I0719 11:50:23.154749  5376 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:23.156838  5378 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:23.563072  5382 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (289)@10.10.186.76:5050
>>> I0719 11:50:23.883855  5376 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (507)@10.10.179.252:5050
>>> I0719 11:50:23.884414  5380 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:23.886569  5375 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:24.163056  5379 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (291)@10.10.186.76:5050
>>> I0719 11:50:24.425379  5378 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (510)@10.10.179.252:5050
>>> I0719 11:50:24.425864  5379 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:24.428951  5375 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:24.935673  5379 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (293)@10.10.186.76:5050
>>> F0719 11:50:25.262277  5381 master.cpp:1662] Recovery failed: Failed to
>>> recover registrar: Failed to perform fetch within 2mins
>>> *** Check failure stack trace: ***
>>>     @     0x7fe6fa0ac37c  google::LogMessage::Fail()
>>>     @     0x7fe6fa0ac2d8  google::LogMessage::SendToLog()
>>>     @     0x7fe6fa0abcce  google::LogMessage::Flush()
>>>     @     0x7fe6fa0aea88  google::LogMessageFatal::~LogMessageFatal()
>>>     @     0x7fe6f900a64c  mesos::internal::master::fail()
>>>     @     0x7fe6f90deffb
>>>  
>>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>>>     @     0x7fe6f90b98df
>>>  _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_
>>>     @     0x7fe6f9086783
>>> _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_
>>>     @     0x7fe6f90df0cd
>>>  
>>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E
>>> _E9_M_invokeERKSt9_Any_dataS1_
>>>     @           0x4a4833  std::function<>::operator()()
>>>     @           0x49f0eb
>>>  
>>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>>>     @           0x4997c2  process::Future<>::fail()
>>>     @     0x7fe6f8ccfa22  process::Promise<>::fail()
>>>     @     0x7fe6f90dc4f0  process::internal::thenf<>()
>>>     @     0x7fe6f9120bd9
>>>  
>>> _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12
>>>
>>> _PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
>>>     @     0x7fe6f91178cd  std::_Bind<>::operator()<>()
>>>     @     0x7fe6f90fe821  std::_Function_handler<>::_M_invoke()
>>>     @     0x7fe6f9117aff  std::function<>::operator()()
>>>     @     0x7fe6f90fe955
>>>  
>>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_
>>>     @     0x7fe6f9120c85
>>>  
>>> _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeER
>>> KSt9_Any_dataS7_
>>>     @     0x7fe6f9117aff  std::function<>::operator()()
>>>     @     0x7fe6f91807c4  process::internal::run<>()
>>>     @     0x7fe6f9176ef4  process::Future<>::fail()
>>>     @     0x7fe6f91b12de  std::_Mem_fn<>::operator()<>()
>>> I0719 11:50:25.414069  5382 replica.cpp:673] Replica in EMPTY status
>>> received a broadcasted recover request from (513)@10.10.179.252:5050
>>> I0719 11:50:25.414718  5376 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>>     @     0x7fe6f91ac6c7
>>>  
>>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT
>>> 0_EESt12_Index_tupleIIXspT1_EEE
>>> I0719 11:50:25.416431  5377 recover.cpp:197] Received a recover
>>> response from a replica in EMPTY status
>>> I0719 11:50:25.418115  5379 http.cpp:381] HTTP GET for /master/state
>>> from 10.10.159.106:3363 with User-Agent='Mozilla/5.0 (X11; Linux
>>> x86_64) AppleWebKit/537.36 (KHTML, like
>>>  Gecko) Chrome/48.0.2564.116 Safari/537.36'
>>>     @     0x7fe6f91a4d23
>>>  
>>> _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_
>>>     @     0x7fe6f919ac63
>>>  
>>> _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9
>>> _E_clES9_
>>>     @     0x7fe6f91ac752
>>>  
>>> _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEE
>>> bEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_
>>>     @           0x4a4833  std::function<>::operator()()
>>>     @           0x49f0eb
>>>  
>>> _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_
>>>     @     0x7fe6f9176ecc  process::Future<>::fail()
>>>     @     0x7fe6f916feac  process::Promise<>::fail()
>>>
>>>
>>>
>>> error logs is
>>> Log file created at: 2016/07/19 11:50:25
>>> Running on machine: oss-mesos-master-bjc-001
>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
>>> F0719 11:50:25.262277  5381 master.cpp:1662] Recovery failed: Failed to
>>> recover registrar: Failed to perform fetch within 2mins
>>> can you help me.thanks
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>

Re: mesos crash

Reply via email to