[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022
 ] 

Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:51 PM:
--------------------------------------------------------------------

I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/wat/json.info_0000000000' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master 
(UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is 
master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
    @     0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
    @     0x7f9be828ea40  (unknown)
    @     0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9be7aab182  start_thread
    @     0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  I fully admit the linked ZK bug may not be the root cause.  But 
Mesos is still trivial to crash if one of the ZK members are not valid (even if 
a quorum are).



was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get 
'/wat/json.info_0000000000' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master 
(UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is 
master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct 
though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory 
--zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: 
getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec56d9ae  
_ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9bec577b54  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec5779ed  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
    @     0x7f9bec577b04  
_ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec577986  
_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
    @     0x7f9be828ea40  (unknown)
    @     0x7f9bec577a96  
_ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9be7aab182  start_thread
    @     0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am 
no C++ expert.  But Mesos is still trivial to crash if one of the ZK members 
are not valid (even if two are).


> Mesos crashes if any configured zookeeper does not resolve.
> -----------------------------------------------------------
>
>                 Key: MESOS-2186
>                 URL: https://issues.apache.org/jira/browse/MESOS-2186
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0, 0.26.0
>         Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>            Reporter: Daniel Hall
>            Priority: Critical
>              Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  
> google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 
> 28643 contender.cpp:131] Joining the ZK group
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 
> 28640 master.cpp:1202] Successfully attached file 
> '/var/log/mesos/mesos-master.INFO'
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  
> google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  
> google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  
> ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  
> process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f55fa21f  
> process::schedule()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e498079d1  
> (unknown)
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e494e89dd  
> (unknown)
> Dec  9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in 
> '/usr/local/sbin/mesos-master'
> Dec  9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed 
> by ABRT signal
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to