[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ]
Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM: -------------------------------------------------------------------- I am still able to easily reproduce this, even with master built from today: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_0000000000' in ZooKeeper I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master! {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} $ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are). was (Author: stevenschlansker): I am still able to easily reproduce this, even with master built from today: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat {code} Three configured ZK members, all is OK. Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable: {code} ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group 2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** 2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] *** Check failure stack trace: *** @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec6044c2 google::LogMessage::Fail() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec60440e google::LogMessage::SendToLog() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603e10 google::LogMessage::Flush() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec603c25 google::LogMessage::~LogMessage() @ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec00b825 ZooKeeperProcess::initialize() @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec57053d process::ProcessManager::resume() @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_ @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE @ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv @ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_ @ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv @ 0x7f9be828ea40 (unknown) @ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE @ 0x7f9be7aab182 start_thread @ 0x7f9be77d847d (unknown) Aborted (core dumped) {code} [~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are). > Mesos crashes if any configured zookeeper does not resolve. > ----------------------------------------------------------- > > Key: MESOS-2186 > URL: https://issues.apache.org/jira/browse/MESOS-2186 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.21.0 > Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6 > Mesos: 0.21.0-1.0.centos65 > CentOS: CentOS release 6.6 (Final) > Reporter: Daniel Hall > Priority: Critical > Labels: mesosphere > > When starting Mesos, if one of the configured zookeeper servers does not > resolve in DNS Mesos will crash and refuse to start. We noticed this issue > while we were rebuilding one of our zookeeper hosts in Google compute (which > bases the DNS on the machines running). > Here is a log from a failed startup (hostnames and ip addresses have been > sanitised). > {noformat} > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 > 28627 main.cpp:292] Starting Mesos master > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 > 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 > 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No > such file or directory > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 > 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such > file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to > create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 > 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack > trace: *** > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 > google::LogMessage::Fail() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 > 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 > (mesosmaster-2.internal) started on 10.x.x.x:5050 > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 > 28640 master.cpp:366] Master allowing unauthenticated frameworks to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 > 28640 master.cpp:371] Master allowing unauthenticated slaves to register > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 > google::LogMessage::SendToLog() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 > google::LogMessage::Flush() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 > google::LogMessage::Flush() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 > google::LogMessage::Flush() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af > google::LogMessage::~LogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f > google::ErrnoLogMessage::~ErrnoLogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 > google::LogMessage::Flush() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 > 28643 contender.cpp:131] Joining the ZK group > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 > 28640 master.cpp:1202] Successfully attached file > '/var/log/mesos/mesos-master.INFO' > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af > google::LogMessage::~LogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f > google::ErrnoLogMessage::~ErrnoLogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af > google::LogMessage::~LogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f > google::ErrnoLogMessage::~ErrnoLogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af > google::LogMessage::~LogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f > google::ErrnoLogMessage::~ErrnoLogMessage() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf > ZooKeeperProcess::initialize() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 > process::ProcessManager::resume() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf > ZooKeeperProcess::initialize() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf > ZooKeeperProcess::initialize() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf > ZooKeeperProcess::initialize() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 > process::ProcessManager::resume() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 > process::ProcessManager::resume() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 > process::ProcessManager::resume() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f55fa21f > process::schedule() > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x3e498079d1 > (unknown) > Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x3e494e89dd > (unknown) > Dec 9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in > '/usr/local/sbin/mesos-master' > Dec 9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed > by ABRT signal > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)