@Qian, I think you're running issues with firewall, did you make sure your master can reach from each other?
FROM master A $ telnet B 5050 I think it fail to connect. Please ensure shutdown any firewall. -- Thanks, Chengwei On Mon, Jun 06, 2016 at 09:06:43PM +0800, Qian Zhang wrote: > I deleted everything in the work dir (/var/lib/mesos/master), and tried again, > the same error still happened :-( > > > Thanks, > Qian Zhang > > On Mon, Jun 6, 2016 at 3:03 AM, Jean Christophe “JC” Martin < > jch.mar...@gmail.com> wrote: > > Qian, > > Zookeeper should be able to reach a quorum with 2, no need to start 3 > simultaneously, but there is an issue with Zookeeper related to connection > timeouts. > https://issues.apache.org/jira/browse/ZOOKEEPER-2164 > In some circumstances, the timeout is higher than the sync timeout, which > cause the leader election to fail. > Try setting the parameter cnxtimeout in zookeeper (by default it’s 5000ms) > to the value 500 (500ms). After doing this, leader election in ZK will be > super fast even if a node is disconnected. > > JC > > > On Jun 4, 2016, at 4:34 PM, Qian Zhang <zhq527...@gmail.com> wrote: > > > > Thanks Vinod and Dick. > > > > I think my 3 ZK servers have formed a quorum, each of them has the > > following config: > > $ cat conf/zoo.cfg > > server.1=192.168.122.132:2888:3888 > > server.2=192.168.122.225:2888:3888 > > server.3=192.168.122.171:2888:3888 > > autopurge.purgeInterval=6 > > autopurge.snapRetainCount=5 > > initLimit=10 > > syncLimit=5 > > maxClientCnxns=0 > > clientPort=2181 > > tickTime=2000 > > quorumListenOnAllIPs=true > > dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot > > dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions > > > > And when I run "bin/zkServer.sh status" on each of them, I can see > "Mode: > > leader" for one, and "Mode: follower" for the other two. > > > > I have already tried to manually start 3 masters simultaneously, and > here > > is what I see in their log: > > In 192.168.122.171(this is the first master I started): > > I0605 07:12:49.418721 1187 detector.cpp:152] Detected a new leader: > > (id='25') > > I0605 07:12:49.419276 1186 group.cpp:698] Trying to get > > '/mesos/log_replicas/0000000024' in ZooKeeper > > I0605 07:12:49.420013 1188 group.cpp:698] Trying to get > > '/mesos/json.info_0000000025' in ZooKeeper > > I0605 07:12:49.423807 1188 zookeeper.cpp:259] A new leading master > > (UPID=master@192.168.122.171:5050) is detected > > I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: { > > log-replica(1)@192.168.122.171:5050 } > > I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader > > is master@192.168.122.171:5050 with id > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b > > I0605 07:12:49.424895 1187 master.cpp:1964] Elected as the leading > > master! > > > > In 192.168.122.225 (second master I started): > > I0605 07:12:51.918702 2246 detector.cpp:152] Detected a new leader: > > (id='25') > > I0605 07:12:51.919983 2246 group.cpp:698] Trying to get > > '/mesos/json.info_0000000025' in ZooKeeper > > I0605 07:12:51.921910 2249 network.hpp:461] ZooKeeper group PIDs: { > > log-replica(1)@192.168.122.171:5050 } > > I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status > > received a broadcasted recover request from (6)@192.168.122.225:5050 > > I0605 07:12:51.927891 2246 zookeeper.cpp:259] A new leading master > > (UPID=master@192.168.122.171:5050) is detected > > I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader > > is master@192.168.122.171:5050 with id > cdc459d4-a05f-4f99-9bf4-1ee9a91d139b > > > > In 192.168.122.132 (last master I started): > > I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader: > > (id='25') > > I0605 07:12:53.555179 16429 group.cpp:698] Trying to get > > '/mesos/json.info_0000000025' in ZooKeeper > > I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master > (UPID > = > > master@192.168.122.171:5050) is detected > > > > So right after I started these 3 masters, the first one > (192.168.122.171) > > was successfully elected as leader, but after 60s, 192.168.122.171 > failed > > with the error mentioned in my first mail, and then 192.168.122.225 was > > elected as leader, but it failed with the same error too after another > 60s, > > and the same thing happened to the last one (192.168.122.132). So after > > about 180s, all my 3 master were down. > > > > I tried both: > > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2 > > --work_dir=/var/lib/mesos/master > > and > > sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181, > > 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2 > > --work_dir=/var/lib/mesos/master > > And I see the same error for both. > > > > 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are > > running on a KVM hypervisor host. > > > > > > > > > > Thanks, > > Qian Zhang > > > > On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <d...@hellooperator.net> > wrote: > > > >> You told the master it needed a quorum of 2 and it's the only one > >> online, so it's bombing out. > >> That's the expected behaviour. > >> > >> You need to start at least 2 zookeepers before it will be a functional > >> group, same for the masters. > >> > >> You haven't mentioned how you setup your zookeeper cluster, so i'm > >> assuming that's working > >> correctly (3 nodes, all aware of the other 2 in their config). If not, > >> you need to sort that out first. > >> > >> > >> Also I think your zk URL is wrong - you want to list all 3 zookeeper > >> nodes like this: > >> > >> sudo ./bin/mesos-master.sh > >> --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2 > >> --work_dir=/var/lib/mesos/master > >> > >> when you've run that command on 2 hosts things should start working, > >> you'll want all 3 up for > >> redundancy. > >> > >> On 4 June 2016 at 16:42, Qian Zhang <zhq527...@gmail.com> wrote: > >>> Hi Folks, > >>> > >>> I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a > >>> Zookeeper running, so they form a Zookeeper cluster. And then when I > >> started > >>> the first Mesos master in one node with: > >>> sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos > --quorum=2 > >>> --work_dir=/var/lib/mesos/master > >>> > >>> I found it will hang here for 60 seconds: > >>> I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master > >>> (UPID=master@192.168.122.132:5050) is detected > >>> I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader > >> is > >>> master@192.168.122.132:5050 with id > 40d387a6-4d61-49d6-af44-51dd41457390 > >>> I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading > >>> master! > >>> I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from > registrar > >>> I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar > >>> I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the > writer > >>> > >>> And after 60s, master will fail: > >>> F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed > to > >>> recover registrar: Failed to perform fetch within 1mins > >>> *** Check failure stack trace: *** > >>> @ 0x7f4b81372f4e google::LogMessage::Fail() > >>> @ 0x7f4b81372e9a google::LogMessage::SendToLog() > >>> @ 0x7f4b8137289c google::LogMessage::Flush() > >>> @ 0x7f4b813757b0 google::LogMessageFatal::~LogMessageFatal() > >>> @ 0x7f4b8040eea0 mesos::internal::master::fail() > >>> @ 0x7f4b804dbeb3 > >>> > >> > > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE > >>> @ 0x7f4b804ba453 > >>> _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_ > >>> @ 0x7f4b804898d7 > >>> > >> > > _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_ > >>> @ 0x7f4b804dbf80 > >>> > >> > > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ > >>> @ 0x49d257 std::function<>::operator()() > >>> @ 0x49837f > >>> > >> > > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ > >>> @ 0x493024 process::Future<>::fail() > >>> @ 0x7f4b8015ad20 process::Promise<>::fail() > >>> @ 0x7f4b804d9295 process::internal::thenf<>() > >>> @ 0x7f4b8051788f > >>> > >> > > _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > >>> @ 0x7f4b8050fa3b std::_Bind<>::operator()<>() > >>> @ 0x7f4b804f94e3 std::_Function_handler<>::_M_invoke() > >>> @ 0x7f4b8050fc69 std::function<>::operator()() > >>> @ 0x7f4b804f9609 > >>> > >> > > _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_ > >>> @ 0x7f4b80517936 > >>> > >> > > _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_ > >>> @ 0x7f4b8050fc69 std::function<>::operator()() > >>> @ 0x7f4b8056b1b4 process::internal::run<>() > >>> @ 0x7f4b80561672 process::Future<>::fail() > >>> @ 0x7f4b8059bf5f std::_Mem_fn<>::operator()<>() > >>> @ 0x7f4b8059757f > >>> > >> > > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > >>> @ 0x7f4b8058fad1 > >>> > >> > > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_ > >>> @ 0x7f4b80585a41 > >>> > >> > > _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_ > >>> @ 0x7f4b80597605 > >>> > >> > > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ > >>> @ 0x49d257 std::function<>::operator()() > >>> @ 0x49837f > >>> > >> > > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ > >>> @ 0x7f4b8056164a process::Future<>::fail() > >>> @ 0x7f4b8055a378 process::Promise<>::fail() > >>> > >>> I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but > no > >>> luck for both. Any ideas about what happened? Thanks. > >>> > >>> > >>> > >>> Thanks, > >>> Qian Zhang > >> > > > > SECURITY NOTE: file ~/.netrc must not be accessible by others
signature.asc
Description: Digital signature