Thanks Vinod and Dick. I think my 3 ZK servers have formed a quorum, each of them has the following config: $ cat conf/zoo.cfg server.1=192.168.122.132:2888:3888 server.2=192.168.122.225:2888:3888 server.3=192.168.122.171:2888:3888 autopurge.purgeInterval=6 autopurge.snapRetainCount=5 initLimit=10 syncLimit=5 maxClientCnxns=0 clientPort=2181 tickTime=2000 quorumListenOnAllIPs=true dataDir=/home/stack/packages/zookeeper-3.4.8/snapshot dataLogDir=/home/stack/packages/zookeeper-3.4.8/transactions
And when I run "bin/zkServer.sh status" on each of them, I can see "Mode: leader" for one, and "Mode: follower" for the other two. I have already tried to manually start 3 masters simultaneously, and here is what I see in their log: In 192.168.122.171(this is the first master I started): I0605 07:12:49.418721 1187 detector.cpp:152] Detected a new leader: (id='25') I0605 07:12:49.419276 1186 group.cpp:698] Trying to get '/mesos/log_replicas/0000000024' in ZooKeeper I0605 07:12:49.420013 1188 group.cpp:698] Trying to get '/mesos/json.info_0000000025' in ZooKeeper I0605 07:12:49.423807 1188 zookeeper.cpp:259] A new leading master (UPID=master@192.168.122.171:5050) is detected I0605 07:12:49.423841 1186 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.122.171:5050 } I0605 07:12:49.424281 1187 master.cpp:1951] The newly elected leader is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b I0605 07:12:49.424895 1187 master.cpp:1964] Elected as the leading master! In 192.168.122.225 (second master I started): I0605 07:12:51.918702 2246 detector.cpp:152] Detected a new leader: (id='25') I0605 07:12:51.919983 2246 group.cpp:698] Trying to get '/mesos/json.info_0000000025' in ZooKeeper I0605 07:12:51.921910 2249 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@192.168.122.171:5050 } I0605 07:12:51.925721 2252 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (6)@192.168.122.225:5050 I0605 07:12:51.927891 2246 zookeeper.cpp:259] A new leading master (UPID=master@192.168.122.171:5050) is detected I0605 07:12:51.928444 2246 master.cpp:1951] The newly elected leader is master@192.168.122.171:5050 with id cdc459d4-a05f-4f99-9bf4-1ee9a91d139b In 192.168.122.132 (last master I started): I0605 07:12:53.553949 16426 detector.cpp:152] Detected a new leader: (id='25') I0605 07:12:53.555179 16429 group.cpp:698] Trying to get '/mesos/json.info_0000000025' in ZooKeeper I0605 07:12:53.560045 16428 zookeeper.cpp:259] A new leading master (UPID= master@192.168.122.171:5050) is detected So right after I started these 3 masters, the first one (192.168.122.171) was successfully elected as leader, but after 60s, 192.168.122.171 failed with the error mentioned in my first mail, and then 192.168.122.225 was elected as leader, but it failed with the same error too after another 60s, and the same thing happened to the last one (192.168.122.132). So after about 180s, all my 3 master were down. I tried both: sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2 --work_dir=/var/lib/mesos/master and sudo ./bin/mesos-master.sh --zk=zk://192.168.122.132:2181, 192.168.122.171:2181,192.168.122.225:2181/mesos --quorum=2 --work_dir=/var/lib/mesos/master And I see the same error for both. 192.168.122.132, 192.168.122.225 and 192.168.122.171 are 3 VMs which are running on a KVM hypervisor host. Thanks, Qian Zhang On Sun, Jun 5, 2016 at 3:47 AM, Dick Davies <d...@hellooperator.net> wrote: > You told the master it needed a quorum of 2 and it's the only one > online, so it's bombing out. > That's the expected behaviour. > > You need to start at least 2 zookeepers before it will be a functional > group, same for the masters. > > You haven't mentioned how you setup your zookeeper cluster, so i'm > assuming that's working > correctly (3 nodes, all aware of the other 2 in their config). If not, > you need to sort that out first. > > > Also I think your zk URL is wrong - you want to list all 3 zookeeper > nodes like this: > > sudo ./bin/mesos-master.sh > --zk=zk://host1:2181,host2:2181,host3:2181/mesos --quorum=2 > --work_dir=/var/lib/mesos/master > > when you've run that command on 2 hosts things should start working, > you'll want all 3 up for > redundancy. > > On 4 June 2016 at 16:42, Qian Zhang <zhq527...@gmail.com> wrote: > > Hi Folks, > > > > I am trying to set up a Mesos HA env with 3 nodes, each of nodes has a > > Zookeeper running, so they form a Zookeeper cluster. And then when I > started > > the first Mesos master in one node with: > > sudo ./bin/mesos-master.sh --zk=zk://127.0.0.1:2181/mesos --quorum=2 > > --work_dir=/var/lib/mesos/master > > > > I found it will hang here for 60 seconds: > > I0604 23:39:56.488219 15330 zookeeper.cpp:259] A new leading master > > (UPID=master@192.168.122.132:5050) is detected > > I0604 23:39:56.489080 15337 master.cpp:1951] The newly elected leader > is > > master@192.168.122.132:5050 with id 40d387a6-4d61-49d6-af44-51dd41457390 > > I0604 23:39:56.489791 15337 master.cpp:1964] Elected as the leading > > master! > > I0604 23:39:56.490401 15337 master.cpp:1651] Recovering from registrar > > I0604 23:39:56.491706 15330 registrar.cpp:332] Recovering registrar > > I0604 23:39:56.496448 15332 log.cpp:524] Attempting to start the writer > > > > And after 60s, master will fail: > > F0604 23:40:56.499596 15337 master.cpp:1640] Recovery failed: Failed to > > recover registrar: Failed to perform fetch within 1mins > > *** Check failure stack trace: *** > > @ 0x7f4b81372f4e google::LogMessage::Fail() > > @ 0x7f4b81372e9a google::LogMessage::SendToLog() > > @ 0x7f4b8137289c google::LogMessage::Flush() > > @ 0x7f4b813757b0 google::LogMessageFatal::~LogMessageFatal() > > @ 0x7f4b8040eea0 mesos::internal::master::fail() > > @ 0x7f4b804dbeb3 > > > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEE6__callIvJS1_EJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE > > @ 0x7f4b804ba453 > > _ZNSt5_BindIFPFvRKSsS1_EPKcSt12_PlaceholderILi1EEEEclIJS1_EvEET0_DpOT_ > > @ 0x7f4b804898d7 > > > _ZZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvRKSsS6_EPKcSt12_PlaceholderILi1EEEEvEERKS2_OT_NS2_6PreferEENUlS6_E_clES6_ > > @ 0x7f4b804dbf80 > > > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureI7NothingE8onFailedISt5_BindIFPFvS1_S1_EPKcSt12_PlaceholderILi1EEEEvEERKS6_OT_NS6_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ > > @ 0x49d257 std::function<>::operator()() > > @ 0x49837f > > > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ > > @ 0x493024 process::Future<>::fail() > > @ 0x7f4b8015ad20 process::Promise<>::fail() > > @ 0x7f4b804d9295 process::internal::thenf<>() > > @ 0x7f4b8051788f > > > _ZNSt5_BindIFPFvRKSt8functionIFN7process6FutureI7NothingEERKN5mesos8internal8RegistryEEERKSt10shared_ptrINS1_7PromiseIS3_EEERKNS2_IS7_EEESB_SH_St12_PlaceholderILi1EEEE6__callIvISM_EILm0ELm1ELm2EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > > @ 0x7f4b8050fa3b std::_Bind<>::operator()<>() > > @ 0x7f4b804f94e3 std::_Function_handler<>::_M_invoke() > > @ 0x7f4b8050fc69 std::function<>::operator()() > > @ 0x7f4b804f9609 > > > _ZZNK7process6FutureIN5mesos8internal8RegistryEE5onAnyIRSt8functionIFvRKS4_EEvEES8_OT_NS4_6PreferEENUlS8_E_clES8_ > > @ 0x7f4b80517936 > > > _ZNSt17_Function_handlerIFvRKN7process6FutureIN5mesos8internal8RegistryEEEEZNKS5_5onAnyIRSt8functionIS8_EvEES7_OT_NS5_6PreferEEUlS7_E_E9_M_invokeERKSt9_Any_dataS7_ > > @ 0x7f4b8050fc69 std::function<>::operator()() > > @ 0x7f4b8056b1b4 process::internal::run<>() > > @ 0x7f4b80561672 process::Future<>::fail() > > @ 0x7f4b8059bf5f std::_Mem_fn<>::operator()<>() > > @ 0x7f4b8059757f > > > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEE6__callIbIS8_EILm0ELm1EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE > > @ 0x7f4b8058fad1 > > > _ZNSt5_BindIFSt7_Mem_fnIMN7process6FutureIN5mesos8internal8RegistryEEEFbRKSsEES6_St12_PlaceholderILi1EEEEclIJS8_EbEET0_DpOT_ > > @ 0x7f4b80585a41 > > > _ZZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS4_FbRKSsEES4_St12_PlaceholderILi1EEEEbEERKS4_OT_NS4_6PreferEENUlS9_E_clES9_ > > @ 0x7f4b80597605 > > > _ZNSt17_Function_handlerIFvRKSsEZNK7process6FutureIN5mesos8internal8RegistryEE8onFailedISt5_BindIFSt7_Mem_fnIMS8_FbS1_EES8_St12_PlaceholderILi1EEEEbEERKS8_OT_NS8_6PreferEEUlS1_E_E9_M_invokeERKSt9_Any_dataS1_ > > @ 0x49d257 std::function<>::operator()() > > @ 0x49837f > > > _ZN7process8internal3runISt8functionIFvRKSsEEJS4_EEEvRKSt6vectorIT_SaIS8_EEDpOT0_ > > @ 0x7f4b8056164a process::Future<>::fail() > > @ 0x7f4b8055a378 process::Promise<>::fail() > > > > I tried both Zookeeper 3.4.8 and 3.4.6 with latest code of Mesos, but no > > luck for both. Any ideas about what happened? Thanks. > > > > > > > > Thanks, > > Qian Zhang >