[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264584#comment-15264584 ]
Benjamin Mahler edited comment on MESOS-5193 at 4/29/16 7:19 PM: ----------------------------------------------------------------- [~prigupta] Looking at the logs, there was a ~ 3 minute window of time in which the masters were experiencing ZooKeeper connectivity issues (from 18:33 - 18:36). Have you noticed this? Also we require that the masters are run under supervision, are you ensuring that the master are being promptly restarted if they terminate? Since the recovery timeout is 1 minute by default, I would suggest a supervision restart that is much smaller, like 10 seconds. Were the masters restarted after the last recovery failures here? {noformat} Master 1: W0429 18:33:08.726205 2518 logging.cpp:88] RAW: Received signal SIGTERM from process 2938 of user 0; exiting I0429 18:33:28.846740 1083 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:37:26.008154 1134 master.cpp:1723] Elected as the leading master! F0429 18:38:26.008847 1127 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins Master 2: W0429 18:36:04.716518 2410 logging.cpp:88] RAW: Received signal SIGTERM from process 3029 of user 0; exiting I0429 18:36:30.429669 1091 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:38:34.699726 1144 master.cpp:1723] Elected as the leading master! F0429 18:39:34.715205 1139 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins Master 3: I0429 18:32:12.877344 7962 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:36:16.489387 7963 master.cpp:1723] Elected as the leading master! F0429 18:37:16.490408 7967 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins {noformat} If they were restarted and the ZooKeeper connectivity was resolved, the masters should have been able to get back up and running. was (Author: bmahler): [~prigupta] Looking at the logs, there was a ~ 3 minute window of time in which the masters were experiencing ZooKeeper connectivity issues (from 18:33 - 18:36). Have you noticed this? Also we require that the masters are run under supervision, are you ensuring that the master are being promptly restarted if they terminate? Since the recovery timeout is 1 minute by default, I would suggest something much smaller, like 10 seconds. Were the masters restarted after the last recovery failures here? {noformat} Master 1: W0429 18:33:08.726205 2518 logging.cpp:88] RAW: Received signal SIGTERM from process 2938 of user 0; exiting I0429 18:33:28.846740 1083 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:37:26.008154 1134 master.cpp:1723] Elected as the leading master! F0429 18:38:26.008847 1127 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins Master 2: W0429 18:36:04.716518 2410 logging.cpp:88] RAW: Received signal SIGTERM from process 3029 of user 0; exiting I0429 18:36:30.429669 1091 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:38:34.699726 1144 master.cpp:1723] Elected as the leading master! F0429 18:39:34.715205 1139 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins Master 3: I0429 18:32:12.877344 7962 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv I0429 18:36:16.489387 7963 master.cpp:1723] Elected as the leading master! F0429 18:37:16.490408 7967 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins {noformat} If they were restarted and the ZooKeeper connectivity was resolved, the masters should have been able to get back up and running. > Recovery failed: Failed to recover registrar on reboot of mesos master > ---------------------------------------------------------------------- > > Key: MESOS-5193 > URL: https://issues.apache.org/jira/browse/MESOS-5193 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 0.22.0, 0.27.0 > Reporter: Priyanka Gupta > Labels: master, mesosphere > Attachments: node1.log, node1_after_work_dir.log, node2.log, > node2_after_work_dir.log, node3.log, node3_after_work_dir.log > > > Hi all, > We are using a 3 node cluster with mesos master, mesos slave and zookeeper on > all of them. We are using chronos on top of it. The problem is when we reboot > the mesos master leader, the other nodes try to get elected as leader but > fail with recovery registrar issue. > "Recovery failed: Failed to recover registrar: Failed to perform fetch within > 1mins" > The next node then try to become the leader but again fails with same error. > I am not sure about the issue. We are currently using mesos 0.22 and also > tried to upgrade to mesos 0.27 as well but the problem continues to happen. > /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir > --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2 > Can you please help us resolve this issue as its a production system. > Thanks, > Priyanka -- This message was sent by Atlassian JIRA (v6.3.4#6332)