[ 
https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264584#comment-15264584
 ] 

Benjamin Mahler edited comment on MESOS-5193 at 4/29/16 7:19 PM:
-----------------------------------------------------------------

[~prigupta] Looking at the logs, there was a ~ 3 minute window of time in which 
the masters were experiencing ZooKeeper connectivity issues (from 18:33 - 
18:36). Have you noticed this?

Also we require that the masters are run under supervision, are you ensuring 
that the master are being promptly restarted if they terminate? Since the 
recovery timeout is 1 minute by default, I would suggest a supervision restart 
that is much smaller, like 10 seconds.

Were the masters restarted after the last recovery failures here?

{noformat}
Master 1:
W0429 18:33:08.726205  2518 logging.cpp:88] RAW: Received signal SIGTERM from 
process 2938 of user 0; exiting
I0429 18:33:28.846740  1083 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:37:26.008154  1134 master.cpp:1723] Elected as the leading master!
F0429 18:38:26.008847  1127 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins

Master 2:
W0429 18:36:04.716518  2410 logging.cpp:88] RAW: Received signal SIGTERM from 
process 3029 of user 0; exiting
I0429 18:36:30.429669  1091 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:38:34.699726  1144 master.cpp:1723] Elected as the leading master!
F0429 18:39:34.715205  1139 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins

Master 3:
I0429 18:32:12.877344  7962 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:36:16.489387  7963 master.cpp:1723] Elected as the leading master!
F0429 18:37:16.490408  7967 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
{noformat}

If they were restarted and the ZooKeeper connectivity was resolved, the masters 
should have been able to get back up and running.


was (Author: bmahler):
[~prigupta] Looking at the logs, there was a ~ 3 minute window of time in which 
the masters were experiencing ZooKeeper connectivity issues (from 18:33 - 
18:36). Have you noticed this?

Also we require that the masters are run under supervision, are you ensuring 
that the master are being promptly restarted if they terminate? Since the 
recovery timeout is 1 minute by default, I would suggest something much 
smaller, like 10 seconds.

Were the masters restarted after the last recovery failures here?

{noformat}
Master 1:
W0429 18:33:08.726205  2518 logging.cpp:88] RAW: Received signal SIGTERM from 
process 2938 of user 0; exiting
I0429 18:33:28.846740  1083 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:37:26.008154  1134 master.cpp:1723] Elected as the leading master!
F0429 18:38:26.008847  1127 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins

Master 2:
W0429 18:36:04.716518  2410 logging.cpp:88] RAW: Received signal SIGTERM from 
process 3029 of user 0; exiting
I0429 18:36:30.429669  1091 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:38:34.699726  1144 master.cpp:1723] Elected as the leading master!
F0429 18:39:34.715205  1139 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins

Master 3:
I0429 18:32:12.877344  7962 main.cpp:230] Build: 2016-04-13 23:22:05 by screwdrv
I0429 18:36:16.489387  7963 master.cpp:1723] Elected as the leading master!
F0429 18:37:16.490408  7967 master.cpp:1457] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 1mins
{noformat}

If they were restarted and the ZooKeeper connectivity was resolved, the masters 
should have been able to get back up and running.

> Recovery failed: Failed to recover registrar on reboot of mesos master
> ----------------------------------------------------------------------
>
>                 Key: MESOS-5193
>                 URL: https://issues.apache.org/jira/browse/MESOS-5193
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.22.0, 0.27.0
>            Reporter: Priyanka Gupta
>              Labels: master, mesosphere
>         Attachments: node1.log, node1_after_work_dir.log, node2.log, 
> node2_after_work_dir.log, node3.log, node3_after_work_dir.log
>
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on 
> all of them. We are using chronos on top of it. The problem is when we reboot 
> the mesos master leader, the other nodes try to get elected as leader but 
> fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 
> 1mins"
> The next node then try to become the leader but again fails with same error. 
> I am not sure about the issue. We are currently using mesos 0.22 and also 
> tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir 
> --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to