[ https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385050#comment-15385050 ]
Christopher M Luciano edited comment on MESOS-5832 at 7/19/16 11:18 PM: ------------------------------------------------------------------------ [~kaysoky] Tuning that to the variable you suggested causing an infinite loop of with the following in the mesas-master.INFO log {code} I0719 23:12:37.737733 32308 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (474)@remotehost:5050 I0719 23:12:38.002034 32285 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (293)@remotehost:5050 I0719 23:12:38.095073 32292 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (164)@remotehost:5050 I0719 23:12:38.449486 32306 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (478)@remotehost:5050 I0719 23:12:38.491892 32301 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (629)@remotehost:5050 I0719 23:12:38.492111 32276 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.492249 32307 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.492570 32289 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.492775 32282 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.950845 32277 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (297)@remotehost:5050 F0719 23:12:39.004389 32291 master.cpp:1458] Recovery failed: Failed to recover registrar: Failed to perform fetch within 15secs {code} was (Author: cmluciano): [~kaysoky] Tuning that to the variable you suggested causing an infinite loop of with the following in the mesas-master.INFO log ``` I0719 23:12:37.737733 32308 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (474)@remotehost:5050 I0719 23:12:38.002034 32285 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (293)@remotehost:5050 I0719 23:12:38.095073 32292 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (164)@remotehost:5050 I0719 23:12:38.449486 32306 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (478)@remotehost:5050 I0719 23:12:38.491892 32301 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (629)@remotehost:5050 I0719 23:12:38.492111 32276 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.492249 32307 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.492570 32289 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.492775 32282 recover.cpp:193] Received a recover response from a replica in EMPTY status I0719 23:12:38.950845 32277 replica.cpp:673] Replica in EMPTY status received a broadcasted recover request from (297)@remotehost:5050 F0719 23:12:39.004389 32291 master.cpp:1458] Recovery failed: Failed to recover registrar: Failed to perform fetch within 15secs ``` > Mesos replicated log corruption with disconnects from ZK > -------------------------------------------------------- > > Key: MESOS-5832 > URL: https://issues.apache.org/jira/browse/MESOS-5832 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.25.1, 0.27.1 > Reporter: Christopher M Luciano > > Setup: > I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 > ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also) > I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs > the same mesos version as the masters). > All of these were pointed at a single zookeeper ( NOT an ensemble ). > mesos-slave and mesos-master is run by upstart, and both are configured to be > restarted on halting/crashing. > Procedure: > 1) I confirm a mesos master has been elected and all agents have been > discovered > 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming > traffic from m1 and m2 > 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They > are not able to communicate with zookeeper, and therefore are no longer > considered part of the cluster > 4) A leader election happens ( m3 is elected leader ) > 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop > mesos-slave - just killing it will cause it to be restarted) > 5) I wait to confirm the slave is reported as down by m3 > 6) I add IPTABLES rules on the zookeeper machine to block all incoming > traffic from m3,m4, and m5 > 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted > and restarted > 8) I confirm that all masters report themselves as not in the cluster > 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all > traffic from m1 and m2 > 10) m1 and m2 now report they are part of the cluster - there is a leader > election and either m1 or m2 is now elected leader. NOTE : because the > cluster does not have quorum, no agents are listed. > 11) I shutdown the mesos-slave process on a2 > 12) In the logs of the current master, I can see this information being > processed by the master. > 13) I add IPTABLES rules on the zookeeper machine to block all masters > 14) I wait for all masters to report themselves as not being in the cluster > 15) I remove all IPTABLES rules on the zookeeper machine > 16) All masters join the cluster, and a leader election happens > 17) After ten minutes, the leader's mesos-master process will halt, a leader > election will happen...and this repeats every 10 minutes > Summary : > Here is what I think is happening in the above test case : I think that at > the end of step 16, the masters all try to do replica log reconciliation, and > can't. I think the state of the agents isn't actually relevant - the replica > log reconciliation causes a hang or a silent failure. After 10 minutes, it > hits a timeout for communicating with the registry (i.e. zookeeper) - even > though it can communicate with zookeeper, it never does because of the > previous hanging/silent failure. > Attached is a perl script I used on the zookeeper machine to automate the > steps above. If you want to use it, you'll need to change the IPs set in the > script, and make sure that one of the first 2 ips is the current mesos master. > Setup: > I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 > ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also) > I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs > the same mesos version as the masters). > All of these were pointed at a single zookeeper ( NOT an ensemble ). > mesos-slave and mesos-master is run by upstart, and both are configured to be > restarted on halting/crashing. > Procedure: > 1) I confirm a mesos master has been elected and all agents have been > discovered > 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming > traffic from m1 and m2 > 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They > are not able to communicate with zookeeper, and therefore are no longer > considered part of the cluster > 4) A leader election happens ( m3 is elected leader ) > 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop > mesos-slave - just killing it will cause it to be restarted) > 5) I wait to confirm the slave is reported as down by m3 > 6) I add IPTABLES rules on the zookeeper machine to block all incoming > traffic from m3,m4, and m5 > 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted > and restarted > 8) I confirm that all masters report themselves as not in the cluster > 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all > traffic from m1 and m2 > 10) m1 and m2 now report they are part of the cluster - there is a leader > election and either m1 or m2 is now elected leader. NOTE : because the > cluster does not have quorum, no agents are listed. > 11) I shutdown the mesos-slave process on a2 > 12) In the logs of the current master, I can see this information being > processed by the master. > 13) I add IPTABLES rules on the zookeeper machine to block all masters > 14) I wait for all masters to report themselves as not being in the cluster > 15) I remove all IPTABLES rules on the zookeeper machine > 16) All masters join the cluster, and a leader election happens > 17) After ten minutes, the leader's mesos-master process will halt, a leader > election will happen...and this repeats every 10 minutes > Summary : > Here is what I think is happening in the above test case : I think that at > the end of step 16, the masters all try to do replica log reconciliation, and > can't. I think the state of the agents isn't actually relevant - the replica > log reconciliation causes a hang or a silent failure. After 10 minutes, it > hits a timeout for communicating with the registry (i.e. zookeeper) - even > though it can communicate with zookeeper, it never does because of the > previous hanging/silent failure. > Attached is a perl script I used on the zookeeper machine to automate the > steps above. If you want to use it, you'll need to change the IPs set in the > script, and make sure that one of the first 2 ips is the current mesos master. > sub drop_it{ > print "dropping $_[0]\n"; > `iptables -I INPUT -s $_[0] -j DROP;`; > } > sub drop_agent{ > print "dropping agent $_[0]\n"; > print `ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no > root\@$_[0] "sudo initctl stop mesos-slave"` > } > sub revive_it{ > print "reviviing $_[0]\n"; > `iptables -D INPUT -s $_[0] -j DROP;`; > } > $master_1='10.xx.xx.xx’; > $master_2='10.xx.xx.xx'; > $master_3='10.xx.xx.xx'; > $master_4='10.xx.xx.xx'; > $master_5='10.xx.xx.xx'; > $agent_1='10.xx.xx.xx'; > $agent_2='10.xx.xx.xx'; > drop_it($master_1); > drop_it($master_2); > sleep(20); > drop_agent($agent_1); > sleep(20); > drop_it($master_3); > drop_it($master_4); > drop_it($master_5); > sleep(20); > revive_it($master_1); > revive_it($master_2); > sleep(180); > drop_agent($agent_2); > sleep(20); > drop_it($master_1); > drop_it($master_2); > sleep(20); > revive_it($master_1); > revive_it($master_2); > revive_it($master_3); > revive_it($master_4); > revive_it($master_5); > The end outcome of the above is a replicated log that can never be resolved. > We keep hitting registrar timeouts and must blow away the log on all the > masters in order for it to be recreated and for the cluster to resolve. -- This message was sent by Atlassian JIRA (v6.3.4#6332)