[ https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371742#comment-15371742 ]
Christopher M Luciano commented on MESOS-5832: ---------------------------------------------- We believe that the replicated log is the problem because we have observed some machines believing there are only X agents registered, while other machines believe that the old value of Y is the correct number of agents. > Mesos replicated log corruption with disconnects from ZK > -------------------------------------------------------- > > Key: MESOS-5832 > URL: https://issues.apache.org/jira/browse/MESOS-5832 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.25.1, 0.27.1 > Reporter: Christopher M Luciano > > Setup: > I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 > ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also) > I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs > the same mesos version as the masters). > All of these were pointed at a single zookeeper ( NOT an ensemble ). > mesos-slave and mesos-master is run by upstart, and both are configured to be > restarted on halting/crashing. > Procedure: > 1) I confirm a mesos master has been elected and all agents have been > discovered > 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming > traffic from m1 and m2 > 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They > are not able to communicate with zookeeper, and therefore are no longer > considered part of the cluster > 4) A leader election happens ( m3 is elected leader ) > 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop > mesos-slave - just killing it will cause it to be restarted) > 5) I wait to confirm the slave is reported as down by m3 > 6) I add IPTABLES rules on the zookeeper machine to block all incoming > traffic from m3,m4, and m5 > 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted > and restarted > 8) I confirm that all masters report themselves as not in the cluster > 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all > traffic from m1 and m2 > 10) m1 and m2 now report they are part of the cluster - there is a leader > election and either m1 or m2 is now elected leader. NOTE : because the > cluster does not have quorum, no agents are listed. > 11) I shutdown the mesos-slave process on a2 > 12) In the logs of the current master, I can see this information being > processed by the master. > 13) I add IPTABLES rules on the zookeeper machine to block all masters > 14) I wait for all masters to report themselves as not being in the cluster > 15) I remove all IPTABLES rules on the zookeeper machine > 16) All masters join the cluster, and a leader election happens > 17) After ten minutes, the leader's mesos-master process will halt, a leader > election will happen...and this repeats every 10 minutes > Summary : > Here is what I think is happening in the above test case : I think that at > the end of step 16, the masters all try to do replica log reconciliation, and > can't. I think the state of the agents isn't actually relevant - the replica > log reconciliation causes a hang or a silent failure. After 10 minutes, it > hits a timeout for communicating with the registry (i.e. zookeeper) - even > though it can communicate with zookeeper, it never does because of the > previous hanging/silent failure. > Attached is a perl script I used on the zookeeper machine to automate the > steps above. If you want to use it, you'll need to change the IPs set in the > script, and make sure that one of the first 2 ips is the current mesos master. > Setup: > I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 > ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also) > I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs > the same mesos version as the masters). > All of these were pointed at a single zookeeper ( NOT an ensemble ). > mesos-slave and mesos-master is run by upstart, and both are configured to be > restarted on halting/crashing. > Procedure: > 1) I confirm a mesos master has been elected and all agents have been > discovered > 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming > traffic from m1 and m2 > 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They > are not able to communicate with zookeeper, and therefore are no longer > considered part of the cluster > 4) A leader election happens ( m3 is elected leader ) > 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop > mesos-slave - just killing it will cause it to be restarted) > 5) I wait to confirm the slave is reported as down by m3 > 6) I add IPTABLES rules on the zookeeper machine to block all incoming > traffic from m3,m4, and m5 > 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted > and restarted > 8) I confirm that all masters report themselves as not in the cluster > 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all > traffic from m1 and m2 > 10) m1 and m2 now report they are part of the cluster - there is a leader > election and either m1 or m2 is now elected leader. NOTE : because the > cluster does not have quorum, no agents are listed. > 11) I shutdown the mesos-slave process on a2 > 12) In the logs of the current master, I can see this information being > processed by the master. > 13) I add IPTABLES rules on the zookeeper machine to block all masters > 14) I wait for all masters to report themselves as not being in the cluster > 15) I remove all IPTABLES rules on the zookeeper machine > 16) All masters join the cluster, and a leader election happens > 17) After ten minutes, the leader's mesos-master process will halt, a leader > election will happen...and this repeats every 10 minutes > Summary : > Here is what I think is happening in the above test case : I think that at > the end of step 16, the masters all try to do replica log reconciliation, and > can't. I think the state of the agents isn't actually relevant - the replica > log reconciliation causes a hang or a silent failure. After 10 minutes, it > hits a timeout for communicating with the registry (i.e. zookeeper) - even > though it can communicate with zookeeper, it never does because of the > previous hanging/silent failure. > Attached is a perl script I used on the zookeeper machine to automate the > steps above. If you want to use it, you'll need to change the IPs set in the > script, and make sure that one of the first 2 ips is the current mesos master. > sub drop_it{ > print "dropping $_[0]\n"; > `iptables -I INPUT -s $_[0] -j DROP;`; > } > sub drop_agent{ > print "dropping agent $_[0]\n"; > print `ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no > root\@$_[0] "sudo initctl stop mesos-slave"` > } > sub revive_it{ > print "reviviing $_[0]\n"; > `iptables -D INPUT -s $_[0] -j DROP;`; > } > $master_1='10.xx.xx.xx’; > $master_2='10.xx.xx.xx'; > $master_3='10.xx.xx.xx'; > $master_4='10.xx.xx.xx'; > $master_5='10.xx.xx.xx'; > $agent_1='10.xx.xx.xx'; > $agent_2='10.xx.xx.xx'; > drop_it($master_1); > drop_it($master_2); > sleep(20); > drop_agent($agent_1); > sleep(20); > drop_it($master_3); > drop_it($master_4); > drop_it($master_5); > sleep(20); > revive_it($master_1); > revive_it($master_2); > sleep(180); > drop_agent($agent_2); > sleep(20); > drop_it($master_1); > drop_it($master_2); > sleep(20); > revive_it($master_1); > revive_it($master_2); > revive_it($master_3); > revive_it($master_4); > revive_it($master_5); > The end outcome of the above is a replicated log that can never be resolved. > We keep hitting registrar timeouts and must blow away the log on all the > masters in order for it to be recreated and for the cluster to resolve. -- This message was sent by Atlassian JIRA (v6.3.4#6332)