[jira] [Commented] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

Christopher M Luciano (JIRA) Mon, 11 Jul 2016 15:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371742#comment-15371742
 ]


Christopher M Luciano commented on MESOS-5832:
----------------------------------------------

We believe that the replicated log is the problem because we have observed some 
machines believing there are only X agents registered, while other machines 
believe that the old value of Y is the correct number of agents.

> Mesos replicated log corruption with disconnects from ZK
> --------------------------------------------------------
>
>                 Key: MESOS-5832
>                 URL: https://issues.apache.org/jira/browse/MESOS-5832
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.25.1, 0.27.1
>            Reporter: Christopher M Luciano
>
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) I confirm that all masters report themselves as not in the cluster
> 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
> traffic from m1 and m2
> 10) m1 and m2 now report they are part of the cluster - there is a leader 
> election and either m1 or m2 is now elected leader. NOTE : because the 
> cluster does not have quorum, no agents are listed.
> 11) I shutdown the mesos-slave process on a2
> 12) In the logs of the current master, I can see this information being 
> processed by the master.
> 13) I add IPTABLES rules on the zookeeper machine to block all masters
> 14) I wait for all masters to report themselves as not being in the cluster
> 15) I remove all IPTABLES rules on the zookeeper machine
> 16) All masters join the cluster, and a leader election happens
> 17) After ten minutes, the leader's mesos-master process will halt, a leader 
> election will happen...and this repeats every 10 minutes
> Summary :
> Here is what I think is happening in the above test case : I think that at 
> the end of step 16, the masters all try to do replica log reconciliation, and 
> can't. I think the state of the agents isn't actually relevant - the replica 
> log reconciliation causes a hang or a silent failure. After 10 minutes, it 
> hits a timeout for communicating with the registry (i.e. zookeeper) - even 
> though it can communicate with zookeeper, it never does because of the 
> previous hanging/silent failure.
> Attached is a perl script I used on the zookeeper machine to automate the 
> steps above. If you want to use it, you'll need to change the IPs set in the 
> script, and make sure that one of the first 2 ips is the current mesos master.
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) I confirm that all masters report themselves as not in the cluster
> 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
> traffic from m1 and m2
> 10) m1 and m2 now report they are part of the cluster - there is a leader 
> election and either m1 or m2 is now elected leader. NOTE : because the 
> cluster does not have quorum, no agents are listed.
> 11) I shutdown the mesos-slave process on a2
> 12) In the logs of the current master, I can see this information being 
> processed by the master.
> 13) I add IPTABLES rules on the zookeeper machine to block all masters
> 14) I wait for all masters to report themselves as not being in the cluster
> 15) I remove all IPTABLES rules on the zookeeper machine
> 16) All masters join the cluster, and a leader election happens
> 17) After ten minutes, the leader's mesos-master process will halt, a leader 
> election will happen...and this repeats every 10 minutes
> Summary :
> Here is what I think is happening in the above test case : I think that at 
> the end of step 16, the masters all try to do replica log reconciliation, and 
> can't. I think the state of the agents isn't actually relevant - the replica 
> log reconciliation causes a hang or a silent failure. After 10 minutes, it 
> hits a timeout for communicating with the registry (i.e. zookeeper) - even 
> though it can communicate with zookeeper, it never does because of the 
> previous hanging/silent failure.
> Attached is a perl script I used on the zookeeper machine to automate the 
> steps above. If you want to use it, you'll need to change the IPs set in the 
> script, and make sure that one of the first 2 ips is the current mesos master.
> sub drop_it{
> print "dropping $_[0]\n";
> `iptables -I INPUT -s $_[0] -j DROP;`;
> }
> sub drop_agent{
> print "dropping agent $_[0]\n";
> print  `ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no 
> root\@$_[0] "sudo initctl stop mesos-slave"`
> }
> sub revive_it{
> print "reviviing $_[0]\n";
> `iptables -D INPUT -s $_[0] -j DROP;`;
> }
> $master_1='10.xx.xx.xx’;
> $master_2='10.xx.xx.xx';
> $master_3='10.xx.xx.xx';
> $master_4='10.xx.xx.xx';
> $master_5='10.xx.xx.xx';
> $agent_1='10.xx.xx.xx';
> $agent_2='10.xx.xx.xx';
> drop_it($master_1);
> drop_it($master_2);
> sleep(20);
> drop_agent($agent_1);
> sleep(20);
> drop_it($master_3);
> drop_it($master_4);
> drop_it($master_5);
> sleep(20);
> revive_it($master_1);
> revive_it($master_2);
> sleep(180);
> drop_agent($agent_2);
> sleep(20);
> drop_it($master_1);
> drop_it($master_2);
> sleep(20);
> revive_it($master_1);
> revive_it($master_2);
> revive_it($master_3);
> revive_it($master_4);
> revive_it($master_5);
> The end outcome of the above is a replicated log that can never be resolved. 
> We keep hitting registrar timeouts and must blow away the log on all the 
> masters in order for it to be recreated and for the cluster to resolve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

Reply via email to