[jira] [Comment Edited] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

2016-07-19 Thread Christopher M Luciano (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385050#comment-15385050
 ] 

Christopher M Luciano edited comment on MESOS-5832 at 7/19/16 11:18 PM:


[~kaysoky] Tuning that to the variable you suggested causing an infinite loop 
of with the following in the mesas-master.INFO log 

{code}
I0719 23:12:37.737733 32308 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (474)@remotehost:5050
I0719 23:12:38.002034 32285 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (293)@remotehost:5050
I0719 23:12:38.095073 32292 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (164)@remotehost:5050
I0719 23:12:38.449486 32306 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (478)@remotehost:5050
I0719 23:12:38.491892 32301 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (629)@remotehost:5050
I0719 23:12:38.492111 32276 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492249 32307 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492570 32289 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492775 32282 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.950845 32277 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (297)@remotehost:5050
F0719 23:12:39.004389 32291 master.cpp:1458] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 15secs
{code}


was (Author: cmluciano):
[~kaysoky] Tuning that to the variable you suggested causing an infinite loop 
of with the following in the mesas-master.INFO log 

```
I0719 23:12:37.737733 32308 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (474)@remotehost:5050
I0719 23:12:38.002034 32285 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (293)@remotehost:5050
I0719 23:12:38.095073 32292 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (164)@remotehost:5050
I0719 23:12:38.449486 32306 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (478)@remotehost:5050
I0719 23:12:38.491892 32301 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (629)@remotehost:5050
I0719 23:12:38.492111 32276 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492249 32307 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492570 32289 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492775 32282 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.950845 32277 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (297)@remotehost:5050
F0719 23:12:39.004389 32291 master.cpp:1458] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 15secs
```

> Mesos replicated log corruption with disconnects from ZK
> 
>
> Key: MESOS-5832
> URL: https://issues.apache.org/jira/browse/MESOS-5832
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.1, 0.27.1
>Reporter: Christopher M Luciano
>
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have 

[jira] [Commented] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

2016-07-19 Thread Christopher M Luciano (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385050#comment-15385050
 ] 

Christopher M Luciano commented on MESOS-5832:
--

[~kaysoky] Tuning that to the variable you suggested causing an infinite loop 
of with the following in the mesas-master.INFO log 

```
I0719 23:12:37.737733 32308 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (474)@remotehost:5050
I0719 23:12:38.002034 32285 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (293)@remotehost:5050
I0719 23:12:38.095073 32292 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (164)@remotehost:5050
I0719 23:12:38.449486 32306 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (478)@remotehost:5050
I0719 23:12:38.491892 32301 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (629)@remotehost:5050
I0719 23:12:38.492111 32276 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492249 32307 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492570 32289 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.492775 32282 recover.cpp:193] Received a recover response from a 
replica in EMPTY status
I0719 23:12:38.950845 32277 replica.cpp:673] Replica in EMPTY status received a 
broadcasted recover request from (297)@remotehost:5050
F0719 23:12:39.004389 32291 master.cpp:1458] Recovery failed: Failed to recover 
registrar: Failed to perform fetch within 15secs
```

> Mesos replicated log corruption with disconnects from ZK
> 
>
> Key: MESOS-5832
> URL: https://issues.apache.org/jira/browse/MESOS-5832
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.1, 0.27.1
>Reporter: Christopher M Luciano
>
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) I confirm that all masters report themselves as not in the cluster
> 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
> traffic from m1 and m2
> 10) m1 and m2 now report they are part of the cluster - there is a leader 
> election and either m1 or m2 is now elected leader. NOTE : because the 
> cluster does not have quorum, no agents are listed.
> 11) I shutdown the mesos-slave process on a2
> 12) In the logs of the current master, I can see this information being 
> processed by the master.
> 13) I add IPTABLES rules on the zookeeper machine to block all masters
> 14) I wait for all masters to report themselves as not being in the cluster
> 15) I remove all IPTABLES rules on the zookeeper machine
> 16) All masters join the cluster, and a leader election happens
> 17) After ten minutes, the leader's mesos-master process will halt, a leader 
> election will happen...and this repeats every 10 minutes
> Summary :
> Here is what I think is happening in the above test case : I think that at 
> the end of step 16, the masters all try to do replica log reconciliation, and 
> can't. I think the state of the agents isn't actually relevant - the replica 
> log reconciliation causes a hang or a silent failure. After 10 minutes, it 
> hits a timeout for communicating with the registry (i.e. zookeeper) - even 
> though it can communicate with zookeeper, it never does because of the 
> previous hanging/silent failure.
> Attached is a perl script I used on the zookeeper machine to automate the 
> steps above. If you 

[jira] [Commented] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

2016-07-11 Thread Christopher M Luciano (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371742#comment-15371742
 ] 

Christopher M Luciano commented on MESOS-5832:
--

We believe that the replicated log is the problem because we have observed some 
machines believing there are only X agents registered, while other machines 
believe that the old value of Y is the correct number of agents.

> Mesos replicated log corruption with disconnects from ZK
> 
>
> Key: MESOS-5832
> URL: https://issues.apache.org/jira/browse/MESOS-5832
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.1, 0.27.1
>Reporter: Christopher M Luciano
>
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) I confirm that all masters report themselves as not in the cluster
> 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
> traffic from m1 and m2
> 10) m1 and m2 now report they are part of the cluster - there is a leader 
> election and either m1 or m2 is now elected leader. NOTE : because the 
> cluster does not have quorum, no agents are listed.
> 11) I shutdown the mesos-slave process on a2
> 12) In the logs of the current master, I can see this information being 
> processed by the master.
> 13) I add IPTABLES rules on the zookeeper machine to block all masters
> 14) I wait for all masters to report themselves as not being in the cluster
> 15) I remove all IPTABLES rules on the zookeeper machine
> 16) All masters join the cluster, and a leader election happens
> 17) After ten minutes, the leader's mesos-master process will halt, a leader 
> election will happen...and this repeats every 10 minutes
> Summary :
> Here is what I think is happening in the above test case : I think that at 
> the end of step 16, the masters all try to do replica log reconciliation, and 
> can't. I think the state of the agents isn't actually relevant - the replica 
> log reconciliation causes a hang or a silent failure. After 10 minutes, it 
> hits a timeout for communicating with the registry (i.e. zookeeper) - even 
> though it can communicate with zookeeper, it never does because of the 
> previous hanging/silent failure.
> Attached is a perl script I used on the zookeeper machine to automate the 
> steps above. If you want to use it, you'll need to change the IPs set in the 
> script, and make sure that one of the first 2 ips is the current mesos master.
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the 

[jira] [Commented] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

2016-07-11 Thread Christopher M Luciano (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371739#comment-15371739
 ] 

Christopher M Luciano commented on MESOS-5832:
--

We do have this flag --registry_fetch_timeout=15mins.

> Mesos replicated log corruption with disconnects from ZK
> 
>
> Key: MESOS-5832
> URL: https://issues.apache.org/jira/browse/MESOS-5832
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.1, 0.27.1
>Reporter: Christopher M Luciano
>
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) I confirm that all masters report themselves as not in the cluster
> 9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
> traffic from m1 and m2
> 10) m1 and m2 now report they are part of the cluster - there is a leader 
> election and either m1 or m2 is now elected leader. NOTE : because the 
> cluster does not have quorum, no agents are listed.
> 11) I shutdown the mesos-slave process on a2
> 12) In the logs of the current master, I can see this information being 
> processed by the master.
> 13) I add IPTABLES rules on the zookeeper machine to block all masters
> 14) I wait for all masters to report themselves as not being in the cluster
> 15) I remove all IPTABLES rules on the zookeeper machine
> 16) All masters join the cluster, and a leader election happens
> 17) After ten minutes, the leader's mesos-master process will halt, a leader 
> election will happen...and this repeats every 10 minutes
> Summary :
> Here is what I think is happening in the above test case : I think that at 
> the end of step 16, the masters all try to do replica log reconciliation, and 
> can't. I think the state of the agents isn't actually relevant - the replica 
> log reconciliation causes a hang or a silent failure. After 10 minutes, it 
> hits a timeout for communicating with the registry (i.e. zookeeper) - even 
> though it can communicate with zookeeper, it never does because of the 
> previous hanging/silent failure.
> Attached is a perl script I used on the zookeeper machine to automate the 
> steps above. If you want to use it, you'll need to change the IPs set in the 
> script, and make sure that one of the first 2 ips is the current mesos master.
> Setup:
> I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 
> ) running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
> I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs 
> the same mesos version as the masters). 
> All of these were pointed at a single zookeeper ( NOT an ensemble ). 
> mesos-slave and mesos-master is run by upstart, and both are configured to be 
> restarted on halting/crashing.
> Procedure:
> 1) I confirm a mesos master has been elected and all agents have been 
> discovered
> 2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
> traffic from m1 and m2
> 3) the mesos-master process on m1 and m2 halt - upstart restarts them. They 
> are not able to communicate with zookeeper, and therefore are no longer 
> considered part of the cluster
> 4) A leader election happens ( m3 is elected leader )
> 4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
> mesos-slave - just killing it will cause it to be restarted)
> 5) I wait to confirm the slave is reported as down by m3
> 6) I add IPTABLES rules on the zookeeper machine to block all incoming 
> traffic from m3,m4, and m5
> 7) I confirm that the mesos-master process on m3,m4,and m5 have all halted 
> and restarted
> 8) 

[jira] [Created] (MESOS-5832) Mesos replicated log corruption with disconnects from ZK

2016-07-11 Thread Christopher M Luciano (JIRA)
Christopher M Luciano created MESOS-5832:


 Summary: Mesos replicated log corruption with disconnects from ZK
 Key: MESOS-5832
 URL: https://issues.apache.org/jira/browse/MESOS-5832
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.27.1, 0.25.1
Reporter: Christopher M Luciano


Setup:
I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 ) 
running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs the 
same mesos version as the masters). 
All of these were pointed at a single zookeeper ( NOT an ensemble ). 
mesos-slave and mesos-master is run by upstart, and both are configured to be 
restarted on halting/crashing.

Procedure:
1) I confirm a mesos master has been elected and all agents have been discovered
2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
traffic from m1 and m2
3) the mesos-master process on m1 and m2 halt - upstart restarts them. They are 
not able to communicate with zookeeper, and therefore are no longer considered 
part of the cluster
4) A leader election happens ( m3 is elected leader )
4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
mesos-slave - just killing it will cause it to be restarted)
5) I wait to confirm the slave is reported as down by m3
6) I add IPTABLES rules on the zookeeper machine to block all incoming traffic 
from m3,m4, and m5
7) I confirm that the mesos-master process on m3,m4,and m5 have all halted and 
restarted
8) I confirm that all masters report themselves as not in the cluster
9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
traffic from m1 and m2
10) m1 and m2 now report they are part of the cluster - there is a leader 
election and either m1 or m2 is now elected leader. NOTE : because the cluster 
does not have quorum, no agents are listed.

11) I shutdown the mesos-slave process on a2
12) In the logs of the current master, I can see this information being 
processed by the master.
13) I add IPTABLES rules on the zookeeper machine to block all masters
14) I wait for all masters to report themselves as not being in the cluster
15) I remove all IPTABLES rules on the zookeeper machine
16) All masters join the cluster, and a leader election happens
17) After ten minutes, the leader's mesos-master process will halt, a leader 
election will happen...and this repeats every 10 minutes

Summary :
Here is what I think is happening in the above test case : I think that at the 
end of step 16, the masters all try to do replica log reconciliation, and 
can't. I think the state of the agents isn't actually relevant - the replica 
log reconciliation causes a hang or a silent failure. After 10 minutes, it hits 
a timeout for communicating with the registry (i.e. zookeeper) - even though it 
can communicate with zookeeper, it never does because of the previous 
hanging/silent failure.


Attached is a perl script I used on the zookeeper machine to automate the steps 
above. If you want to use it, you'll need to change the IPs set in the script, 
and make sure that one of the first 2 ips is the current mesos master.

Setup:
I setup 5 mesos and marathon masters ( which I'll refer to as m1,m2,m3,m4,m5 ) 
running the mesos version 0.27.2 (confirmed to affect 0.25.0 also)
I setup 5 mesos agents ( which I'll refer to as a1,a2,a3,a4,a5 ) (installs the 
same mesos version as the masters). 
All of these were pointed at a single zookeeper ( NOT an ensemble ). 
mesos-slave and mesos-master is run by upstart, and both are configured to be 
restarted on halting/crashing.

Procedure:
1) I confirm a mesos master has been elected and all agents have been discovered
2) On the zookeeper machine, I add an IPTABLES rule which blocks all incoming 
traffic from m1 and m2
3) the mesos-master process on m1 and m2 halt - upstart restarts them. They are 
not able to communicate with zookeeper, and therefore are no longer considered 
part of the cluster
4) A leader election happens ( m3 is elected leader )
4) I shutdown the mesos-slave process on a1 (note - I do a initctl stop 
mesos-slave - just killing it will cause it to be restarted)
5) I wait to confirm the slave is reported as down by m3
6) I add IPTABLES rules on the zookeeper machine to block all incoming traffic 
from m3,m4, and m5
7) I confirm that the mesos-master process on m3,m4,and m5 have all halted and 
restarted
8) I confirm that all masters report themselves as not in the cluster
9) I remove the IPTABLES rule from the zookeeper machine that is blocking all 
traffic from m1 and m2
10) m1 and m2 now report they are part of the cluster - there is a leader 
election and either m1 or m2 is now elected leader. NOTE : because the cluster 
does not have quorum, no agents are listed.

11) I shutdown the mesos-slave process on