[jira] Commented: (QPID-2992) Cluster failing to resurrect durable static route depending on order of shutdown

2011-02-03 Thread michael j. goulish (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12990165#comment-12990165
 ] 

michael j. goulish commented on QPID-2992:
--

Using modifications of Ken's script, I have reproduced two
bad behaviors, including the one that Mark is reporting.

I don't think this is a bug... well, sort of. Two, actually.
I will submit a doc bug, and probably one enhancement request.

What's happening is this:  messaging systems that include
clusters and stores are sensitive to timing issues around
events like broker introduction and shut-down.

Here are the timing issues that I know of:

1. When you shut down a cluster that is using a store,
   there must be time for the last-broker-standing to
   realize his status, and mark his store as clean.
   I.e. my store is the one we should use at re-start.
   If all brokers are killed too quickly, this will not
   happen.  The cluster will not be able to restart
   because it will not find any store that has been
   marked clean.


2. When you make a topology change, i.e. adding a route
   from one cluster to another to create a federation-of-
   clusters, if you then shut down the cluster soon afterwards
   you may get it before that topology-change has had a chance
   to propagate across the cluster.

   This can cause a problem on re-start that depends on the
   order in which the brokers are killed.  If you *first* kill
   the broker that knew about the topology change before he
   manages to communicate that knowledge to the other broker,
   that's Bad.  because the other broker will be the last-man-
   standing, and it will be *his* store that gets marked as
   clean!  So his store will be re-used at startup, and the
   cluster will have lost knowledge of the topology change.


By altering the timing of events in Ken's script, I was able
to:

A. get no failures in 200 runs.  (original script, plus explicit
   wait-loops fro brokers.)

B. get 100% failure because of no clean store.  (kill both brokers
   in B cluster too close together.)

C. get the failure that Mark reported, about 7% of the time.
   ( place B1 under load, then kill it too soon after route-
 creation. )


So, here's what I will propose...

I. A bit of documentation ( I will take first sketch-whack at it,
   then give to doc professionals) to centralize description of
   this type of problem -- the two I have mentioned above, plus
   whatever anyone else thinks up that is similar.

   This will include best-practices on how to avoid this type of
   problem.


II. A request for enhancement wherever there is no very good way
to avoid one of these multi-broker race conditions.


III. III'll come back and update this Jira with the numbers of
 any resultant Jiras that I open.



 Cluster failing to resurrect durable static route depending on order of 
 shutdown
 

 Key: QPID-2992
 URL: https://issues.apache.org/jira/browse/QPID-2992
 Project: Qpid
  Issue Type: Bug
  Components: C++ Broker, C++ Clustering
Affects Versions: 0.8
 Environment: Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell 
 Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4
Reporter: Mark Moseley
Assignee: michael j. goulish
 Attachments: cluster-fed.sh, error


 I've got a 2-node qpid test cluster at each of 2 datacenters, which are 
 federated together with a single durable static route between each. Qpid is 
 version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, 
 respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. 
 The static route is durable and is set up over SSL (but I can replicate as 
 well with non-SSL). I've tried to normalize the hostnames below to make 
 things clearer; hopefully I didn't mess anything up.
 Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B 
 (with B1 and B2), I've got a static exchange route from A1 to B1, as well as 
 another from B1 to A1. Federation is working correctly, so I can send a 
 message on A2 and have it successfully retrieved on B2. The exchange local to 
 cluster A is walmyex1; the local exchange for B is bosmyex1.
 If I shut down the cluster in this order: B2, then B1, and start back up with 
 B1, B2, the static route route fails to get recreated. That is, on A1/A2, 
 looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster 
 B; the only output for it in qpid-config exchanges --bindings is just:
 snip
 Exchange 'bosmyex1' (direct)
 /snip
 If however I shut the cluster down in this order: B1, 

[jira] Commented: (QPID-2992) Cluster failing to resurrect durable static route depending on order of shutdown

2011-01-10 Thread Mark Moseley (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979883#action_12979883
 ] 

Mark Moseley commented on QPID-2992:


On one of the nodes in question. I tried reproducing with this script and it 
seemed to work perfectly. I added authentication as well, and it continued to 
work ok. Your test script is pretty much exactly what I'm doing too.

I wonder though (and I'm just trying to think of reasons why it'd act 
differently in the two scenarios) can you try this out on 4 separate nodes, 
even if virtualized? Though when I reproduce this on the physical nodes, with 
debug logging turned on, it doesn't mention the node on the other side of the 
federated link, whereas when it does work, I see this in the logs:

2011-01-10 19:35:12 debug Known hosts for peer of inter-broker link: 
amqp:tcp:10.1.58.3:5672 amqp:tcp:10.1.58.4:5672 

Running through this again today, I noticed that sometimes, with a completely 
fresh cluster, the connection in a B2-B1-B1-B2 shutdown/startup does work. 
But then I do it again and it doesn't. Or if I do the opposite order it breaks 
as well.

I just modified your script so that after the first round of 
stop/start/check-binding, it flips the order and shuts them down again and 
starts them up -- and yes, I realize this is the opposite order from my ticket 
:) -- and re-checks bindings and they're gone. I'm attaching the output of your 
script.

(Just for clarification, 10.1.58.3==exp01==A1, 10.1.58.4==exp02==A2, 
10.20.58.1==bosmsg01==B1, and 10.20.58.2==bosmsg02==B2. I've been trying to 
regex the hostnames so you guys didn't have to deal with following my 
hostnames, but if you guys prefer, I don't mind just using the real names.)


 Cluster failing to resurrect durable static route depending on order of 
 shutdown
 

 Key: QPID-2992
 URL: https://issues.apache.org/jira/browse/QPID-2992
 Project: Qpid
  Issue Type: Bug
  Components: C++ Broker, C++ Clustering
Affects Versions: 0.8
 Environment: Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell 
 Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4
Reporter: Mark Moseley
Assignee: Alan Conway
 Attachments: cluster-fed.sh


 I've got a 2-node qpid test cluster at each of 2 datacenters, which are 
 federated together with a single durable static route between each. Qpid is 
 version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, 
 respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. 
 The static route is durable and is set up over SSL (but I can replicate as 
 well with non-SSL). I've tried to normalize the hostnames below to make 
 things clearer; hopefully I didn't mess anything up.
 Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B 
 (with B1 and B2), I've got a static exchange route from A1 to B1, as well as 
 another from B1 to A1. Federation is working correctly, so I can send a 
 message on A2 and have it successfully retrieved on B2. The exchange local to 
 cluster A is walmyex1; the local exchange for B is bosmyex1.
 If I shut down the cluster in this order: B2, then B1, and start back up with 
 B1, B2, the static route route fails to get recreated. That is, on A1/A2, 
 looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster 
 B; the only output for it in qpid-config exchanges --bindings is just:
 snip
 Exchange 'bosmyex1' (direct)
 /snip
 If however I shut the cluster down in this order: B1, then B2, and start B2, 
 then B1, the static route gets re-bound. The output then is:
 snip
 Exchange 'bosmyex1' (direct)
 bind [unix.boston.cust] = 
 bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
 /bind
 and I can message over the federated link with no further modification. Prior 
 to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 
 and corosync==1.2.1. In debugging this, I've upgraded both to the latest 
 versions with no change.
 I can replicate this every time I try. These are just test clusters, so I 
 don't have any other activity going on on them, or any other 
 exchanges/queues. My steps:
 On all boxes in cluster A and B:
 * Kill the qpidd if it's running and delete all existing store files, i.e. 
 contents of /var/lib/qpid/
 On host A1 in cluster A (I'm leaving out the -a user/t...@host stuff):
 * Start up qpid
 * qpid-config add exchange direct bosmyex1 --durable
 * qpid-config add exchange direct walmyex1 --durable
 * qpid-config add queue walmyq1 --durable
 * qpid-config bind walmyex1 walmyq1 unix.waltham.cust
 On host B1 in cluster B:
 * qpid-config add exchange direct bosmyex1 --durable
 * qpid-config add exchange direct walmyex1 --durable
 * qpid-config add queue bosmyq1 --durable
 * 

[jira] Commented: (QPID-2992) Cluster failing to resurrect durable static route depending on order of shutdown

2011-01-10 Thread Mark Moseley (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979891#action_12979891
 ] 

Mark Moseley commented on QPID-2992:


I also rewrote the script to do a B1-B2-B2-B1 shutdown/startup sequence 
first (the binding was visible after that), then do a B2-B1-B1-B2 stop/start 
and the binding wasn't there. Maybe it get s a single freebie in a super clean 
cluster?

I had originally posted to the list since I figured I was probably doing 
something wrong, so there could be some conceptual problem on my part, i.e. 
maybe it's not supposed to work like I'm expecting.

 Cluster failing to resurrect durable static route depending on order of 
 shutdown
 

 Key: QPID-2992
 URL: https://issues.apache.org/jira/browse/QPID-2992
 Project: Qpid
  Issue Type: Bug
  Components: C++ Broker, C++ Clustering
Affects Versions: 0.8
 Environment: Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell 
 Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4
Reporter: Mark Moseley
Assignee: Alan Conway
 Attachments: cluster-fed.sh, error


 I've got a 2-node qpid test cluster at each of 2 datacenters, which are 
 federated together with a single durable static route between each. Qpid is 
 version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, 
 respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. 
 The static route is durable and is set up over SSL (but I can replicate as 
 well with non-SSL). I've tried to normalize the hostnames below to make 
 things clearer; hopefully I didn't mess anything up.
 Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B 
 (with B1 and B2), I've got a static exchange route from A1 to B1, as well as 
 another from B1 to A1. Federation is working correctly, so I can send a 
 message on A2 and have it successfully retrieved on B2. The exchange local to 
 cluster A is walmyex1; the local exchange for B is bosmyex1.
 If I shut down the cluster in this order: B2, then B1, and start back up with 
 B1, B2, the static route route fails to get recreated. That is, on A1/A2, 
 looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster 
 B; the only output for it in qpid-config exchanges --bindings is just:
 snip
 Exchange 'bosmyex1' (direct)
 /snip
 If however I shut the cluster down in this order: B1, then B2, and start B2, 
 then B1, the static route gets re-bound. The output then is:
 snip
 Exchange 'bosmyex1' (direct)
 bind [unix.boston.cust] = 
 bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
 /bind
 and I can message over the federated link with no further modification. Prior 
 to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 
 and corosync==1.2.1. In debugging this, I've upgraded both to the latest 
 versions with no change.
 I can replicate this every time I try. These are just test clusters, so I 
 don't have any other activity going on on them, or any other 
 exchanges/queues. My steps:
 On all boxes in cluster A and B:
 * Kill the qpidd if it's running and delete all existing store files, i.e. 
 contents of /var/lib/qpid/
 On host A1 in cluster A (I'm leaving out the -a user/t...@host stuff):
 * Start up qpid
 * qpid-config add exchange direct bosmyex1 --durable
 * qpid-config add exchange direct walmyex1 --durable
 * qpid-config add queue walmyq1 --durable
 * qpid-config bind walmyex1 walmyq1 unix.waltham.cust
 On host B1 in cluster B:
 * qpid-config add exchange direct bosmyex1 --durable
 * qpid-config add exchange direct walmyex1 --durable
 * qpid-config add queue bosmyq1 --durable
 * qpid-config bind bosmyex1 bosmyq1 unix.boston.cust
 On cluster A:
 * Start other member of cluster, A2
 * qpid-route route add amqps://user/p...@hosta1:5671 
 amqps://user/p...@hostb1:5671 walmyex1 unix.waltham.cust -d
 On cluster B:
 * Start other member of cluster, B2
 * qpid-route route add amqps://user/p...@hostb1:5671 
 amqps://user/p...@hosta1:5671 bosmyex1 unix.boston.cust -d
 On either cluster:
 * Check qpid-config exchanges --bindings to make sure bindings are correct 
 for remote exchanges
 * To see correct behaviour, stop cluster in the order B1-B2, or A1-A2, 
 start cluster back up, check bindings.
 * To see broken behaviour, stop cluster in the order B2-B1, or A2-A1, start 
 cluster back up, check bindings.
 This is a test cluster, so I'm free to do anything with it, debugging-wise, 
 that would be useful. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
Apache Qpid - AMQP Messaging Implementation
Project:  http://qpid.apache.org
Use/Interact: