[ https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979883#action_12979883 ]
Mark Moseley commented on QPID-2992: ------------------------------------ On one of the nodes in question. I tried reproducing with this script and it seemed to work perfectly. I added authentication as well, and it continued to work ok. Your test script is pretty much exactly what I'm doing too. I wonder though (and I'm just trying to think of reasons why it'd act differently in the two scenarios) can you try this out on 4 separate nodes, even if virtualized? Though when I reproduce this on the physical nodes, with debug logging turned on, it doesn't mention the node on the other side of the federated link, whereas when it does work, I see this in the logs: 2011-01-10 19:35:12 debug Known hosts for peer of inter-broker link: amqp:tcp:10.1.58.3:5672 amqp:tcp:10.1.58.4:5672 Running through this again today, I noticed that sometimes, with a completely fresh cluster, the connection in a B2->B1->B1->B2 shutdown/startup does work. But then I do it again and it doesn't. Or if I do the opposite order it breaks as well. I just modified your script so that after the first round of stop/start/check-binding, it flips the order and shuts them down again and starts them up -- and yes, I realize this is the opposite order from my ticket :) -- and re-checks bindings and they're gone. I'm attaching the output of your script. (Just for clarification, 10.1.58.3==exp01==A1, 10.1.58.4==exp02==A2, 10.20.58.1==bosmsg01==B1, and 10.20.58.2==bosmsg02==B2. I've been trying to regex the hostnames so you guys didn't have to deal with following my hostnames, but if you guys prefer, I don't mind just using the real names.) > Cluster failing to resurrect durable static route depending on order of > shutdown > -------------------------------------------------------------------------------- > > Key: QPID-2992 > URL: https://issues.apache.org/jira/browse/QPID-2992 > Project: Qpid > Issue Type: Bug > Components: C++ Broker, C++ Clustering > Affects Versions: 0.8 > Environment: Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell > Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4 > Reporter: Mark Moseley > Assignee: Alan Conway > Attachments: cluster-fed.sh > > > I've got a 2-node qpid test cluster at each of 2 datacenters, which are > federated together with a single durable static route between each. Qpid is > version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, > respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. > The static route is durable and is set up over SSL (but I can replicate as > well with non-SSL). I've tried to normalize the hostnames below to make > things clearer; hopefully I didn't mess anything up. > Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B > (with B1 and B2), I've got a static exchange route from A1 to B1, as well as > another from B1 to A1. Federation is working correctly, so I can send a > message on A2 and have it successfully retrieved on B2. The exchange local to > cluster A is walmyex1; the local exchange for B is bosmyex1. > If I shut down the cluster in this order: B2, then B1, and start back up with > B1, B2, the static route route fails to get recreated. That is, on A1/A2, > looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster > B; the only output for it in "qpid-config exchanges --bindings" is just: > <snip> > Exchange 'bosmyex1' (direct) > </snip> > If however I shut the cluster down in this order: B1, then B2, and start B2, > then B1, the static route gets re-bound. The output then is: > <snip> > Exchange 'bosmyex1' (direct) > bind [unix.boston.cust] => > bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61 > </bind> > and I can message over the federated link with no further modification. Prior > to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 > and corosync==1.2.1. In debugging this, I've upgraded both to the latest > versions with no change. > I can replicate this every time I try. These are just test clusters, so I > don't have any other activity going on on them, or any other > exchanges/queues. My steps: > On all boxes in cluster A and B: > * Kill the qpidd if it's running and delete all existing store files, i.e. > contents of /var/lib/qpid/ > On host A1 in cluster A (I'm leaving out the -a user/t...@host stuff): > * Start up qpid > * qpid-config add exchange direct bosmyex1 --durable > * qpid-config add exchange direct walmyex1 --durable > * qpid-config add queue walmyq1 --durable > * qpid-config bind walmyex1 walmyq1 unix.waltham.cust > On host B1 in cluster B: > * qpid-config add exchange direct bosmyex1 --durable > * qpid-config add exchange direct walmyex1 --durable > * qpid-config add queue bosmyq1 --durable > * qpid-config bind bosmyex1 bosmyq1 unix.boston.cust > On cluster A: > * Start other member of cluster, A2 > * qpid-route route add amqps://user/p...@hosta1:5671 > amqps://user/p...@hostb1:5671 walmyex1 unix.waltham.cust -d > On cluster B: > * Start other member of cluster, B2 > * qpid-route route add amqps://user/p...@hostb1:5671 > amqps://user/p...@hosta1:5671 bosmyex1 unix.boston.cust -d > On either cluster: > * Check "qpid-config exchanges --bindings" to make sure bindings are correct > for remote exchanges > * To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, > start cluster back up, check bindings. > * To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start > cluster back up, check bindings. > This is a test cluster, so I'm free to do anything with it, debugging-wise, > that would be useful. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:dev-subscr...@qpid.apache.org