Sorry in advance that this is long. I've tried to explain it as succinctly but thoroughly as possible.
I've got a 2-node qpid test cluster at each of 2 datacenters, which are federated together with a single durable static route between each. Qpid is version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. The static route is durable and is set up over SSL. This is quite possibly just a conceptual problem with how I'm setting this up, so if anyone has a 'right way' to do it, I'm all ears :) Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B with nodes B1 and B2. The static route is defined as A1->B1 for an exchange on cluster B (call it exchangeB), and the other route is B1->A1 for an exchange on cluster A (call it exchangeA). After setting this up, things seem to work pretty well. I can send from any node in cluster A to exchangeB and it's received by any receiving node in cluster B. Running "qpid-config ... exchanges --bindings" on cluster A nodes show the route to cluster B for exchangeB and vice versa. That seems to be good. The trouble I'm having regards failover. I'm finding that if I fail the cluster in the order where the node with the route on it lives: * Kill A1, kill A2, start A2, start A1 -> The bindings on cluster B for exchangeA get set back up automatically Also, after I kill A1, the route seems to fail over correctly to A2, i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or B2 says: Exchange 'exchangeA' (direct) bind [mytopic] => bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b If I stop the cluster in this order: * Kill A2, kill A1, start A1, start A2 -> The bindings on cluster B for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says: Exchange 'exchangeA' (direct) Am I doing something wrong or is this a known limitation? I'd expect that regardless of ordering, a durable route would come back up on its own, on either node. I'd also think that if it was a limitation, it'd happen in the other order, when A2 was the last node standing, considering the route was created for A1. I had tried earlier to use source routes for my routing and they seemed to do better at coming back after failover but on the source clusters' side, the non-primary node (A2) would often blow up when cluster B was down and a node in cluster B came back online, always saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2): 2010-12-28 17:19:37 info ACL Allow id:walcl...@qpid action:create ObjectType:link Name: 2010-12-28 17:19:37 info Connection is a federation link 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39) 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local error 3054 did not occur on member 10.1.58.3:3369: not-attached: Channel 1 is not) 2010-12-28 17:19:39 critical Error delivering frames: local error did not occur on all cluster members : not-attached: Channel 1 is not attached (qpid/a) 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving cluster walclust 2010-12-28 17:19:39 notice Shut down I'm pushing my luck with an email this long, but I'll mention one other weirdness. I was working on another test cluster where the IPs were 10.1.1.246 and 10.1.1.247. In the qpid logs, they were fairly consistently referred to in the logs as 10.1.1.118 and 10.1.1.119, almost like the 8th bit was being cleared. Could be some localized bizarreness (though dns and nsswitch both reported the IPs correctly) but I thought I'd mention it. I haven't tried it out with other IPs where the 4th octet (or any octet) is over 128. --------------------------------------------------------------------- Apache Qpid - AMQP Messaging Implementation Project: http://qpid.apache.org Use/Interact: mailto:users-subscr...@qpid.apache.org