Sorry in advance that this is long. I've tried to explain it as
succinctly but thoroughly as possible.

I've got a 2-node qpid test cluster at each of 2 datacenters, which
are federated together with a single durable static route between
each. Qpid is version 0.8. Corosync and openais are stock Squeeze
(1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
up over SSL.

This is quite possibly just a conceptual problem with how I'm setting
this up, so if anyone has a 'right way' to do it, I'm all ears :)

Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
with nodes B1 and B2. The static route is defined as A1->B1 for an
exchange on cluster B (call it exchangeB), and the other route is
B1->A1 for an exchange on cluster A (call it exchangeA). After setting
this up, things seem to work pretty well. I can send from any node in
cluster A to exchangeB and it's received by any receiving node in
cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
nodes show the route to cluster B for exchangeB and vice versa. That
seems to be good.

The trouble I'm having regards failover. I'm finding that if I fail
the cluster in the order where the node with the route on it lives:

* Kill A1, kill A2, start A2, start A1  -> The bindings on cluster B
for exchangeA get set back up automatically

Also, after I kill A1, the route seems to fail over correctly to A2,
i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
B2 says:
Exchange 'exchangeA' (direct)
    bind [mytopic] => bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b

If I stop the cluster in this order:

* Kill A2, kill A1, start A1, start A2  -> The bindings on cluster B
for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
Exchange 'exchangeA' (direct)

Am I doing something wrong or is this a known limitation? I'd expect
that regardless of ordering, a durable route would come back up on its
own, on either node. I'd also think that if it was a limitation, it'd
happen in the other order, when A2 was the last node standing,
considering the route was created for A1.

I had tried earlier to use source routes for my routing and they
seemed to do better at coming back after failover but on the source
clusters' side, the non-primary node (A2) would often blow up when
cluster B was down and a node in cluster B came back online, always
saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):

2010-12-28 17:19:37 info ACL Allow id:walcl...@qpid action:create
ObjectType:link Name:
2010-12-28 17:19:37 info Connection is a federation link
2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
error 3054 did not occur on member 10.1.58.3:3369: not-attached:
Channel 1 is not)
2010-12-28 17:19:39 critical Error delivering frames: local error did
not occur on all cluster members : not-attached: Channel 1 is not
attached (qpid/a)
2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
cluster walclust
2010-12-28 17:19:39 notice Shut down


I'm pushing my luck with an email this long, but I'll mention one
other weirdness. I was working on another test cluster where the IPs
were 10.1.1.246 and 10.1.1.247. In the qpid logs, they were fairly
consistently referred to in the logs as 10.1.1.118 and 10.1.1.119,
almost like the 8th bit was being cleared. Could be some localized
bizarreness (though dns and nsswitch both reported the IPs correctly)
but I thought I'd mention it. I haven't tried it out with other IPs
where the 4th octet (or any octet) is over 128.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscr...@qpid.apache.org

Reply via email to