I have a requirement for a feature to replicate queue state between (clustered) brokers on a primary and DR site that I'm hoping to implement over the next couple of weeks.

Attached is some notes through which I've been trying to formulate my thoughts. If anyone else has any other thoughts, ideas or comments I'd be keen to hear them.

I'm looking for the simplest workable solution, both to minimise the amount of time need to implement and to keep things easy to reason about to ensure robustness.
==Requirement:==

Need to replicate queue state between two clusters (or potentially two
non-clustered brokers); one of these is the primary site, the second
is a disaster recovery site. They will likely be connected over a WAN.

Two modes are required to be supported concurrently on a per queue
basis: 

(i) only messages flowing into the queue need to be replicated; the DR
site will have active replicas of the consumers of such queues that
will be receiving and consuming them.

(ii) full queue state needs to be replicated, i.e. both the enqueing
and dequeuing of messages on the primary site needs to be reflected in
the DRs replica of the queue.

==Design Notes:==

Core pattern is to have enqueue and dequeue events represented as
messages on a queue that can be consumed and processed. The aim here
is to reuse as much of the messaging platform itself.

In the simplest form, we can have a replication queue on the primary
site into which we place events representing the enqueues and dequeues
as they occur. 

The qpid-route tool can then be used to have those events consumed by
a link. I.e. the primary broker will deliver them to the DR broker to
which they will appear as published messages.

We can then have a special exchange that processes these events,
performing the necessary operations on the queues they refer to. (Note:
There may be some overlap with the management exchange and management
methods here that can be exploited?).

The same approach works between clustered and unclustered brokers. In
the case of clustered brokers, a link will be terminated whenever the
node at one end of it fails and a replacement link will need to be
created. 

By using acknowledgments to ensure at-least-once delivery and event
sequence numbers to detect and ignore duplicate deliveries we can
still ensure that replication is reliable.

Tasks required for simple option:

* add logic to broker/Queue.cpp to add messages to a replication queue
  for each enqueue and (optionally) dequeue

* create new exchange type to process messages representing
  enqueue/dequeue events and perform the relevant actions on the
  target queues

* resiliency of federation links between two clusters, such that if
  the node on either end goes down the link is re-established to/from
  another member of that cluster

One issue with the above approach is that the consumption of messages
from the replication queue on the primary site will add extra load to
the primary broker. (Note: Can we determine how significant is this
likely to be? We can ack in large batches which will reduce the load
to some extent).

To avoid the extra cluster load from the consumer pulling events from
the DR site, a special passive cluster member could be built. This
would join the cluster as a member and thus receive all the cluster
events allowing it to keep its state in sync with other nodes. However
it would not generate any events of its own, and would only be
accessed by the DR site consuming from the replication queue.

Obviously this node would become a single point of failure for the
replication. That could be mitigated to some extent by using durable
storage for the replication queue. If the passive 'snooper' node was
restarted it would recover any replication events it previously knew
of that were not acknowledged by the DR site from the durable storage;
it would then get a state dump from the cluster and would have to
somehow reconcile that with the state recovered from durable storage
(Note: This could be quite tricky; queue sequence number may help if
that is part of state dump from cluster). It would have missed the
enqueue/dequeue events for any messages that were enqueued *and*
dequeued while it was not a member of the cluster, however that would
probably not be a problem.

Extra work required for this approach:

* some way of reconciling state recovered from durable storage with
  that received through the cluster dump on rejoining

* altered hooks etc to prevent the replication queue being itself
  replicated in the local cluster while still allowing the DR site to
  consume from it

A different approach might be for one node in the primary cluster to
simply forward all the cluster traffic down a connection to the DR
site. The DR site would then resubmit this incoming data through the
cluster broadcast mechanism. The stream could be filtered to remove
those commands whose effects were not intended to be replicated
between the clusters.

The issue with this approach is whether its possible to ensure that
the incoming stream from the primary site can be safely (and simply)
merged with the cluster traffic generated by the DR site itself as
messages are consumed from the active queues.

Reply via email to