Re: Local resilience for Artemis

Clebert Suconic Thu, 11 Mar 2021 10:32:13 -0800

The journal getting corrupted could happen in 2 situations:

- the file system is damaged by the infra structure. (Hardware failures,
kernel issues ...   etc)
* if you have a reliable file system here.  I’m not sure how concerned you
should be.


- some invalid data in the journal making the broker to fail upon restart.

I have seen only a handful issues raised like this and as any bug we fix
them when reported.  I am not aware of any at the moment.


So I think it would be considerable safe to do reconnect the POD.

So a damage in the file system or journal after a failure is IMO a disaster
situation. And for that I can only think of the mirror to mitigate any of
that.

On Thu, Mar 11, 2021 at 8:53 AM David Martin <dav...@qoritek.com> wrote:

> Hi,
>
> Looking to host an Artemis cluster in Kubernetes and am not sure how to
> achieve full local resilience.  (Clusters for DR and remote distribution
> will be added later using the mirroring feature introduced with v2.16).
>
> It is configured as 3 active cluster members using static discovery because
> the particular cloud provider does not officially support UDP on its
> managed Kubernetes service network.
>
> There are no backup brokers (active/passive) because the stateful set takes
> care of restarting failed pods immediately.
>
> Each broker has its own networked storage so is resilient in terms of local
> state.
>
> Message redistribution is ON_DEMAND. Publishing is to topics and consuming
> is from durable topic subscription queues.
>
> Publishers and consumers are connecting round-robin with client IP
> affinity/stickiness.
>
> What I'm concerned about is the possibility of journal corruption on one
> broker. Publishers and consumers will failover to either of the remaining 2
> brokers which is fine but some data could be lost permanently as follows.
>
> Hypothetically, consider that Publisher 1 is publishing to Broker 1 and
> Publisher 2 is publishing to Broker 3. Consumer 1 is consuming from Broker
> 2 and Consumer 2 is consuming from Broker 1.   There are more consumers and
> publishers but using 2 of each just to illustrate.
>
> Publisher 1 -> Broker 1 -> Broker 2 -> Consumer 1
> Publisher 2 -> Broker 3 -> Broker 2 -> Consumer 1
> Publisher 1 -> Broker 1 -> Consumer 2
> Publisher 2 -> Broker 3 -> Broker 1 -> Consumer 2
>
> This all works very well with full data integrity and good performance :)
>
> However if say Broker 1's journal got corrupted and it went down
> permanently as a result, any data from Publisher 1 which hadn't yet been
> distributed to Consumer 1 (via Broker 2) or *particularly* Consumer 2
> (directly) would be lost (unless the journal could be recovered).
>
> Is there some straightforward configuration to avoid or reduce this
> possibility? Perhaps a 4 broker cluster could have affinity for publishers
> on 2 brokers and affinity for consumers on the other 2, somehow?
>
>
> Thanks for any advice you can offer.
>
>
> Dave Martin.
>
-- 
Clebert Suconic

Re: Local resilience for Artemis

Reply via email to