The journal getting corrupted could happen in 2 situations: - the file system is damaged by the infra structure. (Hardware failures, kernel issues ... etc) * if you have a reliable file system here. I’m not sure how concerned you should be.
- some invalid data in the journal making the broker to fail upon restart. I have seen only a handful issues raised like this and as any bug we fix them when reported. I am not aware of any at the moment. So I think it would be considerable safe to do reconnect the POD. So a damage in the file system or journal after a failure is IMO a disaster situation. And for that I can only think of the mirror to mitigate any of that. On Thu, Mar 11, 2021 at 8:53 AM David Martin <dav...@qoritek.com> wrote: > Hi, > > Looking to host an Artemis cluster in Kubernetes and am not sure how to > achieve full local resilience. (Clusters for DR and remote distribution > will be added later using the mirroring feature introduced with v2.16). > > It is configured as 3 active cluster members using static discovery because > the particular cloud provider does not officially support UDP on its > managed Kubernetes service network. > > There are no backup brokers (active/passive) because the stateful set takes > care of restarting failed pods immediately. > > Each broker has its own networked storage so is resilient in terms of local > state. > > Message redistribution is ON_DEMAND. Publishing is to topics and consuming > is from durable topic subscription queues. > > Publishers and consumers are connecting round-robin with client IP > affinity/stickiness. > > What I'm concerned about is the possibility of journal corruption on one > broker. Publishers and consumers will failover to either of the remaining 2 > brokers which is fine but some data could be lost permanently as follows. > > Hypothetically, consider that Publisher 1 is publishing to Broker 1 and > Publisher 2 is publishing to Broker 3. Consumer 1 is consuming from Broker > 2 and Consumer 2 is consuming from Broker 1. There are more consumers and > publishers but using 2 of each just to illustrate. > > Publisher 1 -> Broker 1 -> Broker 2 -> Consumer 1 > Publisher 2 -> Broker 3 -> Broker 2 -> Consumer 1 > Publisher 1 -> Broker 1 -> Consumer 2 > Publisher 2 -> Broker 3 -> Broker 1 -> Consumer 2 > > This all works very well with full data integrity and good performance :) > > However if say Broker 1's journal got corrupted and it went down > permanently as a result, any data from Publisher 1 which hadn't yet been > distributed to Consumer 1 (via Broker 2) or *particularly* Consumer 2 > (directly) would be lost (unless the journal could be recovered). > > Is there some straightforward configuration to avoid or reduce this > possibility? Perhaps a 4 broker cluster could have affinity for publishers > on 2 brokers and affinity for consumers on the other 2, somehow? > > > Thanks for any advice you can offer. > > > Dave Martin. > -- Clebert Suconic