Tran Hong Quan created JAMES-4027:
-------------------------------------
Summary: Make all queues on Rabbitmq quorum queue when option
enabled
Key: JAMES-4027
URL: https://issues.apache.org/jira/browse/JAMES-4027
Project: James Server
Issue Type: Bug
Components: eventbus, Queue, rabbitmq
Reporter: Tran Hong Quan
Today, when the quorum option is enabled, only some queues are quorum queues,
not all (e.g. event bus notification queues and Task Manager's termination
queues).
On a James deployment where we use quorum queues and RabbitMQ cluster 3 nodes,
when a RabbitMQ node outages, James can not be fault tolerant against it.
I tried to reproduce what happens and here is my theory:
The RabbitMQ node that stores the notification queues is down
-> James can not publish messages to RabbitMQ and causes e.g. IMAP SELECT,
STORE, APPEND, UNSELECT ... commands to fail
-> James keeps retrying the publish failures (retry for Group registration
which seems to rely on the classic queue too) and queues other IMAP requests.
-> The IMAP server queue is full and the exception `The IMAP server has reached
its maximum capacity` is thrown.
-> James IMAP becomes a zombie and cascading failures.
James needs to be more fault-tolerant in this case.
I propose we apply quorum queues for all the queues when `
quorum.queues.enable=true` so the queues are still available even when a
RabbitMQ node is down, and help James keep functions well.
We did a POC [here |https://github.com/apache/james-project/pull/2191]and the
full quorum queues helped James be more fault tolerant as expected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]