> Do you share this point of view? +1
Quan Vào Th 2, 15 thg 4, 2024 vào lúc 16:43 Benoit TELLIER <btell...@apache.org> đã viết: > Hi Quan, > > First thanks for the job done on this topic. > > I know some members of the community (Karsten ?) already did significant > work on the topic but more oriented toward the POP3 server. > > This work is of course welcome as it would result in a higher > reliability for the IMAP / JMAP components. > > > What do you think about making all queues on Rabbitmq quorum queue > when option enabled? On the principle, +1 In practice that is slightly > harder for the event bus notification queue... - We can likely afford > losing some of those pub sub message? - The queue is tied to a > connection, thus if the node/connection goes done it can be recreated > elsewhere? - We would need to come up with a cleanup strategy in order > to eventually deletes queues haging around. - Also, how relevant is this > RabbitMQ backend pub sub implementation when compared with the work done > with Redis? IMO the eventbus notification was the main blocker in order > to achieve decent HA with RabbitMQ. Do you share this point of view? > Best regards, Benoit TELLIER > > On 15/04/2024 09:53, Quan tran hong wrote: > > Hi folks, > > > > Recently we encountered a deployment issue that used a RabbitMQ Cluster > > where a RabbitMQ node outage (for about 1 hour) forced James service more > > or less to be down too. > > > > I created a Jira ticket to report the issue: > > https://issues.apache.org/jira/projects/JAMES/issues/JAMES-4027 > > > > More details below for one did not read the Jira ticket yet: > > > > Today, when the quorum option is enabled, only some queues are quorum > > queues, not all (e.g. event bus notification queues and Task Manager's > > termination queues). > > > > I tried to reproduce the issue and here is my theory: > > > > The RabbitMQ node that stores the notification queues is down > > -> James can not publish messages to RabbitMQ and causes e.g. IMAP > SELECT, > > STORE, APPEND, UNSELECT ... commands to fail > > -> James keeps retrying the publish failures (retry for Group > registration > > which seems to rely on the classic queue too) and queues other IMAP > > requests in the meantime. > > -> The IMAP server queue becomes full and the exception `The IMAP server > > has reached its maximum capacity` is thrown. > > -> James IMAP becomes a zombie and cascading failures. > > > > James needs to be more fault-tolerant in this case. > > > > We think making all queues on Rabbitmq quorum queue when > > `quorum.queues.enable=true` would help James be more fault tolerant on > that > > scenario. > > > > We investigated a POC athttps:// > github.com/apache/james-project/pull/2191 and > > the full quorum queues helped James be more fault tolerant as expected. > > > > After full quorum queues are used, the James performance is a bit slower > > but is still fine, and that cost is likely needed to make James more > > reliable. > > > > If we use Redis backed event bus notifications, the performance is better > > than the RabbitMQ notification quorum queues. > > > > What do you think about making all queues on Rabbitmq quorum queue when > > option enabled? Feedback and review are very welcome. > > > > Thanks for reading. > > > > Quan > >