René Cordier created JAMES-3955:
-----------------------------------
Summary: James stops consuming sometimes RabbitMQ queue
Key: JAMES-3955
URL: https://issues.apache.org/jira/browse/JAMES-3955
Project: James Server
Issue Type: Improvement
Components: rabbitmq
Reporter: René Cordier
We sometimes had troubles with RabbitMQ in some production environments where
james would stop consuming some queues (like the mail queue) and we never would
understand really why, and we would just restart James in this case.
Well recently I had similar issues but with TaskManagerWorkQueue. Except that
we managed to reproduce the problem manually. We have a task we play at night
that can take a long time to play. After had some other planned tasks as well,
we could observe the following pattern:
While the heavy task is being executed by James, others are pilling up in the
TaskManagerWorkQueue. They getting unacked by James, meaning it's telling
RabbitMQ that it will consume them later (as James executes one task at a
time). Except that after 30 minutes after the first unacked item in the queue,
could see James stopping consuming the queue, and all items coming back to the
ready state.
After looking around RabbitMQ configuration:
[https://www.rabbitmq.com/consumers.html#acknowledgement-timeout]
RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel exception
when detecting that an item here the first one being unacked) has not been
consumed within 30 minutes. Matching with what we observed actually.
>From this I guess we could deduce that when we had a similar issue with the
>mail queue, maybe James failed to consume properly a message or failed at
>acknowledging it for some reason and got the channel closed by RabbitMQ.
>From there, there is some actions we can take to prevent this:
* adding error logs when we get the channel closed on such an exception
* trying to reconnect to the channel when such an exception occurs
* on at least important queues like task manager queue, mail queue, event bus
* potentially try to audit as well if in some cases we do not ack/nack the
message back
* giving the possibility to increase the consumer timeout of the above queue
with the `x-consumer-timeout` queue argument (would require to run rabbitmq
3.12 at least)
For now we can as well increase that timeout in rabbitmq.conf to minimize the
problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]