[
https://issues.apache.org/jira/browse/JAMES-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benoit Tellier reopened JAMES-3955:
-----------------------------------
> James stops consuming sometimes RabbitMQ queue
> ----------------------------------------------
>
> Key: JAMES-3955
> URL: https://issues.apache.org/jira/browse/JAMES-3955
> Project: James Server
> Issue Type: Improvement
> Components: rabbitmq
> Reporter: René Cordier
> Priority: Major
> Time Spent: 4h 40m
> Remaining Estimate: 0h
>
> We sometimes had troubles with RabbitMQ in some production environments where
> james would stop consuming some queues (like the mail queue) and we never
> would understand really why, and we would just restart James in this case.
> Well recently I had similar issues but with TaskManagerWorkQueue. Except that
> we managed to reproduce the problem manually. We have a task we play at night
> that can take a long time to play. After had some other planned tasks as
> well, we could observe the following pattern:
> While the heavy task is being executed by James, others are pilling up in the
> TaskManagerWorkQueue. They getting unacked by James, meaning it's telling
> RabbitMQ that it will consume them later (as James executes one task at a
> time). Except that after 30 minutes after the first unacked item in the
> queue, could see James stopping consuming the queue, and all items coming
> back to the ready state.
> After looking around RabbitMQ configuration:
> [https://www.rabbitmq.com/consumers.html#acknowledgement-timeout]
> RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel
> exception when detecting that an item here the first one being unacked) has
> not been consumed within 30 minutes. Matching with what we observed actually.
> From this I guess we could deduce that when we had a similar issue with the
> mail queue, maybe James failed to consume properly a message or failed at
> acknowledging it for some reason and got the channel closed by RabbitMQ.
> Which I guess is there to prevent having messages being stuck if the consumer
> has issue to ack it correctly.
> From there, there is some actions we can take to prevent this:
> * adding error logs when we get the channel closed on such an exception
> * trying to reconnect to the channel when such an exception occurs
> * on at least important queues like task manager queue, mail queue, event bus
> * potentially try to audit as well if in some cases we do not ack/nack the
> message back
> * giving the possibility to increase the consumer timeout of the above
> queue with the `x-consumer-timeout` queue argument (would require to run
> rabbitmq 3.12 at least)
> For now we can as well increase that timeout in rabbitmq.conf to minimize the
> problems.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]