[ 
https://issues.apache.org/jira/browse/JAMES-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Tellier closed JAMES-3955.
---------------------------------
    Resolution: Fixed

Related PRs merged.

We IMO might still need a solution for very long running tasks that are a pain.

> James stops consuming sometimes RabbitMQ queue
> ----------------------------------------------
>
>                 Key: JAMES-3955
>                 URL: https://issues.apache.org/jira/browse/JAMES-3955
>             Project: James Server
>          Issue Type: Improvement
>          Components: rabbitmq
>            Reporter: René Cordier
>            Priority: Major
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> We sometimes had troubles with RabbitMQ in some production environments where 
> james would stop consuming some queues (like the mail queue) and we never 
> would understand really why, and we would just restart James in this case.
> Well recently I had similar issues but with TaskManagerWorkQueue. Except that 
> we managed to reproduce the problem manually. We have a task we play at night 
> that can take a long time to play. After had some other planned tasks as 
> well, we could observe the following pattern:
> While the heavy task is being executed by James, others are pilling up in the 
> TaskManagerWorkQueue. They getting unacked by James, meaning it's telling 
> RabbitMQ that it will consume them later (as James executes one task at a 
> time). Except that after 30 minutes after the first unacked item in the 
> queue, could see James stopping consuming the queue, and all items coming 
> back to the ready state.
> After looking around RabbitMQ configuration: 
> [https://www.rabbitmq.com/consumers.html#acknowledgement-timeout]
> RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel 
> exception when detecting that an item here the first one being unacked) has 
> not been consumed within 30 minutes. Matching with what we observed actually.
> From this I guess we could deduce that when we had a similar issue with the 
> mail queue, maybe James failed to consume properly a message or failed at 
> acknowledging it for some reason and got the channel closed by RabbitMQ. 
> Which I guess is there to prevent having messages being stuck if the consumer 
> has issue to ack it correctly. 
> From there, there is some actions we can take to prevent this:
>  * adding error logs when we get the channel closed on such an exception
>  * trying to reconnect to the channel when such an exception occurs
>  * on at least important queues like task manager queue, mail queue, event bus
>  * potentially try to audit as well if in some cases we do not ack/nack the 
> message back
>  *  giving the possibility to increase the consumer timeout of the above 
> queue with the `x-consumer-timeout` queue argument (would require to run 
> rabbitmq 3.12 at least)
> For now we can as well increase that timeout in rabbitmq.conf to minimize the 
> problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to