[ https://issues.apache.org/jira/browse/JAMES-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851662#comment-17851662 ]
Benoit Tellier commented on JAMES-3955: --------------------------------------- Looking at the code I have the impression the timeout only is applied on the queue (great!) But the task never is wrapped onto a reactor timeout prior to the consumer timeout. This means that if I submit a 2 day long task it would effectively KILL the consumer => bug. A nicer behaviour would be: to put a 10% margin onto the RabbitMQ consumer timeout to specify .timeout(consumerTimeout) somewhere into the reactive chain Catch the reactor timeout error and eplicitly CANCEL the task. Thoughts? > James stops consuming sometimes RabbitMQ queue > ---------------------------------------------- > > Key: JAMES-3955 > URL: https://issues.apache.org/jira/browse/JAMES-3955 > Project: James Server > Issue Type: Improvement > Components: rabbitmq > Reporter: René Cordier > Priority: Major > Time Spent: 4h 40m > Remaining Estimate: 0h > > We sometimes had troubles with RabbitMQ in some production environments where > james would stop consuming some queues (like the mail queue) and we never > would understand really why, and we would just restart James in this case. > Well recently I had similar issues but with TaskManagerWorkQueue. Except that > we managed to reproduce the problem manually. We have a task we play at night > that can take a long time to play. After had some other planned tasks as > well, we could observe the following pattern: > While the heavy task is being executed by James, others are pilling up in the > TaskManagerWorkQueue. They getting unacked by James, meaning it's telling > RabbitMQ that it will consume them later (as James executes one task at a > time). Except that after 30 minutes after the first unacked item in the > queue, could see James stopping consuming the queue, and all items coming > back to the ready state. > After looking around RabbitMQ configuration: > [https://www.rabbitmq.com/consumers.html#acknowledgement-timeout] > RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel > exception when detecting that an item here the first one being unacked) has > not been consumed within 30 minutes. Matching with what we observed actually. > From this I guess we could deduce that when we had a similar issue with the > mail queue, maybe James failed to consume properly a message or failed at > acknowledging it for some reason and got the channel closed by RabbitMQ. > Which I guess is there to prevent having messages being stuck if the consumer > has issue to ack it correctly. > From there, there is some actions we can take to prevent this: > * adding error logs when we get the channel closed on such an exception > * trying to reconnect to the channel when such an exception occurs > * on at least important queues like task manager queue, mail queue, event bus > * potentially try to audit as well if in some cases we do not ack/nack the > message back > * giving the possibility to increase the consumer timeout of the above > queue with the `x-consumer-timeout` queue argument (would require to run > rabbitmq 3.12 at least) > For now we can as well increase that timeout in rabbitmq.conf to minimize the > problems. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org