Benoit Tellier created JAMES-3810:
-------------------------------------
Summary: Performance issues with RabbitMQMailQueue browsing
Key: JAMES-3810
URL: https://issues.apache.org/jira/browse/JAMES-3810
Project: James Server
Issue Type: Improvement
Components: Queue, rabbitmq
Affects Versions: 3.7.0
Reporter: Benoit Tellier
Fix For: 3.8.0
h3. What
Upon incidents, the process of the browse start update can be suspended if some
old mails are stuck in the mail queue.
Given enough time, browsing/browse-start updates will fail (too much data
loaded at once).
Periodically browse start updates would pile up until creating performance
issues.
h3. The fix
- Allow better resilience upon browsing large amount of data.
- Handle correctly casandra exceptions upon dequeue (nack items)
- And finally provide a health check to WARN when health check is out of date
h3. What to do when health check is out of date
- Identify the items still present in the queue
- Remove them from the queue
- and let everything catch up.
Alternatively if old content can be discarded an aggressive anual update of the
browse start can be an option.
h3. Alternative
Disabling the Cassandra part of the RabbitMQ is OK for a MDA, and RabbitMQ
mailQueue is unfit for a MX.
Thus it would sound legit to just disable it.
h3. Interesting Gitter discussion
I leave it here as this is a very interesting discussions, explaining cause,
effects, solutions, quick fixes, etc...
{code:java}
@iljaweis
Hello, I'm currently investigating some performance problems around the Mail
Queue in RabbitMQ und Cassandra, especially enqueuedmailsv4. From reading
https://github.com/apache/james-project/blob/master/src/adr/0031-distributed-mail-queue.md
and
https://github.com/apache/james-project/blob/master/src/adr/0032-distributed-mail-queue-cleanup.md
it is my understanding that the content of the tables "contentstart" and
"browsestart" indicate the oldest dates where we should start processing
entries from enqueuedmailsv4. However, the content of both contentstart and
browsestart for all queues is stuck at "2022-07-12 14:00:00" and does not
change. The result is that we're querying lots of "old" data from
enqueuedmailsv4, but our queues (judging from RabbitMQ) appear to be basically
empty all the time. The value of updateBrowseStartPace is 1000, the default.
We're using distributed-pop3-app. Is this behaviour expected, or if not, is
there something I could look at? Thanks.
@iljaweis
Hmm. Looks like James uses LIMIT 5000, but we're already at 17435 rows for that
timerangestart. Looks like we need to increate bucketCount, but is there a way
to deal with these old rows?
@chibenwa
Hello.
Emails between contentStart and browseStart is elligible for cleanup (meaning
deletion from objectStore, then from the Cassandra mailQueue view). BrowseStart
matches (modulo the duration of slices) of the oldest mail still in the queue.
ContentStart matches the fist email whose content is not deleted. So
contentStart <= browseStart.
> Looks like James uses LIMIT 5000
I bet you are looking at logs on Cassandra server, most likely upon massive
query rates. That limit do not correspond to any code on James side (Cf
EnqueuedMailsDAO) but nothing to be worrying off as this is the default page
size (drivers requests large reads by packets of 5000.
> I'm currently investigating some performance problems around the Mail Queue
> in RabbitMQ und Cassandra, especially enqueuedmailsv4.
This is rather insufficient as a description for me understanding what happens,
but (knowing the design of the underlying solution) let met take a bet:
You are having performance issues as massive load queries are done by the
RabbitMQ mail queue. This massive load is due to a full content read of the
mail queue content everytime the browse start is read. (reads everything from
12/07, so slice * bucket, and returns load of data.
The root cause is the browseStart being "stuck" on the 12/07. This could be
because an email from the 12/07 had not been properly processed.
Things I would check includes:
> Dead letter queues on RabbitMQ management API. If they are not empty, audit,
> and consider re-routing its content to resume delivery.
(Deserialization fails, or missing object in S3 can be the root cause, thus
message that will always fail to get dequeued are discarded into a dead letter
queue to prevent an infinite loop while not loosing data).
> Check the content of the mail queue by browsing it. If this is not doable,
> audit the content of the oldest slice in enqueuedMailV4 and check it against
> deletedMailsV2. We are looking for content within enqueuedMailV4 that is not
> being marked as deleted.
> The value of updateBrowseStartPace is 1000
During the audit phase, until you know why "browseStart" is stuck, I recommend
increasing it to an unreasonably high value to effectively disable browseStart
updates. This will prevent high query rates that are hurting you from happening.
You could consider a higher value based on your traffic too. Are you delivering
more that 1000 emails for the slicing period?
> Looks like we need to increate bucketCount
Won't help. Unless you scale out your Cassandra cluster and want to spread the
write load of the mail queue.
Here the way to go is to audit why browse start updates are stuck.
> Is this behaviour expected, or if not, is there something I could look at?
> Thanks.
As explained before, that is a "not surprising" edge case. I'm interested to
confirm the diagnostic, and also find the root cause.
Regards
Benoit
@chibenwa
Had a look. By default the browse start updates does 256 slice reads, which is
likely too much. We could likely decrease this number for it to be less
aggressive.
@chibenwa
Also we could think of a healthcheck warning when the browseStart is older than
say 1 week.
@iljaweis
Hello, yes, you guessed exactly right what the situation is. Sorry for being
slightly unclear. Browsing the queue (using the web admin endpoint) timed out,
so I checked for enqueueids from enqueuedmailsv4 that are not in deletedmailsv2
and found a lot of them. That day we actually had a problem with a backend, and
a lot of mails suffered exceptions during queue handling. The Mails were moved
to /var/mail/error, but also a bounce was created. The mails are still sitting
in enqueuedmailsv4 with state=error.
@iljaweis
There isn't anything in any of the dead letter queues
@iljaweis
Right now I'm looking for any exceptions that point to one of the possible
failures you suggested.
@chibenwa
Ok, so... I would "jump" manually the browse start past them (say to ~ 1 week
ago).
Don't touch the contentStart.
Then the content clean up will kick in and clean up dandling data.
@chibenwa
> Right now I'm looking for any exceptions that point to one of the possible
> failures you suggested.
Speak to me more about this outage...
Like: massive disturbance in the force on the cassandra cluster side?
```
Flux<? extends MailQueue.MailQueueItem> deQueue() {
return Flux.using(receiverProvider::createReceiver,
receiver ->
receiver.consumeManualAck(this.name.toWorkQueueName().asString(), new
ConsumeOptions().qos(this.prefetchCount.asInt())),
Receiver::close)
.filter(getResponse -> getResponse.getBody() != null)
.flatMapSequential(this::loadItem)
.concatMap(this::filterIfDeleted);
}
private Mono<RabbitMQMailQueueItem> filterIfDeleted(RabbitMQMailQueueItem
item) {
return mailQueueView.isPresent(item.getEnqueueId())
.handle((isPresent, sink) -> {
if (isPresent) {
sink.next(item);
} else {
item.done(true);
sink.complete();
}
});
}
```
That would cause the "filterIfDeleted" method to fail?
And if this fails it do not look like the error handling is in place to nack
correctly the corresponding message. That message will thus sit unacknowledged,
and once the unacknowledge count exceed the prefetch of 20, no subsequent
deliveries is made until James is restarted....
Sounds like a known story
That do not explain "emails being dropped" though
@iljaweis
Would it be possible to just "skip" the bad timeslice by manually adjusting the
browsestart/contentstart? We actually don't care about any of those old mails
any more.
@iljaweis
Ah, sorry, didn't see you already answered.
@iljaweis
The outage was a problem with our S3 backend. We got a large number of
Exception calling LocalDelivery:
org.apache.james.queue.api.MailQueue$MailQueueException: Timeout enqueueing
during the exact period. I checked some of the mails from enqueuedmailsv4 not
in deletedmailsv2 and they ran into exactly that.
@iljaweis
I will skip the nasty day by adjusting the browsestart table.
@iljaweis
About updateBrowseStartPace: Slice is still the default of 1 hour, but we're
processing more than 1000 mails during that period.
@chibenwa
>i I will skip the nasty day by adjusting the browsestart table.
Update to something reasonably recent or you won't be able to update the browse
start automatically. 1 week to one month.
> About updateBrowseStartPace: Slice is still the default of 1 hour, but we're
> processing more than 1000 mails during that period.
That should be ok: you then wasted ressources trying to update the browse start
too many time during that period. No other harmful consequences.
@chibenwa
WIP PR apache/james-project#1142
@iljaweis
I've set the values in browsestart to '2022-08-12 06:00:00.000000+0000' and let
it run over night.
@iljaweis
Spool there is now at '2022-08-23 05:00:00' but outgoing still at '2022-08-12
06:00:00'. No progress for contentstart
@iljaweis
Can't see any obvious exception. I could try and add some glowroot
instrumentation, would org.apache.james.queue.rabbitmq.view.cassandra be the
correct package?
@iljaweis
Ah, now both browsestart and contentstart for "spool" have caught up, but no
movement for "outgoing". Perhaps it's caused by
https://issues.apache.org/jira/browse/JAMES-3791, because we have ongoing
problems with syntactically broken recipient addresses. We'll deploy a fix for
that the next days.
@chibenwa
contentStart catch up on browseStart have a 1/updatePace on every mail.
- It is random
- Spool have likely more traffic than outgoing.
I'd just take a coffee and relax: if it worked for your spool, when triggered
it will work for your outgoing queue too.
@iljaweis
Will do that coffee and relax thing, and wait for our bugfix deployment anyway.
Thank you for your help! :-)
@iljaweis
That coffee thing worked. We're up to date with contentstart.
@chibenwa
I'm opening a JIRA to preserve this discussion and contextualize the above
proposed PR.
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]