[
https://issues.apache.org/jira/browse/JAMES-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514516#comment-17514516
]
Andreas Joseph Krogh commented on JAMES-3740:
---------------------------------------------
Note that UIDs need not start at 0 for any given mailbox. We use "unique-IDs"
for all emails in a Mailbox but those arr Long because the IDs are not only
unique for a given mailbox (Folder, as we call them) but unique for all
"entities in the system", hence they quickly become quite large. So - UIDs
above 2 billion is not at all unlikely in our scenario.
For this issue we use the `org.mapdb` library and switch to disk-backed
collections when the number of messages in a Mailbox exceed 5000.
> IMAP UID <-> MSN mapping occupies too much memory
> -------------------------------------------------
>
> Key: JAMES-3740
> URL: https://issues.apache.org/jira/browse/JAMES-3740
> Project: James Server
> Issue Type: Improvement
> Components: IMAPServer
> Affects Versions: 3.7.0
> Reporter: Benoit Tellier
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h3. What is UID <-> MSN mapping ?
> In IMAP RFC-3501 there is two ways one addresses a message:
> - By its UID (Unique ID) that is unique (until UID_VALIDITY changes...)
> - By its MSN (Message Sequence Number) which is the (mutable) position of a
> message in the mailbox.
> We then need:
> - Given a UID return its MSN which is for instance compulsory upon EXPUNGED
> notifications when QRESYNCH is not enabled.
> - Given a MSN based request we need to convert it back to a UID (rare).
> We do store the list of UIDs, sorted, in RAM and perform binarysearches to
> resolve those.
> h3. What is the impact on heap?
> Each uid is wrapped in a MessageUID object. This object wrapping comes with
> an overhead of at least 12 bytes in addition to the 8 bytes payload (long).
> Quick benchmarks shows it's actually worse: 10 million uids did take up to
> 275 MB.
> {code:java}
> @Test
> void measureHeapUsage() throws InterruptedException {
> int count =10000000;
> testee.addAll(IntStream.range(0, count)
> .mapToObj(i -> MessageUid.of(i + 1))
> .collect(Collectors.toList()));
> Thread.sleep(1000);
> System.out.println("GCing");
> System.gc();
> Thread.sleep(1000);
>
> System.out.println(ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed());
> }
> {code}
> Now, from let's take a classical production deployment I get:
> - Some users have up to 2.5 million messages in their INBOX
> - I can get an average of 100.000 messages for each user
> So for a small scale deployment, we are already "consuming" ~300 MB of memory
> just for the UID <-> mapping.
> Scaling to 1.000 users on a single James instance we clearly see that HEAP
> consumption will start being a problem (~3GB) without even speaking of target
> of 10.000 users per James I do have in mind.
> It's worth mentioning that IMAP being statefull, and UID <-> MSN mapping
> attached to a selected mailbox, such a mapping is long lived:
> - Multiple small objects would need to be copied individually by the GC,
> putting pressure during long gen
> - Those long lived object will eventually be promoted to old gen, thus the
> more there is the longer the resulting stop-the-world GC pauses will be.
> h3. Temporary fix ?
> We can get rid of the object boxing in UidMsnConverter by using primitive
> type collections for instance provided by fastutils project.
> The same bench was down to 84MB.
> Also, we could get things more compact by using an INT representation of
> UIDs. (Those are most of the case below 2 billions, to be above this there
> need to be more than 2 billion emails transiting through one's mailbox which
> is highly unlikely). A fallback to "long" storage can be setted up if a UID
> above 2 billion is observed.
> This such a compact int storage we are down to 46MB.
> So taking the prior mentioned numbers we could expect a 1.000 people
> deployment to require ~400 MB and a larger scale 10.000 people deployment on
> a single James to consume up to 4GB. Not that enjoyable but definitly more
> manageable.
> Please note that primitive collections are more GC friendly as their elements
> are manages together, as a single object (backing array).
> h3. What other mail servers do
> I found references to Dovecote, which does a similar algorithm compared to
> us: binary search on a list of uids. The noticeable difference is that this
> list of UIDs is held on disk and not in memory as we do.
> References:
> https://doc.dovecot.org/developer_manual/design/indexes/mail_index_api/?highlight=time
> Of course, such a solution would be attractive... We could imagine keeping
> the last 1.000 uids in memory, which would most of the time be the ones used
> for MSN resolution and locate the rest on-disk, use them only when needed and
> thus dramatically reduce heap pressure.
> Making UidMsnConverter an interface with a backing factory would enable
> different implementation to co-exist and allow some experimentation ;-)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]