[jira] [Commented] (JAMES-3740) IMAP UID <-> MSN mapping occupies too much memory

Andreas Joseph Krogh (Jira) Wed, 30 Mar 2022 01:01:05 -0700


    [ 
https://issues.apache.org/jira/browse/JAMES-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514516#comment-17514516
 ]


Andreas Joseph Krogh commented on JAMES-3740:
---------------------------------------------

Note that UIDs need not start at 0 for any given mailbox. We use "unique-IDs" 
for all emails in a Mailbox but those arr Long because the IDs are not only 
unique for a given mailbox (Folder, as we call them) but unique for all 
"entities in the system", hence they quickly become quite large. So - UIDs 
above 2 billion is not at all unlikely in our scenario.

For this issue we use the `org.mapdb` library and switch to disk-backed 
collections when the number of messages in a Mailbox exceed 5000.

> IMAP UID <-> MSN mapping occupies too much memory
> -------------------------------------------------
>
>                 Key: JAMES-3740
>                 URL: https://issues.apache.org/jira/browse/JAMES-3740
>             Project: James Server
>          Issue Type: Improvement
>          Components: IMAPServer
>    Affects Versions: 3.7.0
>            Reporter: Benoit Tellier
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. What is UID <-> MSN mapping ?
> In IMAP RFC-3501 there is two ways one addresses a message:
>  - By its UID (Unique ID) that is unique (until UID_VALIDITY changes...)
>  - By its MSN (Message Sequence  Number) which is the (mutable) position of a 
> message in the mailbox.
> We then need:
>  - Given a UID return its MSN which is for instance compulsory upon EXPUNGED 
> notifications when QRESYNCH is not enabled.
>  - Given a MSN based request we need to convert it back to a UID (rare).
> We do store the list of UIDs, sorted, in RAM and perform binarysearches to 
> resolve those.
> h3. What is the impact on heap?
> Each uid is wrapped in a MessageUID object. This object wrapping comes with 
> an overhead of at least 12 bytes in addition to the 8 bytes payload (long). 
> Quick benchmarks shows it's actually worse: 10 million uids did take up to 
> 275 MB.
> {code:java}
>     @Test
>     void measureHeapUsage() throws InterruptedException {
>         int count =10000000;
>         testee.addAll(IntStream.range(0, count)
>             .mapToObj(i -> MessageUid.of(i + 1))
>             .collect(Collectors.toList()));
>         Thread.sleep(1000);
>         System.out.println("GCing");
>         System.gc();
>         Thread.sleep(1000);
>         
> System.out.println(ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed());
>     }
> {code}
> Now, from let's take a classical production deployment I get:
>  - Some users have up to 2.5 million messages in their INBOX
>  - I can get an average of 100.000 messages for each user
> So for a small scale deployment, we are already "consuming" ~300 MB of memory 
> just for the UID <-> mapping.
> Scaling to 1.000 users on a single James instance we clearly see that HEAP 
> consumption will start being a problem (~3GB) without even speaking of target 
> of 10.000 users per James I do have in mind.
> It's worth mentioning that IMAP being statefull, and UID <-> MSN mapping 
> attached to a selected mailbox, such a mapping is long lived:
>  - Multiple small objects would need to be copied individually by the GC, 
> putting pressure during long gen
>  - Those long lived object will eventually be promoted to old gen, thus the 
> more there is the longer the resulting stop-the-world GC pauses will be.
> h3. Temporary fix ?
> We can get rid of the object boxing in UidMsnConverter by using primitive 
> type collections for instance provided by fastutils project.
> The same bench was down to 84MB.
> Also, we could get things more compact by using an INT representation of 
> UIDs. (Those are most of the case below 2 billions, to be above this there 
> need to be more than 2 billion emails transiting through one's mailbox which 
> is highly unlikely). A fallback to "long" storage can be setted up if a UID 
> above 2 billion is observed.
> This such a compact int storage we are down to 46MB.
> So taking the prior mentioned numbers we could expect a 1.000 people 
> deployment to require ~400 MB and a larger scale 10.000 people deployment on 
> a single James to consume up to 4GB. Not that enjoyable but definitly more 
> manageable.
> Please note that primitive collections are more GC friendly as their elements 
> are manages together, as a single object (backing array).
> h3. What other mail servers do
> I found references to Dovecote, which does a similar algorithm compared to 
> us: binary search on a list of uids. The noticeable difference is that this 
> list of UIDs is held on disk and not in memory as we do.
> References: 
> https://doc.dovecot.org/developer_manual/design/indexes/mail_index_api/?highlight=time
> Of course, such a solution would be attractive... We could imagine keeping 
> the last 1.000 uids in memory, which would most of the time be the ones used 
> for MSN resolution and locate the rest on-disk, use them only when needed and 
> thus dramatically reduce heap pressure.
> Making UidMsnConverter an interface with a backing factory would enable 
> different implementation to co-exist and allow some experimentation ;-)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (JAMES-3740) IMAP UID <-> MSN mapping occupies too much memory

Reply via email to