[
https://issues.apache.org/jira/browse/JAMES-3793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benoit Tellier closed JAMES-3793.
---------------------------------
Resolution: Fixed
> OOM when loading a very large object from S3?
> ---------------------------------------------
>
> Key: JAMES-3793
> URL: https://issues.apache.org/jira/browse/JAMES-3793
> Project: James Server
> Issue Type: Bug
> Reporter: Benoit Tellier
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> h2. What?
> We encountered recurring OutOfMemory exception on one of our production
> deployment.
> Memory dump analysis was unconclusive and this tends to disqualify an
> explanation based on a memory leak (300MB of objects only on the heap a few
> minutes after the OOM).
> A careful log analysis lead to find what seems to be the "original OOM":
> {code:java}
> java.lang.OutOfMemoryError: Java heap space
> at java.base/java.util.Arrays.copyOf(Unknown Source)
> at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64)
> at
> org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown
> Source)
> at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
> at
> reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
> at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
> at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> {code}
>
> Following this OOM the application is in a zombie state, unresponsive,
> throwing OOMs without stacktraces, with Cassandra queries that never
> finishes, unable to obtain a rabbitMQ connection and have issues within the
> S3 driver... This sound like reactive programming limitations, that prevents
> the java platform to handle the OOM like it should (crash the app, take a
> dump, etc...)
> We did audit quickly our dateset and found several emails/attachment
> exceeding the 100MB, with a partial and quick audit (we might very well have
> some larger data!).
> Thus the current explanation is that somehow we successfully saved in S3 a
> very big mail and now have OOMs when one tries to read it (as S3 blob store
> DAO does defensive copies).
> h2. Possible actions
> This is an ongoing events, thus our understanding of it can evolve yet as it
> raises interesting fixes that are hard to understand without the related
> context, I decided to share it here anyway. I will report upcoming
> developments here.
> Our first action is to confirm the current diagnostic:
> - Further audit our datasets to find large items
> - Deploy a patched version of James that rejects and log S3 objects larger
> than 50MB
> Yet our current understanding leads to interesting questions...
> *Is it a good idea to load big objects from S3 into our memory?*
> As a preliminary answer, upon email reads we are using `byte[]` for
> simplicity (no resource management, full view of the data). Changing this is
> not the scope of this ticket at this is likely a major rework with many
> unthought impacts. (I dont want to open that Pandora box...)
> SMTP, IMAP, JMAP, and the mailet container all have configuration preventing
> sending/saving/receiving/uploading too big of a mail/attachment/blob, so we
> likely have the convincing defense line at the protocol level. Yet this can
> be defeated by either bad configuration (in our case JMAP was not checking
> the size of sent email....) history (rules were not the same in the past so
> we ingested too big of a mail in the past), 'malicious action' (if all it
> takes to crash james is to replace a 1 MB mail by a 1 GB mail....). It thus
> sounds interesting to me to have additional protection at the data access
> layer, and be able to (optionally) configure S3 to not load objects of say
> more than 50 MBs. This can be added within the blob.properties file.
> Something like:
> {code:java}
> # Maximum size of blobs allowed to be loaded as byte array. Allow to prevent
> loading too large objects into memory (can cause OutOfMemoryException).
> # Optional, defaults to no limit being enforced. This is a size in bytes.
> Supported units are B, K, M, G, T defaulting to B)
> max.blob.inmemory.size=50M
> {code}
> As an operator this would give me some peace of mind knowing that James won't
> attempt to load GB large emails into memory and would fail early, without
> heading to the OOM realm and all the related stability issues it brings.
> Also the incriminated code path (`BytesWrapper::asByteArray`) does a
> defensive copy but there is an alternative: BytesWrapper::asByteArrayUnsafe.
> The S3 driver guaranties not to mutate the byte[] which sounds good enough
> given that james don't do it either. Preventing needless copies of MBs of
> large mail won't solve the core issue but definitly give a nice performance
> boost as well as decrease the impact of handling very large emails...
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]