Benoit Tellier created JAMES-3793:
-------------------------------------
Summary: OOM when loading a very large object from S3?
Key: JAMES-3793
URL: https://issues.apache.org/jira/browse/JAMES-3793
Project: James Server
Issue Type: Bug
Reporter: Benoit Tellier
h2. What?
We encountered recurring OutOfMemory exception on one of our production
deployment.
Memory dump analysis was unconclusive and this tends to disqualify an
explanation based on a memory leak (300MB of objects only on the heap a few
minutes after the OOM).
A careful log analysis lead to find what seems to be the "original OOM":
{code:java}
java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Unknown Source)
at software.amazon.awssdk.core.BytesWrapper.asByteArray(BytesWrapper.java:64)
at
org.apache.james.blob.objectstorage.aws.S3BlobStoreDAO$$Lambda$4237/0x00000008019f5ad8.apply(Unknown
Source)
at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
at
reactor.core.publisher.MonoPublishOn$PublishOnSubscriber.run(MonoPublishOn.java:181)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
{code}
Following this OOM the application is in a zombie state, unresponsive, throwing
OOMs without stacktraces, with Cassandra queries that never finishes, unable to
obtain a rabbitMQ connection and have issues within the S3 driver... This sound
like reactive programming limitations, that prevents the java platform to
handle the OOM like it should (crash the app, take a dump, etc...)
We did audit quickly our dateset and found several emails/attachment exceeding
the 100MB, with a partial and quick audit (we might very well have some larger
data!).
Thus the current explanation is that somehow we successfully saved in S3 a very
big mail and now have OOMs when one tries to read it (as S3 blob store DAO does
defensive copies).
h2. Possible actions
This is an ongoing events, thus our understanding of it can evolve yet as it
raises interesting fixes that are hard to understand without the related
context, I decided to share it here anyway. I will report upcoming developments
here.
Our first action is to confirm the current diagnostic:
- Further audit our datasets to find large items
- Deploy a patched version of James that rejects and log S3 objects larger
than 50MB
Yet our current understanding leads to interesting questions...
*Is it a good idea to load big objects from S3 into our memory?*
As a preliminary answer, upon email reads we are using `byte[]` for simplicity
(no resource management, full view of the data). Changing this is not the scope
of this ticket at this is likely a major rework with many unthought impacts. (I
dont want to open that Pandora box...)
SMTP, IMAP, JMAP, and the mailet container all have configuration preventing
sending/saving/receiving/uploading too big of a mail/attachment/blob, so we
likely have the convincing defense line at the protocol level. Yet this can be
defeated by either bad configuration (in our case JMAP was not checking the
size of sent email....) history (rules were not the same in the past so we
ingested too big of a mail in the past), 'malicious action' (if all it takes to
crash james is to replace a 1 MB mail by a 1 GB mail....). It thus sounds
interesting to me to have additional protection at the data access layer, and
be able to (optionally) configure S3 to not load objects of say more than 50
MBs. This can be added within the blob.properties file.
Something like:
{code:java}
# Maximum size of blobs allowed to be loaded as byte array. Allow to prevent
loading too large objects into memory (can cause OutOfMemoryException).
# Optional, defaults to no limit being enforced. This is a size in bytes.
Supported units are B, K, M, G, T defaulting to B)
max.blob.inmemory.size=50M
{code}
As an operator this would give me some peace of mind knowing that James won't
attempt to load GB large emails into memory and would fail early, without
heading to the OOM realm and all the related stability issues it brings.
Also the incriminated code path (`BytesWrapper::asByteArray`) does a defensive
copy but there is an alternative: BytesWrapper::asByteArrayUnsafe. The S3
driver guaranties not to mutate the byte[] which sounds good enough given that
james don't do it either. Preventing needless copies of MBs of large mail won't
solve the core issue but definitly give a nice performance boost as well as
decrease the impact of handling very large emails...
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]