[
https://issues.apache.org/jira/browse/JAMES-3676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benoit Tellier closed JAMES-3676.
---------------------------------
Resolution: Fixed
> S3 blobStore leaks HTTP connexions when reading streams partially (even when
> closing them))
> -------------------------------------------------------------------------------------------
>
> Key: JAMES-3676
> URL: https://issues.apache.org/jira/browse/JAMES-3676
> Project: James Server
> Issue Type: Improvement
> Components: Blob
> Affects Versions: 3.6.0
> Reporter: Benoit Tellier
> Priority: Major
> Labels: bug
> Fix For: 3.7.0, 3.6.1
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> h3. Symptoms
> Recently I found those delightful logs on one of my production environment:
> {code:java}
>
> {
> "timestamp": "2021-11-25T04:19:56.897Z",
> "level": "ERROR",
> "thread": "imapserver-executor-560",
> "mdc": {
> "fetchedData": "FetchData{flags=false, uid=true, internalDate=false,
> size=true, envelope=false, body=false, bodyStructure=false, setSeen=false,
> bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null,
> name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false,
> changedSince=-1, vanished=false}",
> "protocol": "IMAP",
> "useUid": "true",
> "action": "FETCH",
> "sessionId": "SID-cdkbuyectzcc",
> "user": "[email protected]"
> },
> "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor",
> "message": "Unexpected error during IMAP processing",
> "context": "default",
> "exception": "software.amazon.awssdk.core.exception.SdkClientException:
> Unable to execute HTTP request: Acquire operation took longer than the
> configured maximum time. This indicates that a request cannot get a
> connection from the pool within the specified maximum time. This can be due
> to high request rate.\nConsider taking any of the following actions to
> mitigate the issue: increase max connections, increase acquire timeout, or
> slowing the request rate.\nIncreasing the max connections can increase client
> throughput (unless the network interface is already fully utilized), but can
> eventually start to hit operation system limitations on the number of file
> descriptors used by the process. If you already are fully utilizing your
> network interface or cannot further increase your connection count,
> increasing the acquire timeout gives extra time for requests to acquire a
> connection before timing out. If the connections doesn't free up, the
> subsequent requests will still timeout.\nIf the above mechanisms are not able
> to fix the issue, try smoothing out your requests so that large traffic
> bursts cannot overload the client, being more efficient with the number of
> times you need to call AWS, or by increasing the number of hosts sending
> requests.\n\tat
> software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat
>
> software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat
>
> software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat
>
> software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat
>
> software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat
>
> software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
> Source)\n\tat
> software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
> Source)\n\tat
> software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown
> Source)\n\tat
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown
> Source)\n\tat
> software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat
>
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat
>
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat
>
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat
>
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat
>
> io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat
> io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat
>
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat
> io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat
>
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat
> java.base/java.lang.Thread.run(Unknown Source)\nCaused by:
> java.lang.Throwable: Acquire operation took longer than the configured
> maximum time. This indicates that a request cannot get a connection from the
> pool within the specified maximum time. This can be due to high request
> rate.\nConsider taking any of the following actions to mitigate the issue:
> increase max connections, increase acquire timeout, or slowing the request
> rate.\nIncreasing the max connections can increase client throughput (unless
> the network interface is already fully utilized), but can eventually start to
> hit operation system limitations on the number of file descriptors used by
> the process. If you already are fully utilizing your network interface or
> cannot further increase your connection count, increasing the acquire timeout
> gives extra time for requests to acquire a connection before timing out. If
> the connections doesn't free up, the subsequent requests will still
> timeout.\nIf the above mechanisms are not able to fix the issue, try
> smoothing out your requests so that large traffic bursts cannot overload the
> client, being more efficient with the number of times you need to call AWS,
> or by increasing the number of hosts sending requests.\n\tat
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat
>
> software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t...
> 11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException:
> Acquire operation took longer than 10000 milliseconds.\n\tat
> software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat
>
> software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat
>
> org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat
> io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t...
> 6 common frames omitted\n"
> }
> {code}
> TLDR the S3 driver cannot get HTTP connexion from the pool and fails
> randomly, which is very inconvenient.
> I attempted a reboot, which temporarily solved the issue but we eventually
> run out again of HTTP connections after some time...
> h3. Diagnostic
> This sounds like a leak to me!
> I quickly wrote a unit test reproducing the exact same issue by partially
> reading many blobs. Closing the resulting or not proved to have no effect.
> This pointed me to a defect within ReactorUtils.toInputStream was not
> cancelling the related publisher when closing the stream: only full
> consumption was releasing the subscription! Once again an easy unit test
> properly demonstrate the issue.
> Propagating the "close" from the SequencedInputStream back to the subscribed
> publisher cannot be done with an iterator intermediate representation, but
> adopting a Java8 Stream instead allows to propagate closure and cancel the
> subscription if need be. This approach solves both failing tests.
> Possible causes that "revealed" this issue includes:
> - S3 driver update <3
> - Additionnal usage of the BlobStore stream API for JMAP blob uploads?
> h3. Complementary work
> We might benefit to invest into tools for automatic leak detections.
> I found some java agents to do so: https://file-leak-detector.kohsuke.org/
> could do the job!
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]