Benoit Tellier created JAMES-3676:
-------------------------------------

             Summary: S3 blobStore leaks HTTP connexions when reading streams 
partially (even when closing them))
                 Key: JAMES-3676
                 URL: https://issues.apache.org/jira/browse/JAMES-3676
             Project: James Server
          Issue Type: Improvement
          Components: Blob
    Affects Versions: 3.6.0
            Reporter: Benoit Tellier
             Fix For: 3.7.0, 3.6.1


h3. Symptoms

Recently I found those delightful logs on one of my production environment:

{code:java}
        
{
  "timestamp": "2021-11-25T04:19:56.897Z",
  "level": "ERROR",
  "thread": "imapserver-executor-560",
  "mdc": {
    "fetchedData": "FetchData{flags=false, uid=true, internalDate=false, 
size=true, envelope=false, body=false, bodyStructure=false, setSeen=false, 
bodyElements=[BodyFetchElement{firstOctet=null, numberOfOctets=null, 
name=BODY[], sectionType=CONTENT, fieldNames=Optional.empty}], modSeq=false, 
changedSince=-1, vanished=false}",
    "protocol": "IMAP",
    "useUid": "true",
    "action": "FETCH",
    "sessionId": "SID-cdkbuyectzcc",
    "user": "[email protected]"
  },
  "logger": "org.apache.james.imap.processor.AbstractMailboxProcessor",
  "message": "Unexpected error during IMAP processing",
  "context": "default",
  "exception": "software.amazon.awssdk.core.exception.SdkClientException: 
Unable to execute HTTP request: Acquire operation took longer than the 
configured maximum time. This indicates that a request cannot get a connection 
from the pool within the specified maximum time. This can be due to high 
request rate.\nConsider taking any of the following actions to mitigate the 
issue: increase max connections, increase acquire timeout, or slowing the 
request rate.\nIncreasing the max connections can increase client throughput 
(unless the network interface is already fully utilized), but can eventually 
start to hit operation system limitations on the number of file descriptors 
used by the process. If you already are fully utilizing your network interface 
or cannot further increase your connection count, increasing the acquire 
timeout gives extra time for requests to acquire a connection before timing 
out. If the connections doesn't free up, the subsequent requests will still 
timeout.\nIf the above mechanisms are not able to fix the issue, try smoothing 
out your requests so that large traffic bursts cannot overload the client, 
being more efficient with the number of times you need to call AWS, or by 
increasing the number of hosts sending requests.\n\tat 
software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:98)\n\tat
 
software.amazon.awssdk.core.exception.SdkClientException.create(SdkClientException.java:43)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:204)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.utils.RetryableStageHelper.setLastException(RetryableStageHelper.java:200)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.maybeRetryExecute(AsyncRetryableStage.java:179)\n\tat
 
software.amazon.awssdk.core.internal.http.pipeline.stages.AsyncRetryableStage$RetryingExecutor.lambda$attemptExecute$1(AsyncRetryableStage.java:159)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.utils.CompletableFutureUtils.lambda$forwardExceptionTo$0(CompletableFutureUtils.java:76)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage.lambda$null$0(MakeAsyncHttpRequestStage.java:104)\n\tat
 java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown 
Source)\n\tat 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)\n\tat 
software.amazon.awssdk.core.internal.http.pipeline.stages.MakeAsyncHttpRequestStage$WrappedErrorForwardingResponseHandler.onError(MakeAsyncHttpRequestStage.java:158)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:300)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener(NettyRequestExecutor.java:172)\n\tat
 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)\n\tat
 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)\n\tat
 
io.netty.util.concurrent.DefaultPromise.access$200(DefaultPromise.java:35)\n\tat
 io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:502)\n\tat 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)\n\tat
 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)\n\tat
 io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)\n\tat 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)\n\tat
 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat 
java.base/java.lang.Thread.run(Unknown Source)\nCaused by: java.lang.Throwable: 
Acquire operation took longer than the configured maximum time. This indicates 
that a request cannot get a connection from the pool within the specified 
maximum time. This can be due to high request rate.\nConsider taking any of the 
following actions to mitigate the issue: increase max connections, increase 
acquire timeout, or slowing the request rate.\nIncreasing the max connections 
can increase client throughput (unless the network interface is already fully 
utilized), but can eventually start to hit operation system limitations on the 
number of file descriptors used by the process. If you already are fully 
utilizing your network interface or cannot further increase your connection 
count, increasing the acquire timeout gives extra time for requests to acquire 
a connection before timing out. If the connections doesn't free up, the 
subsequent requests will still timeout.\nIf the above mechanisms are not able 
to fix the issue, try smoothing out your requests so that large traffic bursts 
cannot overload the client, being more efficient with the number of times you 
need to call AWS, or by increasing the number of hosts sending requests.\n\tat 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.decorateException(NettyRequestExecutor.java:306)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure(NettyRequestExecutor.java:299)\n\t...
 11 common frames omitted\nCaused by: java.util.concurrent.TimeoutException: 
Acquire operation took longer than 10000 milliseconds.\n\tat 
software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire(HealthCheckedChannelPool.java:77)\n\tat
 
software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda$acquire$0(HealthCheckedChannelPool.java:67)\n\tat
 
org.glowroot.agent.plugin.executor.RunnableWrapper.run(RunnableWrapper.java:35)\n\tat
 io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\n\tat 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\n\t...
 6 common frames omitted\n"
}
{code}

TLDR the S3 driver cannot get HTTP connexion from the pool and fails randomly, 
which is very inconvenient.

I attempted a reboot, which temporarily solved the issue but we eventually run 
out again of HTTP connections after some time...

h3. Diagnostic

This sounds like a leak to me!

I quickly wrote a unit test reproducing the exact same issue by partially 
reading  many blobs. Closing the resulting or not proved to have no effect. 

This pointed me to a defect within ReactorUtils.toInputStream was not 
cancelling the related publisher when closing the stream: only full consumption 
was releasing the subscription! Once again an easy unit test properly 
demonstrate the issue.

Propagating the "close" from the SequencedInputStream back to the subscribed 
publisher cannot be done with an iterator intermediate representation, but 
adopting a Java8 Stream instead allows to propagate closure and cancel the 
subscription if need be. This approach solves both failing tests.

h3. Complementary work

We might benefit to invest into tools for automatic leak detections.

I found some java agents to do so: https://file-leak-detector.kohsuke.org/ 
could do the job!




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to