[ 
https://issues.apache.org/jira/browse/RATIS-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918030#comment-16918030
 ] 

Josh Elser commented on RATIS-485:
----------------------------------

I think the issue is that the {{TimeoutScheduler}} is always trying to submit 
another shutdown task.

[https://github.com/apache/incubator-ratis/blob/087652d47704dc8b9edf7d28f8e6268af1364b1b/ratis-common/src/main/java/org/apache/ratis/util/TimeoutScheduler.java#L108-L113]

We get into a loop where we do the following:
 * Try to talk to quorum
 * Call fails
 * Cancel that call
 * Schedule a task to shutdown the scheduler with a delay of 1min
 * Repeat

But, because we spin so fast, we saturate all of these threads with these calls 
that haven't yet run. The ScheduledThreadPoolExecutor we create in 
TimeoutScheduler has no bound to it, so we will effectively DDOS ourselves with 
this. If this is right, two things we can do:
 * Don't schedule another shutdown when we've already scheduled one that hasn't 
yet run
 * Apply an upper bound on the threadpool size (big, but not Integer.MAX_VALUE)

Should be able to test this and prove whether or not this is what's happening 
pretty easily.

> Load Generator OOMs if Ratis Unavailable
> ----------------------------------------
>
>                 Key: RATIS-485
>                 URL: https://issues.apache.org/jira/browse/RATIS-485
>             Project: Ratis
>          Issue Type: Bug
>          Components: examples
>            Reporter: Clay B.
>            Priority: Trivial
>         Attachments: loadgen.log, r485_20190827.patch
>
>
> Running the load generator without a Ratis cluster (e.g. spurious node IPs) 
> results in an OOM.
> If one has a single Ratis server it tries seemingly indefinitely:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1{code}
> If one has two Ratis servers it OOMs:
> {code:java}
> vagrant@ratis-server:~/incubator-ratis$ 
> ./ratis-examples/src/main/bin/client.sh filestore loadgen --size 1048576 
> --numFiles 100 --peers n0:127.0.0.1:1,n1:127.0.0.1:2
> [...]
> 1/787867107@5e5792a0 with java.util.concurrent.CompletionException: 
> java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:417 - client-272A2E13A5DD: suggested new 
> leader: null. Failed 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
> 2019-02-14 07:47:22 DEBUG RaftClient:437 - client-272A2E13A5DD: change Leader 
> from n1 to n0
> 2019-02-14 07:47:22 DEBUG RaftClient:291 - schedule attempt #10740 with 
> policy RetryForeverNoSleep for 
> RaftClientRequest:client-272A2E13A5DD->n1@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:323 - client-272A2E13A5DD: send* 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
> 2019-02-14 07:47:22 DEBUG RaftClient:338 - client-272A2E13A5DD: Failed 
> RaftClientRequest:client-272A2E13A5DD->n0@group-6F7570313233, cid=0, seq=0 
> RW, 
> org.apache.ratis.examples.filestore.FileStoreClient$$Lambda$41/787867107@5e5792a0
>  with java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: 
> unable to create new native thread
> Exception in thread "main" java.util.concurrent.CompletionException: 
> java.lang.OutOfMemoryError: unable to create new native thread
>         at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$sendRequestAsync$14(RaftClientImpl.java:349)
>         at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>         at 
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
>         at 
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
>         at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:334)
>         at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestWithRetryAsync(RaftClientImpl.java:286)
>         at 
> org.apache.ratis.util.SlidingWindow$Client.sendOrDelayRequest(SlidingWindow.java:243)
>         at 
> org.apache.ratis.util.SlidingWindow$Client.retry(SlidingWindow.java:259)
>         at 
> org.apache.ratis.client.impl.RaftClientImpl.lambda$null$10(RaftClientImpl.java:293)
>         at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:85)
>         at 
> org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:104)
>         at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50)
>         at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.ensurePrestart(ThreadPoolExecutor.java:1603)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:334)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
>         at 
> org.apache.ratis.util.TimeoutScheduler.schedule(TimeoutScheduler.java:117)
>         at 
> org.apache.ratis.util.TimeoutScheduler.onTimeout(TimeoutScheduler.java:104)
>         at 
> org.apache.ratis.util.TimeoutScheduler.onTimeout(TimeoutScheduler.java:82)
>         at 
> org.apache.ratis.util.TimeoutScheduler.onTimeout(TimeoutScheduler.java:134)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.onNext(GrpcClientProtocolClient.java:234)
>         at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsync(GrpcClientRpc.java:71)
>         at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequestAsync(RaftClientImpl.java:324)
>         ... 15 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to