[ https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939296#comment-16939296 ]
Li Cheng commented on HDDS-2186: -------------------------------- [~ljain] After some investigation, it turned out MiniOzoneCluster is abusing resources to create pipelines. Reason it didn't have problems before is that every datanode could only be assigned to one pipeline so that the quota runs out fast. Now the limit is taken off and there is no virtual limit to prevent cluster from creating pipelines other than ratis says resource like memory is not enough. I'm adding logic to prevent this, but unfortunately, factor ONE and factor THREE pipelines need to be handled differently, the logic grows more and more complex. > Fix tests using MiniOzoneCluster for its memory related exceptions > ------------------------------------------------------------------ > > Key: HDDS-2186 > URL: https://issues.apache.org/jira/browse/HDDS-2186 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Affects Versions: HDDS-1564 > Reporter: Li Cheng > Priority: Major > Labels: flaky-test > > After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a > bunch of 'out of memory' exceptions in ratis. Attached sample stacks. > > 2019-09-26 15:12:22,824 > [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker] > ERROR segmented.SegmentedRaftLogWorker > (SegmentedRaftLogWorker.java:run(323)) - > 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker > hit exception2019-09-26 15:12:22,824 > [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker] > ERROR segmented.SegmentedRaftLogWorker > (SegmentedRaftLogWorker.java:run(323)) - > 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker > hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at > java.nio.Bits.reserveMemory(Bits.java:694) at > java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) at > java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at > org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.<init>(BufferedWriteChannel.java:41) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.<init>(SegmentedRaftLogOutputStream.java:72) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566) > at > org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289) > at java.lang.Thread.run(Thread.java:748) > > which leads to: > 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR > pipeline.RatisPipelineProvider > (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc > org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990 > for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 > [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider > (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc > org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990 > for > c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException: > deadline exceeded after 2999881264ns at > org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at > org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at > org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178) > at > org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) > at > org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278) > at > org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) > at > org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142) > at > org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177) > at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at > java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at > java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at > java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at > java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at > java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at > java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) > at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at > java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) > at > org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171) > at > java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at > java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) > at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)Caused > by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: > DEADLINE_EXCEEDED: deadline exceeded after 2999881264ns at > org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233) > at > org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214) > at > org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139) > at > org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:274) > at > org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$3(GrpcClientProtocolClient.java:149) > at > org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:176) > ... 25 more -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org