[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954002#comment-16954002 ] Lokesh Jain commented on RATIS-706: --- [~elek] Thanks for reporting the issue! [~szetszwo] Thanks for the contribution! I have committed the patch to the master branch. > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at >
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953937#comment-16953937 ] Hadoop QA commented on RATIS-706: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 55s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 12s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 5s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 18m 2s{color} | {color:red} root in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 26m 25s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | ratis.logservice.TestLogServiceWithNetty | | | ratis.logservice.TestLogServiceWithGrpc | | | ratis.logservice.server.TestMetaServer | | | ratis.grpc.TestServerRestartWithGrpc | | | ratis.grpc.TestRaftReconfigurationWithGrpc | | | ratis.server.simulation.TestRaftReconfigurationWithSimulatedRpc | | | ratis.netty.TestRaftReconfigurationWithNetty | | | ratis.server.raftlog.TestRaftLogMetrics | | | ratis.grpc.TestRaftAsyncWithGrpc | | | ratis.grpc.TestWatchRequestWithGrpc | | | ratis.examples.filestore.TestFileStoreWithGrpc | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/ratis:date2019-10-17 | | JIRA Issue | RATIS-706 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12982735/r706_20191011b.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs checkstyle compile | | uname | Linux 51181d5e7d30 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh | | git revision | master / 9cbf1ef | | maven | version: Apache Maven 3.6.2 (40f52333136460af0dc0d7232c0dc0bcf0d9e117; 2019-08-27T15:06:16Z) | | Default Java | 1.8.0_222 | | unit | https://builds.apache.org/job/PreCommit-RATIS-Build/1075/artifact/out/patch-unit-root.txt | | Test Results | https://builds.apache.org/job/PreCommit-RATIS-Build/1075/testReport/ | | Max. process+thread count | 3368 (vs. ulimit of 5000) | | modules | C: ratis-common ratis-test U: . | | Console output | https://builds.apache.org/job/PreCommit-RATIS-Build/1075/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953915#comment-16953915 ] Lokesh Jain commented on RATIS-706: --- [~szetszwo] Thanks for working on this! The changes look good to me. +1. > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at > org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) > - waiting to lock
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951107#comment-16951107 ] Marton Elek commented on RATIS-706: --- Thanks the quick fix [~szetszwo]. I deployed it and worked well, couldn't see the dead lock any more. I am not sure If I can judge the code but will try if nobody else will review it... > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at >
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949196#comment-16949196 ] Hadoop QA commented on RATIS-706: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 8s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 12s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 46s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 7s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 39m 39s{color} | {color:red} root in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 20s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 50m 3s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | ratis.examples.filestore.TestFileStoreWithNetty | | | ratis.netty.TestRaftSnapshotWithNetty | | | ratis.server.simulation.TestRaftExceptionWithSimulation | | | ratis.grpc.TestGroupInfoWithGrpc | | | ratis.server.simulation.TestRaftWithSimulatedRpc | | | ratis.grpc.TestRaftAsyncWithGrpc | | | ratis.netty.TestRetryCacheWithNettyRpc | | | ratis.grpc.TestRaftWithGrpc | | | ratis.grpc.TestLeaderElectionWithGrpc | | | ratis.grpc.TestServerRestartWithGrpc | | | ratis.netty.TestGroupManagementWithNetty | | | ratis.netty.TestRaftWithNetty | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.3 Server=19.03.3 Image:yetus/ratis:date2019-10-11 | | JIRA Issue | RATIS-706 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12982735/r706_20191011b.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs checkstyle compile | | uname | Linux 8f427da481b7 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-RATIS-Build/yetus-personality.sh | | git revision | master / 699792d | | maven | version: Apache Maven 3.6.2 (40f52333136460af0dc0d7232c0dc0bcf0d9e117; 2019-08-27T15:06:16Z) | | Default Java | 1.8.0_222 | | unit | https://builds.apache.org/job/PreCommit-RATIS-Build/1056/artifact/out/patch-unit-root.txt | | Test Results | https://builds.apache.org/job/PreCommit-RATIS-Build/1056/testReport/ | | Max. process+thread count | 2689 (vs. ulimit of 5000) | | modules | C: ratis-common ratis-test U: . | | Console output | https://builds.apache.org/job/PreCommit-RATIS-Build/1056/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949150#comment-16949150 ] Tsz-wo Sze commented on RATIS-706: -- r706_20191011b.patch: adds a test. > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at > org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) > - waiting to lock <0x00033eb99b10> (a >
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949096#comment-16949096 ] Tsz-wo Sze commented on RATIS-706: -- r706_20191011.patch: close proxy without holding lock. Will add a test. > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at > org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) > - waiting to lock <0x00033eb99b10> (a >
[jira] [Commented] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949072#comment-16949072 ] Tsz-wo Sze commented on RATIS-706: -- Thanks for filing this bug. It is a real deadlock. One easy fix is to change PeerProxyMap to close a proxy without holding the lock. Will post a patch. > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at > org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) >