[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-706: - Attachment: r706_20191011b.patch > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at > org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) > - waiting to lock <0x00033eb99b10> (a > org.apache.ratis.util.SlidingWindow$Client) > at > org.apache.
[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated RATIS-706: - Attachment: r706_20191011.patch > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Assignee: Tsz-wo Sze >Priority: Major > Attachments: jstack.txt, r706_20191011.patch > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. > {code} > Found one Java-level deadlock: > = > "pool-2-thread-6": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a > org.apache.ratis.util.SlidingWindow$Client), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]" > "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty > queue]": > waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a > java.lang.Object), > which is held by > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]" > Java stack information for the threads listed above: > === > "pool-2-thread-6": > at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) > - waiting to lock <0x00033eb70a00> (a java.lang.Object) > at > org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) > at > org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) > at > org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) > at > org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) > at > org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) > at > org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) > at > org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) > at > org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) > at > org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) > at > org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) > - locked <0x0003f2ba4240> (a > org.apache.hadoop.ozone.client.io.OzoneOutputStream) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) > at > org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) > at > java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) > at > java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) > at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) > "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty > queue]": > at > org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) > - waiting to lock <0x00033eb99b10> (a > org.apache.ratis.util.SlidingWindow$Client) > at > org.apache.ratis.client.impl.Order
[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukul Kumar Singh updated RATIS-706: Description: I started an Ozone cluster on Kubernetes and started a freon test (ozone freon ockg -n1) After a while I found that the one freon instance is not creating keys any more. I checked the om RPC endpoint with ozone insight and no RPC messages has been arrived. Based on the jstack output we have a deadlock between PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. I am not sure (yet) what is the exact problem, but based on the stack traces It seems to be Ratis related. {code} Found one Java-level deadlock: = "pool-2-thread-6": waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a java.lang.Object), which is held by "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty queue]" "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty queue]": waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a org.apache.ratis.util.SlidingWindow$Client), which is held by "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty queue]" "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty queue]": waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a java.lang.Object), which is held by "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty queue]" Java stack information for the threads listed above: === "pool-2-thread-6": at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103) - waiting to lock <0x00033eb70a00> (a java.lang.Object) at org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78) at org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75) at org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59) at org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139) at org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282) at org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198) at org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161) at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346) at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482) at org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496) at org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143) at org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435) at org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473) at org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60) - locked <0x0003f2ba4240> (a org.apache.hadoop.ozone.client.io.OzoneOutputStream) at org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710) at org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88) at org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615) at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515) at java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628) at java.lang.Thread.run(java.base@11.0.3/Thread.java:834) "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty queue]": at org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348) - waiting to lock <0x00033eb99b10> (a org.apache.ratis.util.SlidingWindow$Client) at org.apache.ratis.client.impl.OrderedAsync.resetSlidingWindow(OrderedAsync.java:121) at org.apache.ratis.client.impl.OrderedAsync$$Lambda$496/0x000840498440.accept(Unknown Source) at org.apache.ratis.client.impl.RaftClientImpl.lambda$handleIOException$6(RaftClientImpl.java:349) at org.apache.ratis.client.impl.RaftClientImpl$$Lambda$427/0x0008402e1840.accept(Unknown Source) at java.util.Optional.ifPresent(java.base@11.0.3/Optional.java:183) at org.apache.ratis.client.impl.RaftClientImpl.handleIOException(R
[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc
[ https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek updated RATIS-706: -- Attachment: jstack.txt > Dead lock in GrpcClientRpc > -- > > Key: RATIS-706 > URL: https://issues.apache.org/jira/browse/RATIS-706 > Project: Ratis > Issue Type: Bug > Components: gRPC >Reporter: Marton Elek >Priority: Major > Attachments: jstack.txt > > > I started an Ozone cluster on Kubernetes and started a freon test (ozone > freon ockg -n1) > After a while I found that the one freon instance is not creating keys any > more. I checked the om RPC endpoint with ozone insight and no RPC messages > has been arrived. > Based on the jstack output we have a deadlock between > PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync. > I am not sure (yet) what is the exact problem, but based on the stack traces > It seems to be Ratis related. -- This message was sent by Atlassian Jira (v8.3.4#803005)