[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc

2019-10-10 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-706:
-
Attachment: r706_20191011b.patch

> Dead lock in GrpcClientRpc
> --
>
> Key: RATIS-706
> URL: https://issues.apache.org/jira/browse/RATIS-706
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Marton Elek
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: jstack.txt, r706_20191011.patch, r706_20191011b.patch
>
>
> I started an Ozone cluster on Kubernetes and started a freon test (ozone 
> freon ockg -n1)
> After a while I found that the one freon instance is not creating keys any 
> more. I checked the om RPC endpoint with ozone insight and no RPC messages 
> has been arrived.
> Based on the jstack output we have a deadlock between 
> PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync.
> I am not sure (yet) what is the exact problem, but based on the stack traces 
> It seems to be Ratis related.
> {code}
> Found one Java-level deadlock:
> =
> "pool-2-thread-6":
>   waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a 
> java.lang.Object),
>   which is held by 
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]"
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]":
>   waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a 
> org.apache.ratis.util.SlidingWindow$Client),
>   which is held by 
> "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty 
> queue]"
> "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty 
> queue]":
>   waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a 
> java.lang.Object),
>   which is held by 
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]"
> Java stack information for the threads listed above:
> ===
> "pool-2-thread-6":
>   at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103)
>   - waiting to lock <0x00033eb70a00> (a java.lang.Object)
>   at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78)
>   at 
> org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75)
>   at 
> org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59)
>   at 
> org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139)
>   at 
> org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282)
>   at 
> org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198)
>   at 
> org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496)
>   at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143)
>   at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435)
>   at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473)
>   at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
>   - locked <0x0003f2ba4240> (a 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)
>   at 
> org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710)
>   at 
> org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88)
>   at 
> org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
>   at 
> java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]":
>   at 
> org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348)
>   - waiting to lock <0x00033eb99b10> (a 
> org.apache.ratis.util.SlidingWindow$Client)
>   at 
> org.apache.

[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc

2019-10-10 Thread Tsz-wo Sze (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze updated RATIS-706:
-
Attachment: r706_20191011.patch

> Dead lock in GrpcClientRpc
> --
>
> Key: RATIS-706
> URL: https://issues.apache.org/jira/browse/RATIS-706
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Marton Elek
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: jstack.txt, r706_20191011.patch
>
>
> I started an Ozone cluster on Kubernetes and started a freon test (ozone 
> freon ockg -n1)
> After a while I found that the one freon instance is not creating keys any 
> more. I checked the om RPC endpoint with ozone insight and no RPC messages 
> has been arrived.
> Based on the jstack output we have a deadlock between 
> PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync.
> I am not sure (yet) what is the exact problem, but based on the stack traces 
> It seems to be Ratis related.
> {code}
> Found one Java-level deadlock:
> =
> "pool-2-thread-6":
>   waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a 
> java.lang.Object),
>   which is held by 
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]"
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]":
>   waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a 
> org.apache.ratis.util.SlidingWindow$Client),
>   which is held by 
> "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty 
> queue]"
> "java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty 
> queue]":
>   waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a 
> java.lang.Object),
>   which is held by 
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]"
> Java stack information for the threads listed above:
> ===
> "pool-2-thread-6":
>   at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103)
>   - waiting to lock <0x00033eb70a00> (a java.lang.Object)
>   at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78)
>   at 
> org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75)
>   at 
> org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59)
>   at 
> org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139)
>   at 
> org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282)
>   at 
> org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198)
>   at 
> org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496)
>   at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143)
>   at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435)
>   at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473)
>   at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
>   - locked <0x0003f2ba4240> (a 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream)
>   at 
> org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710)
>   at 
> org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88)
>   at 
> org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
>   at 
> java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)
> "java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
> queue]":
>   at 
> org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348)
>   - waiting to lock <0x00033eb99b10> (a 
> org.apache.ratis.util.SlidingWindow$Client)
>   at 
> org.apache.ratis.client.impl.Order

[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc

2019-10-10 Thread Mukul Kumar Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mukul Kumar Singh updated RATIS-706:

Description: 
I started an Ozone cluster on Kubernetes and started a freon test (ozone freon 
ockg -n1)

After a while I found that the one freon instance is not creating keys any 
more. I checked the om RPC endpoint with ozone insight and no RPC messages has 
been arrived.

Based on the jstack output we have a deadlock between 
PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync.

I am not sure (yet) what is the exact problem, but based on the stack traces It 
seems to be Ratis related.

{code}
Found one Java-level deadlock:
=
"pool-2-thread-6":
  waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a 
java.lang.Object),
  which is held by 
"java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
queue]"
"java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
queue]":
  waiting to lock monitor 0x01170980 (object 0x00033eb99b10, a 
org.apache.ratis.util.SlidingWindow$Client),
  which is held by 
"java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty 
queue]"
"java.util.concurrent.ThreadPoolExecutor$Worker@df368f8[State = -1, empty 
queue]":
  waiting to lock monitor 0x7f80356c8800 (object 0x00033eb70a00, a 
java.lang.Object),
  which is held by 
"java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
queue]"

Java stack information for the threads listed above:
===
"pool-2-thread-6":
at org.apache.ratis.util.PeerProxyMap.getProxy(PeerProxyMap.java:103)
- waiting to lock <0x00033eb70a00> (a java.lang.Object)
at 
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequestAsyncUnordered(GrpcClientRpc.java:78)
at 
org.apache.ratis.client.impl.UnorderedAsync.sendRequestWithRetry(UnorderedAsync.java:75)
at 
org.apache.ratis.client.impl.UnorderedAsync.send(UnorderedAsync.java:59)
at 
org.apache.ratis.client.impl.RaftClientImpl.sendWatchAsync(RaftClientImpl.java:139)
at 
org.apache.hadoop.hdds.scm.XceiverClientRatis.watchForCommit(XceiverClientRatis.java:282)
at 
org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchForCommit(CommitWatcher.java:198)
at 
org.apache.hadoop.hdds.scm.storage.CommitWatcher.watchOnLastIndex(CommitWatcher.java:161)
at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.watchForCommit(BlockOutputStream.java:346)
at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.handleFlush(BlockOutputStream.java:482)
at 
org.apache.hadoop.hdds.scm.storage.BlockOutputStream.close(BlockOutputStream.java:496)
at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.close(BlockOutputStreamEntry.java:143)
at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.handleFlushOrClose(KeyOutputStream.java:435)
at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:473)
at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
- locked <0x0003f2ba4240> (a 
org.apache.hadoop.ozone.client.io.OzoneOutputStream)
at 
org.apache.hadoop.ozone.freon.RandomKeyGenerator.createKey(RandomKeyGenerator.java:710)
at 
org.apache.hadoop.ozone.freon.RandomKeyGenerator.access$1100(RandomKeyGenerator.java:88)
at 
org.apache.hadoop.ozone.freon.RandomKeyGenerator$ObjectCreator.run(RandomKeyGenerator.java:615)
at 
java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.3/Executors.java:515)
at 
java.util.concurrent.FutureTask.run(java.base@11.0.3/FutureTask.java:264)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.3/ThreadPoolExecutor.java:1128)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.3/ThreadPoolExecutor.java:628)
at java.lang.Thread.run(java.base@11.0.3/Thread.java:834)
"java.util.concurrent.ThreadPoolExecutor$Worker@77329f41[State = -1, empty 
queue]":
at 
org.apache.ratis.util.SlidingWindow$Client.resetFirstSeqNum(SlidingWindow.java:348)
- waiting to lock <0x00033eb99b10> (a 
org.apache.ratis.util.SlidingWindow$Client)
at 
org.apache.ratis.client.impl.OrderedAsync.resetSlidingWindow(OrderedAsync.java:121)
at 
org.apache.ratis.client.impl.OrderedAsync$$Lambda$496/0x000840498440.accept(Unknown
 Source)
at 
org.apache.ratis.client.impl.RaftClientImpl.lambda$handleIOException$6(RaftClientImpl.java:349)
at 
org.apache.ratis.client.impl.RaftClientImpl$$Lambda$427/0x0008402e1840.accept(Unknown
 Source)
at java.util.Optional.ifPresent(java.base@11.0.3/Optional.java:183)
at 
org.apache.ratis.client.impl.RaftClientImpl.handleIOException(R

[jira] [Updated] (RATIS-706) Dead lock in GrpcClientRpc

2019-10-10 Thread Marton Elek (Jira)


 [ 
https://issues.apache.org/jira/browse/RATIS-706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marton Elek updated RATIS-706:
--
Attachment: jstack.txt

> Dead lock in GrpcClientRpc
> --
>
> Key: RATIS-706
> URL: https://issues.apache.org/jira/browse/RATIS-706
> Project: Ratis
>  Issue Type: Bug
>  Components: gRPC
>Reporter: Marton Elek
>Priority: Major
> Attachments: jstack.txt
>
>
> I started an Ozone cluster on Kubernetes and started a freon test (ozone 
> freon ockg -n1)
> After a while I found that the one freon instance is not creating keys any 
> more. I checked the om RPC endpoint with ozone insight and no RPC messages 
> has been arrived.
> Based on the jstack output we have a deadlock between 
> PeerProxyMap.handleException and GrpcClientRpc.sendRequestAsync.
> I am not sure (yet) what is the exact problem, but based on the stack traces 
> It seems to be Ratis related.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)