[ https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512135#comment-16512135 ]
Yongjun Zhang commented on HADOOP-15538: ---------------------------------------- All other 6 BLOCKED threads (there are 8 BLOCKED threads in total in each jstack frame) look like {code:java} "Thread-52" #82 prio=5 os_prio=0 tid=0x000000001a2c1000 nid=0xf189f waiting for monitor entry [0x00007f697a77f000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1053) - waiting to lock <0x00000006215c1e08> (a java.lang.Object) at org.apache.hadoop.ipc.Client.call(Client.java:1483) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:266) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1323) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1310) at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1298) at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:309) at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:275) - locked <0x0000000618947838> (a java.lang.Object) at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:267) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1629) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:338) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:334) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) {code} These threads all blocked due to RPC serialization at {code:java} public void sendRpcRequest(final Call call) throws InterruptedException, IOException { if (shouldCloseConnection.get()) { return; } // Serialize the call to be sent. This is done from the actual // caller thread, rather than the sendParamsExecutor thread, // so that if the serialization throws an error, it is reported // properly. This also parallelizes the serialization. // // Format of a call on the wire: // 0) Length of rest below (1 + 2) // 1) RpcRequestHeader - is serialized Delimited hence contains length // 2) RpcRequest // // Items '1' and '2' are prepared here. final DataOutputBuffer d = new DataOutputBuffer(); RpcRequestHeaderProto header = ProtoUtil.makeRpcRequestHeader( call.rpcKind, OperationProto.RPC_FINAL_PACKET, call.id, call.retry, clientId); header.writeDelimitedTo(d); call.rpcRequest.write(d); synchronized (sendRpcRequestLock) { <===================Client.java, line 1053 Future<?> senderFuture = sendParamsExecutor.submit(new Runnable() { {code} These 6 threads are all blocked due to the single RPC thread holding the sendRpcRequestLock ("IPC Parameter Sending Thread #294" in the jira description), which is blocked due to the reported deadlock. So what is the "UNKNOWN_owner_addr=0x00007f68332e2800" thread that caused the "deadlock" is hidden. > Possible deadlock in Client > --------------------------- > > Key: HADOOP-15538 > URL: https://issues.apache.org/jira/browse/HADOOP-15538 > Project: Hadoop Common > Issue Type: Bug > Components: common > Reporter: Yongjun Zhang > Priority: Major > > We have a jstack collection that spans 13 minutes. Once frame per ~1.5 > minutes. And for each of the frame, I observed the following: > {code} > Found one Java-level deadlock: > ============================= > "IPC Parameter Sending Thread #294": > waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a > java.lang.Object), > which is held by UNKNOWN_owner_addr=0x00007f68332e2800 > Java stack information for the threads listed above: > =================================================== > "IPC Parameter Sending Thread #294": > at > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268) > - waiting to lock <0x0000000621745390> (a java.lang.Object) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461) > - locked <0x0000000621745380> (a java.lang.Object) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > - locked <0x0000000621749850> (a java.io.BufferedOutputStream) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072) > - locked <0x000000062174b878> (a java.io.DataOutputStream) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Found one Java-level deadlock: > ============================= > "IPC Client (297602875) connection to x.y.z.p:8020 from impala": > waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a > java.lang.Object), > which is held by UNKNOWN_owner_addr=0x00007f68332e2800 > Java stack information for the threads listed above: > =================================================== > "IPC Client (297602875) connection to x.y.z.p:8020 from impala": > at > sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279) > - waiting to lock <0x0000000621745390> (a java.lang.Object) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390) > - locked <0x0000000621745370> (a java.lang.Object) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read(BufferedInputStream.java:265) > - locked <0x00000006217476f0> (a java.io.BufferedInputStream) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006) > Found 2 deadlocks. > {code} > This happens with jdk1.8.0_162 on 2.6.32-696.18.7.el6.x86_64. > The code appears to match > https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/tree/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java. > The first thread is blocked at: > https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=268 > The second thread is blocked at: > https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=279 > There are two issues here: > # There seems to be a real deadlock because the stacks remain the same even > if the first an last jstack frames captured is 13 minutes apart. > # Java deadlock report seems to be problematic, two threads that have > deadlock should not be blocked on the same lock, but they appear to be in > this case: the same SocketChannelImpl's stateLock. > I found a relevant jdk jira https://bugs.openjdk.java.net/browse/JDK-8007476, > it explains where two deadlocks are reported and they are really for the same > deadlock. > I don't see a similar report about this issue in jdk jira database, and I'm > thinking about filing a jdk jira for that, but would like to throw some > discussion here before that. > Issue#1 is important, because the client is hanging, which indicate a real > problem; however, without a correct report before issue#2 is fixed, it's not > clear how the deadlock really looks like. > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org