[jira] [Comment Edited] (HADOOP-15538) Possible deadlock in Client

2018-06-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513183#comment-16513183
 ] 

Yongjun Zhang edited comment on HADOOP-15538 at 6/15/18 12:45 AM:
--

Thanks a lot [~daryn] and [~billyean].

Please see attached two jstack files, one is at the begining t1.jstack, the 
other is 13 minutes later: t1+13min.jstack.

My read is, JDK-8007476 fixed a jdk internal error that crash the jstack dump 
itself, by introducing "which is held by UNKNOWN_owner_addr=" to make it dump 
reasonable value of the other guy who hold the lock, so it can dump out the 
deadlock threads pair instead of crashing. In the example provided there, the 
two threads point to each other because they are the two threads involved in 
the same deadlock.

I think the java version here has the JDK-8007476  fix, that's why we also have 
"which is held by UNKNOWN_owner_addr=" reported; However, the other party of 
deadlocks is not reported here.

Problem here:
 # Did the deadlock finder miss some threads involved in the deadlock?
 # If it did not miss, why the two threads listed are blocked at the same lock 
(stateLock)? And these two threads point to the same culprit addr reported as 
"which is held by UNKNOWN_owner_addr=".
 # Who holds the stateLock?

Haibo's suggestion of having heapdump seems a good direction to dive into. 
Unfortunately the problem is intermittent and it takes time to see it again. 
But I will look into.

If any of you have more comments and insight to share, I would really 
appreciate.

Thanks.


was (Author: yzhangal):
Thanks a lot [~daryn] and [~billyean].

Please see attached two jstack files, one is at the begining t1.jstack, the 
other is 13 minutes later: t1+13min.jstack.

My read is, JDK-8007476 fixed a jdk internal error that crash the jstack dump 
itself, by introducing "which is held by UNKNOWN_owner_addr=" to make it dump 
reasonable value of the other guy who hold the lock, so it can dump out the 
deadlock threads pair instead of crashing. In the example provided there, the 
two threads point to each other because they are the two threads involved in 
the same deadlock.

I think the java version here has the fix, that's why we also have "which is 
held by UNKNOWN_owner_addr=" reported;  However, the other party of deadlocks 
is not reported here.

Problem here:
# Did the deadlock finder miss some threads involved in the deadlock?
# If it did not miss, why the two threads listed are blocked at the same lock 
(stateLock)? And  these two threads point to the same culprit addr reported as 
"which is held by UNKNOWN_owner_addr=".
# Who holds the stateLock?

Haibo's suggestion of having heapdump seems a good direction to dive into. 
Unfortunately the problem is intermittent and it takes time to see it again. 
But I will look into.

If any of you have more comments and insight to share, I would really 
appreciate.

Thanks.




> Possible deadlock in Client
> ---
>
> Key: HADOOP-15538
> URL: https://issues.apache.org/jira/browse/HADOOP-15538
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: common
>Reporter: Yongjun Zhang
>Priority: Major
> Attachments: t1+13min.jstack, t1.jstack
>
>
> We have a jstack collection that spans 13 minutes. Once frame per ~1.5 
> minutes. And for each of the frame, I observed the following:
> {code}
> Found one Java-level deadlock:
> =
> "IPC Parameter Sending Thread #294":
>   waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x7f68332e2800
> Java stack information for the threads listed above:
> ===
> "IPC Parameter Sending Thread #294":
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
> - waiting to lock <0x000621745390> (a java.lang.Object)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> - locked <0x000621745380> (a java.lang.Object)
> at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x000621749850> (a java.io.BufferedOutputStream)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at org.apache.hadoop.ipc.Client$Conne

[jira] [Comment Edited] (HADOOP-15538) Possible deadlock in Client

2018-06-14 Thread Daryn Sharp (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512923#comment-16512923
 ] 

Daryn Sharp edited comment on HADOOP-15538 at 6/14/18 8:02 PM:
---

The JDK-8007476 bug appears to have simply obscured the threads in the deadlock 
output.  Notice that the stacktraces still include "\- waiting to lock 
" and "\- locked " which makes it easy to see the deadlock 
offenders despite the "UNKNOWN_owner_addr".

Do you have the full stack trace that you can link as an attachment?


was (Author: daryn):
The JDK-8007476 bug appears to have simply obscured the threads in the deadlock 
output.  Notice that the stacktraces still include "- waiting to lock 
" and "- locked " which makes it easy to see the deadlock 
offenders despite the "UNKNOWN_owner_addr".

Do you have the full stack trace that you can link as an attachment?

> Possible deadlock in Client
> ---
>
> Key: HADOOP-15538
> URL: https://issues.apache.org/jira/browse/HADOOP-15538
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: common
>Reporter: Yongjun Zhang
>Priority: Major
>
> We have a jstack collection that spans 13 minutes. Once frame per ~1.5 
> minutes. And for each of the frame, I observed the following:
> {code}
> Found one Java-level deadlock:
> =
> "IPC Parameter Sending Thread #294":
>   waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x7f68332e2800
> Java stack information for the threads listed above:
> ===
> "IPC Parameter Sending Thread #294":
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
> - waiting to lock <0x000621745390> (a java.lang.Object)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> - locked <0x000621745380> (a java.lang.Object)
> at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x000621749850> (a java.io.BufferedOutputStream)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072)
> - locked <0x00062174b878> (a java.io.DataOutputStream)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Found one Java-level deadlock:
> =
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>   waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x7f68332e2800
> Java stack information for the threads listed above:
> ===
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
> at 
> sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279)
> - waiting to lock <0x000621745390> (a java.lang.Object)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390)
> - locked <0x000621745370> (a java.lang.Object)
> at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0006217476f0> (a java.io.BufferedInputStream)
> 

[jira] [Comment Edited] (HADOOP-15538) Possible deadlock in Client

2018-06-14 Thread Yongjun Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512076#comment-16512076
 ] 

Yongjun Zhang edited comment on HADOOP-15538 at 6/14/18 7:39 AM:
-

Hm, I noticed that in [https://bugs.openjdk.java.net/browse/JDK-8007476], The 
"which is held by UNKNOWN_owner_addr" shows two different values for the two 
threads involved in the deadlock, one corresponds to each of these two threads.
{code:java}
Found one Java-level deadlock:
=
"Worker-1":
  waiting to lock monitor 0x02f60e54 (object 0x1026ce00, a java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x0352fc54
..
Found one Java-level deadlock:
=
"Worker-0":
  waiting to lock monitor 0x02f601bc (object 0x1026ce08, a java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x0357fbd4
{code}
However, in the case I reported here, they point to the same value:
{code:java}
Found one Java-level deadlock:
=
"IPC Parameter Sending Thread #294":
  waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x7f68332e2800
..
Found one Java-level deadlock:
=
"IPC Client (297602875) connection to x.y.z.p:8020 from impala":
  waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x7f68332e2800
{code}
and this value could just be a third thread, thus two deadlock pairs , . This explains issue#2. However, I wish java stack 
dump could be more clear about why UNKNOWN_owner_addr here, what thread this 
really is.


was (Author: yzhangal):
Hm, I noticed that in https://bugs.openjdk.java.net/browse/JDK-8007476, The 
"which is held by UNKNOWN_owner_addr" shows two different values for the two 
threads involved in the deadlock, one corresponds to each of these two threads. 

{code}
Found one Java-level deadlock:
=
"Worker-1":
  waiting to lock monitor 0x02f60e54 (object 0x1026ce00, a java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x0352fc54
..
Found one Java-level deadlock:
=
"Worker-0":
  waiting to lock monitor 0x02f601bc (object 0x1026ce08, a java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x0357fbd4
{code}

However, in the case I reported here, they point to the same value:
{code}
Found one Java-level deadlock:
=
"IPC Parameter Sending Thread #294":
  waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x7f68332e2800
..
Found one Java-level deadlock:
=
"IPC Client (297602875) connection to x.y.z.p:8020 from impala":
  waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
java.lang.Object),
  which is held by UNKNOWN_owner_addr=0x7f68332e2800
{code}

and this value could just be a third thread, thus two deadlock pairs , . I wish java stack dump could be more clear about 
why UNKNOWN_owner_addr here, what thread this really is.


> Possible deadlock in Client
> ---
>
> Key: HADOOP-15538
> URL: https://issues.apache.org/jira/browse/HADOOP-15538
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: common
>Reporter: Yongjun Zhang
>Priority: Major
>
> We have a jstack collection that spans 13 minutes. Once frame per ~1.5 
> minutes. And for each of the frame, I observed the following:
> {code}
> Found one Java-level deadlock:
> =
> "IPC Parameter Sending Thread #294":
>   waiting to lock monitor 0x7f68f21f3188 (object 0x000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x7f68332e2800
> Java stack information for the threads listed above:
> ===
> "IPC Parameter Sending Thread #294":
> at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
> - waiting to lock <0x000621745390> (a java.lang.Object)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> - locked <0x000621745380> (a java.lang.Object)
> at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(Buffe