[ 
https://issues.apache.org/jira/browse/FLINK-12547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848633#comment-16848633
 ] 

Haibo Sun edited comment on FLINK-12547 at 5/27/19 6:08 AM:
------------------------------------------------------------

[~QiLuo],

Because the blob client has a retry mechanism, I understand that "The TM hangs 
for over an hour (longer than the SO_TIMEOUT)" is possible, but it does not 
mean that SO_TIMEOUT does not work. In addition, it is not excluded that there 
may be other reasons leading to hang.

 

`30 minutes` is too longer, and I suggest to set SO_TIMEOUT to a smaller value.


was (Author: sunhaibotb):
[~QiLuo], because the blob client has a retry mechanism, I understand that "The 
TM hangs for over an hour (longer than the SO_TIMEOUT)"  is possible, but it 
does not mean that SO_TIMEOUT does not work.  In addition, `30 minutes` is too 
longer, and I think you should set SO_TIMEOUT to a smaller value.

> Deadlock when the task thread downloads jars using BlobClient
> -------------------------------------------------------------
>
>                 Key: FLINK-12547
>                 URL: https://issues.apache.org/jira/browse/FLINK-12547
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.0
>            Reporter: Haibo Sun
>            Assignee: Haibo Sun
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The jstack is as follows (this jstack is from an old Flink version, but the 
> master branch has the same problem).
> {code:java}
> "Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000 
> nid=0xe2 runnable [0x00007f80da5fd000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at 
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
> at 
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
> at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
> at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
> at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
> at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
> - locked <0x000000062cf2a188> (a java.lang.Object)
> at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
> at java.lang.Thread.run(Thread.java:834)
> Locked ownable synchronizers:
> - None
> {code}
>  
> The reason is that SO_TIMEOUT is not set in the socket connection of the blob 
> client. When the network packet loss seriously due to the high CPU load of 
> the machine, the blob client connection fails to perceive that the server has 
> been disconnected, which results in blocking in the native method. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to