Haibo Sun created FLINK-12547:
---------------------------------

             Summary: Deadlock when the task thread downloads jars using 
BlobClient
                 Key: FLINK-12547
                 URL: https://issues.apache.org/jira/browse/FLINK-12547
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Operators
    Affects Versions: 1.8.0
            Reporter: Haibo Sun
            Assignee: Haibo Sun


The jstack is as follows (this jstack is from an old Flink version, but the 
master branch has the same problem).
{code:java}
"Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000 
nid=0xe2 runnable [0x00007f80da5fd000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
at 
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
at 
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
at 
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
at 
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
- locked <0x000000062cf2a188> (a java.lang.Object)
at 
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
at java.lang.Thread.run(Thread.java:834)

Locked ownable synchronizers:
- None
{code}
The reason is that SO_TIMEOUT is not set in the socket connection of the blob 
client. When the network packet loss seriously due to the high CPU load of the 
machine, the blob client connection fails to perceive that the server has been 
disconnected, which results in blocking in the native method. 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to