Hello,
This issue occurred again and we dumped the TM thread. It indeed hung on socket
read to download jar from Blob server:
"DataSource (at createInput(ExecutionEnvironment.java:548) (our.code))
(1999/2000)" #72 prio=5 os_prio=0 tid=0x00007fb9a1521000 nid=0xa0994 runnable
[0x00007fb97cfbf000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at
org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
at
org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
at
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:170)
at
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
at
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
at
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
- locked <0x000000078ab60ba8> (a java.lang.Object)
at
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:893)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:585)
at java.lang.Thread.run(Thread.java:748)
I checked the latest master code. There’s still no socket timeout in Blob
client. Should I create an issue to add this timeout?
Regards,
Qi
> On Apr 19, 2019, at 7:49 PM, qi luo <[email protected]> wrote:
>
> Hi all,
>
> We use Flink 1.5 batch and start thousands of jobs per day. Occasionally we
> observed some stuck jobs, due to some TM hang in “DEPLOYING” state.
>
> On checking TM log, it shows that it stuck in downloading jars in BlobClient:
>
> ————
> ...
> INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received
> task DataSource (at createInput(ExecutionEnvironment.java:548) (our.code))
> (184/2000).
> INFO org.apache.flink.runtime.taskmanager.Task -
> DataSource (at createInput(ExecutionEnvironment.java:548) (our.code))
> (184/2000) switched from CREATED to DEPLOYING.
> INFO org.apache.flink.runtime.taskmanager.Task -
> Creating FileSystem stream leak safety net for task DataSource (at
> createInput(ExecutionEnvironment.java:548) (our.code)) (184/2000) [DEPLOYING]
> INFO org.apache.flink.runtime.taskmanager.Task - Loading
> JAR files for task DataSource (at createInput(ExecutionEnvironment.java:548)
> (our.code)) (184/2000) [DEPLOYING].
> INFO org.apache.flink.runtime.blob.BlobClient -
> Downloading
> 19e65c0caa41f264f9ffe4ca2a48a434/p-3ecd6341bf97d5512b14c93f6c9f51f682b6db26-37d5e69d156ee00a924c1ebff0c0d280
> from some-host-ip-port
>
> no more logs...
> ————
>
> It seems that the TM is calling BlobClient to download jars from
> JM/BlobServer. Under hood it’s calling Socket.connect() and then
> Socket.read() to retrieve results.
>
> Should we add timeout in socket operations in BlobClient to resolve this
> issue?
>
> Thanks,
> Qi