Hi all! Today we started noticing that deploying our jobs took over 3 minutes when deployed from some machine and normal (few seconds) when deployed from the others.
Looking at the logs it seems that the client cant find some job id for a few minutes in this case: ... 2017-11-21 15:23:00,880 DEBUG org.apache.flink.yarn.YarnJobManager - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in JobManager 2017-11-21 15:23:04,528 DEBUG org.apache.zookeeper.ClientCnxn - Got ping response for sessionid: 0x25eb8e005b7971b after 0ms 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client - IPC Client (937277082) connection to splat13.sto.midasplayer.com/172.26.87.155:8030 from splat sending #38 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client - IPC Client (937277082) connection to splat13.sto.midasplayer.com/172.26.87.155:8030 from splat got value #38 2017-11-21 15:23:04,651 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: allocate took 16ms 2017-11-21 15:23:05,880 DEBUG org.apache.flink.yarn.YarnJobManager - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in JobManager 2017-11-21 15:23:06,409 DEBUG akka.remote.RemoteWatcher - Sending Heartbeat to [akka.tcp:// fl...@splat33.sto.midasplayer.com:56045] 2017-11-21 15:23:06,413 DEBUG akka.remote.RemoteWatcher - Received heartbeat rsp from [akka.tcp:// fl...@splat33.sto.midasplayer.com:56045] 2017-11-21 15:23:07,665 DEBUG akka.serialization.Serialization(akka://flink) - Using serializer[akka.serialization.JavaSerializer] for message [org.apache.flink.runtime.clusterframework.messages.GetClusterStatusResponse] 2017-11-21 15:23:07,824 INFO org.apache.flink.yarn.YarnJobManager - Submitting job 179d67bfab7c4c0b9f00ea772f6e4f0c (event-bifrost-log). 2017 Interestingly enough nothing like this shows when deployed from other servers. We suspect there might be some strange network issue (which doesnt seem to affect jar upload times) that screws with akka in some way. Any idea how to debug this? Thank you! Gyula