[ https://issues.apache.org/jira/browse/FLINK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836032#comment-16836032 ]
Abdul Qadeer commented on FLINK-12437: -------------------------------------- [~rmetzger] I don't expect a fix to be made available for this in 1.4.0. I would like to know if this is a known issue fixed in newer versions. > Taskmanager doesn't initiate registration after jobmanager marks it terminated > ------------------------------------------------------------------------------ > > Key: FLINK-12437 > URL: https://issues.apache.org/jira/browse/FLINK-12437 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Reporter: Abdul Qadeer > Priority: Major > > This issue is observed in Standalone cluster deployment mode with Zookeeper > HA enabled in Flink 1.4.0. A few taskmanagers restarted due to Out of > Metaspace. > The offending taskmanager `pipelineruntime-taskmgr-6789dd578b-dcp4r` first > successfully registers with jobmanager, and the remote watcher marks it > terminated soon after as seen in logs. There were other taskmanagers that > were terminated around same time but they had been quarantined by jobmanager > with message similar to: > {noformat} > Association to [akka.tcp://flink@10.60.5.121:8070] having UID [864976677] is > irrecoverably failed. UID is now quarantined and all messages to this UID > will be delivered to dead letters. Remote actorsystem must be restarted to > recover from this situation. > {noformat} > They came back up and successfully registered with jobmanager. This didn't > happen for the offending taskmanager: > > At JobManager: > {noformat} > {"timeMillis":1557073368155,"thread":"flink-akka.actor.default-dispatcher-49","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Registered > TaskManager at pipelineruntime-taskmgr-6789dd578b-dcp4r > (akka.tcp://flink@10.60.5.85:8070/user/taskmanager) as > ae61ac607f0ab35ab5066f7dc221e654. Current number of registered hosts is 8. > Current number of alive task slots is > 51.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":125,"threadPriority":5} > ... > ... > {"timeMillis":1557073391386,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered > task manager /10.60.5.85. Number of registered task managers 7. Number of > available slots > 45.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5} > ... > ... > {"timeMillis":1557073391483,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered > task manager /10.60.5.85. Number of registered task managers 6. Number of > available slots > 39.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5} > ... > ... > {"timeMillis":1557073370389,"thread":"flink-akka.actor.default-dispatcher-35","level":"INFO","loggerName":"akka.actor.LocalActorRef","message":"Message > [akka.remote.ReliableDeliverySupervisor$Ungate$] from > Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260] > to > Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260] > was not delivered. [22] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":98,"threadPriority":5} > {noformat} > At TaskManager: > {noformat} > {"timeMillis":1557073366068,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting > > TaskManager","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073366073,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting > TaskManager actor system at > 10.60.5.85:8070.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073366077,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying > to start actor system at > 10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073366510,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.event.slf4j.Slf4jLogger","message":"Slf4jLogger > > started","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5} > {"timeMillis":1557073366694,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Starting > > remoting","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5} > {"timeMillis":1557073367049,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting > started; listening on addresses > :[akka.tcp://flink@10.60.5.85:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5} > {"timeMillis":1557073367051,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting > now listens on addresses: > [akka.tcp://flink@10.60.5.85:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5} > {"timeMillis":1557073367089,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Actor > system started at > akka.tcp://flink@10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367138,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Configuring > FlinkMetricsReporter with > {class=com.pipeline.processor.flink.metrics.FlinkMetricsReporter}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"com.pipeline.processor.flink.metrics.FlinkMetricsReporter","message":"Metrics > Reporter > Open","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Reporting > metrics of type > com.pipeline.processor.flink.metrics.FlinkMetricsReporter.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367142,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting > TaskManager > actor","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367176,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyConfig","message":"NettyConfig > [server address: /10.60.5.85, server port: 0, ssl enabled: false, memory > segment size (bytes): 32768, transport type: NIO, number of server threads: 3 > (manual), number of client threads: 3 (manual), server connect backlog: 0 > (use Netty's default), client connect timeout (sec): 120, send/receive buffer > size (bytes): 0 (use Netty's > default)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367187,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration","message":"Messages > have a max timeout of 100000 > ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367198,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Temporary > file directory '/tmp': total 373 GB, usable 295 GB (79.09% > usable)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367608,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.buffer.NetworkBufferPool","message":"Allocated > 639 MB for network buffer pool (number of memory segments: 20467, bytes per > segment: > 32768).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367710,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could > not load Queryable State Client Proxy. Probable reason: > flink-queryable-state-runtime is not in the classpath. Please put the > corresponding jar from the opt to the lib > folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367711,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could > not load Queryable State Server. Probable reason: > flink-queryable-state-runtime is not in the classpath. Please put the > corresponding jar from the opt to the lib > folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367712,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.NetworkEnvironment","message":"Starting > the network environment and its > components.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367753,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyClient","message":"Successful > initialization (took 34 > ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367805,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyServer","message":"Successful > initialization (took 51 ms). Listening on SocketAddress > /10.60.5.85:38873.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367808,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Limiting > managed memory to 0.7 of the currently free heap space (4005 MB), memory > will be allocated > lazily.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367819,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.disk.iomanager.IOManager","message":"I/O > manager uses directory /tmp/flink-io-5f657721-13dd-40aa-9c00-2a15d5666280 > for spill > files.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367826,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User > file cache uses directory > /tmp/flink-dist-cache-30b1f2fd-9457-435b-a601-ae0b4e37dc6d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5} > {"timeMillis":1557073367862,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User > file cache uses directory > /tmp/flink-dist-cache-3dfb3cd5-b261-4df3-a662-a1cd91047c72","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073367888,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting > TaskManager actor at > akka://flink/user/taskmanager#1157564383.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073367889,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager > data connection information: > pipelineruntime-taskmgr-6789dd578b-dcp4r-57b5f60d8144eb16425ec5bd9666768f @ > pipelineruntime-taskmgr-6789dd578b-dcp4r > (dataPort=38873)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073367890,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager > has 6 task > slot(s).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Memory > usage stats: [HEAP: 842/6554/6554 MB, NON HEAP: 62/64/1776 MB > (used/committed/max)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Starting > > ZooKeeperLeaderRetrievalService.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073367965,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Leader > node has changed with > Leader=akka.tcp://flink@10.60.5.53:6123/user/jobmanager, session > ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5} > {"timeMillis":1557073367966,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"New > leader information: Leader=akka.tcp://flink@10.60.5.53:6123/user/jobmanager, > session > ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5} > {"timeMillis":1557073367975,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying > to register at JobManager akka.tcp://flink@10.60.5.53:6123/user/jobmanager > (attempt 1, timeout: 500 > milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073368168,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Successful > registration at JobManager > (akka.tcp://flink@10.60.5.53:6123/user/jobmanager), starting network stack > and library > cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073368177,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Determined > BLOB server address to be /10.60.5.53:43987. Starting BLOB > cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073368184,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.PermanentBlobCache","message":"Created > BLOB cache storage directory > /tmp/blobStore-ffdc49ba-e86f-4240-93ad-7566c43e9b0d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073368189,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.TransientBlobCache","message":"Created > BLOB cache storage directory > /tmp/blobStore-764277b6-6e46-4c8f-b7ee-80f746edefab","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391398,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$R4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [1] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$S4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [2] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$T4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [3] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$U4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [4] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$V4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [5] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$W4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [6] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$X4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [7] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391474,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$Y4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [8] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391475,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$Z4] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [9] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > {"timeMillis":1557073391477,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$04] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [10] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5} > ... > ... > ... > {"timeMillis":1557073691534,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message > [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] > from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$sab] to > Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [316] > dead letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":49,"threadPriority":5} > {noformat} > TCP dump at taskmanager: > {noformat} > 19:55:58.214944 IP 10.60.5.85.45008 > 10.60.5.53.6123: tcp 715 > 0x0000: 4500 02ff 2809 4000 4006 f0ee 0a3c 0555 E...(.@.@....<.U > 0x0010: 0a3c 0535 afd0 17eb a107 10ac 0270 79da .<.5.........py. > 0x0020: 8018 ce96 21f3 0000 0101 080a f2c0 c93f ....!..........? > 0x0030: b74c ec05 0000 02c7 0ac4 0512 c105 0a3d .L.............= > 0x0040: 0a3b 616b 6b61 2e74 6370 3a2f 2f66 6c69 .;akka.tcp://fli > 0x0050: 6e6b 4031 302e 3630 2e35 2e35 333a 3631 nk@10.60.5.53:61 > 0x0060: 3233 2f75 7365 722f 6a6f 626d 616e 6167 23/user/jobmanag > 0x0070: 6572 2331 3231 3433 3237 3831 3312 bf04 er#1214327813... > 0x0080: 0aba 04ac ed00 0573 7200 3f6f 7267 2e61 .......sr.?org.a > 0x0090: 7061 6368 652e 666c 696e 6b2e 7275 6e74 pache.flink.runt > 0x00a0: 696d 652e 6d65 7373 6167 6573 2e54 6173 ime.messages.Tas > 0x00b0: 6b4d 616e 6167 6572 4d65 7373 6167 6573 kManagerMessages > 0x00c0: 2448 6561 7274 6265 6174 1fb7 fffd 259b $Heartbeat....%. > 0x00d0: c539 0200 024c 000c 6163 6375 6d75 6c61 .9...L..accumula > 0x00e0: 746f 7273 7400 164c 7363 616c 612f 636f torst..Lscala/co > 0x00f0: 6c6c 6563 7469 6f6e 2f53 6571 3b4c 000a llection/Seq;L.. > 0x0100: 696e 7374 616e 6365 4944 7400 2e4c 6f72 instanceIDt..Lor > 0x0110: 672f 6170 6163 6865 2f66 6c69 6e6b 2f72 g/apache/flink/r > 0x0120: 756e 7469 6d65 2f69 6e73 7461 6e63 652f untime/instance/ > 0x0130: 496e 7374 616e 6365 4944 3b78 7073 7200 InstanceID;xpsr. > 0x0140: 2473 6361 6c61 2e63 6f6c 6c65 6374 696f $scala.collectio > 0x0150: 6e2e 6d75 7461 626c 652e 4172 7261 7942 n.mutable.ArrayB > 0x0160: 7566 6665 7215 38b0 5383 828e 7302 0003 uffer.8.S...s... > 0x0170: 4900 0b69 6e69 7469 616c 5369 7a65 4900 I..initialSizeI. > 0x0180: 0573 697a 6530 5b00 0561 7272 6179 7400 .size0[..arrayt. > 0x0190: 135b 4c6a 6176 612f 6c61 6e67 2f4f 626a .[Ljava/lang/Obj > 0x01a0: 6563 743b 7870 0000 0010 0000 0000 7572 ect;xp........ur > 0x01b0: 0013 5b4c 6a61 7661 2e6c 616e 672e 4f62 ..[Ljava.lang.Ob > 0x01c0: 6a65 6374 3b90 ce58 9f10 7329 6c02 0000 ject;..X..s)l... > 0x01d0: 7870 0000 0010 7070 7070 7070 7070 7070 xp....pppppppppp > 0x01e0: 7070 7070 7070 7372 002c 6f72 672e 6170 ppppppsr.,org.ap > 0x01f0: 6163 6865 2e66 6c69 6e6b 2e72 756e 7469 ache.flink.runti > 0x0200: 6d65 2e69 6e73 7461 6e63 652e 496e 7374 me.instance.Inst > 0x0210: 616e 6365 4944 0000 0000 0000 0001 0200 anceID.......... > 0x0220: 0078 7200 206f 7267 2e61 7061 6368 652e .xr..org.apache. > 0x0230: 666c 696e 6b2e 7574 696c 2e41 6273 7472 flink.util.Abstr > 0x0240: 6163 7449 4400 0000 0000 0000 0102 0003 actID........... > 0x0250: 4a00 096c 6f77 6572 5061 7274 4a00 0975 J..lowerPartJ..u > 0x0260: 7070 6572 5061 7274 4c00 0874 6f53 7472 pperPartL..toStr > 0x0270: 696e 6774 0012 4c6a 6176 612f 6c61 6e67 ingt..Ljava/lang > 0x0280: 2f53 7472 696e 673b 7870 ae61 ac60 7f0a /String;xp.a.`.. > 0x0290: b35a b506 6f7d c221 e654 7400 2061 6536 .Z..o}.!.Tt..ae6 > 0x02a0: 3161 6336 3037 6630 6162 3335 6162 3530 1ac607f0ab35ab50 > 0x02b0: 3636 6637 6463 3232 3165 3635 3410 0122 66f7dc221e654.." > 0x02c0: 3e0a 3c61 6b6b 612e 7463 703a 2f2f 666c >.<akka.tcp://fl > 0x02d0: 696e 6b40 3130 2e36 302e 352e 3835 3a38 ink@10.60.5.85:8 > 0x02e0: 3037 302f 7573 6572 2f74 6173 6b6d 616e 070/user/taskman > 0x02f0: 6167 6572 2331 3135 3735 3634 3338 33 ager#1157564383 > 19:55:58.214996 IP 10.60.5.53.6123 > 10.60.5.85.45008: tcp 0 > 0x0000: 4500 0034 c1fe 4000 3f06 5ac4 0a3c 0535 E..4..@.?.Z..<.5 > 0x0010: 0a3c 0555 17eb afd0 0270 79da a107 1377 .<.U.....py....w > 0x0020: 8010 ce93 1f28 0000 0101 080a b74c ff8d .....(.......L.. > 0x0030: f2c0 c93f ...? > {noformat} > After this, the taskmanager never registers again at the jobmanager. > This run had the following akka configuration: > akka.watch.heartbeat.pause: 60 s > akka.ask.timeout: 100 s > I noticed that akka.watch.heartbeat.interval defaults to ask.timeout if not > specified in configuration. Is it possible for these kind of failures to > happen due to the heartbeat-interval being more than heartbeat-pause? -- This message was sent by Atlassian JIRA (v7.6.3#76005)