[ 
https://issues.apache.org/jira/browse/FLINK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836032#comment-16836032
 ] 

Abdul Qadeer commented on FLINK-12437:
--------------------------------------

[~rmetzger] I don't expect a fix to be made available for this in 1.4.0. I 
would like to know if this is a known issue fixed in newer versions. 

> Taskmanager doesn't initiate registration after jobmanager marks it terminated
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-12437
>                 URL: https://issues.apache.org/jira/browse/FLINK-12437
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Abdul Qadeer
>            Priority: Major
>
> This issue is observed in Standalone cluster deployment mode with Zookeeper 
> HA enabled in Flink 1.4.0. A few taskmanagers restarted due to Out of 
> Metaspace.
>  The offending taskmanager `pipelineruntime-taskmgr-6789dd578b-dcp4r` first 
> successfully registers with jobmanager, and the remote watcher marks it 
> terminated soon after as seen in logs. There were other taskmanagers that 
> were terminated around same time but they had been quarantined by jobmanager 
> with message similar to:
> {noformat}
> Association to [akka.tcp://flink@10.60.5.121:8070] having UID [864976677] is 
> irrecoverably failed. UID is now quarantined and all messages to this UID 
> will be delivered to dead letters. Remote actorsystem must be restarted to 
> recover from this situation.
> {noformat}
> They came back up and successfully registered with jobmanager. This didn't 
> happen for the offending taskmanager:
>   
>  At JobManager:
> {noformat}
> {"timeMillis":1557073368155,"thread":"flink-akka.actor.default-dispatcher-49","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Registered
>  TaskManager at pipelineruntime-taskmgr-6789dd578b-dcp4r 
> (akka.tcp://flink@10.60.5.85:8070/user/taskmanager) as 
> ae61ac607f0ab35ab5066f7dc221e654. Current number of registered hosts is 8. 
> Current number of alive task slots is 
> 51.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":125,"threadPriority":5}
> ...
> ...
> {"timeMillis":1557073391386,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered
>  task manager /10.60.5.85. Number of registered task managers 7. Number of 
> available slots 
> 45.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
> ...
> ...
> {"timeMillis":1557073391483,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered
>  task manager /10.60.5.85. Number of registered task managers 6. Number of 
> available slots 
> 39.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
> ...
> ...
> {"timeMillis":1557073370389,"thread":"flink-akka.actor.default-dispatcher-35","level":"INFO","loggerName":"akka.actor.LocalActorRef","message":"Message
>  [akka.remote.ReliableDeliverySupervisor$Ungate$] from 
> Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260]
>  to 
> Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260]
>  was not delivered. [22] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":98,"threadPriority":5}
> {noformat}
> At TaskManager:
> {noformat}
> {"timeMillis":1557073366068,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
>  
> TaskManager","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073366073,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
>  TaskManager actor system at 
> 10.60.5.85:8070.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073366077,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying
>  to start actor system at 
> 10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073366510,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.event.slf4j.Slf4jLogger","message":"Slf4jLogger
>  
> started","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073366694,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Starting
>  
> remoting","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073367049,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting
>  started; listening on addresses 
> :[akka.tcp://flink@10.60.5.85:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073367051,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting
>  now listens on addresses: 
> [akka.tcp://flink@10.60.5.85:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073367089,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Actor
>  system started at 
> akka.tcp://flink@10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367138,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Configuring
>  FlinkMetricsReporter with 
> {class=com.pipeline.processor.flink.metrics.FlinkMetricsReporter}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"com.pipeline.processor.flink.metrics.FlinkMetricsReporter","message":"Metrics
>  Reporter 
> Open","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Reporting
>  metrics of type 
> com.pipeline.processor.flink.metrics.FlinkMetricsReporter.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367142,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
>  TaskManager 
> actor","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367176,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyConfig","message":"NettyConfig
>  [server address: /10.60.5.85, server port: 0, ssl enabled: false, memory 
> segment size (bytes): 32768, transport type: NIO, number of server threads: 3 
> (manual), number of client threads: 3 (manual), server connect backlog: 0 
> (use Netty's default), client connect timeout (sec): 120, send/receive buffer 
> size (bytes): 0 (use Netty's 
> default)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367187,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration","message":"Messages
>  have a max timeout of 100000 
> ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367198,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Temporary
>  file directory '/tmp': total 373 GB, usable 295 GB (79.09% 
> usable)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367608,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.buffer.NetworkBufferPool","message":"Allocated
>  639 MB for network buffer pool (number of memory segments: 20467, bytes per 
> segment: 
> 32768).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367710,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could
>  not load Queryable State Client Proxy. Probable reason: 
> flink-queryable-state-runtime is not in the classpath. Please put the 
> corresponding jar from the opt to the lib 
> folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367711,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could
>  not load Queryable State Server. Probable reason: 
> flink-queryable-state-runtime is not in the classpath. Please put the 
> corresponding jar from the opt to the lib 
> folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367712,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.NetworkEnvironment","message":"Starting
>  the network environment and its 
> components.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367753,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyClient","message":"Successful
>  initialization (took 34 
> ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367805,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyServer","message":"Successful
>  initialization (took 51 ms). Listening on SocketAddress 
> /10.60.5.85:38873.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367808,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Limiting
>  managed memory to 0.7 of the currently free heap space (4005 MB), memory 
> will be allocated 
> lazily.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367819,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.disk.iomanager.IOManager","message":"I/O
>  manager uses directory /tmp/flink-io-5f657721-13dd-40aa-9c00-2a15d5666280 
> for spill 
> files.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367826,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User
>  file cache uses directory 
> /tmp/flink-dist-cache-30b1f2fd-9457-435b-a601-ae0b4e37dc6d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367862,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User
>  file cache uses directory 
> /tmp/flink-dist-cache-3dfb3cd5-b261-4df3-a662-a1cd91047c72","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367888,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
>  TaskManager actor at 
> akka://flink/user/taskmanager#1157564383.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367889,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager
>  data connection information: 
> pipelineruntime-taskmgr-6789dd578b-dcp4r-57b5f60d8144eb16425ec5bd9666768f @ 
> pipelineruntime-taskmgr-6789dd578b-dcp4r 
> (dataPort=38873)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367890,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager
>  has 6 task 
> slot(s).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Memory
>  usage stats: [HEAP: 842/6554/6554 MB, NON HEAP: 62/64/1776 MB 
> (used/committed/max)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Starting
>  
> ZooKeeperLeaderRetrievalService.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367965,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Leader
>  node has changed with 
> Leader=akka.tcp://flink@10.60.5.53:6123/user/jobmanager, session 
> ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
> {"timeMillis":1557073367966,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"New
>  leader information: Leader=akka.tcp://flink@10.60.5.53:6123/user/jobmanager, 
> session 
> ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
> {"timeMillis":1557073367975,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying
>  to register at JobManager akka.tcp://flink@10.60.5.53:6123/user/jobmanager 
> (attempt 1, timeout: 500 
> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368168,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Successful
>  registration at JobManager 
> (akka.tcp://flink@10.60.5.53:6123/user/jobmanager), starting network stack 
> and library 
> cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368177,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Determined
>  BLOB server address to be /10.60.5.53:43987. Starting BLOB 
> cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368184,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.PermanentBlobCache","message":"Created
>  BLOB cache storage directory 
> /tmp/blobStore-ffdc49ba-e86f-4240-93ad-7566c43e9b0d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368189,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.TransientBlobCache","message":"Created
>  BLOB cache storage directory 
> /tmp/blobStore-764277b6-6e46-4c8f-b7ee-80f746edefab","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391398,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$R4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [1] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$S4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [2] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$T4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [3] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$U4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [4] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$V4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [5] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$W4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [6] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$X4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [7] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391474,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$Y4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [8] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391475,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$Z4] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [9] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391477,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$04] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [10] dead 
> letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> ...
> ...
> ...
> {"timeMillis":1557073691534,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
>  [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] 
> from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$sab] to 
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [316] 
> dead letters encountered. This logging can be turned off or adjusted with 
> configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":49,"threadPriority":5}
> {noformat}
> TCP dump at taskmanager:
> {noformat}
> 19:55:58.214944 IP 10.60.5.85.45008 > 10.60.5.53.6123: tcp 715
>       0x0000:  4500 02ff 2809 4000 4006 f0ee 0a3c 0555  E...(.@.@....<.U
>       0x0010:  0a3c 0535 afd0 17eb a107 10ac 0270 79da  .<.5.........py.
>       0x0020:  8018 ce96 21f3 0000 0101 080a f2c0 c93f  ....!..........?
>       0x0030:  b74c ec05 0000 02c7 0ac4 0512 c105 0a3d  .L.............=
>       0x0040:  0a3b 616b 6b61 2e74 6370 3a2f 2f66 6c69  .;akka.tcp://fli
>       0x0050:  6e6b 4031 302e 3630 2e35 2e35 333a 3631  nk@10.60.5.53:61
>       0x0060:  3233 2f75 7365 722f 6a6f 626d 616e 6167  23/user/jobmanag
>       0x0070:  6572 2331 3231 3433 3237 3831 3312 bf04  er#1214327813...
>       0x0080:  0aba 04ac ed00 0573 7200 3f6f 7267 2e61  .......sr.?org.a
>       0x0090:  7061 6368 652e 666c 696e 6b2e 7275 6e74  pache.flink.runt
>       0x00a0:  696d 652e 6d65 7373 6167 6573 2e54 6173  ime.messages.Tas
>       0x00b0:  6b4d 616e 6167 6572 4d65 7373 6167 6573  kManagerMessages
>       0x00c0:  2448 6561 7274 6265 6174 1fb7 fffd 259b  $Heartbeat....%.
>       0x00d0:  c539 0200 024c 000c 6163 6375 6d75 6c61  .9...L..accumula
>       0x00e0:  746f 7273 7400 164c 7363 616c 612f 636f  torst..Lscala/co
>       0x00f0:  6c6c 6563 7469 6f6e 2f53 6571 3b4c 000a  llection/Seq;L..
>       0x0100:  696e 7374 616e 6365 4944 7400 2e4c 6f72  instanceIDt..Lor
>       0x0110:  672f 6170 6163 6865 2f66 6c69 6e6b 2f72  g/apache/flink/r
>       0x0120:  756e 7469 6d65 2f69 6e73 7461 6e63 652f  untime/instance/
>       0x0130:  496e 7374 616e 6365 4944 3b78 7073 7200  InstanceID;xpsr.
>       0x0140:  2473 6361 6c61 2e63 6f6c 6c65 6374 696f  $scala.collectio
>       0x0150:  6e2e 6d75 7461 626c 652e 4172 7261 7942  n.mutable.ArrayB
>       0x0160:  7566 6665 7215 38b0 5383 828e 7302 0003  uffer.8.S...s...
>       0x0170:  4900 0b69 6e69 7469 616c 5369 7a65 4900  I..initialSizeI.
>       0x0180:  0573 697a 6530 5b00 0561 7272 6179 7400  .size0[..arrayt.
>       0x0190:  135b 4c6a 6176 612f 6c61 6e67 2f4f 626a  .[Ljava/lang/Obj
>       0x01a0:  6563 743b 7870 0000 0010 0000 0000 7572  ect;xp........ur
>       0x01b0:  0013 5b4c 6a61 7661 2e6c 616e 672e 4f62  ..[Ljava.lang.Ob
>       0x01c0:  6a65 6374 3b90 ce58 9f10 7329 6c02 0000  ject;..X..s)l...
>       0x01d0:  7870 0000 0010 7070 7070 7070 7070 7070  xp....pppppppppp
>       0x01e0:  7070 7070 7070 7372 002c 6f72 672e 6170  ppppppsr.,org.ap
>       0x01f0:  6163 6865 2e66 6c69 6e6b 2e72 756e 7469  ache.flink.runti
>       0x0200:  6d65 2e69 6e73 7461 6e63 652e 496e 7374  me.instance.Inst
>       0x0210:  616e 6365 4944 0000 0000 0000 0001 0200  anceID..........
>       0x0220:  0078 7200 206f 7267 2e61 7061 6368 652e  .xr..org.apache.
>       0x0230:  666c 696e 6b2e 7574 696c 2e41 6273 7472  flink.util.Abstr
>       0x0240:  6163 7449 4400 0000 0000 0000 0102 0003  actID...........
>       0x0250:  4a00 096c 6f77 6572 5061 7274 4a00 0975  J..lowerPartJ..u
>       0x0260:  7070 6572 5061 7274 4c00 0874 6f53 7472  pperPartL..toStr
>       0x0270:  696e 6774 0012 4c6a 6176 612f 6c61 6e67  ingt..Ljava/lang
>       0x0280:  2f53 7472 696e 673b 7870 ae61 ac60 7f0a  /String;xp.a.`..
>       0x0290:  b35a b506 6f7d c221 e654 7400 2061 6536  .Z..o}.!.Tt..ae6
>       0x02a0:  3161 6336 3037 6630 6162 3335 6162 3530  1ac607f0ab35ab50
>       0x02b0:  3636 6637 6463 3232 3165 3635 3410 0122  66f7dc221e654.."
>       0x02c0:  3e0a 3c61 6b6b 612e 7463 703a 2f2f 666c  >.<akka.tcp://fl
>       0x02d0:  696e 6b40 3130 2e36 302e 352e 3835 3a38  ink@10.60.5.85:8
>       0x02e0:  3037 302f 7573 6572 2f74 6173 6b6d 616e  070/user/taskman
>       0x02f0:  6167 6572 2331 3135 3735 3634 3338 33    ager#1157564383
> 19:55:58.214996 IP 10.60.5.53.6123 > 10.60.5.85.45008: tcp 0
>       0x0000:  4500 0034 c1fe 4000 3f06 5ac4 0a3c 0535  E..4..@.?.Z..<.5
>       0x0010:  0a3c 0555 17eb afd0 0270 79da a107 1377  .<.U.....py....w
>       0x0020:  8010 ce93 1f28 0000 0101 080a b74c ff8d  .....(.......L..
>       0x0030:  f2c0 c93f                                ...?
> {noformat}
> After this, the taskmanager never registers again at the jobmanager.
> This run had the following akka configuration:
> akka.watch.heartbeat.pause: 60 s
> akka.ask.timeout: 100 s
> I noticed that akka.watch.heartbeat.interval defaults to ask.timeout if not 
> specified in configuration. Is it possible for these kind of failures to 
> happen due to the heartbeat-interval being more than heartbeat-pause?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to