Re: TM get killed/disconnected after a while
Hi, Can you provide a bit more info about your setup, such as what Kubernetes resources you are using? (Deployments, Service) Is the pod running the taskmanager killed by Kubernetes or does it fail? Can you provide the output of kubectl describe pod and kubectl logs of the taskmanager pod that exited? -- Patrick Lucas On Fri, Oct 6, 2017 at 8:16 PM, Hao Sun wrote: > Hi, I am running Flink 1.3.2 on kubernetes, I am not sure why sometime one > of my TM is killed, is there a way to debug this? Thanks > > = Logs > > *2017-10-05 22:36:42,631 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at fps-flink-taskmanager-2384273947-9n4kc > (akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274/user/taskmanager) > as 330ff7eeaabfe2b7289fee4a0e36c4b2. Current number of registered hosts is > 2. Current number of alive task slots is 2.* > 2017-10-05 22:37:04,974 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > - Deploying Source: KafkaSource(maxwell.users) -> > MaxwellFilter->Maxwell(maxwell.users) -> FixedDelayWatermark(maxwell.users) > -> MaxwellFPSEvent->InfluxDBData(maxwell.users) -> (Sink: > influxdbSink(maxwell.users), Sink: PrintSink(maxwell.users)) (1/1) (attempt > #0) to fps-flink-taskmanager-2384273947-9n4kc > *2017-10-06 06:08:55,657 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, > address is now gated for [5000] ms. Reason: [Disassociated]* > 2017-10-06 06:08:55,832 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address > is now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: [The remote system has quarantined this system. No > further associations to the remote system are possible until this system is > restarted.] > 2017-10-06 06:09:01,232 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has > failed, address is now gated for [5000] ms. Reason: [Association failed > with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] > Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve] > 2017-10-06 06:09:03,416 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has > failed, address is now gated for [5000] ms. Reason: [Association failed > with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] > Caused by: [fps-flink-taskmanager-2384273947-9n4kc] > 2017-10-06 06:09:11,174 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has > failed, address is now gated for [5000] ms. Reason: [Association failed > with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] > Caused by: [fps-flink-taskmanager-2384273947-9n4kc] > 2017-10-06 06:09:11,440 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address > is now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: [The remote system has quarantined this system. No > further associations to the remote system are possible until this system is > restarted.] > 2017-10-06 06:09:21,232 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has > failed, address is now gated for [5000] ms. Reason: [Association failed > with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] > Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve] > 2017-10-06 06:09:27,460 WARN Remoting > - Tried to associate with unreachable remote address > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address > is now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: [The remote system has quarantined this system. No > further associations to the remote system are possible until this system is > restarted.] > 2017-10-06 06:09:31,173 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has > failed, address is now gated for [5000] ms. Reason: [Association failed > with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] > Caused by: [fps-flink-taskmanager-2384273947-9n4kc] > 2017-10-06 06:09:41,179 WARN akka.remote.ReliableDeliverySupervisor > - Assoc
TM get killed/disconnected after a while
Hi, I am running Flink 1.3.2 on kubernetes, I am not sure why sometime one of my TM is killed, is there a way to debug this? Thanks = Logs *2017-10-05 22:36:42,631 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at fps-flink-taskmanager-2384273947-9n4kc (akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274/user/taskmanager) as 330ff7eeaabfe2b7289fee4a0e36c4b2. Current number of registered hosts is 2. Current number of alive task slots is 2.* 2017-10-05 22:37:04,974 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph- Deploying Source: KafkaSource(maxwell.users) -> MaxwellFilter->Maxwell(maxwell.users) -> FixedDelayWatermark(maxwell.users) -> MaxwellFPSEvent->InfluxDBData(maxwell.users) -> (Sink: influxdbSink(maxwell.users), Sink: PrintSink(maxwell.users)) (1/1) (attempt #0) to fps-flink-taskmanager-2384273947-9n4kc *2017-10-06 06:08:55,657 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Disassociated]* 2017-10-06 06:08:55,832 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] 2017-10-06 06:09:01,232 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve] 2017-10-06 06:09:03,416 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by: [fps-flink-taskmanager-2384273947-9n4kc] 2017-10-06 06:09:11,174 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by: [fps-flink-taskmanager-2384273947-9n4kc] 2017-10-06 06:09:11,440 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] 2017-10-06 06:09:21,232 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve] 2017-10-06 06:09:27,460 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] 2017-10-06 06:09:31,173 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by: [fps-flink-taskmanager-2384273947-9n4kc] 2017-10-06 06:09:41,179 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274]] Caused by: [fps-flink-taskmanager-2384273947-9n4kc: Name does not resolve] 2017-10-06 06:09:51,174 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-2384273947-9n4kc:40274] has failed, address is now gated for [5000] ms. R