[ https://issues.apache.org/jira/browse/FLINK-17933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123540#comment-17123540 ]
Roman Khachatryan commented on FLINK-17933: ------------------------------------------- The error was caused by external reasons (misconfiguration and not enough space). Was not able to reproduce after reconfiguring flink and re-scaling the cluster. > TaskManager was terminated on Yarn - investigate > ------------------------------------------------ > > Key: FLINK-17933 > URL: https://issues.apache.org/jira/browse/FLINK-17933 > Project: Flink > Issue Type: Task > Components: Deployment / YARN, Runtime / Task > Affects Versions: 1.11.0 > Reporter: Roman Khachatryan > Assignee: Roman Khachatryan > Priority: Major > Fix For: 1.11.0 > > > When running a job on Yarn cluster (load testing) some jobs result in > failures. > Initial symptoms are no bytes written/transferred in CSV and failures in > logs: > {code:java} > 2020-05-17 10:02:32,858 WARN org.apache.flink.runtime.taskmanager.Task [] - > Map -> Flat Map (138/160) (e49f7ea26b633c8035f2a919b1c580c8) switched from > RUNNING to FAILED.{code} > > It turned out that all such failures were caused by "Connection reset" from a > single IP, except for one "Leadership lost" error (another IP). > Connection reset was likely caused by TM receiving SIGTERM > (container_1589453804748_0118_01_000004 and 5 both on ip-172-31-42-229): > {code:java} > 2020-05-17 10:02:31,362 INFO org.apache.flink.yarn.YarnTaskExecutorRunner [] > - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.{code} > > Other TMs received SIGTERM one minute later (all logs were uploaded at the > same time though). > > From the JM it looked like this: > {code:java} > 2020-05-17 10:02:23,583 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] > - Trigger heartbeat request. > 2020-05-17 10:02:23,587 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] > - Received heartbeat from container_1589453804748_0118_01_000005. > 2020-05-17 10:02:23,590 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] > - Received heartbeat from container_1589453804748_0118_01_000006. > 2020-05-17 10:02:23,592 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] > - Received heartbeat from container_1589453804748_0118_01_000004. > 2020-05-17 10:02:23,595 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] > - Received heartbeat from container_1589453804748_0118_01_000003. > 2020-05-17 10:02:23,598 DEBUG org.apache.flink.runtime.jobmaster.JobMaster [] > - Received heartbeat from container_1589453804748_0118_01_000002. > 2020-05-17 10:02:23,725 DEBUG > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received > acknowledge message for checkpoint 12 from task > 459efd2ad8fe2ffe7fffe28530064fe1 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at > container_1589453804748_0118_01_000002 @ > ip-172-31-43-69.eu-central-1.compute.internal (dataPort=44625). > 2020-05-17 10:02:29,103 DEBUG > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received > acknowledge message for checkpoint 12 from task > 266a9326be7e3ec669cce2e6a97ae5b0 of job 5d4d8c88de23b1361fe0dce6ba8443f8 at > container_1589453804748_0118_01_000005 @ > ip-172-31-42-229.eu-central-1.compute.internal (dataPort=37329). > 2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] - > Association with remote system > [akka.tcp://fl...@ip-172-31-42-229.eu-central-1.compute.internal:39999] has > failed, address is now gated for [50] ms. Reason: [Disassociated] > 2020-05-17 10:02:32,862 WARN akka.remote.ReliableDeliverySupervisor [] - > Association with remote system > [akka.tcp://fl...@ip-172-31-42-229.eu-central-1.compute.internal:42567] has > failed, address is now gated for [50] ms. Reason: [Disassociated] > 2020-05-17 10:02:32,900 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Map -> Flat Map > (87/160) (cb77c7002503baa74baf73a3a100c2f2) switched from RUNNING to FAILED. > org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: > readAddress(..) failed: Connection reset by peer (connection to > 'ip-172-31-42-229.eu-central-1.compute.internal/172.31.42.229:37329'){code} > > There are also JobManager heartbeat timeouts but they don't correlate with > the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)