[ https://issues.apache.org/jira/browse/FLINK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809421#comment-16809421 ]
Guowei Ma commented on FLINK-12106: ----------------------------------- AFAIK, the community is working on it. [FLINK-10941|https://issues.apache.org/jira/browse/FLINK-10941] has the same problem. This issue is related to the lifecycle control of Shuffle Resource. There have some related discussions and design[1][2]. [1] [https://docs.google.com/document/d/13vAJJxfRXAwI4MtO8dux8hHnNMw2Biu5XRrb_hvGehA/edit#heading=h.v7vhb7w01d61] [2] [https://cwiki.apache.org/confluence/display/FLINK/FLIP-31%3A+Pluggable+Shuffle+Manager] > Jobmanager is killing FINISHED taskmanger containers, causing exception in > still running Taskmanagers an > -------------------------------------------------------------------------------------------------------- > > Key: FLINK-12106 > URL: https://issues.apache.org/jira/browse/FLINK-12106 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Affects Versions: 1.7.2 > Environment: Hadoop: hdp/2.5.6.0-40 > Flink: 2.7.2 > Reporter: John > Priority: Major > > When running a single flink job on YARN, some of the taskmanger containers > reach the FINISHED state before others. It appears that, after receiving > final execution state FINISHED from a taskmanager, jobmanager is waiting ~68 > seconds and then freeing the associated slot in the taskmanager. After and > additional 60 seconds, jobmanager is stopping the same taskmanger because > TaskExecutor exceeded the idle timeout. > Meanwhile, other taskmangers are still working to complete the job. Within > 10 seconds after the taskmanger container above is stopped, the remaining > task managers receive an exception due to loss of connection to the stopped > taskmanager. These exceptions result job failure. > > Relevant logs: > 2019-04-03 13:49:00,013 INFO org.apache.flink.yarn.YarnResourceManager > - Registering TaskManager with ResourceID > container_1553017480503_0158_01_000038 > (akka.tcp://flink@hadoop4:42745/user/taskmanager_0) at ResourceManager > 2019-04-03 13:49:05,900 INFO org.apache.flink.yarn.YarnResourceManager > - Registering TaskManager with ResourceID > container_1553017480503_0158_01_000059 > (akka.tcp://flink@hadoop9:55042/user/taskmanager_0) at ResourceManager > > > 2019-04-03 13:48:51,132 INFO org.apache.flink.yarn.YarnResourceManager > - Received new container: > container_1553017480503_0158_01_000077 - Remaining pending container > requests: 6 > 2019-04-03 13:48:52,862 INFO org.apache.flink.yarn.YarnTaskExecutorRunner > - > -Dlog.file=/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000077/taskmanager.log > 2019-04-03 13:48:57,490 INFO > org.apache.flink.runtime.io.network.netty.NettyServer - Successful > initialization (took 202 ms). Listening on SocketAddress > /192.168.230.69:40140. > 2019-04-03 13:49:12,575 INFO org.apache.flink.yarn.YarnResourceManager > - Registering TaskManager with ResourceID > container_1553017480503_0158_01_000077 > (akka.tcp://flink@hadoop9:51525/user/taskmanager_0) at ResourceManager > 2019-04-03 13:49:12,631 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Allocated > slot for AllocationID\{42fed3e5a136240c23cc7b394e3249e9}. > 2019-04-03 14:58:15,188 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - > Un-registering task and sending final execution state FINISHED to JobManager > for task DataSink > (com.anovadata.alexflinklib.sinks.bucketing.BucketingOutputFormat@26874f2c) > a4b5fb32830d4561147b2714828109e2. > 2019-04-03 14:59:23,049 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Releasing > idle slot [AllocationID\{42fed3e5a136240c23cc7b394e3249e9}]. > 2019-04-03 14:59:23,058 INFO > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot > TaskSlot(index:0, state:ACTIVE, resource profile: > ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, > directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, > networkMemoryInMB=2147483647}, allocationId: > AllocationID\{42fed3e5a136240c23cc7b394e3249e9}, jobId: > a6c4e367698c15cdf168d19a89faff1d). > 2019-04-03 15:00:02,641 INFO org.apache.flink.yarn.YarnResourceManager > - Stopping container container_1553017480503_0158_01_000077. > 2019-04-03 15:00:02,646 INFO org.apache.flink.yarn.YarnResourceManager > - Closing TaskExecutor connection > container_1553017480503_0158_01_000077 because: TaskExecutor exceeded the > idle timeout. > > > 2019-04-03 13:48:48,902 INFO org.apache.flink.yarn.YarnTaskExecutorRunner > - > -Dlog.file=/data1/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000059/taskmanager.log > 2019-04-03 14:59:24,677 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem > columnStore to file. allocated memory: 109479981 > 2019-04-03 15:00:05,696 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter - mem size > 135014409 > 134217728: flushing 1930100 records to disk. > 2019-04-03 15:00:05,696 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem > columnStore to file. allocated memory: 102677684 > 2019-04-03 15:00:08,671 ERROR org.apache.flink.runtime.operators.BatchTask > - Error in task code: CHAIN Partition -> FlatMap > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This > indicates that the remote task manager was lost. > 2019-04-03 15:00:08,714 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - > Un-registering task and sending final execution state FAILED to JobManager > for task CHAIN Partition -> FlatMap > 2019-04-03 15:00:08,812 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to cancel task DataSink () > 2019-04-03 15:00:08,812 INFO org.apache.flink.runtime.taskmanager.Task > - DataSink () switched from RUNNING to CANCELING. > 2019-04-03 15:00:08,812 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code DataSink () > > > 2019-04-03 13:48:44,562 INFO org.apache.flink.yarn.YarnTaskExecutorRunner > - > -Dlog.file=/data8/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000038/taskmanager.log > 2019-04-03 14:59:18,620 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem > columnStore to file. allocated memory: 0 > 2019-04-03 14:59:48,088 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter - mem size > 136179972 > 134217728: flushing 1930100 records to disk. > 2019-04-03 14:59:48,088 INFO > org.apache.parquet.hadoop.InternalParquetRecordWriter - Flushing mem > columnStore to file. allocated memory: 103333893 > 2019-04-03 15:00:08,692 ERROR org.apache.flink.runtime.operators.BatchTask > - Error in task code: CHAIN Partition -> FlatMap > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This > indicates that the remote task manager was lost. > 2019-04-03 15:00:08,741 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - > Un-registering task and sending final execution state FAILED to JobManager > for task CHAIN Partition -> FlatMap > 2019-04-03 15:00:08,817 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to cancel task DataSink () > 2019-04-03 15:00:08,817 INFO org.apache.flink.runtime.taskmanager.Task > - DataSink () switched from RUNNING to CANCELING. > 2019-04-03 15:00:08,817 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code DataSink () > > > 2019-04-03 15:00:09,196 INFO > org.apache.flink.runtime.dispatcher.MiniDispatcher - Job > a6c4e367698c15cdf168d19a89faff1d reached globally terminal state FAILED. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)