[ 
https://issues.apache.org/jira/browse/FLINK-12106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809421#comment-16809421
 ] 

Guowei Ma commented on FLINK-12106:
-----------------------------------

AFAIK, the community is working on it.  
[FLINK-10941|https://issues.apache.org/jira/browse/FLINK-10941] has the same 
problem. 

This issue is related to the lifecycle control of Shuffle Resource. There have 
some related discussions and design[1][2].

[1] 
[https://docs.google.com/document/d/13vAJJxfRXAwI4MtO8dux8hHnNMw2Biu5XRrb_hvGehA/edit#heading=h.v7vhb7w01d61]

[2] 
[https://cwiki.apache.org/confluence/display/FLINK/FLIP-31%3A+Pluggable+Shuffle+Manager]

 

> Jobmanager is killing FINISHED taskmanger containers, causing exception in 
> still running Taskmanagers an
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12106
>                 URL: https://issues.apache.org/jira/browse/FLINK-12106
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.7.2
>         Environment: Hadoop:  hdp/2.5.6.0-40
> Flink: 2.7.2
>            Reporter: John
>            Priority: Major
>
> When running a single flink job on YARN, some of the taskmanger containers 
> reach the FINISHED state before others.  It appears that, after receiving 
> final execution state FINISHED from a taskmanager, jobmanager is waiting ~68 
> seconds and then freeing the associated slot in the taskmanager.  After and 
> additional 60 seconds, jobmanager is stopping the same taskmanger because 
> TaskExecutor exceeded the idle timeout.
> Meanwhile, other taskmangers are still working to complete the job.  Within 
> 10 seconds after the taskmanger container above is stopped, the remaining 
> task managers receive an exception due to loss of connection to the stopped 
> taskmanager.  These exceptions result job failure.
>  
> Relevant logs:
> 2019-04-03 13:49:00,013 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Registering TaskManager with ResourceID 
> container_1553017480503_0158_01_000038 
> (akka.tcp://flink@hadoop4:42745/user/taskmanager_0) at ResourceManager
> 2019-04-03 13:49:05,900 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Registering TaskManager with ResourceID 
> container_1553017480503_0158_01_000059 
> (akka.tcp://flink@hadoop9:55042/user/taskmanager_0) at ResourceManager
>  
>  
> 2019-04-03 13:48:51,132 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Received new container: 
> container_1553017480503_0158_01_000077 - Remaining pending container 
> requests: 6
> 2019-04-03 13:48:52,862 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner    
>               -     
> -Dlog.file=/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000077/taskmanager.log
> 2019-04-03 13:48:57,490 INFO  
> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful 
> initialization (took 202 ms). Listening on SocketAddress 
> /192.168.230.69:40140.
> 2019-04-03 13:49:12,575 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Registering TaskManager with ResourceID 
> container_1553017480503_0158_01_000077 
> (akka.tcp://flink@hadoop9:51525/user/taskmanager_0) at ResourceManager
> 2019-04-03 13:49:12,631 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Allocated 
> slot for AllocationID\{42fed3e5a136240c23cc7b394e3249e9}.
> 2019-04-03 14:58:15,188 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - 
> Un-registering task and sending final execution state FINISHED to JobManager 
> for task DataSink 
> (com.anovadata.alexflinklib.sinks.bucketing.BucketingOutputFormat@26874f2c) 
> a4b5fb32830d4561147b2714828109e2.
> 2019-04-03 14:59:23,049 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing 
> idle slot [AllocationID\{42fed3e5a136240c23cc7b394e3249e9}].
> 2019-04-03 14:59:23,058 INFO  
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable      - Free slot 
> TaskSlot(index:0, state:ACTIVE, resource profile: 
> ResourceProfile\{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
> directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
> networkMemoryInMB=2147483647}, allocationId: 
> AllocationID\{42fed3e5a136240c23cc7b394e3249e9}, jobId: 
> a6c4e367698c15cdf168d19a89faff1d).
> 2019-04-03 15:00:02,641 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Stopping container container_1553017480503_0158_01_000077.
> 2019-04-03 15:00:02,646 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Closing TaskExecutor connection 
> container_1553017480503_0158_01_000077 because: TaskExecutor exceeded the 
> idle timeout.
>  
>  
> 2019-04-03 13:48:48,902 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner    
>               -     
> -Dlog.file=/data1/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000059/taskmanager.log
> 2019-04-03 14:59:24,677 INFO  
> org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
> columnStore to file. allocated memory: 109479981
> 2019-04-03 15:00:05,696 INFO  
> org.apache.parquet.hadoop.InternalParquetRecordWriter         - mem size 
> 135014409 > 134217728: flushing 1930100 records to disk.
> 2019-04-03 15:00:05,696 INFO  
> org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
> columnStore to file. allocated memory: 102677684
> 2019-04-03 15:00:08,671 ERROR org.apache.flink.runtime.operators.BatchTask    
>               - Error in task code:  CHAIN Partition -> FlatMap 
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This 
> indicates that the remote task manager was lost.
> 2019-04-03 15:00:08,714 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - 
> Un-registering task and sending final execution state FAILED to JobManager 
> for task CHAIN Partition -> FlatMap
> 2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Attempting to cancel task DataSink ()
> 2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - DataSink () switched from RUNNING to CANCELING.
> 2019-04-03 15:00:08,812 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Triggering cancellation of task code DataSink ()
>  
>  
> 2019-04-03 13:48:44,562 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner    
>               -     
> -Dlog.file=/data8/hadoop/yarn/log/application_1553017480503_0158/container_1553017480503_0158_01_000038/taskmanager.log
> 2019-04-03 14:59:18,620 INFO  
> org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
> columnStore to file. allocated memory: 0
> 2019-04-03 14:59:48,088 INFO  
> org.apache.parquet.hadoop.InternalParquetRecordWriter         - mem size 
> 136179972 > 134217728: flushing 1930100 records to disk.
> 2019-04-03 14:59:48,088 INFO  
> org.apache.parquet.hadoop.InternalParquetRecordWriter         - Flushing mem 
> columnStore to file. allocated memory: 103333893
> 2019-04-03 15:00:08,692 ERROR org.apache.flink.runtime.operators.BatchTask    
>               - Error in task code:  CHAIN Partition -> FlatMap
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
> Lost connection to task manager 'hadoop9/192.168.230.69:40140'. This 
> indicates that the remote task manager was lost.
> 2019-04-03 15:00:08,741 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - 
> Un-registering task and sending final execution state FAILED to JobManager 
> for task CHAIN Partition -> FlatMap
> 2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Attempting to cancel task DataSink ()
> 2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - DataSink () switched from RUNNING to CANCELING.
> 2019-04-03 15:00:08,817 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Triggering cancellation of task code DataSink ()
>  
>  
> 2019-04-03 15:00:09,196 INFO  
> org.apache.flink.runtime.dispatcher.MiniDispatcher            - Job 
> a6c4e367698c15cdf168d19a89faff1d reached globally terminal state FAILED.
>  
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to