Jin Xing created FLINK-22676:
--------------------------------

             Summary: The partition tracker should support remote shuffle 
properly
                 Key: FLINK-22676
                 URL: https://issues.apache.org/jira/browse/FLINK-22676
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Network
            Reporter: Jin Xing


In current Flink, data partition is bound with the ResourceID of TM in 
Execution#startTrackingPartitions and partition tracker will stop tracking 
corresponding partitions when a TM 
disconnects(JobMaster#disconnectTaskManager), i.e. the lifecycle of shuffle 
data is bound with computing resource (TM). It works fine for internal shuffle 
service, but doesn't for remote shuffle service. Note that shuffle data is 
accommodated on remote, the lifecycle of a completed partition is capable to be 
decoupled with TM, i.e. TM is totally fine to be released when no computing 
task on it and further shuffle reading requests could be directed to remote 
shuffle cluster. In addition, when a TM is lost, its completed data partitions 
on remote shuffle cluster could avoid reproducing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to