Zhenqiu Huang created FLINK-15416:
-------------------------------------

             Summary: Add Retry Mechanism for 
PartitionRequestClientFactory.ConnectingChannel
                 Key: FLINK-15416
                 URL: https://issues.apache.org/jira/browse/FLINK-15416
             Project: Flink
          Issue Type: Wish
          Components: Runtime / Network
    Affects Versions: 1.10.0
            Reporter: Zhenqiu Huang


We run a flink with 256 TMs in production. The job internally has keyby logic. 
Thus, it builds a 256 * 256 communication channels. An outage happened when 
there is a chip internal link of one of the network switchs broken that 
connecting these machines. During the outage, the flink can't restart 
successfully as there is always an exception like  "Connecting the channel 
failed: Connecting to remote task manager + '****/10.14.139.6:41300' has 
failed. This might indicate that the remote task manager has been lost. 

After deep investigation with the network infrastructure team, we found there 
are 6 switchs connecting with these machines. Each switch has 32 physcal links. 
Every socket is round-robin assigned to each of links for load balances. Thus, 
there is always average 256 * 256 / 6 * 32  * 2 = 170 channels will be assigned 
to the broken link. The issue lasted for 4 hours until we found the broken link 
and restart the problematic switch. 

Given this, we found that the retry of creating channel will help to resolve 
this issue. For our networking topology, we can set retry to 2. As 170 / (132 * 
132) < 1, which means after retry twice no channel in 170 channels will be 
assigned to the broken link in the average case.

I think it is valuable fix for this kind of partial network partition.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to