[ 
https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-15416.
----------------------------

Merged in master: 0175bf5d69013a48eb501d78ac7a231c9fa335c1

> Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
> -----------------------------------------------------------------------
>
>                 Key: FLINK-15416
>                 URL: https://issues.apache.org/jira/browse/FLINK-15416
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / Network
>    Affects Versions: 1.10.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We run a flink with 256 TMs in production. The job internally has keyby 
> logic. Thus, it builds a 256 * 256 communication channels. An outage happened 
> when there is a chip internal link of one of the network switchs broken that 
> connecting these machines. During the outage, the flink can't restart 
> successfully as there is always an exception like  "Connecting the channel 
> failed: Connecting to remote task manager + '****/10.14.139.6:41300' has 
> failed. This might indicate that the remote task manager has been lost. 
> After deep investigation with the network infrastructure team, we found there 
> are 6 switchs connecting with these machines. Each switch has 32 physcal 
> links. Every socket is round-robin assigned to each of links for load 
> balances. Thus, there is always average 256 * 256 / 6 * 32  * 2 = 170 
> channels will be assigned to the broken link. The issue lasted for 4 hours 
> until we found the broken link and restart the problematic switch. 
> Given this, we found that the retry of creating channel will help to resolve 
> this issue. For our networking topology, we can set retry to 2. As 170 / (132 
> * 132) < 1, which means after retry twice no channel in 170 channels will be 
> assigned to the broken link in the average case.
> I think it is valuable fix for this kind of partial network partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to