[ https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071859#comment-17071859 ]
Piotr Nowojski commented on FLINK-15416: ---------------------------------------- As I have responded [in the ticket|https://github.com/apache/flink/pull/11541#issuecomment-606051120], in the end we decided to go on with this improvement, as a first step towards full/proper/general retry mechanism in case of network connection loss. Full implementation (not just for establishing connection) would require a follow up work and a FLIP document, as it would probably introduce some data buffering mechanism and some acknowledgement mechanism. > Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel > ----------------------------------------------------------------------- > > Key: FLINK-15416 > URL: https://issues.apache.org/jira/browse/FLINK-15416 > Project: Flink > Issue Type: Wish > Components: Runtime / Network > Affects Versions: 1.10.0 > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We run a flink with 256 TMs in production. The job internally has keyby > logic. Thus, it builds a 256 * 256 communication channels. An outage happened > when there is a chip internal link of one of the network switchs broken that > connecting these machines. During the outage, the flink can't restart > successfully as there is always an exception like "Connecting the channel > failed: Connecting to remote task manager + '****/10.14.139.6:41300' has > failed. This might indicate that the remote task manager has been lost. > After deep investigation with the network infrastructure team, we found there > are 6 switchs connecting with these machines. Each switch has 32 physcal > links. Every socket is round-robin assigned to each of links for load > balances. Thus, there is always average 256 * 256 / 6 * 32 * 2 = 170 > channels will be assigned to the broken link. The issue lasted for 4 hours > until we found the broken link and restart the problematic switch. > Given this, we found that the retry of creating channel will help to resolve > this issue. For our networking topology, we can set retry to 2. As 170 / (132 > * 132) < 1, which means after retry twice no channel in 170 channels will be > assigned to the broken link in the average case. > I think it is valuable fix for this kind of partial network partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)