[jira] [Commented] (FLINK-15416) Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel

Piotr Nowojski (Jira) Tue, 31 Mar 2020 07:56:18 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-15416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071859#comment-17071859
 ]


Piotr Nowojski commented on FLINK-15416:
----------------------------------------

As I have responded [in the 
ticket|https://github.com/apache/flink/pull/11541#issuecomment-606051120], in 
the end we decided to go on with this improvement, as a first step towards 
full/proper/general retry mechanism in case of network connection loss. 

Full implementation (not just for establishing connection) would require a 
follow up work and a FLIP document, as it would probably introduce some data 
buffering mechanism and some acknowledgement mechanism.

> Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel
> -----------------------------------------------------------------------
>
>                 Key: FLINK-15416
>                 URL: https://issues.apache.org/jira/browse/FLINK-15416
>             Project: Flink
>          Issue Type: Wish
>          Components: Runtime / Network
>    Affects Versions: 1.10.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We run a flink with 256 TMs in production. The job internally has keyby 
> logic. Thus, it builds a 256 * 256 communication channels. An outage happened 
> when there is a chip internal link of one of the network switchs broken that 
> connecting these machines. During the outage, the flink can't restart 
> successfully as there is always an exception like  "Connecting the channel 
> failed: Connecting to remote task manager + '****/10.14.139.6:41300' has 
> failed. This might indicate that the remote task manager has been lost. 
> After deep investigation with the network infrastructure team, we found there 
> are 6 switchs connecting with these machines. Each switch has 32 physcal 
> links. Every socket is round-robin assigned to each of links for load 
> balances. Thus, there is always average 256 * 256 / 6 * 32  * 2 = 170 
> channels will be assigned to the broken link. The issue lasted for 4 hours 
> until we found the broken link and restart the problematic switch. 
> Given this, we found that the retry of creating channel will help to resolve 
> this issue. For our networking topology, we can set retry to 2. As 170 / (132 
> * 132) < 1, which means after retry twice no channel in 170 channels will be 
> assigned to the broken link in the average case.
> I think it is valuable fix for this kind of partial network partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15416) Add Retry Mechanism for PartitionRequestClientFactory.ConnectingChannel

Reply via email to