[ 
https://issues.apache.org/jira/browse/FLINK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dizhou cao updated FLINK-32191:
-------------------------------
    Description: We encountered a case in our production environment where the 
netty client was unable to send data to the server due to an abnormality in the 
switch link. However, client can only detect the abnormality after RTO timeout 
retransmission failure, which takes about 15 minutes in our production 
environment. This may result in a 15-minute job unavailability. We hope to 
perform failover and reschedule job more quickly. Flink has already enabled 
keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
TCP_KEEPCOUNT. These configurations are already supported at the Netty.  (was: 
We encountered a case in our production environment where netty client was 
unable to send data downstream due to an abnormality in the switch link. 
However, client can only detect the abnormality after RTO timeout 
retransmission failure, which takes about 15 minutes in our production 
environment. This may result in a 15-minute job unavailability. We hope to 
perform failover and reschedule job more quickly. Flink has already enabled 
keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
TCP_KEEPCOUNT. These configurations are already supported at the Netty.)

> Support for configuring keepalive related parameters.
> -----------------------------------------------------
>
>                 Key: FLINK-32191
>                 URL: https://issues.apache.org/jira/browse/FLINK-32191
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>            Reporter: dizhou cao
>            Priority: Minor
>
> We encountered a case in our production environment where the netty client 
> was unable to send data to the server due to an abnormality in the switch 
> link. However, client can only detect the abnormality after RTO timeout 
> retransmission failure, which takes about 15 minutes in our production 
> environment. This may result in a 15-minute job unavailability. We hope to 
> perform failover and reschedule job more quickly. Flink has already enabled 
> keepalive, but the default keepalive idle time is 2 hours. We can adjust the 
> timeout of TCP keepalive by configuring TCP_KEEPIDLE, TCP_KEEPINTERVAL, and 
> TCP_KEEPCOUNT. These configurations are already supported at the Netty.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to