Dong Lin created FLINK-31681:
--------------------------------

             Summary: Network connection timeout between operators should 
trigger either network re-connection or job failover
                 Key: FLINK-31681
                 URL: https://issues.apache.org/jira/browse/FLINK-31681
             Project: Flink
          Issue Type: Bug
            Reporter: Dong Lin


If a network connection error occurs between two operators, the upstream 
operator may log the following error message in the method 
PartitionRequestQueue#handleException and subsequently close the connection. 
When this happens, the Flink job may become stuck without completing or 
failing. 

To avoid this issue, we can either allow the upstream operator to reconnect 
with the downstream operator, or enable job failover so that users can take 
corrective action promptly.

org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered 
error while consuming partitions 
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException: 
writeAccess(...) failed: Connection timed out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to