Dong Lin created FLINK-31681: -------------------------------- Summary: Network connection timeout between operators should trigger either network re-connection or job failover Key: FLINK-31681 URL: https://issues.apache.org/jira/browse/FLINK-31681 Project: Flink Issue Type: Bug Reporter: Dong Lin
If a network connection error occurs between two operators, the upstream operator may log the following error message in the method PartitionRequestQueue#handleException and subsequently close the connection. When this happens, the Flink job may become stuck without completing or failing. To avoid this issue, we can either allow the upstream operator to reconnect with the downstream operator, or enable job failover so that users can take corrective action promptly. org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors#NativeIOException: writeAccess(...) failed: Connection timed out. -- This message was sent by Atlassian Jira (v8.20.10#820010)