[ https://issues.apache.org/jira/browse/FLINK-17992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-17992: ----------------------------------- Labels: pull-request-available (was: ) > Exception from RemoteInputChannel#onBuffer should not fail the whole > NetworkClientHandler > ----------------------------------------------------------------------------------------- > > Key: FLINK-17992 > URL: https://issues.apache.org/jira/browse/FLINK-17992 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.10.0, 1.10.1 > Reporter: Zhijiang > Assignee: Zhijiang > Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0 > > > RemoteInputChannel#onBuffer is invoked by > CreditBasedPartitionRequestClientHandler while receiving and decoding the > network data. #onBuffer can throw exceptions which would tag the error in > client handler and fail all the added input channels inside handler. Then it > would cause a tricky potential issue as following. > If the RemoteInputChannel is canceling by canceler thread, then the task > thread might exit early than canceler thread terminate. That means the > PartitionRequestClient might not be closed (triggered by canceler thread) > while the new task attempt is already deployed into this TaskManger. > Therefore the new task might reuse the previous PartitionRequestClient while > requesting partitions, but note that the respective client handler was > already tagged an error before during above RemoteInputChannel#onBuffer. It > will cause the next round unnecessary failover. > It is hard to find this potential issue in production because it can be > restored normal finally after one or more additional failover. We find this > potential problem from UnalignedCheckpointITCase because it will define the > precise restart times within configured failures. > The solution is to only fail the respective task when its internal > RemoteInputChannel#onBuffer throws any exceptions instead of failing the > whole channels inside client handler, then the client is still health and can > also be reused by other input channels as long as it is not released yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)