[ https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538931#comment-17538931 ]
zlzhang0122 commented on FLINK-27608: ------------------------------------- [~Thesharing] First, thanks for your quickly response and really detailed explanation. And yes, I agree with you, there is only one scenario here because it is a distributed environment. The reason why it takes such a long time to deploy the upstream tasks is the upstream tasks has a large state to restore. And sometimes this may be happen very frequently. So the problem comes back to the beginning that the config of taskmanager.network.request-backoff.max is not easy to decide and can we have some better solution to deal with it? Thanks again!! > Flink may throw PartitionNotFound Exception if the downstream task reached > Running state earlier than it's upstream task > ------------------------------------------------------------------------------------------------------------------------ > > Key: FLINK-27608 > URL: https://issues.apache.org/jira/browse/FLINK-27608 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.14.2 > Reporter: zlzhang0122 > Priority: Major > Attachments: exception.txt > > > Flink streaming job deployment may throw PartitionNotFound Exception if the > downstream task reached Running state earlier than its upstream task and > after maximum backoff for partition requests passed.But the config of > taskmanager.network.request-backoff.max is not eay to decide. Can we use a > loop awaiting the upstream task partition be ready? > -- This message was sent by Atlassian Jira (v8.20.7#820007)