[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task
[ https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538931#comment-17538931 ] zlzhang0122 commented on FLINK-27608: - [~Thesharing] First, thanks for your quickly response and really detailed explanation. And yes, I agree with you, there is only one scenario here because it is a distributed environment. The reason why it takes such a long time to deploy the upstream tasks is the upstream tasks has a large state to restore. And sometimes this may be happen very frequently. So the problem comes back to the beginning that the config of taskmanager.network.request-backoff.max is not easy to decide and can we have some better solution to deal with it? Thanks again!! > Flink may throw PartitionNotFound Exception if the downstream task reached > Running state earlier than it's upstream task > > > Key: FLINK-27608 > URL: https://issues.apache.org/jira/browse/FLINK-27608 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.14.2 >Reporter: zlzhang0122 >Priority: Major > Attachments: exception.txt > > > Flink streaming job deployment may throw PartitionNotFound Exception if the > downstream task reached Running state earlier than its upstream task and > after maximum backoff for partition requests passed.But the config of > taskmanager.network.request-backoff.max is not eay to decide. Can we use a > loop awaiting the upstream task partition be ready? > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task
[ https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537721#comment-17537721 ] Zhilong Hong commented on FLINK-27608: -- I've read the log you uploaded. In fact, there's only one scenario here. The retry mechanism is always working. In your log, you can see that the exception happens in the method {{retriggerPartitionRequest}} located in the SingleInputGate. The exception is thrown out because there are too many failed retries for the partition request, and the backoff time has reached the maximum value. The partition request shouldn't failed that many times. Thus, you should find out why it takes such a long time to deploy the upstream tasks. Or you could increase the value of the config {{taskmanager.network.request-backoff.max}}, which allows more retries for the partition request. > Flink may throw PartitionNotFound Exception if the downstream task reached > Running state earlier than it's upstream task > > > Key: FLINK-27608 > URL: https://issues.apache.org/jira/browse/FLINK-27608 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.14.2 >Reporter: zlzhang0122 >Priority: Major > Attachments: exception.txt > > > Flink streaming job deployment may throw PartitionNotFound Exception if the > downstream task reached Running state earlier than its upstream task and > after maximum backoff for partition requests passed.But the config of > taskmanager.network.request-backoff.max is not eay to decide. Can we use a > loop awaiting the upstream task partition be ready? > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task
[ https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537500#comment-17537500 ] zlzhang0122 commented on FLINK-27608: - [~Thesharing] Thanks for your detailed reply. I think the scenario you have mentioned is very useful and is one of the scenarios. The case I have met is another scenario, in that case, the akka message maybe miss or timeout, and I have upload a log about that to describe it.Correct me if I'm wrong? > Flink may throw PartitionNotFound Exception if the downstream task reached > Running state earlier than it's upstream task > > > Key: FLINK-27608 > URL: https://issues.apache.org/jira/browse/FLINK-27608 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.14.2 >Reporter: zlzhang0122 >Priority: Major > Attachments: exception.txt > > > Flink streaming job deployment may throw PartitionNotFound Exception if the > downstream task reached Running state earlier than its upstream task and > after maximum backoff for partition requests passed.But the config of > taskmanager.network.request-backoff.max is not eay to decide. Can we use a > loop awaiting the upstream task partition be ready? > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task
[ https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536840#comment-17536840 ] Zhilong Hong commented on FLINK-27608: -- As a {{PartitionNotFoundException}} is thrown, it will be handled in by the logic located at {{{}org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler:298{}}}. The {{Task}} will try to {{requestPartitionProducerState}} from the JobManager. If the upstream task is not ready (i.e. in the DEPLOYING or INITIALIZING state), the {{SingleInputGate}} will try to retrigger another partition request until the partition is consumable. > Flink may throw PartitionNotFound Exception if the downstream task reached > Running state earlier than it's upstream task > > > Key: FLINK-27608 > URL: https://issues.apache.org/jira/browse/FLINK-27608 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Affects Versions: 1.14.2 >Reporter: zlzhang0122 >Priority: Major > Fix For: 1.16.0 > > > Flink streaming job deployment may throw PartitionNotFound Exception if the > downstream task reached Running state earlier than its upstream task and > after maximum backoff for partition requests passed.But the config of > taskmanager.network.request-backoff.max is not eay to decide. Can we use a > loop awaiting the upstream task partition be ready? > -- This message was sent by Atlassian Jira (v8.20.7#820007)