[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task

2022-05-18 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538931#comment-17538931
 ] 

zlzhang0122 commented on FLINK-27608:
-

[~Thesharing] First, thanks for your quickly response and really detailed 
explanation. And yes, I agree with you, there is only one scenario here because 
it is a distributed environment. The reason why it takes such a long time to 
deploy the upstream tasks is the upstream tasks has a large state to restore. 
And sometimes this may be happen very frequently. So the problem comes back to 
the beginning that  the config of taskmanager.network.request-backoff.max is 
not easy to decide and can we have some better solution to deal with it? Thanks 
again!!

> Flink may throw PartitionNotFound Exception if the downstream task reached 
> Running state earlier than it's upstream task
> 
>
> Key: FLINK-27608
> URL: https://issues.apache.org/jira/browse/FLINK-27608
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.14.2
>Reporter: zlzhang0122
>Priority: Major
> Attachments: exception.txt
>
>
> Flink streaming job deployment may throw PartitionNotFound Exception if the 
> downstream task reached Running state earlier than its upstream task and 
> after maximum backoff for partition requests passed.But the config of 
> taskmanager.network.request-backoff.max is not eay to decide. Can we use a 
> loop awaiting the upstream task partition be ready?
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task

2022-05-16 Thread Zhilong Hong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537721#comment-17537721
 ] 

Zhilong Hong commented on FLINK-27608:
--

I've read the log you uploaded. In fact, there's only one scenario here. The 
retry mechanism is always working. In your log, you can see that the exception 
happens in the method {{retriggerPartitionRequest}} located in the 
SingleInputGate. The exception is thrown out because there are too many failed 
retries for the partition request, and the backoff time has reached the maximum 
value. The partition request shouldn't failed that many times. Thus, you should 
find out why it takes such a long time to deploy the upstream tasks. Or you 
could increase the value of the config 
{{taskmanager.network.request-backoff.max}}, which allows more retries for the 
partition request.

> Flink may throw PartitionNotFound Exception if the downstream task reached 
> Running state earlier than it's upstream task
> 
>
> Key: FLINK-27608
> URL: https://issues.apache.org/jira/browse/FLINK-27608
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.14.2
>Reporter: zlzhang0122
>Priority: Major
> Attachments: exception.txt
>
>
> Flink streaming job deployment may throw PartitionNotFound Exception if the 
> downstream task reached Running state earlier than its upstream task and 
> after maximum backoff for partition requests passed.But the config of 
> taskmanager.network.request-backoff.max is not eay to decide. Can we use a 
> loop awaiting the upstream task partition be ready?
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task

2022-05-16 Thread zlzhang0122 (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537500#comment-17537500
 ] 

zlzhang0122 commented on FLINK-27608:
-

[~Thesharing] Thanks for your detailed reply. I think the scenario you have 
mentioned is very useful and is one of the scenarios. The case I have met is 
another scenario, in that case, the akka message maybe miss or timeout, and I 
have upload a log about that to describe it.Correct me if I'm wrong?

> Flink may throw PartitionNotFound Exception if the downstream task reached 
> Running state earlier than it's upstream task
> 
>
> Key: FLINK-27608
> URL: https://issues.apache.org/jira/browse/FLINK-27608
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.14.2
>Reporter: zlzhang0122
>Priority: Major
> Attachments: exception.txt
>
>
> Flink streaming job deployment may throw PartitionNotFound Exception if the 
> downstream task reached Running state earlier than its upstream task and 
> after maximum backoff for partition requests passed.But the config of 
> taskmanager.network.request-backoff.max is not eay to decide. Can we use a 
> loop awaiting the upstream task partition be ready?
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27608) Flink may throw PartitionNotFound Exception if the downstream task reached Running state earlier than it's upstream task

2022-05-13 Thread Zhilong Hong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536840#comment-17536840
 ] 

Zhilong Hong commented on FLINK-27608:
--

As a {{PartitionNotFoundException}} is thrown, it will be handled in by the 
logic located at 
{{{}org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler:298{}}}.
 The {{Task}} will try to {{requestPartitionProducerState}} from the 
JobManager. If the upstream task is not ready (i.e. in the DEPLOYING or 
INITIALIZING state), the {{SingleInputGate}} will  try to retrigger another 
partition request until the partition is consumable.

> Flink may throw PartitionNotFound Exception if the downstream task reached 
> Running state earlier than it's upstream task
> 
>
> Key: FLINK-27608
> URL: https://issues.apache.org/jira/browse/FLINK-27608
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.14.2
>Reporter: zlzhang0122
>Priority: Major
> Fix For: 1.16.0
>
>
> Flink streaming job deployment may throw PartitionNotFound Exception if the 
> downstream task reached Running state earlier than its upstream task and 
> after maximum backoff for partition requests passed.But the config of 
> taskmanager.network.request-backoff.max is not eay to decide. Can we use a 
> loop awaiting the upstream task partition be ready?
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)