[jira] [Commented] (FLINK-25055) Support listen and notify mechanism for PartitionRequest

Zhilong Hong (Jira) Mon, 07 Feb 2022 22:31:38 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-25055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488616#comment-17488616
 ]


Zhilong Hong commented on FLINK-25055:
--------------------------------------

I think this improvement will also solve FLINK-24295. In FLINK-24295, we find 
that too many {{requestPartitionState}} checks will jam the JobManager during 
task deployment. Once the notify mechanism is introduced, 
{{PartitionNotFoundException}} will not be raised for scenario Till mentioned 
above (for A->B, B gets deployed before A has registered its partition on the 
TaskManager). There will be less {{requestPartitionState}} checks during task 
deployment. I'm wondering what the progress of this issue is, [~zjureel]?

> Support listen and notify mechanism for PartitionRequest
> --------------------------------------------------------
>
>                 Key: FLINK-25055
>                 URL: https://issues.apache.org/jira/browse/FLINK-25055
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Network
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Shammon
>            Assignee: Shammon
>            Priority: Major
>
> We submit batch jobs to flink session cluster with eager scheduler for olap. 
> JM deploys subtasks to TaskManager independently, and the downstream subtasks 
> may start before the upstream ones are ready. The downstream subtask sends 
> PartitionRequest to upstream ones, and may receive PartitionNotFoundException 
> from them. Then it will retry to send PartitionRequest after a few ms until 
> timeout.
> The current approach raises two problems. First, there will be too many retry 
> PartitionRequest messages. Each downstream subtask will send PartitionRequest 
> to all its upstream subtasks and the total number of messages will be O(N*N), 
> where N is the parallelism of subtasks. Secondly, the interval between 
> polling retries will increase the delay for upstream and downstream tasks to 
> confirm PartitionRequest.
> We want to support listen and notify mechanism for PartitionRequest when the 
> job needs no failover. Upstream TaskManager will add the PartitionRequest to 
> a listen list with a timeout checker, and notify the request when the task 
> register its partition in the TaskManager.
> [~nkubicek] I noticed that your scenario of using flink is similar to ours. 
> What do you think?  And hope to hear from you [~trohrmann] THX



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-25055) Support listen and notify mechanism for PartitionRequest

Reply via email to