[ 
https://issues.apache.org/jira/browse/FLINK-25338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461226#comment-17461226
 ] 

Xintong Song commented on FLINK-25338:
--------------------------------------

In general, I like the idea that the JM process maintains connections to all 
TMs in one place, and forward only necessary messages to JobMasters.

My gut feeling, the component responsible for managing JM-TM connections should 
be the dispatcher. I'd prefer not to introduce a new component at the cluster 
entry point level. Maybe we can separate the effort into 2 steps? We can 
firstly make this improvement with a single-thread dispatcher, and introduce a 
thread pool based approach secondary if it indeed turns into a bottleneck. WDYT?

Moreover, I think this improvement changes the architecture how RPC endpoints 
talk to each other, thus may deserve a FLIP discussion and a formal vote.

> Improvement of connection from TM to JM in session cluster
> ----------------------------------------------------------
>
>                 Key: FLINK-25338
>                 URL: https://issues.apache.org/jira/browse/FLINK-25338
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.7, 1.13.5, 1.14.2
>            Reporter: Shammon
>            Priority: Major
>
> When taskmanager receives slot request from resourcemanager for the specify 
> job, it will connect to the jobmaster with given job address. Taskmanager 
> register itself, monitor the heartbeat of job and update task's state by this 
> connection. There's no need to create connections in one taskmanager for each 
> job, and when the taskmanager is busy, it will increase the latency of job. 
> One idea is that taskmanager manages the connection to `Dispatcher`, sends 
> events such as heartbeat, state update to `Dispatcher`,  and `Dispatcher` 
> tell the local `JobMaster`. The main problem is that `Dispatcher` is an actor 
> and can only be executed in one thread, it may be the performance bottleneck 
> for deserialize event.
> The other idea is to create a netty service in `SessionClusterEntrypoint`, it 
> can receive and deserialize events from taskmanagers in a threadpool, and 
> send the event to the `Dispatcher` or `JobMaster`. Taskmanagers manager the 
> connection to the netty service when it start. Thus a service can also 
> receive the result of a job from taskmanager later.
> [~xtsong] What do you think? THX



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to