Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Xintong Song Sat, 27 Aug 2022 23:50:17 -0700

Sorry for joining the discussion late. And thanks Zhengyu for bringing this
discussion up.


I think this is an interesting topic. I actually had something similar in
mind for a long time. I haven't carried it out for the same concerns as
others already mentioned, that the benefit for this effort may not be that
significant compared to the complexity it introduces. However, I think
Matthias has come up with an inspiring idea, to simply offload some of the
JobMasters to standby JobManager processes, which potentially reduces the
complexity significantly and sounds promising to me.

# My understanding on the session mode

In the early days, a standalone session cluster was probably the most
straightforward way to deploy Flink on a bunch of machines / VMs. Nowadays,
as K8s / Yarn deployments become more and more convenient and application
mode becomes available for the standalone deployment, why (or whether) do
users still need the session mode?

>From my experiences, I see users still want to use session mode in
production mainly for 2 reasons:
- To reduce the job bootstrap time. Leveraging an existing session cluster
would save the time for requesting resources from K8s / Yarn and launching
/ initializing the processes, which in many cases is the major time
consumption in launching a job. This is valued typically for interactive
workloads.
- To improve resource efficiency. Session cluster allows multiple jobs to
share the same JobManager & TaskManager processes. This reduces the
framework resource overhead for small workloads.

# Concrete use cases

I've seen 2 use cases which may benefit from this proposal.

- AFAIK, ByteDance builds a large scale OLAP query service with Flink. They
have a large Flink session cluster that runs >100 sql queries per seconds,
where the JobManager process becomes the single-pointed performance
bottleneck. I asked the same question as Chesnay did, that whether this can
be fulfilled by having multiple session clusters plus a load-balancing
service. Seems that's exactly the approach they were using, but with many
limitations. Cc-ed @Shammon, would you like to provide more details?
- One of our users at Alibaba has hundreds of Flink jobs for event
processing. The workload of each job is typically low (tens of records per
hour), meanwhile there's a high requirement on the timeliness (records upon
received must be processed immediately). Consequently, they have a large
session cluster that runs hundreds of long-running jobs. The reason they
don't like multiple session clusters is that, as business develops, they
may frequently create new jobs and retire old ones. It is inconvenient to
maintain / migrate the long-running jobs across multiple clusters.

# My opinion on this proposal

I do see a value behind this proposal. Admittedly, I haven't seen any use
cases that absolutely cannot be solved by having multiple session clusters.
But it would definitely reduce the complexity on the user side if Flink can
support larger scale session clusters. Additionally, there's a chance that
spreading the JobMasters across multiple JobManager processes may also help
with the high availability, that only part of the jobs are affected upon
JobManager failures.

On the other hand, I share the concern that the current design complicates
the coordination & deployment components way more than the benefit it
brings.

I would suggest looking a bit more along the direction of leveraging
standby JobManager processes as Matthias pointed out, see if the benefit is
worth the price.

Best,

Xintong



On Fri, Aug 26, 2022 at 5:55 PM Zheng Yu Chen <[email protected]> wrote:

> Hi Chesnay ,
> I have also considered the method you mentioned. If we deploy some
> load balancing or intelligent scheduling in front of multiple
> SessionClusters, this may cause the following problems
> ● Insufficient resource utilization. When we distribute these
> resources on each cluster, the job cannot make full use of the overall
> TM resources. Some clusters may have very high workload and some are
> idle, resulting in wasted resources.
> ● The user's usage cost increases, and the user introduces additional
> components to adapt to the SessionCluster. The problem is caused by
> the overload of the JobManager. If there is a solution on the Flink
> side, it will be better.
> Maybe there is a better way to deal with it, I am sorting it out, and
> I will reply with new ideas in the emails later.
>
> Chesnay Schepler <[email protected]> 于2022年8月17日周三 22:31写道：
> >
> > To be honest I'm terrified at the idea of splitting the Dispatcher into
> > several processes, even more so if this is supposed to be opt-in and
> > specific to session mode.
> > It would fragment the coordination layer even more than it already is,
> > and make ops more complicated (yet another set of processes to monitor,
> > configure etc.).
> >
> > I'm not convinced that this proposal really gets us a lot of benefits;
> > and would rather propose that you split your single session cluster into
> > multiple session clusters (with the scheduling component in front of it
> > to distribute jobs) to even the load.
> >
> >  > The currently idling JobManagers could be utilized to take over some
> > of the workload from the leader.
> >
> > This would also be the path I would go down if we'd try to tackle this.
> >
> > On 17/08/2022 16:22, Matthias Pohl wrote:
> > > Hi Conrad,
> > > thanks for reaching out to the community with your proposal. I looked
> > > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > > elaborate a bit more on the concrete use-cases you have in mind? How
> > > do these match the user-cases of the two favored execution modes, i.e.
> > > Flink's Session and Application mode?
> > >
> > > As mentioned in [2], Application Mode should be preferred for single
> > > long-running jobs to isolate the resources of each of those jobs from
> > > each other. In contrast, Session Mode is the natural choice when
> > > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > > deploying some kind of dev environment for testing out job
> > > implementations. It feels like your use-case is somewhere in between a
> > > bit? It would be interesting to get a better understanding of where
> > > you're coming from. Maybe, you could provide some workload statistics?
> > >
> > > That considered, I guess it's a topic worth looking into. Here are a
> > > few thoughts after looking into FLIP-257:
> > > - As far as I can see, the BlobServer is used for sharing
> > > configuration information (e.g. Classpath information) as part of the
> > > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > > through the BlobServer but rather stored in the JobGraphStore backed
> > > by the HighAvailabilityServices implementation. The HA side is not
> > > really covered in FLIP-257, yet.
> > > - The approach of having the current Dispatcher living next to the new
> > > JobMasterDispatcher (that encapsulates the logic around distributing
> > > the workload onto multiple runners) leaves me with some doubt whether
> > > there wouldn't be a better separation of concerns here. What about
> > > leaving the Dispatcher as is but adding some abstraction between
> > > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > > around whether these instances are "deployed" on the same machine or
> > > somewhere else.
> > > - About distributing JobManager workload: The JobManager already
> > > utilizes leader election for faster recovery. Hence, one can set up
> > > multiple JobManagers in idle mode which wait to gain leadership and
> > > pick up the work (i.e. recovering the jobs) of the previously failed
> > > JobManager leader. What about utilizing this setup: The currently
> > > idling JobManagers could be utilized to take over some of the workload
> > > from the leader. I haven't thought this through, yet. But I'm
> > > wondering whether that would be a path we could go down. This would
> > > enable us to still stick to the JobManager/TaskManager setup which
> > > users are already used to rather than introducing another type of
> > > cluster node.
> > > - The JobManager initialization logic is kind of tricky to get your
> > > head around. There is some overhead, I hope, we could clean up as part
> > > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > > with it for some time (i.e. it's not going to be removed in 1.16)
> > > since it's a quite basic feature users might rely on. This shouldn't
> > > be a blocker. I just wanted to mention it to have it in the back of
> > > our minds when looking into ways to come up with a solid proposal for
> > > FLIP-257.
> > > - My concern is that this FLIP might turn out to be larger than
> > > expected and that it might be worth cutting it down into smaller
> > > chunks with each being covered in a separate FLIP down the road if we
> > > have some agreement and a clearer picture on how this should be
> tackled.
> > >
> > > I'm gonna add Chesnay and David to this discussion.
> > >
> > > Best,
> > > Matthias
> > >
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > > [2]
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > > [3]
> > >
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> > >
> > >
> > > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <[email protected]>
> > > wrote:
> > >
> > >     Hi community ~
> > >
> > >     I think this title should be quite interesting. The idea is to
> > >     reduce the
> > >     workload of the JobManager and make the SessionCluster [2] more
> > >     stable in
> > >     the process of running jobs. I designed a plan for splitting the
> > >     JobManager
> > >     on FLIP-257 [1]:
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> > >
> > >     This proposal proposes a splitting scheme for the current process
> > >     and a new
> > >     process implementation idea that is compatible with the original
> > >     process
> > >     model: splitting the internal JobMaster component of the
> > >     JobManager, and
> > >     controlling whether to enable this new process through a parameter
> > >     In the
> > >     split scheme, when the user configures, the JobMaster will make it
> > >     run as
> > >     an independent service, reducing the workload of the JobManager. By
> > >     implementing a new Dispatcher to communicate and interact with a
> > >     single
> > >     split JobMaster or multiple JobMasters, to achieve job management
> > >
> > >     The main features that it provides is:
> > >
> > >        - After the user submits the job, the JobMaster thread was
> > >     split into
> > >        other processes to run in the past. They no longer run in the
> > >     JobManager,
> > >        but in other processes.
> > >        - Users can deploy multiple components mentioned above, which
> run
> > >        multiple JobMaster threads, thereby reducing the workload of
> > >     the JobManager
> > >
> > >     Some of the challenging use cases that these features solve are:
> > >
> > >        - Compatible with the original job running mode (run JobMaster
> > >     Thread on
> > >        JobManager)
> > >        - Implement a new Dispatcher that forwards client operations
> > >     related to
> > >        jobs
> > >
> > >
> > >      I would love to hear and address your thoughts and feedback , and
> if
> > >     possible drive a FLIP-257 ！
> > >
> > >
> > >     [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> > >
> > >     [2]
> > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> > >
> > >
> > >     --
> > >
> > >     Have a nice day ~
> > >
> > >     ConradJam
> > >
>
> --
> Best
>
> ConradJam
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Reply via email to