Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

David Morávek Tue, 23 Aug 2022 02:05:23 -0700

Hi Zheng,

Thanks for the write-up! I tend to agree with Chesnay that this introduces
additional complexity to an already complex deployment model.


One of the main focuses in this area is to reduce feature sparsity and to
have fewer high-quality options. Example efforts are deprecation (and
eventual removal) of per-job mode, removal of Mesos RM, ...

Let's discuss your points:

> This can save some JVM resources and reduce server costs

if so, the saving would IMO be negligible; why?

- JobMaster is by far the most resource-intensive component inside the
JobManager
- The CPU / memory ratio of the underlying hypervisor remains the same (or
you'd have unused resources on the machine that you still need to pay for)
- The most overhead of the JobMaster comes from the JVM itself, not from RM
/ Dispatcher

> More adequate resource utilization

Can you elaborate? Is this about sharing TMs between multiple jobs (I'd
discourage that for long-running mission-critical workloads)?

> Starting Application Mode has a long resource application and waiting
(because SessionCluster has already applied for fixed TM and JM resources
at startup)

This means you have to overprovision your SessionCluster. This goes against
resource utilization efforts from the previous point (you're shaving off
little resources from JM, but have spare TMs instead, that are the order of
magnitude
 more resource intensive).

If you're able to start TMs upfront with the session cluster, you already
know you're going to need them. If this is a concern, you could as well
start the TMs that will eventually connect to your JM once it starts
(you've decided to submit your job) - there might be some enhancements to
ApplicationMode needed to make this robust, but efforts in this direction
are where the things should IMO be headed.

As for the resource utilization, the session cluster actually blocks you
from leveraging reactive scaling efforts and eventually auto-scaling,
because we'd need to enhance Flink surface area with multi-job scheduling
capabilities (queues, pre-emptions, priorities between jobs) - I don't
think we should ever go in that direction, that's outside Flink's scope.

> Poor isolation between JobMaster threads in JobManager: When there are
too many jobs, the JobManager is under great pressure.

The session mode is mainly designed for interactive workloads but agreed
that JM threads might interfere. Still, I fail to see this as a reason for
introducing additional complexity because this could be mitigated on the
user side (smarter job scheduling, multiple clusters, AM for streaming
jobs).

> there will inevitably be more rich functions running on JobMaster.

This is a separate discussion. So far we were mostly pushing against
running against any user code on JM (there are few exceptions already, but
any enhancement should be carefully considered)

> JobManager's functional responsibilities are too large

from the "architecture perspective", it's just a bundle of independent
components with clearly defined responsibilities, that makes their
coordination simpler and more resource efficient (networking, fewer JVMs -
each comes with a significant overhead)

--

So far I'm under impression that this actually introduces more issues than
it tries to solve.

Best,
D.


On Thu, Aug 18, 2022 at 12:10 PM Zheng Yu Chen <jam.gz...@gmail.com> wrote:

> You're right, this does add to the complexity of their communication
> coordination
> I can understand what you mean is similar to ngnix, load balancing to
> different SessionClusters in the front, rather than one more component. In
> fact, I have tried this myself, and it seems to solve the problem of high
> load of cluster JM, but it cannot fundamentally solve the following
> problems
>
> Deploying components is complicated and requires one more ngnix and related
> configuration. You also need to make sure that your jobs are not assigned
> to a busy JobManager
> As well as my previous reply mentioned the problem, this is a trade-off
> solution (after all, you can choose Application Mode, so there will be no
> such problem), when we need to use SessionCluster for long-running jobs, we
> Can you think like this?
>
> what do you think ~
>
>
> Chesnay Schepler <ches...@apache.org> 于2022年8月17日周三 22:31写道：
>
> > To be honest I'm terrified at the idea of splitting the Dispatcher into
> > several processes, even more so if this is supposed to be opt-in and
> > specific to session mode.
> > It would fragment the coordination layer even more than it already is,
> > and make ops more complicated (yet another set of processes to monitor,
> > configure etc.).
> >
> > I'm not convinced that this proposal really gets us a lot of benefits;
> > and would rather propose that you split your single session cluster into
> > multiple session clusters (with the scheduling component in front of it
> > to distribute jobs) to even the load.
> >
> >  > The currently idling JobManagers could be utilized to take over some
> > of the workload from the leader.
> >
> > This would also be the path I would go down if we'd try to tackle this.
> >
> > On 17/08/2022 16:22, Matthias Pohl wrote:
> > > Hi Conrad,
> > > thanks for reaching out to the community with your proposal. I looked
> > > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > > elaborate a bit more on the concrete use-cases you have in mind? How
> > > do these match the user-cases of the two favored execution modes, i.e.
> > > Flink's Session and Application mode?
> > >
> > > As mentioned in [2], Application Mode should be preferred for single
> > > long-running jobs to isolate the resources of each of those jobs from
> > > each other. In contrast, Session Mode is the natural choice when
> > > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > > deploying some kind of dev environment for testing out job
> > > implementations. It feels like your use-case is somewhere in between a
> > > bit? It would be interesting to get a better understanding of where
> > > you're coming from. Maybe, you could provide some workload statistics?
> > >
> > > That considered, I guess it's a topic worth looking into. Here are a
> > > few thoughts after looking into FLIP-257:
> > > - As far as I can see, the BlobServer is used for sharing
> > > configuration information (e.g. Classpath information) as part of the
> > > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > > through the BlobServer but rather stored in the JobGraphStore backed
> > > by the HighAvailabilityServices implementation. The HA side is not
> > > really covered in FLIP-257, yet.
> > > - The approach of having the current Dispatcher living next to the new
> > > JobMasterDispatcher (that encapsulates the logic around distributing
> > > the workload onto multiple runners) leaves me with some doubt whether
> > > there wouldn't be a better separation of concerns here. What about
> > > leaving the Dispatcher as is but adding some abstraction between
> > > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > > around whether these instances are "deployed" on the same machine or
> > > somewhere else.
> > > - About distributing JobManager workload: The JobManager already
> > > utilizes leader election for faster recovery. Hence, one can set up
> > > multiple JobManagers in idle mode which wait to gain leadership and
> > > pick up the work (i.e. recovering the jobs) of the previously failed
> > > JobManager leader. What about utilizing this setup: The currently
> > > idling JobManagers could be utilized to take over some of the workload
> > > from the leader. I haven't thought this through, yet. But I'm
> > > wondering whether that would be a path we could go down. This would
> > > enable us to still stick to the JobManager/TaskManager setup which
> > > users are already used to rather than introducing another type of
> > > cluster node.
> > > - The JobManager initialization logic is kind of tricky to get your
> > > head around. There is some overhead, I hope, we could clean up as part
> > > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > > with it for some time (i.e. it's not going to be removed in 1.16)
> > > since it's a quite basic feature users might rely on. This shouldn't
> > > be a blocker. I just wanted to mention it to have it in the back of
> > > our minds when looking into ways to come up with a solid proposal for
> > > FLIP-257.
> > > - My concern is that this FLIP might turn out to be larger than
> > > expected and that it might be worth cutting it down into smaller
> > > chunks with each being covered in a separate FLIP down the road if we
> > > have some agreement and a clearer picture on how this should be
> tackled.
> > >
> > > I'm gonna add Chesnay and David to this discussion.
> > >
> > > Best,
> > > Matthias
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > > [2]
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > > [3]
> > >
> >
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> > >
> > >
> > > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <jam.gz...@gmail.com>
> > > wrote:
> > >
> > >     Hi community ~
> > >
> > >     I think this title should be quite interesting. The idea is to
> > >     reduce the
> > >     workload of the JobManager and make the SessionCluster [2] more
> > >     stable in
> > >     the process of running jobs. I designed a plan for splitting the
> > >     JobManager
> > >     on FLIP-257 [1]:
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> > >
> > >
> > >     This proposal proposes a splitting scheme for the current process
> > >     and a new
> > >     process implementation idea that is compatible with the original
> > >     process
> > >     model: splitting the internal JobMaster component of the
> > >     JobManager, and
> > >     controlling whether to enable this new process through a parameter
> > >     In the
> > >     split scheme, when the user configures, the JobMaster will make it
> > >     run as
> > >     an independent service, reducing the workload of the JobManager. By
> > >     implementing a new Dispatcher to communicate and interact with a
> > >     single
> > >     split JobMaster or multiple JobMasters, to achieve job management
> > >
> > >     The main features that it provides is:
> > >
> > >        - After the user submits the job, the JobMaster thread was
> > >     split into
> > >        other processes to run in the past. They no longer run in the
> > >     JobManager,
> > >        but in other processes.
> > >        - Users can deploy multiple components mentioned above, which
> run
> > >        multiple JobMaster threads, thereby reducing the workload of
> > >     the JobManager
> > >
> > >     Some of the challenging use cases that these features solve are:
> > >
> > >        - Compatible with the original job running mode (run JobMaster
> > >     Thread on
> > >        JobManager)
> > >        - Implement a new Dispatcher that forwards client operations
> > >     related to
> > >        jobs
> > >
> > >
> > >      I would love to hear and address your thoughts and feedback , and
> if
> > >     possible drive a FLIP-257 ！
> > >
> > >
> > >     [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> > >
> > >
> > >     [2]
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> > >
> > >
> > >     --
> > >
> > >     Have a nice day ~
> > >
> > >     ConradJam
> > >
> >
>
>
> --
> Best
>
> ConradJam
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Reply via email to