Hi Gyula, Samrat and Shammon,

My team is also looking forward to autoscaler is compatible with yarn.

Currently, all of our flink jobs are running on yarn. And autoscaler is
a great feature for flink users, it can greatly simplify the process of
tuning parallelism.

If the autoscaler supports yarn, I propose to divide it into two stages:
1. It only collects and evaluates scaling related performance metrics
 but does not trigger any job upgrades.
2. Support for automatic upgrades of yarn jobs.

Also, I also hope to join it, and improve it together.

And very happy Gyula can help with the review.

Best,
Rui Fan

On Mon, Feb 20, 2023 at 8:56 AM Shammon FY <zjur...@gmail.com> wrote:

> Hi Samrat
>
> My team is also looking at this piece. After you give your proposal, we
> also hope to join it with you if possible. I hope we can improve this
> together for use in our production too, thanks :)
>
> Best,
> Shammon
>
> On Fri, Feb 17, 2023 at 9:27 PM Samrat Deb <decordea...@gmail.com> wrote:
>
> > @Gyula
> > Thank you
> > We will work on this and try to come up with an approach.
> >
> >
> >
> >
> > On Fri, Feb 17, 2023 at 6:12 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
> >
> > > In case you guys feel strongly about this I suggest you try to fork the
> > > autoscaler implementation and make a version that works with both the
> > > Kubernetes operator and YARN.
> > > If your solution is generic and works well, we can discuss the way
> > forward.
> > >
> > > Unfortunately me or my team don't really have the resources to assist
> you
> > > with the YARN effort as we are mostly invested in Kubernetes but of
> > course
> > > we are happy to review your work.
> > >
> > > Gyula
> > >
> > >
> > > On Fri, Feb 17, 2023 at 1:09 PM Prabhu Joseph <
> > prabhujose.ga...@gmail.com>
> > > wrote:
> > >
> > > > @Gyula
> > > >
> > > > >> It is easier to make the operator work with jobs running in
> > different
> > > > types of clusters than to take the
> > > > autoscaler module itself and plug that in somewhere else.
> > > >
> > > > Our (part of Samrat's team) main problem is to leverage the
> AutoScaler
> > > > Recommendation Engine part of Flink-Kubernetes-Operator for our Flink
> > > jobs
> > > > running on YARN.
> > > > Currently, it is not feasible as the autoscaler module is tightly
> > coupled
> > > > with the operator. We agree that the operator serves the two core
> > > > requirements, but the operator itself
> > > > cannot be used for Flink jobs running on YARN. Those core
> requirements
> > > are
> > > > solved through other mechanisms in the case of YARN. But the main
> > problem
> > > > for us is *how to*
> > > > *use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zjur...@gmail.com>
> wrote:
> > > >
> > > > > Hi Gyula, Samrat
> > > > >
> > > > > Thanks for your input and I totally agree with you that it's really
> > big
> > > > > work. As @Samrat mentioned above, I think it's not a short way to
> > make
> > > > the
> > > > > autoscaler completely independent too. But I still find some
> valuable
> > > > > points for the `completely independent autoscaler`, and I think
> this
> > > may
> > > > be
> > > > > the goal we need to achieve in the future.
> > > > >
> > > > > 1. A large k8s cluster may manage thousands of machines, and users
> > may
> > > > run
> > > > > tens of thousands flink jobs in one k8s cluster. If the autoscaler
> > > > manages
> > > > > all these jobs, the autoscaler should be horizontal expansion.
> > > > >
> > > > > 2. As you mentioned, "execute the job stateful upgrades safely" is
> > > > indeed a
> > > > > complexity work, but I think we should decouple it from k8s
> operator
> > > > >
> > > > > a) In addition to k8s, there may be some other resource management
> > > > >
> > > > > b) Flink may support more scaler operations by REST API, such as
> > > FLIP-291
> > > > > [1]
> > > > >
> > > > > c) In our production environment, there's a 'Job Submission
> Gateway'
> > > > which
> > > > > stores job info and config, monitors the status of running jobs.
> > After
> > > > the
> > > > > autoscaler upgrades the job, it must update the config in Gateway
> and
> > > > users
> > > > > can restart his job with the updated config to avoid resource
> > conflict.
> > > > > Under these circumstances, the autoscaler sending upgrade requests
> to
> > > the
> > > > > gateway may be a good choice.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > >
> > > > >
> > > > > Best,
> > > > > Shammon
> > > > >
> > > > >
> > > > > On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gyula.f...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > @Shammon , Samrat:
> > > > > >
> > > > > > I appreciate the enthusiasm and I wish this was only a matter of
> > > > > intention
> > > > > > but making the autoscaler work without the operator may be a
> pretty
> > > big
> > > > > > task.
> > > > > > You must not forget 2 core requirements here.
> > > > > >
> > > > > > 1. The autoscaler logic itself has to run somewhere (in this case
> > on
> > > > k8s
> > > > > > within the operator)S
> > > > > > 2. Something has to execute the job stateful upgrades safely
> based
> > on
> > > > the
> > > > > > scaling decisions (in this case the operator does that).
> > > > > >
> > > > > > 1. Can be solved almost anywhere easily however you need
> resiliency
> > > etc
> > > > > for
> > > > > > this to be a prod application, 2. is the really tricky part. The
> > > > operator
> > > > > > was actually built to execute job upgrades, if you look at the
> code
> > > you
> > > > > > will appreciate the complexity of the task.
> > > > > >
> > > > > > As I said in the earlier thread. It is easier to make the
> operator
> > > work
> > > > > > with jobs running in different types of clusters than to take the
> > > > > > autoscaler module itself and plug that in somewhere else.
> > > > > >
> > > > > > Gyula
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <
> decordea...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Shammon,
> > > > > > >
> > > > > > > Thank you for your input, completely aligned with you.
> > > > > > >
> > > > > > > We are fine with either of the options ,
> > > > > > >
> > > > > > > but IMO, to start with it will be easy to have it in the
> > > > > > > flink-kubernetes-operator as a module instead of a separate
> repo
> > > > which
> > > > > > > requires additional effort.
> > > > > > >
> > > > > > > Given that we would be incrementally working on making an
> > > autoscaling
> > > > > > > recommendation framework generic enough,
> > > > > > >
> > > > > > > Once it reaches a point where the community feels it needs to
> be
> > > > moved
> > > > > > to a
> > > > > > > separate repo we can take a call.
> > > > > > >
> > > > > > > Bests,
> > > > > > >
> > > > > > > Samrat
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <
> > decordea...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Max ,
> > > > > > > > If you are fine and aligned with the same thought , since
> this
> > is
> > > > > going
> > > > > > > to
> > > > > > > > be very useful to us, we are ready to help / contribute
> > > additional
> > > > > work
> > > > > > > > required.
> > > > > > > >
> > > > > > > > Bests,
> > > > > > > > Samrat
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <
> zjur...@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Samrat
> > > > > > > >>
> > > > > > > >> Do you mean to create an independent module for flink
> scaling
> > in
> > > > > > > >> flink-k8s-operator? How about creating a project such as
> > > > > > > >> `flink-auto-scaling` which is completely independent?
> Besides
> > > > > resource
> > > > > > > >> managers such as k8s and yarn, we can do more things in the
> > > > project,
> > > > > > for
> > > > > > > >> example, updating config in the user's `job submission
> system`
> > > > after
> > > > > > > >> scaling flink jobs. WDYT?
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Shammon
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <
> > > > m...@apache.org>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Samrat,
> > > > > > > >> >
> > > > > > > >> > The autoscaling module is now pluggable but it is still
> > > tightly
> > > > > > > >> > coupled with Kubernetes. It will take additional work for
> > the
> > > > > logic
> > > > > > to
> > > > > > > >> > work independently of the cluster manager.
> > > > > > > >> >
> > > > > > > >> > -Max
> > > > > > > >> >
> > > > > > > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
> > > > > decordea...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> > >
> > > > > > > >> > > Oh! yesterday it got merged.
> > > > > > > >> > > Apologies , I missed the recent commit @Gyula.
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the update
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
> > > > > gyula.f...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Max recently moved the autoscaler logic in a separate
> > > > > submodule,
> > > > > > > did
> > > > > > > >> > you
> > > > > > > >> > > > see that?
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > > > > > >> > > >
> > > > > > > >> > > > Gyula
> > > > > > > >> > > >
> > > > > > > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > > > > > > decordea...@gmail.com>
> > > > > > > >> > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi ,
> > > > > > > >> > > > >
> > > > > > > >> > > > > *Context:*
> > > > > > > >> > > > > Auto Scaling was introduced in Flink as part of
> > > > FLIP-271[1].
> > > > > > > >> > > > > It discusses one of the important aspects to
> provide a
> > > > > robust
> > > > > > > >> default
> > > > > > > >> > > > > scaling algorithm.
> > > > > > > >> > > > >       a. Ensure scaling yields effective usage of
> > > assigned
> > > > > > task
> > > > > > > >> > slots.
> > > > > > > >> > > > >       b. Ramp up in case of any backlog to ensure it
> > > gets
> > > > > > > >> processed
> > > > > > > >> > in a
> > > > > > > >> > > > > timely manner
> > > > > > > >> > > > >       c. Minimize the number of scaling decisions to
> > > > prevent
> > > > > > > >> costly
> > > > > > > >> > > > rescale
> > > > > > > >> > > > > operation
> > > > > > > >> > > > > The flip intends to add an auto scaling framework
> > based
> > > > on 6
> > > > > > > major
> > > > > > > >> > > > metrics
> > > > > > > >> > > > > and contains different types of threshold to trigger
> > the
> > > > > > > scaling.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thread[2] discusses a different problem: why
> > autoscaler
> > > is
> > > > > > part
> > > > > > > of
> > > > > > > >> > the
> > > > > > > >> > > > > operator instead of jobmanager at runtime.
> > > > > > > >> > > > > The Community decided to keep the autoscaling logic
> in
> > > the
> > > > > > > >> > > > > flink-kubernetes-operator.
> > > > > > > >> > > > >
> > > > > > > >> > > > > *Proposal: *
> > > > > > > >> > > > > In this discussion, I want to put forward a thought
> of
> > > > > > > extracting
> > > > > > > >> > out the
> > > > > > > >> > > > > auto scaling logic into a new submodule in
> > > > > > > >> flink-kubernetes-operator
> > > > > > > >> > > > > repository[3],
> > > > > > > >> > > > > which will be independent of any resource
> > > > manager/Operator.
> > > > > > > >> > > > > Currently the Autoscaling algorithm is very tightly
> > > > coupled
> > > > > > with
> > > > > > > >> the
> > > > > > > >> > > > > kubernetes API.
> > > > > > > >> > > > > This makes the autoscaling core algorithm not so
> > easily
> > > > > > > extensible
> > > > > > > >> > for
> > > > > > > >> > > > > different available resource managers like YARN,
> Mesos
> > > > etc.
> > > > > > > >> > > > > A Separate autoscaling module inside the flink
> > > kubernetes
> > > > > > > operator
> > > > > > > >> > will
> > > > > > > >> > > > > help other resource managers to leverage the
> > autoscaling
> > > > > > logic.
> > > > > > > >> > > > >
> > > > > > > >> > > > > [1]
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > > > > >> > > > > [2]
> > > > > > > >>
> > > https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > > > > >> > > > > [3]
> > https://github.com/apache/flink-kubernetes-operator
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > Bests,
> > > > > > > >> > > > > Samrat
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to