Hi Rui,

Thanks for the proposal. I think it makes a lot of sense to decouple
the autoscaler from Kubernetes-related dependencies. A couple of notes
when I read the proposal:

1. You propose AutoScalerEventHandler, AutoScalerStateStore,
AutoScalerStateStoreFactory, and AutoScalerEventHandler.
AutoscalerStateStore is a generic key/value database (methods:
"get"/"put"/"delete"). I would propose to refine this interface and
make it less general purpose, e.g. add a method for persisting scaling
decisions as well as any metrics gathered for the current metric
window. For simplicity, I'd even go so far to remove the state store
entirely, but rather handle state in the AutoScalerEventHandler which
will receive all related scaling and metric collection events, and can
keep track of any state.

2. You propose to make the current autoscaler module
Kubernetes-agnostic by moving the Kubernetes parts into the main
operator module. I think that makes sense since the Kubernetes
implementation will continue to be tightly coupled with Kubernetes.
The goal of the separate module was to make the autoscaler logic
pluggable, but this will continue to be possible with the new
"flink-autoscaler" module which contains the autoscaling logic and
interfaces. In the long run, the autoscaling logic can move to a
separate repository, although this will complicate the release
process, so I would defer this unless there is strong interest.

3. The proposal mentions some removal of tests. It is critical for us
that all test coverage of the current implementation remains active.
It is ok if some of the test coverage only covers the Kubernetes
implementation. We can eventually move more tests without Kubernetes
significance into the implementation-agnostic autoscaler tests.

-Max

On Tue, Aug 1, 2023 at 9:46 AM Rui Fan <fan...@apache.org> wrote:
>
> Hi all,
>
> I and Samrat(cc'ed) created the FLIP-334[1] to decoupling the autoscaler
> and kubernetes.
>
> Currently, the flink-autoscaler is tightly integrated with Kubernetes.
> There are compelling reasons to extend the use of flink-autoscaler to
> more types of Flink jobs:
> 1. With the recent merge of the Externalized Declarative Resource
> Management (FLIP-291[2]), in-place scaling is now supported
> across all types of Flink jobs. This development has made scaling Flink on
> YARN a straightforward process.
> 2. Several discussions[3] within the Flink user community, as observed in
> the mail list , have emphasized the necessity of flink-autoscaler
> supporting
> Flink on YARN.
>
> Please refer to the FLIP[1] document for more details about the proposed
> design and implementation. We welcome any feedback and opinions on
> this proposal.
>
> [1] https://cwiki.apache.org/confluence/x/x4qzDw
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> [3] https://lists.apache.org/thread/pr0r8hq8kqpzk3q1zrzkl3rp1lz24v7v

Reply via email to