Thanks to Gyula and Max for a great start, I'll try this feature out and
I'll raise it on issue đ
Maximilian Michels äş2022ĺš´12ć15ćĽĺ¨ĺ 02:37ĺéďź
> A heads-up: Gyula just opened a PR with the code contribution based on the
> design: https://github.com/apache/flink-kubernetes-operator/pull/484
>
> We
A heads-up: Gyula just opened a PR with the code contribution based on the
design: https://github.com/apache/flink-kubernetes-operator/pull/484
We have run some tests based on the current state and achieved very good
results thus far. We were able to cut the resources of some of the
deployments by
Thanks for the reply.
Gyula and Max.
Prasanna
On Sat, 26 Nov 2022, 00:24 Maximilian Michels, wrote:
> Hi John, hi Prasanna, hi Rui,
>
> Gyula already gave great answers to your questions, just adding to it:
>
> >Whatâs the reason to add auto scaling to the Operator instead of to the
> JobMana
Hi John, hi Prasanna, hi Rui,
Gyula already gave great answers to your questions, just adding to it:
>Whatâs the reason to add auto scaling to the Operator instead of to the
JobManager?
As Gyula mentioned, the JobManager is not the ideal place, at least not
until Flink supports in-place autoscal
Hi Gyula
Thanks for the clarification!
Best
Rui Fan
On Fri, Nov 25, 2022 at 1:50 PM Gyula FĂłra wrote:
> Rui, Prasanna:
>
> I am afraid that creating a completely independent autoscaler process that
> works with any type of Flink clusters is out of scope right now due to the
> following reasons
Rui, Prasanna:
I am afraid that creating a completely independent autoscaler process that
works with any type of Flink clusters is out of scope right now due to the
following reasons:
If we were to create a new general process, we would have to implement high
availability and a pluggable mechanis
Thanks for this answer, Gyula!
-John
On Thu, Nov 24, 2022, at 14:53, Gyula FĂłra wrote:
> Hi John!
>
> Thank you for the excellent question.
>
> There are few reasons why we felt that the operator is the right place for
> this component:
>
> - Ideally the autoscaler is a separate process (an outsi
HI max,
This is a great initiative and good discussion going on.
We have set up flink cluster using Amazon ECS . So It would be good to
design in such a way that we can deploy the autoscaler in a separate docker
image which could observe the JM, JOBS and emit outputs that can use to
trigger the E
Hi Gyula, Max, John!
Thanks for the great FLIP, it's very useful for flink users.
> Ideally the autoscaler is a separate process (an outside observer)
Could we finally use the autoscaler as a outside tool? or run it as a
separate java process? If it's complex, can the part that detects
the job
Hi John!
Thank you for the excellent question.
There are few reasons why we felt that the operator is the right place for
this component:
- Ideally the autoscaler is a separate process (an outside observer) , and
the jobmanager is very much tied to the lifecycle of the job. The operator
is a pe
Hi Max,
Thanks for the FLIP!
Iâve been curious about one one point. I can imagine some good reasons for it
but wonder what you have in mind. Whatâs the reason to add auto scaling to the
Operator instead of to the JobManager?
It seems like adding that capability to the JobManager would be a big
Thanks for your comments @Dong and @Chen. It is true that not all the
details are contained in the FLIP. The document is meant as a general
design concept.
As for the rescaling time, this is going to be a configurable setting for
now but it is foreseeable that we will provide auto-tuning of this
c
>Do we think the scaler could be a plugin or hard coded ?
+1 For pluggable scaling logic.
On Mon, Nov 21, 2022 at 3:38 AM Chen Qin wrote:
> On Sun, Nov 20, 2022 at 7:25 AM Gyula FĂłra wrote:
>
> > Hi Chen!
> >
> > I think in the long term it makes sense to provide some pluggable
> > mechanisms
On Sun, Nov 20, 2022 at 7:25 AM Gyula FĂłra wrote:
> Hi Chen!
>
> I think in the long term it makes sense to provide some pluggable
> mechanisms but it's not completely trivial where exactly you would plug in
> your custom logic at this point.
>
sounds good, more specifically would be great if it
Hi Chen!
I think in the long term it makes sense to provide some pluggable
mechanisms but it's not completely trivial where exactly you would plug in
your custom logic at this point.
In any case the problems you mentioned should be solved robustly by the
algorithm itself without any customization
Hi Gyula,
Do we think the scaler could be a plugin or hard coded ?
We observed some cases scaler can't address (e.g async io dependency
service degradation or small spike that doesn't worth restarting job)
Thanks,
Chen
On Fri, Nov 18, 2022 at 1:03 AM Gyula FĂłra wrote:
> Hi Dong!
>
> Could you
Hi Gyula!
Thanks for all the explanations!
Personally, I would like to see a full story of how the algorithm works
(e.g. how it determines the estimated time for scale), how users can get
the basic information needed to monitor the health/effectiveness of
autoscaler (e.g. metrics), and how the al
Hi Dong!
Could you please confirm that your main concerns have been addressed?
Some other minor details that might not have been fully clarified:
- The prototype has been validated on some production workloads yes
- We are only planning to use metrics that are generally available and are
previo
Hi Dong!
This is not an experimental feature proposal. The implementation of the
prototype is still in an experimental phase but by the time the FLIP,
initial prototype and review is done, this should be in a good stable first
version.
This proposal is pretty general as autoscalers/tuners get as f
Hi Gyula,
If I understand correctly, this autopilot proposal is an experimental
feature and its configs/metrics are not mature enough to provide backward
compatibility yet. And the proposal provides high-level ideas of the
algorithm but it is probably too complicated to explain it end-to-end.
On
Hi Dong,
Let me address your comments.
Time for scale / backlog processing time derivation:
We can add some more details to the Flip but at this point the
implementation is actually much simpler than the algorithm to describe it.
I would not like to add more equations etc because it just overcomp
Thanks for the update! Please see comments inline.
On Tue, Nov 15, 2022 at 11:46 PM Maximilian Michels wrote:
> Of course! Let me know if your concerns are addressed. The wiki page has
> been updated.
>
> >It will be great to add this in the FLIP so that reviewers can understand
> how the source
Of course! Let me know if your concerns are addressed. The wiki page has
been updated.
>It will be great to add this in the FLIP so that reviewers can understand
how the source parallelisms are computed and how the algorithm works
end-to-end.
I've updated the FLIP page to add more details on how
Hi Maximilian,
It seems that the following comments from the previous discussions have not
been addressed yet. Any chance we can have them addressed before starting
the voting thread?
Thanks,
Dong
On Mon, Nov 7, 2022 at 2:33 AM Gyula FĂłra wrote:
> Hi Dong!
>
> Let me try to answer the question
I agree we should start the vote.
On a separate (but related) small discussion we could also decide
backporting https://issues.apache.org/jira/browse/FLINK-29501 for 1.16.1 so
that the autoscaler could be more efficiently developed and tested and to
make it 1.16 compatible.
Cheers,
Gyula
On Tue,
+1 If there are no further comments, I'll start a vote thread in the next
few days.
-Max
On Tue, Nov 15, 2022 at 2:06 PM Zheng Yu Chen wrote:
> @Gyula Have a good news, now flip-256 now is finish and merge it .
> flip-271 discussion seems to have stopped and I wonder if there are any
> othe
@Gyula Have a good news, now flip-256 now is finish and merge it .
flip-271 discussion seems to have stopped and I wonder if there are any
other comments. Can we get to the polls and start this exciting feature đ
Maybe I can get involved in developing this feature
Gyula FĂłra äş2022ĺš´11ć8ćĽĺ¨äş 18
>> # Horizontal scaling V.S. Vertical scaling
>
>True. We left out vertical scaling intentionally. For now we assume CPU /
memory is set up by the user. While definitely useful, vertical scaling
>adds another dimension to the scaling problem which we wanted to tackle
later. I'll update the FLIP to
I had 2 extra comments to Max's reply:
1. About pre-allocating resources:
This could be done through the operator when the standalone deployment mode
is used relatively easily as there we have better control of pods/resources.
2. Session jobs:
There is a FLIP (
https://cwiki.apache.org/confluence
@Yang
>Since the current auto-scaling needs to fully redeploy the application, it
may fail to start due to lack of resources.
Great suggestions. I agree that we will have to have to preallocate /
reserve resources to ensure the rescaling doesn't take longer as expected.
This is not only a problem
Thanks for the fruitful discussion and I am really excited to see that the
auto-scaling really happens for
Flink Kubernetes operator. It will be a very important step to make the
long-running Flink job more smoothly.
I just have some immature ideas and want to share them here.
# Resource Reserv
Thanks for all the interest here and for the great remarks! Gyula
already did a great job addressing the questions here. Let me try to
add additional context:
@Biao Geng:
>1. For source parallelisms, if the user configure a much larger value than
>normal, there should be very little pending rec
@Dong:
Looking at the busyTime metrics in the TaskOMetricGroup it seems that busy
time is actually defined as "not idle or (soft) backpressured" . So I think
it would give us the correct reading based on what you said about the Kafka
sink.
In any case we have to test this and if something is not
Thanks for the explanation Gyula. Please see my reply inline.
BTW, has the proposed solution been deployed and evaluated with any
production workload? If yes, I am wondering if you could share the
experience, e.g. what is the likelihood of having regression and
improvement respectively after enabl
@Guyla,
Thanks for the explanation and the follow up actions. That sounds good to
me.
Thanks,
JunRui Lee
Yanfei Lei äş2022ĺš´11ć7ćĽĺ¨ä¸ 12:20ĺéďź
> Hi Max,
>
> Thanks for the proposal. This proposal makes Flink better adapted to
> cloud-native applications!
>
> After reading the FLIP, I'm curious abo
Hi Max,
Thanks for the proposal. This proposal makes Flink better adapted to
cloud-native applications!
After reading the FLIP, I'm curious about some points:
1) It's said that "The first step is collecting metrics for all JobVertices
by combining metrics from all the runtime subtasks and comput
Hi Dong!
Let me try to answer the questions :)
1 : busyTimeMsPerSecond is not specific for CPU, it measures the time spent
in the main record processing loop for an operator if I
understand correctly. This includes IO operations too.
2: We should add this to the FLIP I agree. It would be a Durat
Hi Max,
Thank you for the proposal. The proposal tackles a very important issue for
Flink users and the design looks promising overall!
I have some questions to better understand the proposed public interfaces
and the algorithm.
1) The proposal seems to assume that the operator's busyTimeMsPerSe
@Pedro:
The current design focuses on record processing time metrics. In most cases
when we need to scale (such as too much state per operator), record
processing time actually slows, so it would detect that. Of course in the
future we can add new logic if we see something missing.
@ConradJam:
We
Hi Max
Thank you for dirver this flipďźI have some advice for this flip
Do we not only exist in the (on/off) switch, but also have one more option
for (advcie).
After the user opens (advcie), it does not actually perform AutoScaling. It
only outputs the notification form of tuning suggestions for t
>>>> possible
>>>>> to have different strategy for âscaling inâ to make it more
>> conservative.
>>>>> Or more eagerly, allow custom autoscaling strategy(e.g. time-based
>>>>> strategy).
>>>>> Another side thought is that
rategy for âscaling inâ to make it more
> conservative.
> >>> Or more eagerly, allow custom autoscaling strategy(e.g. time-based
> >>> strategy).
> >>> Another side thought is that to recover a job from
> checkpoint/savepoint,
> >>> the new parall
the new parallelism cannot be larger than max parallelism defined in the
>>> checkpoint(see this<
>> https://github.com/apache/flink/blob/17a782c202c93343b8884cb52f4562f9c4ba593f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L128
>>> ).
runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L128
> >).
> > Not sure if this limit should be mentioned in the FLIP.
> >
> > Again, thanks for the great work and looking forward to using flink k8s
> > operator with it!
> >
> > Best
Biao Geng
>
> From: Maximilian Michels
> Date: Saturday, November 5, 2022 at 2:37 AM
> To: dev
> Cc: Gyula FĂłra , Thomas Weise ,
> Marton Balassi , Ĺrhidi MĂĄtyĂĄs <
> matyas.orh...@gmail.com>
> Subject: [DISCUSS] FLIP-271: Autoscaling
> Hi,
>
> I would like
r with it!
Best,
Biao Geng
From: Maximilian Michels
Date: Saturday, November 5, 2022 at 2:37 AM
To: dev
Cc: Gyula FĂłra , Thomas Weise , Marton
Balassi , Ĺrhidi MĂĄtyĂĄs
Subject: [DISCUSS] FLIP-271: Autoscaling
Hi,
I would like to kick off the discussion on implementing autoscaling for
Flink as p
Thanks for preparing the FLIP and kicking off the discussion, Max. Looking
forward to this. :-)
On Sat, Nov 5, 2022 at 9:27 AM Niels Basjes wrote:
> I'm really looking forward to seeing this in action.
>
> Niels
>
> On Fri, 4 Nov 2022, 19:37 Maximilian Michels, wrote:
>
>> Hi,
>>
>> I would lik
I'm really looking forward to seeing this in action.
Niels
On Fri, 4 Nov 2022, 19:37 Maximilian Michels, wrote:
> Hi,
>
> I would like to kick off the discussion on implementing autoscaling for
> Flink as part of the Flink Kubernetes operator. I've outlined an approach
> here which I find promi
Thank you Max, Gyula!
This is definitely an exciting one :)
Cheers,
Matyas
On Fri, Nov 4, 2022 at 1:16 PM Gyula FĂłra wrote:
> Hi!
>
> Thank you for the proposal Max! It is great to see this highly desired
> feature finally take shape.
>
> I think we have all the right building blocks to make t
Hi!
Thank you for the proposal Max! It is great to see this highly desired
feature finally take shape.
I think we have all the right building blocks to make this successful.
Cheers,
Gyula
On Fri, Nov 4, 2022 at 7:37 PM Maximilian Michels wrote:
> Hi,
>
> I would like to kick off the discussio
Hi,
I would like to kick off the discussion on implementing autoscaling for
Flink as part of the Flink Kubernetes operator. I've outlined an approach
here which I find promising:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
I've been discussing this approach with some
51 matches
Mail list logo