Re: [DISCUSS] FLIP-271: Autoscaling

2022-12-15 Thread ConradJam
Thanks to Gyula and Max for a great start, I'll try this feature out and I'll raise it on issue 😀 Maximilian Michels 于2022年12月15日周四 02:37写道: > A heads-up: Gyula just opened a PR with the code contribution based on the > design: https://github.com/apache/flink-kubernetes-operator/pull/484 > > We

Re: [DISCUSS] FLIP-271: Autoscaling

2022-12-14 Thread Maximilian Michels
A heads-up: Gyula just opened a PR with the code contribution based on the design: https://github.com/apache/flink-kubernetes-operator/pull/484 We have run some tests based on the current state and achieved very good results thus far. We were able to cut the resources of some of the deployments by

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-25 Thread Prasanna kumar
Thanks for the reply. Gyula and Max. Prasanna On Sat, 26 Nov 2022, 00:24 Maximilian Michels, wrote: > Hi John, hi Prasanna, hi Rui, > > Gyula already gave great answers to your questions, just adding to it: > > >What’s the reason to add auto scaling to the Operator instead of to the > JobMana

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-25 Thread Maximilian Michels
Hi John, hi Prasanna, hi Rui, Gyula already gave great answers to your questions, just adding to it: >What’s the reason to add auto scaling to the Operator instead of to the JobManager? As Gyula mentioned, the JobManager is not the ideal place, at least not until Flink supports in-place autoscal

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread Rui Fan
Hi Gyula Thanks for the clarification! Best Rui Fan On Fri, Nov 25, 2022 at 1:50 PM Gyula FĂłra wrote: > Rui, Prasanna: > > I am afraid that creating a completely independent autoscaler process that > works with any type of Flink clusters is out of scope right now due to the > following reasons

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread Gyula FĂłra
Rui, Prasanna: I am afraid that creating a completely independent autoscaler process that works with any type of Flink clusters is out of scope right now due to the following reasons: If we were to create a new general process, we would have to implement high availability and a pluggable mechanis

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread John Roesler
Thanks for this answer, Gyula! -John On Thu, Nov 24, 2022, at 14:53, Gyula FĂłra wrote: > Hi John! > > Thank you for the excellent question. > > There are few reasons why we felt that the operator is the right place for > this component: > > - Ideally the autoscaler is a separate process (an outsi

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread Prasanna kumar
HI max, This is a great initiative and good discussion going on. We have set up flink cluster using Amazon ECS . So It would be good to design in such a way that we can deploy the autoscaler in a separate docker image which could observe the JM, JOBS and emit outputs that can use to trigger the E

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread Rui Fan
Hi Gyula, Max, John! Thanks for the great FLIP, it's very useful for flink users. > Ideally the autoscaler is a separate process (an outside observer) Could we finally use the autoscaler as a outside tool? or run it as a separate java process? If it's complex, can the part that detects the job

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread Gyula FĂłra
Hi John! Thank you for the excellent question. There are few reasons why we felt that the operator is the right place for this component: - Ideally the autoscaler is a separate process (an outside observer) , and the jobmanager is very much tied to the lifecycle of the job. The operator is a pe

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-24 Thread John Roesler
Hi Max, Thanks for the FLIP! I’ve been curious about one one point. I can imagine some good reasons for it but wonder what you have in mind. What’s the reason to add auto scaling to the Operator instead of to the JobManager? It seems like adding that capability to the JobManager would be a big

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-23 Thread Maximilian Michels
Thanks for your comments @Dong and @Chen. It is true that not all the details are contained in the FLIP. The document is meant as a general design concept. As for the rescaling time, this is going to be a configurable setting for now but it is foreseeable that we will provide auto-tuning of this c

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-21 Thread Maximilian Michels
>Do we think the scaler could be a plugin or hard coded ? +1 For pluggable scaling logic. On Mon, Nov 21, 2022 at 3:38 AM Chen Qin wrote: > On Sun, Nov 20, 2022 at 7:25 AM Gyula FĂłra wrote: > > > Hi Chen! > > > > I think in the long term it makes sense to provide some pluggable > > mechanisms

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-20 Thread Chen Qin
On Sun, Nov 20, 2022 at 7:25 AM Gyula FĂłra wrote: > Hi Chen! > > I think in the long term it makes sense to provide some pluggable > mechanisms but it's not completely trivial where exactly you would plug in > your custom logic at this point. > sounds good, more specifically would be great if it

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-20 Thread Gyula FĂłra
Hi Chen! I think in the long term it makes sense to provide some pluggable mechanisms but it's not completely trivial where exactly you would plug in your custom logic at this point. In any case the problems you mentioned should be solved robustly by the algorithm itself without any customization

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-19 Thread Chen Qin
Hi Gyula, Do we think the scaler could be a plugin or hard coded ? We observed some cases scaler can't address (e.g async io dependency service degradation or small spike that doesn't worth restarting job) Thanks, Chen On Fri, Nov 18, 2022 at 1:03 AM Gyula FĂłra wrote: > Hi Dong! > > Could you

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-18 Thread Dong Lin
Hi Gyula! Thanks for all the explanations! Personally, I would like to see a full story of how the algorithm works (e.g. how it determines the estimated time for scale), how users can get the basic information needed to monitor the health/effectiveness of autoscaler (e.g. metrics), and how the al

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-18 Thread Gyula FĂłra
Hi Dong! Could you please confirm that your main concerns have been addressed? Some other minor details that might not have been fully clarified: - The prototype has been validated on some production workloads yes - We are only planning to use metrics that are generally available and are previo

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-17 Thread Gyula FĂłra
Hi Dong! This is not an experimental feature proposal. The implementation of the prototype is still in an experimental phase but by the time the FLIP, initial prototype and review is done, this should be in a good stable first version. This proposal is pretty general as autoscalers/tuners get as f

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-17 Thread Dong Lin
Hi Gyula, If I understand correctly, this autopilot proposal is an experimental feature and its configs/metrics are not mature enough to provide backward compatibility yet. And the proposal provides high-level ideas of the algorithm but it is probably too complicated to explain it end-to-end. On

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-16 Thread Gyula FĂłra
Hi Dong, Let me address your comments. Time for scale / backlog processing time derivation: We can add some more details to the Flip but at this point the implementation is actually much simpler than the algorithm to describe it. I would not like to add more equations etc because it just overcomp

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-16 Thread Dong Lin
Thanks for the update! Please see comments inline. On Tue, Nov 15, 2022 at 11:46 PM Maximilian Michels wrote: > Of course! Let me know if your concerns are addressed. The wiki page has > been updated. > > >It will be great to add this in the FLIP so that reviewers can understand > how the source

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-15 Thread Maximilian Michels
Of course! Let me know if your concerns are addressed. The wiki page has been updated. >It will be great to add this in the FLIP so that reviewers can understand how the source parallelisms are computed and how the algorithm works end-to-end. I've updated the FLIP page to add more details on how

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-15 Thread Dong Lin
Hi Maximilian, It seems that the following comments from the previous discussions have not been addressed yet. Any chance we can have them addressed before starting the voting thread? Thanks, Dong On Mon, Nov 7, 2022 at 2:33 AM Gyula FĂłra wrote: > Hi Dong! > > Let me try to answer the question

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-15 Thread Gyula FĂłra
I agree we should start the vote. On a separate (but related) small discussion we could also decide backporting https://issues.apache.org/jira/browse/FLINK-29501 for 1.16.1 so that the autoscaler could be more efficiently developed and tested and to make it 1.16 compatible. Cheers, Gyula On Tue,

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-15 Thread Maximilian Michels
+1 If there are no further comments, I'll start a vote thread in the next few days. -Max On Tue, Nov 15, 2022 at 2:06 PM Zheng Yu Chen wrote: > @Gyula Have a good news, now flip-256 now is finish and merge it . > flip-271 discussion seems to have stopped and I wonder if there are any > othe

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-15 Thread Zheng Yu Chen
@Gyula Have a good news, now flip-256 now is finish and merge it . flip-271 discussion seems to have stopped and I wonder if there are any other comments. Can we get to the polls and start this exciting feature 😀 Maybe I can get involved in developing this feature Gyula Fóra 于2022年11月8日周二 18

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-08 Thread Maximilian Michels
>> # Horizontal scaling V.S. Vertical scaling > >True. We left out vertical scaling intentionally. For now we assume CPU / memory is set up by the user. While definitely useful, vertical scaling >adds another dimension to the scaling problem which we wanted to tackle later. I'll update the FLIP to

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-08 Thread Gyula FĂłra
I had 2 extra comments to Max's reply: 1. About pre-allocating resources: This could be done through the operator when the standalone deployment mode is used relatively easily as there we have better control of pods/resources. 2. Session jobs: There is a FLIP ( https://cwiki.apache.org/confluence

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-08 Thread Maximilian Michels
@Yang >Since the current auto-scaling needs to fully redeploy the application, it may fail to start due to lack of resources. Great suggestions. I agree that we will have to have to preallocate / reserve resources to ensure the rescaling doesn't take longer as expected. This is not only a problem

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-07 Thread Yang Wang
Thanks for the fruitful discussion and I am really excited to see that the auto-scaling really happens for Flink Kubernetes operator. It will be a very important step to make the long-running Flink job more smoothly. I just have some immature ideas and want to share them here. # Resource Reserv

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-07 Thread Maximilian Michels
Thanks for all the interest here and for the great remarks! Gyula already did a great job addressing the questions here. Let me try to add additional context: @Biao Geng: >1. For source parallelisms, if the user configure a much larger value than >normal, there should be very little pending rec

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-07 Thread Gyula FĂłra
@Dong: Looking at the busyTime metrics in the TaskOMetricGroup it seems that busy time is actually defined as "not idle or (soft) backpressured" . So I think it would give us the correct reading based on what you said about the Kafka sink. In any case we have to test this and if something is not

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-07 Thread Dong Lin
Thanks for the explanation Gyula. Please see my reply inline. BTW, has the proposed solution been deployed and evaluated with any production workload? If yes, I am wondering if you could share the experience, e.g. what is the likelihood of having regression and improvement respectively after enabl

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-07 Thread JunRui Lee
@Guyla, Thanks for the explanation and the follow up actions. That sounds good to me. Thanks, JunRui Lee Yanfei Lei 于2022年11月7日周一 12:20写道: > Hi Max, > > Thanks for the proposal. This proposal makes Flink better adapted to > cloud-native applications! > > After reading the FLIP, I'm curious abo

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-06 Thread Yanfei Lei
Hi Max, Thanks for the proposal. This proposal makes Flink better adapted to cloud-native applications! After reading the FLIP, I'm curious about some points: 1) It's said that "The first step is collecting metrics for all JobVertices by combining metrics from all the runtime subtasks and comput

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-06 Thread Gyula FĂłra
Hi Dong! Let me try to answer the questions :) 1 : busyTimeMsPerSecond is not specific for CPU, it measures the time spent in the main record processing loop for an operator if I understand correctly. This includes IO operations too. 2: We should add this to the FLIP I agree. It would be a Durat

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-06 Thread Dong Lin
Hi Max, Thank you for the proposal. The proposal tackles a very important issue for Flink users and the design looks promising overall! I have some questions to better understand the proposed public interfaces and the algorithm. 1) The proposal seems to assume that the operator's busyTimeMsPerSe

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-06 Thread Gyula FĂłra
@Pedro: The current design focuses on record processing time metrics. In most cases when we need to scale (such as too much state per operator), record processing time actually slows, so it would detect that. Of course in the future we can add new logic if we see something missing. @ConradJam: We

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Zheng Yu Chen
Hi Max Thank you for dirver this flip,I have some advice for this flip Do we not only exist in the (on/off) switch, but also have one more option for (advcie). After the user opens (advcie), it does not actually perform AutoScaling. It only outputs the notification form of tuning suggestions for t

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Pedro Silva
>>>> possible >>>>> to have different strategy for “scaling in” to make it more >> conservative. >>>>> Or more eagerly, allow custom autoscaling strategy(e.g. time-based >>>>> strategy). >>>>> Another side thought is that

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Gyula FĂłra
rategy for “scaling in” to make it more > conservative. > >>> Or more eagerly, allow custom autoscaling strategy(e.g. time-based > >>> strategy). > >>> Another side thought is that to recover a job from > checkpoint/savepoint, > >>> the new parall

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Pedro Silva
the new parallelism cannot be larger than max parallelism defined in the >>> checkpoint(see this< >> https://github.com/apache/flink/blob/17a782c202c93343b8884cb52f4562f9c4ba593f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L128 >>> ).

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread JunRui Lee
runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L128 > >). > > Not sure if this limit should be mentioned in the FLIP. > > > > Again, thanks for the great work and looking forward to using flink k8s > > operator with it! > > > > Best

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Gyula FĂłra
Biao Geng > > From: Maximilian Michels > Date: Saturday, November 5, 2022 at 2:37 AM > To: dev > Cc: Gyula Fóra , Thomas Weise , > Marton Balassi , Őrhidi Måtyås < > matyas.orh...@gmail.com> > Subject: [DISCUSS] FLIP-271: Autoscaling > Hi, > > I would like

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Biao Geng
r with it! Best, Biao Geng From: Maximilian Michels Date: Saturday, November 5, 2022 at 2:37 AM To: dev Cc: Gyula Fóra , Thomas Weise , Marton Balassi , Őrhidi Måtyås Subject: [DISCUSS] FLIP-271: Autoscaling Hi, I would like to kick off the discussion on implementing autoscaling for Flink as p

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread MĂĄrton Balassi
Thanks for preparing the FLIP and kicking off the discussion, Max. Looking forward to this. :-) On Sat, Nov 5, 2022 at 9:27 AM Niels Basjes wrote: > I'm really looking forward to seeing this in action. > > Niels > > On Fri, 4 Nov 2022, 19:37 Maximilian Michels, wrote: > >> Hi, >> >> I would lik

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-05 Thread Niels Basjes
I'm really looking forward to seeing this in action. Niels On Fri, 4 Nov 2022, 19:37 Maximilian Michels, wrote: > Hi, > > I would like to kick off the discussion on implementing autoscaling for > Flink as part of the Flink Kubernetes operator. I've outlined an approach > here which I find promi

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-04 Thread Őrhidi Måtyås
Thank you Max, Gyula! This is definitely an exciting one :) Cheers, Matyas On Fri, Nov 4, 2022 at 1:16 PM Gyula FĂłra wrote: > Hi! > > Thank you for the proposal Max! It is great to see this highly desired > feature finally take shape. > > I think we have all the right building blocks to make t

Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-04 Thread Gyula FĂłra
Hi! Thank you for the proposal Max! It is great to see this highly desired feature finally take shape. I think we have all the right building blocks to make this successful. Cheers, Gyula On Fri, Nov 4, 2022 at 7:37 PM Maximilian Michels wrote: > Hi, > > I would like to kick off the discussio

[DISCUSS] FLIP-271: Autoscaling

2022-11-04 Thread Maximilian Michels
Hi, I would like to kick off the discussion on implementing autoscaling for Flink as part of the Flink Kubernetes operator. I've outlined an approach here which I find promising: https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling I've been discussing this approach with some