+1(binding) Thanks for driving this proposal, it will be useful for rescale.
I’m preparing the FLIP-334[1], it will decouple the autoscaler and kubernetes. In the end, we hope all kind of flink jobs work well with rescale and autoscaler. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424711 Best Rui Fan On Fri, 28 Jul 2023 at 19:21, Martijn Visser <martijnvis...@apache.org> wrote: > +1 (binding) > > On Fri, Jul 14, 2023 at 11:59 AM Prabhu Joseph <prabhujose.ga...@gmail.com > > > wrote: > > > *+1 (non-binding)* > > > > Thanks for working on this. We have seen good improvement during the cool > > down period with this feature. > > Below are details on the test results from one of our clusters: > > > > On a scale-out operation, 8 new nodes were added one by one with a gap of > > ~30 seconds. There were 8 restarts within 4 minutes with the default > > behaviour, > > whereas only one with this feature (cooldown period of 4 minutes). > > > > The number of records processed by the job with this feature during the > > restart window is higher (2909764), whereas it is only 1323960 with the > > default > > behaviour due to multiple restarts, where it spends most of the time > > recovering, and also whatever work progressed by the tasks after the last > > successful completed checkpoint is lost. > > > > Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown > Period > > Remarks > > NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric > indicates > > the difference the cool down period brings in. When the job is doing > > multiple restarts, the task spends most of the time recovering, and the > > progress the task made will be lost during the restart. > > > > 2. There is only one restart with Cool Down Period which happened when > the > > 8th node got added back. > > > > Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 > > NumRestarts 8 1 > > > > > > > > > > > > > > > > > > On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot <echauc...@apache.org> > > wrote: > > > > > Hi all, > > > > > > I'm going on vacation tonight for 3 weeks. > > > > > > Even if the vote is not finished, as the implementation is rather quick > > > and the design discussion had settled, I preferred I implementing > > > FLIP-322 [1] to allow people to take a look while I'm off. > > > > > > [1] https://github.com/apache/flink/pull/22985 > > > > > > Best > > > > > > Etienne > > > > > > Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : > > > > > > > > Hi all, > > > > > > > > Would you mind casting your vote to this second vote thread (opened > > > > after new discussions) so that the subject can move forward ? > > > > > > > > @David, @Chesnay, @Robert you took part to the discussions, can you > > > > please sent your vote ? > > > > > > > > Thank you very much > > > > > > > > Best > > > > > > > > Etienne > > > > > > > > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : > > > >> > > > >> Hi all, > > > >> > > > >> Thanks for your feedback about the FLIP-322: Cooldown period for > > > >> adaptive scheduler [1]. > > > >> > > > >> This FLIP was discussed in [2]. > > > >> > > > >> I'd like to start a vote for it. The vote will be open for at least > 72 > > > >> hours (until July 9th 15:00 GMT) unless there is an objection or > > > >> insufficient votes. > > > >> > > > >> [1] > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler > > > >> [2] > https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 > > > >> > > > >> Best, > > > >> > > > >> Etienne > > >