*+1 (non-binding)*

Thanks for working on this. We have seen good improvement during the cool
down period with this feature.
Below are details on the test results from one of our clusters:

On a scale-out operation, 8 new nodes were added one by one with a gap of
~30 seconds. There were 8 restarts within 4 minutes with the default
behaviour,
whereas only one with this feature (cooldown period of 4 minutes).

The number of records processed by the job with this feature during the
restart window is higher (2909764), whereas it is only 1323960 with the
default
behaviour due to multiple restarts, where it spends most of the time
recovering, and also whatever work progressed by the tasks after the last
successful completed checkpoint is lost.

Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period
Remarks
NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates
the difference the cool down period brings in. When the job is doing
multiple restarts, the task spends most of the time recovering, and the
progress the task made will be lost during the restart.

2. There is only one restart with Cool Down Period which happened when the
8th node got added back.

Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69
NumRestarts 8 1








On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot <echauc...@apache.org>
wrote:

> Hi all,
>
> I'm going on vacation tonight for 3 weeks.
>
> Even if the vote is not finished, as the implementation is rather quick
> and the design discussion had settled, I preferred I implementing
> FLIP-322 [1] to allow people to take a look while I'm off.
>
> [1] https://github.com/apache/flink/pull/22985
>
> Best
>
> Etienne
>
> Le 12/07/2023 à 09:56, Etienne Chauchot a écrit :
> >
> > Hi all,
> >
> > Would you mind casting your vote to this second vote thread (opened
> > after new discussions) so that the subject can move forward ?
> >
> > @David, @Chesnay, @Robert you took part to the discussions, can you
> > please sent your vote ?
> >
> > Thank you very much
> >
> > Best
> >
> > Etienne
> >
> > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit :
> >>
> >> Hi all,
> >>
> >> Thanks for your feedback about the FLIP-322: Cooldown period for
> >> adaptive scheduler [1].
> >>
> >> This FLIP was discussed in [2].
> >>
> >> I'd like to start a vote for it. The vote will be open for at least 72
> >> hours (until July 9th 15:00 GMT) unless there is an objection or
> >> insufficient votes.
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler
> >> [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6
> >>
> >> Best,
> >>
> >> Etienne

Reply via email to