Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
Thanks Max! Etienne Le 09/08/2023 à 12:53, Maximilian Michels a écrit : +1 (binding) -Max On Tue, Aug 8, 2023 at 10:56 AM Etienne Chauchot wrote: Hi all, As part of Flink bylaws, binding votes for FLIP changes are active committer votes. Up to now, we have only 2 binding votes. Can one of the committers/PMC members vote on this FLIP ? Thanks Etienne Le 08/08/2023 à 10:19, Etienne Chauchot a écrit : Hi Joseph, Thanks for the detailled review ! Best Etienne Le 14/07/2023 à 11:57, Prabhu Joseph a écrit : *+1 (non-binding)* Thanks for working on this. We have seen good improvement during the cool down period with this feature. Below are details on the test results from one of our clusters: On a scale-out operation, 8 new nodes were added one by one with a gap of ~30 seconds. There were 8 restarts within 4 minutes with the default behaviour, whereas only one with this feature (cooldown period of 4 minutes). The number of records processed by the job with this feature during the restart window is higher (2909764), whereas it is only 1323960 with the default behaviour due to multiple restarts, where it spends most of the time recovering, and also whatever work progressed by the tasks after the last successful completed checkpoint is lost. Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period Remarks NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates the difference the cool down period brings in. When the job is doing multiple restarts, the task spends most of the time recovering, and the progress the task made will be lost during the restart. 2. There is only one restart with Cool Down Period which happened when the 8th node got added back. Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 NumRestarts 8 1 On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot wrote: Hi all, I'm going on vacation tonight for 3 weeks. Even if the vote is not finished, as the implementation is rather quick and the design discussion had settled, I preferred I implementing FLIP-322 [1] to allow people to take a look while I'm off. [1]https://github.com/apache/flink/pull/22985 Best Etienne Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : Hi all, Would you mind casting your vote to this second vote thread (opened after new discussions) so that the subject can move forward ? @David, @Chesnay, @Robert you took part to the discussions, can you please sent your vote ? Thank you very much Best Etienne Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : Hi all, Thanks for your feedback about the FLIP-322: Cooldown period for adaptive scheduler [1]. This FLIP was discussed in [2]. I'd like to start a vote for it. The vote will be open for at least 72 hours (until July 9th 15:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler [2]https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 Best, Etienne
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
+1 (binding) -Max On Tue, Aug 8, 2023 at 10:56 AM Etienne Chauchot wrote: > > Hi all, > > As part of Flink bylaws, binding votes for FLIP changes are active > committer votes. > > Up to now, we have only 2 binding votes. Can one of the committers/PMC > members vote on this FLIP ? > > Thanks > > Etienne > > > Le 08/08/2023 à 10:19, Etienne Chauchot a écrit : > > > > Hi Joseph, > > > > Thanks for the detailled review ! > > > > Best > > > > Etienne > > > > Le 14/07/2023 à 11:57, Prabhu Joseph a écrit : > >> *+1 (non-binding)* > >> > >> Thanks for working on this. We have seen good improvement during the cool > >> down period with this feature. > >> Below are details on the test results from one of our clusters: > >> > >> On a scale-out operation, 8 new nodes were added one by one with a gap of > >> ~30 seconds. There were 8 restarts within 4 minutes with the default > >> behaviour, > >> whereas only one with this feature (cooldown period of 4 minutes). > >> > >> The number of records processed by the job with this feature during the > >> restart window is higher (2909764), whereas it is only 1323960 with the > >> default > >> behaviour due to multiple restarts, where it spends most of the time > >> recovering, and also whatever work progressed by the tasks after the last > >> successful completed checkpoint is lost. > >> > >> Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period > >> Remarks > >> NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates > >> the difference the cool down period brings in. When the job is doing > >> multiple restarts, the task spends most of the time recovering, and the > >> progress the task made will be lost during the restart. > >> > >> 2. There is only one restart with Cool Down Period which happened when the > >> 8th node got added back. > >> > >> Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 > >> NumRestarts 8 1 > >> > >> > >> > >> > >> > >> > >> > >> > >> On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot > >> wrote: > >> > >>> Hi all, > >>> > >>> I'm going on vacation tonight for 3 weeks. > >>> > >>> Even if the vote is not finished, as the implementation is rather quick > >>> and the design discussion had settled, I preferred I implementing > >>> FLIP-322 [1] to allow people to take a look while I'm off. > >>> > >>> [1]https://github.com/apache/flink/pull/22985 > >>> > >>> Best > >>> > >>> Etienne > >>> > >>> Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : > Hi all, > > Would you mind casting your vote to this second vote thread (opened > after new discussions) so that the subject can move forward ? > > @David, @Chesnay, @Robert you took part to the discussions, can you > please sent your vote ? > > Thank you very much > > Best > > Etienne > > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : > > Hi all, > > > > Thanks for your feedback about the FLIP-322: Cooldown period for > > adaptive scheduler [1]. > > > > This FLIP was discussed in [2]. > > > > I'd like to start a vote for it. The vote will be open for at least 72 > > hours (until July 9th 15:00 GMT) unless there is an objection or > > insufficient votes. > > > > [1] > > > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler > > [2]https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 > > > > Best, > > > > Etienne
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
Hi all, As part of Flink bylaws, binding votes for FLIP changes are active committer votes. Up to now, we have only 2 binding votes. Can one of the committers/PMC members vote on this FLIP ? Thanks Etienne Le 08/08/2023 à 10:19, Etienne Chauchot a écrit : Hi Joseph, Thanks for the detailled review ! Best Etienne Le 14/07/2023 à 11:57, Prabhu Joseph a écrit : *+1 (non-binding)* Thanks for working on this. We have seen good improvement during the cool down period with this feature. Below are details on the test results from one of our clusters: On a scale-out operation, 8 new nodes were added one by one with a gap of ~30 seconds. There were 8 restarts within 4 minutes with the default behaviour, whereas only one with this feature (cooldown period of 4 minutes). The number of records processed by the job with this feature during the restart window is higher (2909764), whereas it is only 1323960 with the default behaviour due to multiple restarts, where it spends most of the time recovering, and also whatever work progressed by the tasks after the last successful completed checkpoint is lost. Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period Remarks NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates the difference the cool down period brings in. When the job is doing multiple restarts, the task spends most of the time recovering, and the progress the task made will be lost during the restart. 2. There is only one restart with Cool Down Period which happened when the 8th node got added back. Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 NumRestarts 8 1 On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot wrote: Hi all, I'm going on vacation tonight for 3 weeks. Even if the vote is not finished, as the implementation is rather quick and the design discussion had settled, I preferred I implementing FLIP-322 [1] to allow people to take a look while I'm off. [1]https://github.com/apache/flink/pull/22985 Best Etienne Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : Hi all, Would you mind casting your vote to this second vote thread (opened after new discussions) so that the subject can move forward ? @David, @Chesnay, @Robert you took part to the discussions, can you please sent your vote ? Thank you very much Best Etienne Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : Hi all, Thanks for your feedback about the FLIP-322: Cooldown period for adaptive scheduler [1]. This FLIP was discussed in [2]. I'd like to start a vote for it. The vote will be open for at least 72 hours (until July 9th 15:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler [2]https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 Best, Etienne
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
Hi Joseph, Thanks for the detailled review ! Best Etienne Le 14/07/2023 à 11:57, Prabhu Joseph a écrit : *+1 (non-binding)* Thanks for working on this. We have seen good improvement during the cool down period with this feature. Below are details on the test results from one of our clusters: On a scale-out operation, 8 new nodes were added one by one with a gap of ~30 seconds. There were 8 restarts within 4 minutes with the default behaviour, whereas only one with this feature (cooldown period of 4 minutes). The number of records processed by the job with this feature during the restart window is higher (2909764), whereas it is only 1323960 with the default behaviour due to multiple restarts, where it spends most of the time recovering, and also whatever work progressed by the tasks after the last successful completed checkpoint is lost. Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period Remarks NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates the difference the cool down period brings in. When the job is doing multiple restarts, the task spends most of the time recovering, and the progress the task made will be lost during the restart. 2. There is only one restart with Cool Down Period which happened when the 8th node got added back. Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 NumRestarts 8 1 On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot wrote: Hi all, I'm going on vacation tonight for 3 weeks. Even if the vote is not finished, as the implementation is rather quick and the design discussion had settled, I preferred I implementing FLIP-322 [1] to allow people to take a look while I'm off. [1]https://github.com/apache/flink/pull/22985 Best Etienne Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : Hi all, Would you mind casting your vote to this second vote thread (opened after new discussions) so that the subject can move forward ? @David, @Chesnay, @Robert you took part to the discussions, can you please sent your vote ? Thank you very much Best Etienne Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : Hi all, Thanks for your feedback about the FLIP-322: Cooldown period for adaptive scheduler [1]. This FLIP was discussed in [2]. I'd like to start a vote for it. The vote will be open for at least 72 hours (until July 9th 15:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler [2]https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 Best, Etienne
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
+1(binding) Thanks for driving this proposal, it will be useful for rescale. I’m preparing the FLIP-334[1], it will decouple the autoscaler and kubernetes. In the end, we hope all kind of flink jobs work well with rescale and autoscaler. [1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=263424711 Best Rui Fan On Fri, 28 Jul 2023 at 19:21, Martijn Visser wrote: > +1 (binding) > > On Fri, Jul 14, 2023 at 11:59 AM Prabhu Joseph > > wrote: > > > *+1 (non-binding)* > > > > Thanks for working on this. We have seen good improvement during the cool > > down period with this feature. > > Below are details on the test results from one of our clusters: > > > > On a scale-out operation, 8 new nodes were added one by one with a gap of > > ~30 seconds. There were 8 restarts within 4 minutes with the default > > behaviour, > > whereas only one with this feature (cooldown period of 4 minutes). > > > > The number of records processed by the job with this feature during the > > restart window is higher (2909764), whereas it is only 1323960 with the > > default > > behaviour due to multiple restarts, where it spends most of the time > > recovering, and also whatever work progressed by the tasks after the last > > successful completed checkpoint is lost. > > > > Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown > Period > > Remarks > > NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric > indicates > > the difference the cool down period brings in. When the job is doing > > multiple restarts, the task spends most of the time recovering, and the > > progress the task made will be lost during the restart. > > > > 2. There is only one restart with Cool Down Period which happened when > the > > 8th node got added back. > > > > Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 > > NumRestarts 8 1 > > > > > > > > > > > > > > > > > > On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot > > wrote: > > > > > Hi all, > > > > > > I'm going on vacation tonight for 3 weeks. > > > > > > Even if the vote is not finished, as the implementation is rather quick > > > and the design discussion had settled, I preferred I implementing > > > FLIP-322 [1] to allow people to take a look while I'm off. > > > > > > [1] https://github.com/apache/flink/pull/22985 > > > > > > Best > > > > > > Etienne > > > > > > Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : > > > > > > > > Hi all, > > > > > > > > Would you mind casting your vote to this second vote thread (opened > > > > after new discussions) so that the subject can move forward ? > > > > > > > > @David, @Chesnay, @Robert you took part to the discussions, can you > > > > please sent your vote ? > > > > > > > > Thank you very much > > > > > > > > Best > > > > > > > > Etienne > > > > > > > > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : > > > >> > > > >> Hi all, > > > >> > > > >> Thanks for your feedback about the FLIP-322: Cooldown period for > > > >> adaptive scheduler [1]. > > > >> > > > >> This FLIP was discussed in [2]. > > > >> > > > >> I'd like to start a vote for it. The vote will be open for at least > 72 > > > >> hours (until July 9th 15:00 GMT) unless there is an objection or > > > >> insufficient votes. > > > >> > > > >> [1] > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler > > > >> [2] > https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 > > > >> > > > >> Best, > > > >> > > > >> Etienne > > >
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
+1 (binding) On Fri, Jul 14, 2023 at 11:59 AM Prabhu Joseph wrote: > *+1 (non-binding)* > > Thanks for working on this. We have seen good improvement during the cool > down period with this feature. > Below are details on the test results from one of our clusters: > > On a scale-out operation, 8 new nodes were added one by one with a gap of > ~30 seconds. There were 8 restarts within 4 minutes with the default > behaviour, > whereas only one with this feature (cooldown period of 4 minutes). > > The number of records processed by the job with this feature during the > restart window is higher (2909764), whereas it is only 1323960 with the > default > behaviour due to multiple restarts, where it spends most of the time > recovering, and also whatever work progressed by the tasks after the last > successful completed checkpoint is lost. > > Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period > Remarks > NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates > the difference the cool down period brings in. When the job is doing > multiple restarts, the task spends most of the time recovering, and the > progress the task made will be lost during the restart. > > 2. There is only one restart with Cool Down Period which happened when the > 8th node got added back. > > Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 > NumRestarts 8 1 > > > > > > > > > On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot > wrote: > > > Hi all, > > > > I'm going on vacation tonight for 3 weeks. > > > > Even if the vote is not finished, as the implementation is rather quick > > and the design discussion had settled, I preferred I implementing > > FLIP-322 [1] to allow people to take a look while I'm off. > > > > [1] https://github.com/apache/flink/pull/22985 > > > > Best > > > > Etienne > > > > Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : > > > > > > Hi all, > > > > > > Would you mind casting your vote to this second vote thread (opened > > > after new discussions) so that the subject can move forward ? > > > > > > @David, @Chesnay, @Robert you took part to the discussions, can you > > > please sent your vote ? > > > > > > Thank you very much > > > > > > Best > > > > > > Etienne > > > > > > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : > > >> > > >> Hi all, > > >> > > >> Thanks for your feedback about the FLIP-322: Cooldown period for > > >> adaptive scheduler [1]. > > >> > > >> This FLIP was discussed in [2]. > > >> > > >> I'd like to start a vote for it. The vote will be open for at least 72 > > >> hours (until July 9th 15:00 GMT) unless there is an objection or > > >> insufficient votes. > > >> > > >> [1] > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler > > >> [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 > > >> > > >> Best, > > >> > > >> Etienne >
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
*+1 (non-binding)* Thanks for working on this. We have seen good improvement during the cool down period with this feature. Below are details on the test results from one of our clusters: On a scale-out operation, 8 new nodes were added one by one with a gap of ~30 seconds. There were 8 restarts within 4 minutes with the default behaviour, whereas only one with this feature (cooldown period of 4 minutes). The number of records processed by the job with this feature during the restart window is higher (2909764), whereas it is only 1323960 with the default behaviour due to multiple restarts, where it spends most of the time recovering, and also whatever work progressed by the tasks after the last successful completed checkpoint is lost. Metrics Default Adaptive Scheduler Adaptive Scheduler With Cooldown Period Remarks NumRecordsProcessed 1323960 2909764 1. NumRecordsProcessed metric indicates the difference the cool down period brings in. When the job is doing multiple restarts, the task spends most of the time recovering, and the progress the task made will be lost during the restart. 2. There is only one restart with Cool Down Period which happened when the 8th node got added back. Job Parallelism 13 -> 20 -> 27 -> 34 -> 41 -> 48 -> 55 → 62 → 69 13 → 69 NumRestarts 8 1 On Wed, Jul 12, 2023 at 8:03 PM Etienne Chauchot wrote: > Hi all, > > I'm going on vacation tonight for 3 weeks. > > Even if the vote is not finished, as the implementation is rather quick > and the design discussion had settled, I preferred I implementing > FLIP-322 [1] to allow people to take a look while I'm off. > > [1] https://github.com/apache/flink/pull/22985 > > Best > > Etienne > > Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : > > > > Hi all, > > > > Would you mind casting your vote to this second vote thread (opened > > after new discussions) so that the subject can move forward ? > > > > @David, @Chesnay, @Robert you took part to the discussions, can you > > please sent your vote ? > > > > Thank you very much > > > > Best > > > > Etienne > > > > Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : > >> > >> Hi all, > >> > >> Thanks for your feedback about the FLIP-322: Cooldown period for > >> adaptive scheduler [1]. > >> > >> This FLIP was discussed in [2]. > >> > >> I'd like to start a vote for it. The vote will be open for at least 72 > >> hours (until July 9th 15:00 GMT) unless there is an objection or > >> insufficient votes. > >> > >> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler > >> [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 > >> > >> Best, > >> > >> Etienne
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
Hi all, I'm going on vacation tonight for 3 weeks. Even if the vote is not finished, as the implementation is rather quick and the design discussion had settled, I preferred I implementing FLIP-322 [1] to allow people to take a look while I'm off. [1] https://github.com/apache/flink/pull/22985 Best Etienne Le 12/07/2023 à 09:56, Etienne Chauchot a écrit : Hi all, Would you mind casting your vote to this second vote thread (opened after new discussions) so that the subject can move forward ? @David, @Chesnay, @Robert you took part to the discussions, can you please sent your vote ? Thank you very much Best Etienne Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : Hi all, Thanks for your feedback about the FLIP-322: Cooldown period for adaptive scheduler [1]. This FLIP was discussed in [2]. I'd like to start a vote for it. The vote will be open for at least 72 hours (until July 9th 15:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 Best, Etienne
Re: [VOTE] FLIP-322 Cooldown period for adaptive scheduler. Second vote.
Hi all, Would you mind casting your vote to this second vote thread (opened after new discussions) so that the subject can move forward ? @David, @Chesnay, @Robert you took part to the discussions, can you please sent your vote ? Thank you very much Best Etienne Le 06/07/2023 à 13:02, Etienne Chauchot a écrit : Hi all, Thanks for your feedback about the FLIP-322: Cooldown period for adaptive scheduler [1]. This FLIP was discussed in [2]. I'd like to start a vote for it. The vote will be open for at least 72 hours (until July 9th 15:00 GMT) unless there is an objection or insufficient votes. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-322+Cooldown+period+for+adaptive+scheduler [2] https://lists.apache.org/thread/qvgxzhbp9rhlsqrybxdy51h05zwxfns6 Best, Etienne