Hi, community. FYI: Since the design work of the query interface of rescale history was separated into FLIP-487[1] during the discussion, we have therefore changed the title of the FLIP to:
FLIP-495: Support AdaptiveScheduler record and store the rescale history. [1] https://cwiki.apache.org/confluence/x/vZCMEw Best regards, Yuepeng Pan On 2025/08/19 09:13:22 Yuepeng Pan wrote: > Bumping this thread kindly. Thanks! > > Best, > Yuepeng Pan > > > > > At 2025-08-13 14:52:26, "Yuepeng Pan" <[email protected]> wrote: > > Hi, Matthias, > Thank you very much for your comments! > I have carefully read your reply and made some changes in the hope of making > improvements. > Please help take a look. > > For your comments: > > > 1. You mention a few options for when it comes to storing the data which is > > good. The FLIP doesn't point out, though, what option you're going to go > > for as part of this FLIP (as far as I can see). It would be good to only > > outline the option to go for in the FLIP and list the other options as > > rejected alternatives (with the pro's and con's). I think it make sense to > > go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore > > for now). The other options can be considered as a follow-up. > > This is very meaningful. Based on this comment, I have kept option 3 in its > original place and moved the other candidate options to [1]. > > > 2. About the terminal states of a rescaling (i.e. IGNORED, FAILED, > > COMPLETED): Can we we clarify in the FLIP under what conditions the > > rescaling transitions into each of the three terminal states? > > Yes, this is a reasonable request for understanding and explaining the logic > of transitions to terminated states. > A new subsection [2] has been added to address this. > > > 3. The section "The information to record in a rescale event" could be > > restructured in four sections (to remove redundancy): > > a) The IDs (Rescale > > ID, resourceRequirementsEpochID, subRescaleIdOfResourceRequirementsEpochID): > > What about making these names easier to read: GlobalRescaleID, RescaleUUID, > > RescaleAttemptId) > > b) Per-vertex data which includes: JobVertexID, JobVertexName, > > SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient, > > desired, post-rescale) > > c) The SlotSharingGroup information: SlotSharingGroupId, name, > > ResourceProfile > > d) Other information: Timestamps of state transitions, etc. as laid out in > > the FLIP already > > That makes sense to me. Please check [3] for the latest updates in this part. > > > 4. The FLIP doesn't explain how the data is passed through the > > AdaptiveScheduler states. We should be handling some kind of > > RescaleSnapshot that is passed through the different states and updated and > > its final state is stored somewhere within AdaptiveScheduler in the end, I > > guess. Can we clarify that in the FLIP? > > Indeed — this was missing in the original FLIP. To address this, I have added > [4], which focuses on describing how a Rescale is represented, > and how we can quickly pass and maintain the Rescale history. > > > 5. You mention the config parameters for the cache in the public interface > > section. But there's no mentioning of any caching and how that is used > > within the FLIP. > > Sorry for the rough description in the previous version. > Since this part belongs to the REST API acceleration mechanism for rescaling, > and Option 6 seems reasonable to me, > I plan to add it to FLIP-487 once the design of FLIP-495 has reached > consensus. > Of course, if needed, I'd be happy to clarify the usage and purpose of this > parameter in the current email thread. > > > 6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should > > be about the actual implementation details and how the data is stored > > internally whereas FLIP-487 is about exposing the information to the > > outside through the REST API and the Flink UI. That would be a way to > > decrease the scope of FLIP-495. WDYT? > > That sounds nice to me. Therefore, I have moved all REST API–related changes > to FLIP-487. > BTW, to avoid repetitive changes in FLIP-487, I'll start organizing FLIP-487 > after FLIP-495 has been finalized. > > Looking forward to your next review! > > [1]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Aboutrescaleeventsstorage.1 > [2]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-ThemainscenarioswhereRescalestatusswitchestoterminated > [3]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Theinformationtorecordinarescaleevent > [4]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-InternalInterfaces > > > > > > Best regards, > Yuepeng Pan > > > > > > > > > > > > At 2025-08-10 23:54:37, "Matthias Pohl" <[email protected]> wrote: > >Hi Yuepeng, > >thanks for reminding me of this FLIP. I went over it and have a few items > >which we might need to address before we can actually finalize the vote: > > > >1. You mention a few options for when it comes to storing the data which is > >good. The FLIP doesn't point out, though, what option you're going to go > >for as part of this FLIP (as far as I can see). It would be good to only > >outline the option to go for in the FLIP and list the other options as > >rejected alternatives (with the pro's and con's). I think it make sense to > >go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore > >for now). The other options can be considered as a follow-up. > >2. About the terminal states of a rescaling (i.e. IGNORED, FAILED, > >COMPLETED): Can we we clarify in the FLIP under what conditions the > >rescaling transitions into each of the three terminal states? > >3. The section "The information to record in a rescale event" could be > >restructured in four sections (to remove redundancy): > > a) The IDs (Rescale > >ID, resourceRequirementsEpochID, subRescaleIdOfResourceRequirementsEpochID): > >What about making these names easier to read: GlobalRescaleID, RescaleUUID, > >RescaleAttemptId) > > b) Per-vertex data which includes: JobVertexID, JobVertexName, > >SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient, > >desired, post-rescale) > > c) The SlotSharingGroup information: SlotSharingGroupId, name, > >ResourceProfile > > d) Other information: Timestamps of state transitions, etc. as laid out in > >the FLIP already > >4. The FLIP doesn't explain how the data is passed through the > >AdaptiveScheduler states. We should be handling some kind of > >RescaleSnapshot that is passed through the different states and updated and > >its final state is stored somewhere within AdaptiveScheduler in the end, I > >guess. Can we clarify that in the FLIP? > >5. You mention the config parameters for the cache in the public interface > >section. But there's no mentioning of any caching and how that is used > >within the FLIP. > >6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should > >be about the actual implementation details and how the data is stored > >internally whereas FLIP-487 is about exposing the information to the > >outside through the REST API and the Flink UI. That would be a way to > >decrease the scope of FLIP-495. WDYT? > > > >Best, > >Matthias > > > > > >On Mon, Mar 24, 2025 at 11:37 AM Yuepeng Pan <[email protected]> wrote: > > > >> Hi, Community, > >> > >> There haven’t been any further responses to this email over the past few > >> days. > >> I'd like to initiate a vote on the current proposal[1] in the next few > >> days. > >> Please rest assured that I’m proceeding cautiously and not rushing the > >> process. > >> If there are any concerns about this FLIP-495[1], > >> I will gladly pause and make the adjustments. > >> > >> Best regards, > >> Yuepeng Pan > >> > >> [1] > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > >> > >> > >> On 2024/12/17 15:18:45 Yuepeng Pan wrote: > >> > Hi community, > >> > > >> > > >> > > >> > > >> > We discussed several aspects of FLIP-487[1] 'Show history of rescales in > >> Web UI for AdaptiveScheduler' > >> > and received a lot of valuable feedback. Based on the suggestions from > >> the email thread[2], > >> > we plan to split the original proposal for FLIP-487[1]. > >> > > >> > > >> > > >> > > >> > The current email thread and the FLIP-495[3] wiki will be used to > >> discuss 'Support AdaptiveScheduler in recording and querying the rescale > >> history', > >> > while FLIP-487[1] will primarily focus on displaying-related design > >> content > >> > > >> > > >> > > >> > > >> > Looking forward to any feedback and opinions on FLIP-495[3]. > >> > > >> > > >> > > >> > > >> > [1] > >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > >> > > >> > [2] https://lists.apache.org/thread/f4md4btkf006mxcxf66bng1kfz0rsn8c > >> > > >> > [3] > >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > >> > > >> > > >> > > >> > > >> > Thank you very much. > >> > > >> > > >> > > >> > > >> > Best, > >> > > >> > Regards. > >> > > >> > Yuepeng Pan > >> >
