Hi, community. FYI, To ensure that the rescale history stored and recorded in FLIP-495 can be accessed by external systems/users, we'd plan to release the FLIP-495 functionality together with at least two sub-tasks[1][2] of FLIP-487[3].
These two sub-tasks will respectively support: - retrieving all current rescale history records - retrieving the detailed record of a specific rescale by its rescale UUID [1] https://issues.apache.org/jira/browse/FLINK-38894 [2] https://issues.apache.org/jira/browse/FLINK-38895 [3] https://issues.apache.org/jira/browse/FLINK-22258 Best, Yuepeng Pan On 2025/09/18 04:03:22 Yuepeng Pan wrote: > Hi, community. > > FYI: > Since the design work of the query interface of rescale history was separated > into FLIP-487[1] during the discussion, we have therefore changed the title > of the FLIP to: > > FLIP-495: Support AdaptiveScheduler record and store the rescale history. > > [1] https://cwiki.apache.org/confluence/x/vZCMEw > > Best regards, > Yuepeng Pan > > On 2025/08/19 09:13:22 Yuepeng Pan wrote: > > Bumping this thread kindly. Thanks! > > > > Best, > > Yuepeng Pan > > > > > > > > > > At 2025-08-13 14:52:26, "Yuepeng Pan" <[email protected]> wrote: > > > > Hi, Matthias, > > Thank you very much for your comments! > > I have carefully read your reply and made some changes in the hope of > > making improvements. > > Please help take a look. > > > > For your comments: > > > > > 1. You mention a few options for when it comes to storing the data which > > > is > > > good. The FLIP doesn't point out, though, what option you're going to go > > > for as part of this FLIP (as far as I can see). It would be good to only > > > outline the option to go for in the FLIP and list the other options as > > > rejected alternatives (with the pro's and con's). I think it make sense to > > > go for option 3 (i.e. following what's done for the > > > ExecutionGraphInfoStore > > > for now). The other options can be considered as a follow-up. > > > > This is very meaningful. Based on this comment, I have kept option 3 in its > > original place and moved the other candidate options to [1]. > > > > > 2. About the terminal states of a rescaling (i.e. IGNORED, FAILED, > > > COMPLETED): Can we we clarify in the FLIP under what conditions the > > > rescaling transitions into each of the three terminal states? > > > > Yes, this is a reasonable request for understanding and explaining the > > logic of transitions to terminated states. > > A new subsection [2] has been added to address this. > > > > > 3. The section "The information to record in a rescale event" could be > > > restructured in four sections (to remove redundancy): > > > a) The IDs (Rescale > > > ID, resourceRequirementsEpochID, > > > subRescaleIdOfResourceRequirementsEpochID): > > > What about making these names easier to read: GlobalRescaleID, > > > RescaleUUID, > > > RescaleAttemptId) > > > b) Per-vertex data which includes: JobVertexID, JobVertexName, > > > SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient, > > > desired, post-rescale) > > > c) The SlotSharingGroup information: SlotSharingGroupId, name, > > > ResourceProfile > > > d) Other information: Timestamps of state transitions, etc. as laid out in > > > the FLIP already > > > > That makes sense to me. Please check [3] for the latest updates in this > > part. > > > > > 4. The FLIP doesn't explain how the data is passed through the > > > AdaptiveScheduler states. We should be handling some kind of > > > RescaleSnapshot that is passed through the different states and updated > > > and > > > its final state is stored somewhere within AdaptiveScheduler in the end, I > > > guess. Can we clarify that in the FLIP? > > > > Indeed — this was missing in the original FLIP. To address this, I have > > added [4], which focuses on describing how a Rescale is represented, > > and how we can quickly pass and maintain the Rescale history. > > > > > 5. You mention the config parameters for the cache in the public interface > > > section. But there's no mentioning of any caching and how that is used > > > within the FLIP. > > > > Sorry for the rough description in the previous version. > > Since this part belongs to the REST API acceleration mechanism for > > rescaling, and Option 6 seems reasonable to me, > > I plan to add it to FLIP-487 once the design of FLIP-495 has reached > > consensus. > > Of course, if needed, I'd be happy to clarify the usage and purpose of this > > parameter in the current email thread. > > > > > 6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 > > > should > > > be about the actual implementation details and how the data is stored > > > internally whereas FLIP-487 is about exposing the information to the > > > outside through the REST API and the Flink UI. That would be a way to > > > decrease the scope of FLIP-495. WDYT? > > > > That sounds nice to me. Therefore, I have moved all REST API–related > > changes to FLIP-487. > > BTW, to avoid repetitive changes in FLIP-487, I'll start organizing > > FLIP-487 after FLIP-495 has been finalized. > > > > Looking forward to your next review! > > > > [1]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Aboutrescaleeventsstorage.1 > > [2]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-ThemainscenarioswhereRescalestatusswitchestoterminated > > [3]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Theinformationtorecordinarescaleevent > > [4]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-InternalInterfaces > > > > > > > > > > > > Best regards, > > Yuepeng Pan > > > > > > > > > > > > > > > > > > > > > > > > At 2025-08-10 23:54:37, "Matthias Pohl" <[email protected]> wrote: > > >Hi Yuepeng, > > >thanks for reminding me of this FLIP. I went over it and have a few items > > >which we might need to address before we can actually finalize the vote: > > > > > >1. You mention a few options for when it comes to storing the data which is > > >good. The FLIP doesn't point out, though, what option you're going to go > > >for as part of this FLIP (as far as I can see). It would be good to only > > >outline the option to go for in the FLIP and list the other options as > > >rejected alternatives (with the pro's and con's). I think it make sense to > > >go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore > > >for now). The other options can be considered as a follow-up. > > >2. About the terminal states of a rescaling (i.e. IGNORED, FAILED, > > >COMPLETED): Can we we clarify in the FLIP under what conditions the > > >rescaling transitions into each of the three terminal states? > > >3. The section "The information to record in a rescale event" could be > > >restructured in four sections (to remove redundancy): > > > a) The IDs (Rescale > > >ID, resourceRequirementsEpochID, > > >subRescaleIdOfResourceRequirementsEpochID): > > >What about making these names easier to read: GlobalRescaleID, RescaleUUID, > > >RescaleAttemptId) > > > b) Per-vertex data which includes: JobVertexID, JobVertexName, > > >SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient, > > >desired, post-rescale) > > > c) The SlotSharingGroup information: SlotSharingGroupId, name, > > >ResourceProfile > > > d) Other information: Timestamps of state transitions, etc. as laid out in > > >the FLIP already > > >4. The FLIP doesn't explain how the data is passed through the > > >AdaptiveScheduler states. We should be handling some kind of > > >RescaleSnapshot that is passed through the different states and updated and > > >its final state is stored somewhere within AdaptiveScheduler in the end, I > > >guess. Can we clarify that in the FLIP? > > >5. You mention the config parameters for the cache in the public interface > > >section. But there's no mentioning of any caching and how that is used > > >within the FLIP. > > >6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should > > >be about the actual implementation details and how the data is stored > > >internally whereas FLIP-487 is about exposing the information to the > > >outside through the REST API and the Flink UI. That would be a way to > > >decrease the scope of FLIP-495. WDYT? > > > > > >Best, > > >Matthias > > > > > > > > >On Mon, Mar 24, 2025 at 11:37 AM Yuepeng Pan <[email protected]> wrote: > > > > > >> Hi, Community, > > >> > > >> There haven’t been any further responses to this email over the past few > > >> days. > > >> I'd like to initiate a vote on the current proposal[1] in the next few > > >> days. > > >> Please rest assured that I’m proceeding cautiously and not rushing the > > >> process. > > >> If there are any concerns about this FLIP-495[1], > > >> I will gladly pause and make the adjustments. > > >> > > >> Best regards, > > >> Yuepeng Pan > > >> > > >> [1] > > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > >> > > >> > > >> On 2024/12/17 15:18:45 Yuepeng Pan wrote: > > >> > Hi community, > > >> > > > >> > > > >> > > > >> > > > >> > We discussed several aspects of FLIP-487[1] 'Show history of rescales > > >> > in > > >> Web UI for AdaptiveScheduler' > > >> > and received a lot of valuable feedback. Based on the suggestions from > > >> the email thread[2], > > >> > we plan to split the original proposal for FLIP-487[1]. > > >> > > > >> > > > >> > > > >> > > > >> > The current email thread and the FLIP-495[3] wiki will be used to > > >> discuss 'Support AdaptiveScheduler in recording and querying the rescale > > >> history', > > >> > while FLIP-487[1] will primarily focus on displaying-related design > > >> content > > >> > > > >> > > > >> > > > >> > > > >> > Looking forward to any feedback and opinions on FLIP-495[3]. > > >> > > > >> > > > >> > > > >> > > > >> > [1] > > >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > > >> > > > >> > [2] https://lists.apache.org/thread/f4md4btkf006mxcxf66bng1kfz0rsn8c > > >> > > > >> > [3] > > >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > >> > > > >> > > > >> > > > >> > > > >> > Thank you very much. > > >> > > > >> > > > >> > > > >> > > > >> > Best, > > >> > > > >> > Regards. > > >> > > > >> > Yuepeng Pan > > >> > > >
