Hi, community.

FYI:
Since the design work of the query interface of rescale history was separated 
into FLIP-487[1] during the discussion, we have therefore changed the title of 
the FLIP to:

FLIP-495: Support AdaptiveScheduler record and store the rescale history.

[1] https://cwiki.apache.org/confluence/x/vZCMEw

Best regards,
Yuepeng Pan

On 2025/08/19 09:13:22 Yuepeng Pan wrote:
> Bumping this thread kindly. Thanks!
> 
> Best,
> Yuepeng Pan 
> 
> 
> 
> 
> At 2025-08-13 14:52:26, "Yuepeng Pan" <[email protected]> wrote:
> 
> Hi, Matthias,
> Thank you  very much for your comments!
> I have carefully read your reply and made some changes in the hope of making 
> improvements.
> Please help take a look.
> 
> For your comments:
> 
> > 1. You mention a few options for when it comes to storing the data which is
> > good. The FLIP doesn't point out, though, what option you're going to go
> > for as part of this FLIP (as far as I can see). It would be good to only
> > outline the option to go for in the FLIP and list the other options as
> > rejected alternatives (with the pro's and con's). I think it make sense to
> > go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore
> > for now). The other options can be considered as a follow-up.
> 
> This is very meaningful. Based on this comment, I have kept option 3 in its 
> original place and moved the other candidate options to [1].
> 
> > 2. About the terminal states of a rescaling (i.e. IGNORED, FAILED,
> > COMPLETED): Can we we clarify in the FLIP under what conditions the
> > rescaling transitions into each of the three terminal states?
> 
> Yes, this is a reasonable request for understanding and explaining the logic 
> of transitions to terminated states.
> A new subsection [2] has been added to address this.
> 
> > 3. The section "The information to record in a rescale event" could be
> > restructured in four sections (to remove redundancy):
> > a) The IDs (Rescale
> > ID, resourceRequirementsEpochID, subRescaleIdOfResourceRequirementsEpochID):
> > What about making these names easier to read: GlobalRescaleID, RescaleUUID,
> > RescaleAttemptId)
> > b) Per-vertex data which includes: JobVertexID, JobVertexName,
> > SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient,
> > desired, post-rescale)
> > c) The SlotSharingGroup information: SlotSharingGroupId, name,
> > ResourceProfile
> > d) Other information: Timestamps of state transitions, etc. as laid out in
> > the FLIP already
> 
> That makes sense to me. Please check [3] for the latest updates in this part.
> 
> > 4. The FLIP doesn't explain how the data is passed through the
> > AdaptiveScheduler states. We should be handling some kind of
> > RescaleSnapshot that is passed through the different states and updated and
> > its final state is stored somewhere within AdaptiveScheduler in the end, I
> > guess. Can we clarify that in the FLIP?
> 
> Indeed — this was missing in the original FLIP. To address this, I have added 
> [4], which focuses on describing how a Rescale is represented,
> and how we can quickly pass and maintain the Rescale history.
> 
> > 5. You mention the config parameters for the cache in the public interface
> > section. But there's no mentioning of any caching and how that is used
> > within the FLIP.
> 
> Sorry for the rough description in the previous version.
> Since this part belongs to the REST API acceleration mechanism for rescaling, 
> and Option 6 seems reasonable to me,
> I plan to add it to FLIP-487 once the design of FLIP-495 has reached 
> consensus.
> Of course, if needed, I'd be happy to clarify the usage and purpose of this 
> parameter in the current email thread.
> 
> > 6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should
> > be about the actual implementation details and how the data is stored
> > internally whereas FLIP-487 is about exposing the information to the
> > outside through the REST API and the Flink UI. That would be a way to
> > decrease the scope of FLIP-495. WDYT?
> 
> That sounds nice to me. Therefore, I have moved all REST API–related changes 
> to FLIP-487. 
> BTW, to avoid repetitive changes in FLIP-487, I'll start organizing FLIP-487 
> after FLIP-495 has been finalized.
> 
> Looking forward to your next review!
> 
> [1]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Aboutrescaleeventsstorage.1
> [2]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-ThemainscenarioswhereRescalestatusswitchestoterminated
> [3]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-Theinformationtorecordinarescaleevent
> [4]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-InternalInterfaces
> 
> 
> 
> 
> 
> Best regards,
> Yuepeng Pan
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> At 2025-08-10 23:54:37, "Matthias Pohl" <[email protected]> wrote:
> >Hi Yuepeng,
> >thanks for reminding me of this FLIP. I went over it and have a few items
> >which we might need to address before we can actually finalize the vote:
> >
> >1. You mention a few options for when it comes to storing the data which is
> >good. The FLIP doesn't point out, though, what option you're going to go
> >for as part of this FLIP (as far as I can see). It would be good to only
> >outline the option to go for in the FLIP and list the other options as
> >rejected alternatives (with the pro's and con's). I think it make sense to
> >go for option 3 (i.e. following what's done for the ExecutionGraphInfoStore
> >for now). The other options can be considered as a follow-up.
> >2. About the terminal states of a rescaling (i.e. IGNORED, FAILED,
> >COMPLETED): Can we we clarify in the FLIP under what conditions the
> >rescaling transitions into each of the three terminal states?
> >3. The section "The information to record in a rescale event" could be
> >restructured in four sections (to remove redundancy):
> > a) The IDs (Rescale
> >ID, resourceRequirementsEpochID, subRescaleIdOfResourceRequirementsEpochID):
> >What about making these names easier to read: GlobalRescaleID, RescaleUUID,
> >RescaleAttemptId)
> > b) Per-vertex data which includes: JobVertexID, JobVertexName,
> >SlotSharingGroupId, the different parallelisms (pre-rescale, sufficient,
> >desired, post-rescale)
> > c) The SlotSharingGroup information: SlotSharingGroupId, name,
> >ResourceProfile
> > d) Other information: Timestamps of state transitions, etc. as laid out in
> >the FLIP already
> >4. The FLIP doesn't explain how the data is passed through the
> >AdaptiveScheduler states. We should be handling some kind of
> >RescaleSnapshot that is passed through the different states and updated and
> >its final state is stored somewhere within AdaptiveScheduler in the end, I
> >guess. Can we clarify that in the FLIP?
> >5. You mention the config parameters for the cache in the public interface
> >section. But there's no mentioning of any caching and how that is used
> >within the FLIP.
> >6. The REST endpoint is probably better suited in FLIP-487. FLIP-495 should
> >be about the actual implementation details and how the data is stored
> >internally whereas FLIP-487 is about exposing the information to the
> >outside through the REST API and the Flink UI. That would be a way to
> >decrease the scope of FLIP-495. WDYT?
> >
> >Best,
> >Matthias
> >
> >
> >On Mon, Mar 24, 2025 at 11:37 AM Yuepeng Pan <[email protected]> wrote:
> >
> >> Hi, Community,
> >>
> >> There haven’t been any further responses to this email over the past few
> >> days.
> >> I'd like to initiate a vote on the current proposal[1] in the next few
> >> days.
> >> Please rest assured that I’m proceeding cautiously and not rushing the
> >> process.
> >> If there are any concerns about this FLIP-495[1],
> >> I will gladly pause and make the adjustments.
> >>
> >> Best regards,
> >> Yuepeng Pan
> >>
> >> [1]
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
> >>
> >>
> >> On 2024/12/17 15:18:45 Yuepeng Pan wrote:
> >> > Hi community,
> >> >
> >> >
> >> >
> >> >
> >> > We discussed several aspects of FLIP-487[1] 'Show history of rescales in
> >> Web UI for AdaptiveScheduler'
> >> > and received a lot of valuable feedback. Based on the suggestions from
> >> the email thread[2],
> >> > we plan to split the original proposal for FLIP-487[1].
> >> >
> >> >
> >> >
> >> >
> >> > The current email thread and the FLIP-495[3] wiki will be used to
> >> discuss 'Support AdaptiveScheduler in recording and querying the rescale
> >> history',
> >> > while FLIP-487[1] will primarily focus on displaying-related design
> >> content
> >> >
> >> >
> >> >
> >> >
> >> > Looking forward to any feedback and opinions on FLIP-495[3].
> >> >
> >> >
> >> >
> >> >
> >> > [1]
> >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
> >> >
> >> > [2] https://lists.apache.org/thread/f4md4btkf006mxcxf66bng1kfz0rsn8c
> >> >
> >> > [3]
> >> https://cwiki.apache.org/confluence/display/FLINK/%5BWIP%5D+FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
> >> >
> >> >
> >> >
> >> >
> >> > Thank you very much.
> >> >
> >> >
> >> >
> >> >
> >> > Best,
> >> >
> >> > Regards.
> >> >
> >> > Yuepeng Pan
> >>
> 

Reply via email to