Bumping this thread. Thanks! Best regards, Yuepeng Pan
On 2025/09/02 15:41:07 Yuepeng Pan wrote: > Hi, community. > > > At present, FLIP-495[1][2] has gone through a new round of discussions and a > preliminary general consensus has been reached, which provides the necessary > premise for the discussion of the current FLIP-487[3]. > > > Therefore, I would like to resume the discussion on the current FLIP. > > The version of the current FLIP mainly covers and has completed the following > two aspects of design: > - The REST API design for querying rescale history information > - The Web UI design for showing rescale history information > > > Looking forward to your comments and suggestions. > > > [1] https://lists.apache.org/thread/t3r9wdd5gpbqnvzw35kb3wb3d9brpnon > [2] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > [3] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > > > > Best regards, > Yuepeng Pan > > > ---- Replied Message ---- > | From | Matthias Pohl<[email protected]> | > | Date | 12/2/2024 16:59 | > | To | <[email protected]> | > | Subject | Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for > AdaptiveScheduler | > Hi Yuepeng, > thanks for the proposal. Having a way to see the history of rescales is a > nice feature, I guess. I went over the draft and have a few questions: > > Can we reorganize the draft? Right now, we have some (for RescaleEvent, > Required/AcquiredParallelism) schema defined in the "Proposed Changes" > section and some other schema under "Public Interfaces". It would be nice > to have this more organized. > Just as a suggestion: In the end the proposed changes should list the > different REST endpoints you want to introduce (including the corresponding > schemas for request and response). > --- > I'm also wondering whether it would make sense to focus on the REST > endpoints in this FLIP and put the UI work in a separate FLIP. WDYT? > Decreasing the scope would probably help handling the required changes. > --- > Have you considered adding the onChange event timestamp for a rescale event > as well? We introduced a separation of the job requirements change event > and the actual rescale execution in FLIP-461 [1]. It might be worth > documenting the time when a change was monitored for the first time that > triggered the rescale. WDYT? > --- > You're mentioning "comments" as a field of the RescaleEvent in your > proposal. What's the use-case here? Where are these comments from? > > (update) > A brief talk with Yuepeng on that topic revealed that the field is supposed > to be used for errors that occurred during the rescale operation. My take > on that one: > - We might want to reconsider the field name in that case (maybe > errors_during_rescale?). "comments" seems to be quite generic. > - Additionally, shouldn't we make this a list of errors rather than a > String field? > - How certain are we that we can associate errors to the actual rescale > operation and rather than the error being caused by something else? > --- > In the schema of the RescaleEvent you describe the three different > ID/numbers in the following way: > > The ‘id’ is automatically incremental, The rescaleAttemptId is generated > based on one specified resource-requirement and the attempt number is > generated based on rescaleAttemptId. > > But there is no "attempt number" mentioned in the RescaleEvent schema. > Additionally, what is the ID based on? Do we start from 0 and just > increment? Or do we want to have a mechanism that ensures that the IDs are > also unique/monotonically increasing after JobManager failovers? > --- > For the parallelism schema: I might be misreading the draft here but you're > proposing to use the subtask name as the ID to refer to the JobVertex? That > the name might become quite long. What about using the JobVertexID here. > That would be also more aligned to how the parallelism is represented by > the /jobs/<job-id>/resource-requirements endpoint. If we want to add the > task name for readability purposes, we can still add this one as a taskName > field to the Required/AcquiredParallelism schema. > --- > Status field: > - What is the meaning of "TRYING"? I guess, we're more or less using the > AdaptiveScheduler states here, aren't we? Can't we align/stick to the > naming that's defined in the AdaptiveScheduler state? > --- > Do we really need a new REST endpoint for the configuration? Can't we get > the provided information already from the existing configuration endpoint? > That said, I still find it useful to have a config tab in the UI at the end. > --- > For the summary endpoint: I see similarities to the checkpoint summary > here. Not sure whether you already considered that but would it make sense > to align the field names in some way to have a consistent look-and-feel? > I'm also wondering whether it makes sense to align the schema to have > something like latest rescale, failed rescale, ... > > Best, > Matthias > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler > > On Mon, Nov 25, 2024 at 11:24 AM yuanfeng hu <[email protected]> wrote: > > +1, I think this feature is very useful for adaptive scheduler. > > Yuepeng Pan <[email protected]> 于2024年11月22日周五 18:38写道: > > Hi community, > > > > > Currently, the Adaptive Scheduler already supports the REST API > > to manually adjust[1] the parallelism of jobs, which enhances the > > functionality of the Adaptive Scheduler. > > However, Adaptive Scheduler doesn't support displaying or tracing the > rescale history yet[2]. > > This makes it inconvenient for users/devs to quickly obtain some internal > > information about the rescale history of the Adaptive Scheduler. > > And showing the history of rescale events of AdaptiveScheduler in the web > > UI is very useful for users to make the next step for jobs. > > > > > Therefore, I created the FLIP-487[3] doc to support > > 'Show history of rescales in Web UI for AdaptiveScheduler'. > > Please refer to the google document[3] for more details > > about the proposed design and implementation. > > > > > Looking forward to any feedback and opinions on this proposal. > > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > > [2] https://issues.apache.org/jira/browse/FLINK-22258 > > [3] > > https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?tab=t.0 > > > > > Thank you very much. > > > > > Best, > > Regards. > > Yuepeng Pan > > > > -- > Best, > Yuanfeng > >
