Bumping this thread. Thanks!

Best regards,
Yuepeng Pan



On 2025/09/02 15:41:07 Yuepeng Pan wrote:
> Hi, community.
> 
> 
> At present, FLIP-495[1][2] has gone through a new round of discussions and a 
> preliminary general consensus has been reached, which provides the necessary 
> premise for the discussion of the current FLIP-487[3].
> 
> 
> Therefore, I would like to resume the discussion on the current FLIP.
> 
> The version of the current FLIP mainly covers and has completed the following 
> two aspects of design:
> - The REST API design for querying rescale history information
> - The Web UI design for showing rescale history information
> 
> 
> Looking forward to your comments and suggestions.
> 
> 
> [1] https://lists.apache.org/thread/t3r9wdd5gpbqnvzw35kb3wb3d9brpnon
> [2] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
>  
> [3] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
>  
> 
> 
> Best regards,
> Yuepeng Pan
> 
> 
> ---- Replied Message ----
> | From | Matthias Pohl<[email protected]> |
> | Date | 12/2/2024 16:59 |
> | To | <[email protected]> |
> | Subject | Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for 
> AdaptiveScheduler |
> Hi Yuepeng,
> thanks for the proposal. Having a way to see the history of rescales is a
> nice feature, I guess. I went over the draft and have a few questions:
> 
> Can we reorganize the draft? Right now, we have some (for RescaleEvent,
> Required/AcquiredParallelism) schema defined in the "Proposed Changes"
> section and some other schema under "Public Interfaces". It would be nice
> to have this more organized.
> Just as a suggestion: In the end the proposed changes should list the
> different REST endpoints you want to introduce (including the corresponding
> schemas for request and response).
> ---
> I'm also wondering whether it would make sense to focus on the REST
> endpoints in this FLIP and put the UI work in a separate FLIP. WDYT?
> Decreasing the scope would probably help handling the required changes.
> ---
> Have you considered adding the onChange event timestamp for a rescale event
> as well? We introduced a separation of the job requirements change event
> and the actual rescale execution in FLIP-461 [1]. It might be worth
> documenting the time when a change was monitored for the first time that
> triggered the rescale. WDYT?
> ---
> You're mentioning "comments" as a field of the RescaleEvent in your
> proposal. What's the use-case here? Where are these comments from?
> 
> (update)
> A brief talk with Yuepeng on that topic revealed that the field is supposed
> to be used for errors that occurred during the rescale operation. My take
> on that one:
> - We might want to reconsider the field name in that case (maybe
> errors_during_rescale?). "comments" seems to be quite generic.
> - Additionally, shouldn't we make this a list of errors rather than a
> String field?
> - How certain are we that we can associate errors to the actual rescale
> operation and rather than the error being caused by something else?
> ---
> In the schema of the RescaleEvent you describe the three different
> ID/numbers in the following way:
> 
> The ‘id’ is automatically incremental, The rescaleAttemptId is generated
> based on one specified resource-requirement and the attempt number is
> generated based on rescaleAttemptId.
> 
> But there is no "attempt number" mentioned in the RescaleEvent schema.
> Additionally, what is the ID based on? Do we start from 0 and just
> increment? Or do we want to have a mechanism that ensures that the IDs are
> also unique/monotonically increasing after JobManager failovers?
> ---
> For the parallelism schema: I might be misreading the draft here but you're
> proposing to use the subtask name as the ID to refer to the JobVertex? That
> the name might become quite long. What about using the JobVertexID here.
> That would be also more aligned to how the parallelism is represented by
> the /jobs/<job-id>/resource-requirements endpoint. If we want to add the
> task name for readability purposes, we can still add this one as a taskName
> field to the Required/AcquiredParallelism schema.
> ---
> Status field:
> - What is the meaning of "TRYING"? I guess, we're more or less using the
> AdaptiveScheduler states here, aren't we? Can't we align/stick to the
> naming that's defined in the AdaptiveScheduler state?
> ---
> Do we really need a new REST endpoint for the configuration? Can't we get
> the provided information already from the existing configuration endpoint?
> That said, I still find it useful to have a config tab in the UI at the end.
> ---
> For the summary endpoint: I see similarities to the checkpoint summary
> here. Not sure whether you already considered that but would it make sense
> to align the field names in some way to have a consistent look-and-feel?
> I'm also wondering whether it makes sense to align the schema to have
> something like latest rescale, failed rescale, ...
> 
> Best,
> Matthias
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler
> 
> On Mon, Nov 25, 2024 at 11:24 AM yuanfeng hu <[email protected]> wrote:
> 
> +1, I think this feature is very useful for adaptive scheduler.
> 
> Yuepeng Pan <[email protected]> 于2024年11月22日周五 18:38写道:
> 
> Hi community,
> 
> 
> 
> 
> Currently, the Adaptive Scheduler already supports the REST API
> 
> to manually adjust[1] the parallelism of jobs, which enhances the
> 
> functionality of the Adaptive Scheduler.
> 
> However, Adaptive Scheduler doesn't support displaying or tracing the
> rescale history yet[2].
> 
> This makes it inconvenient for users/devs to quickly obtain some internal
> 
> information about the rescale history of the Adaptive Scheduler.
> 
> And showing the history of rescale events of AdaptiveScheduler in the web
> 
> UI is very useful for users to make the next step for jobs.
> 
> 
> 
> 
> Therefore, I created the FLIP-487[3] doc to support
> 
> 'Show history of rescales in Web UI for AdaptiveScheduler'.
> 
> Please refer to the google document[3] for more details
> 
> about the proposed design and implementation.
> 
> 
> 
> 
> Looking forward to any feedback and opinions on this proposal.
> 
> 
> 
> 
> [1]
> 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> 
> [2] https://issues.apache.org/jira/browse/FLINK-22258
> 
> [3]
> 
> https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?tab=t.0
> 
> 
> 
> 
> Thank you very much.
> 
> 
> 
> 
> Best,
> 
> Regards.
> 
> Yuepeng Pan
> 
> 
> 
> --
> Best,
> Yuanfeng
> 
> 

Reply via email to