Hi Lei,
Thank you for the feedback! I really appreciate you sharing these great questions and I would like to clarify my thinking: 1. Handling FINISHED jobs in FAILING state The FAILING state is designed to close active components, so already-FINISHED jobs are intentionally left untouched. This keeps the state transitions clean and simple. 2. Application HA and RESTARTING state This is a very interesting point. Application HA in the follow-up tasks is primarily centered around recovering from a JobManager failure (e.g., due to a machine crash). In that scenario, the JobManager itself is unavailable, making it impossible to update or query the application's status. However, you've brought up another excellent use case: automatically restarting an application in response to a failed job (or other errors in the main execution logic). This would be a powerful mechanism to build resilience against transient issues like network instability. For this scenario, you are absolutely right. Introducing a RESTARTING state for application would be both reasonable and necessary to clearly indicate to the user that a recovery attempt is in progress. This capability seems like an important enhancement to application management and may involve significant work. To keep the scope of the current FLIP focused, I propose we don't include this functionality for now. If you are interested, I would be very happy to discuss this feature further in a separate thread. I think it's a great direction for future work. Best Regards, Yi At 2025-09-25 17:32:10, "Lei Yang" <[email protected]> wrote: >Hi Yi, thanks for creating this FLIP! > >I'm trying to understand your FLIP. By introducing the Application entity, >you're able to organically organize jobs, making them easier to observe >and manage. This is great work! > >I'd like to share some questions with you, and hope you could help me >clarify them: > >1. When an application is in the FAILING state, how are the jobs that have >already reached the FINISHED state handled? Will they simply be ignored, >or will there be other actions taken? > >2. In the "Follow-up Tasks", you mentioned high availability for the >application, >which will restart failed jobs to restore the application. However, I >didn't see the >description of the application's status during such restarts in the FLIP. I >think >we might need to introduce a RESTARTING status to explicitly indicate the >application is in the process of restarting? > >Best, >Lei > >Yi Zhang <[email protected]> 于2025年9月23日周二 11:24写道: > >> Hi everyone, >> >> >> I would like to start a discussion about FLIP-549: Support Application >> Management [1]. >> >> >> Despite Flink’s widespread adoption, the existing model for running user >> logic limits observability and execution flexibility, which affects user >> experience. This FLIP introduces a new application management framework >> designed to close these gaps and provide a foundation for future >> improvements. >> >> >> Looking forward to your feedback and suggestions. >> >> >> >> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management >> >> >> Best regards, >> >> Yi Zhang
