Hi Lei,

Thank you for the feedback! I really appreciate you sharing these great 
questions and I would like to clarify my thinking:


1. Handling FINISHED jobs in FAILING state
The FAILING state is designed to close active components, so already-FINISHED 
jobs are intentionally left untouched. This keeps the state transitions clean 
and simple.

2. Application HA and RESTARTING state
This is a very interesting point. Application HA in the follow-up tasks is 
primarily centered around recovering from a JobManager failure (e.g., due to a 
machine crash). In that scenario, the JobManager itself is unavailable, making 
it impossible to update or query the application's status. 


However, you've brought up another excellent use case: automatically restarting 
an application in response to a failed job (or other errors in the main 
execution logic). This would be a powerful mechanism to build resilience 
against transient issues like network instability. For this scenario, you are 
absolutely right. Introducing a RESTARTING state for application would be both 
reasonable and necessary to clearly indicate to the user that a recovery 
attempt is in progress.
This capability seems like an important enhancement to application management 
and may involve significant work. To keep the scope of the current FLIP 
focused, I propose we don't include this functionality for now. 
If you are interested, I would be very happy to discuss this feature further in 
a separate thread. I think it's a great direction for future work.




Best Regards,

Yi


At 2025-09-25 17:32:10, "Lei Yang" <[email protected]> wrote:
>Hi Yi, thanks for creating this FLIP!
>
>I'm trying to understand your FLIP. By introducing the Application entity,
>you're able to organically organize jobs, making them easier to observe
>and manage. This is great work!
>
>I'd like to share some questions with you, and hope you could help me
>clarify them:
>
>1. When an application is in the FAILING state, how are the jobs that have
>already reached the FINISHED state handled? Will they simply be ignored,
>or will there be other actions taken?
>
>2. In the "Follow-up Tasks", you mentioned high availability for the
>application,
>which will restart failed jobs to restore the application. However, I
>didn't see the
>description of the application's status during such restarts in the FLIP. I
>think
>we might need to introduce a RESTARTING status to explicitly indicate the
>application is in the process of restarting?
>
>Best,
>Lei
>
>Yi Zhang <[email protected]> 于2025年9月23日周二 11:24写道:
>
>> Hi everyone,
>>
>>
>> I would like to start a discussion about FLIP-549: Support Application
>> Management [1].
>>
>>
>> Despite Flink’s widespread adoption, the existing model for running user
>> logic limits observability and execution flexibility, which affects user
>> experience. This FLIP introduces a new application management framework
>> designed to close these gaps and provide a foundation for future
>> improvements.
>>
>>
>> Looking forward to your feedback and suggestions.
>>
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management
>>
>>
>> Best regards,
>>
>> Yi Zhang

Reply via email to