Re: [DISCUSS] FLIP-549: Support Application Management

Yi Zhang Fri, 17 Oct 2025 20:57:15 -0700

Hi Shengkai,


Just a quick note, the FLIP has been updated to reflect the latest 
changes in the REST API and Application status description.


Thanks,
Yi


At 2025-09-25 11:47:41, "Yi Zhang" <[email protected]> wrote:
>Hi Shengkai,
>
>
>
>
>Thanks for taking the time to review the FLIP and for your thoughtful and 
>constructive feedback! I would like to share the planned updates based on your 
>points:
>
>
>
>1. Asynchronous REST API for Application Submission
>
>This is an excellent point. To maintain compatibility, the existing 
>/jars/:jarid/run API will retain its synchronous behavior, returning a Job ID 
>only after the user's main method has completed and the job has been 
>submitted. However, as you pointed out, this can lead to long response times 
>and poor user experience.
>
>Therefore, we plan to introduce a new asynchronous REST API, likely 
>/jars/:jarid/run-application. This API will submit the application and return 
>an Application ID for status polling immediately, without waiting for the main 
>method or job submission to finish. I'll add this to the proposal.
>
>
>
>2. Clarification on "Pre-termination Cleanup"
>
>Thank you for pointing out the ambiguity. I will update the document to 
>clarify that "pre-termination cleanup" refers to the process where an 
>application, before transitioning to a terminal state, will actively cancel 
>all the jobs it manages and wait for them to reach their own terminal states. 
>This ensures that the application's lifecycle is cohesively tied to the 
>lifecycle of the jobs it owns.
>
>
>
>3. Potential Job Leak Prevention
>
>You've raised a critical concern here. As described in the point above, the 
>primary mechanism is that the application itself ensures its jobs are 
>terminated before it shuts down, which should prevent leaks in normal 
>circumstances.
>
>The question then becomes how to handle the exceptional case where a job fails 
>to respond to a cancellation request. Upon further reflection, I believe that 
>if a job is unresponsive to a cancellation initiated by the application, a 
>background monitor issuing the same request would likely face the same problem.
>
>Therefore, maybe triggering a fatal error is a more appropriate action in this 
>scenario. While a fatal error in a session cluster could affect other running 
>applications, an unresponsive job indicates a severe underlying issue that 
>warrants such a drastic measure to prevent an inconsistent and unpredictable 
>system state. I will update the proposal to detail this fault-tolerance 
>strategy and the reasoning behind it.
>
>
>
>4. API Compatibility Considerations
>
>Ensuring a smooth transition for existing users is a top priority and I can 
>confirm that all existing Job ID-based REST APIs will remain fully functional.
>
>Users will still be able to query and cancel jobs launched via this new 
>application framework using Job IDs. I will add a specific section in the 
>document to explicitly state this, reassuring users that their existing tools 
>and scripts will continue to work as expected.
>
>
>
>Once again, thank you for your invaluable input. I will incorporate these 
>changes into the document shortly. Please let me know if you have any further 
>questions or suggestions.
>
>
>
>Best regards,
>
>Yi
>
>At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote:
>>Hi, Yi.
>>
>>The FLIP is both interesting and highly promising for Flink users. Once
>>implemented, it will enable powerful use cases—such as running a Jupyter
>>Notebook kernel or SQL Gateway as a first-class application within the
>>JobManager. This represents a significant step forward in usability and
>>integration.
>>
>>I’d like to share a few suggestions and clarifications that could help
>>strengthen the proposal:
>>
>>*Asynchronous REST API for Application Submission*
>>
>>Given that launching such applications may involve complex initialization
>>and take considerable time to complete, it would be beneficial to support
>>an asynchronous submission mechanism via REST. A synchronous endpoint might
>>lead to timeouts or poor user experience. An async API could return an
>>application ID immediately, allowing users to poll or query the status of
>>the deployment using that identifier.
>>
>>*Clarification on "Pre-termination Cleanup"*
>>The term pre-termination cleanup is mentioned several times in the
>>document. Could you please elaborate on what this entails? Specifically,
>>which resources are expected to be released, and at what point in the life
>>cycle does this occur? A clearer definition would help ensure consistent
>>implementation and improve reliability.
>>
>>*Potential Job Leak Prevention*
>>
>>There appears to be a risk of job leaks if an application fails to properly
>>cancel its associated Flink job upon termination. To mitigate this, we
>>might consider introducing a background daemon thread (or a monitoring
>>service) that periodically checks for orphaned jobs whose parent
>>applications have already terminated, and automatically triggers cleanup.
>>Alternatively, integrating with Flink’s existing lifecycle management
>>mechanisms could help ensure robust resource cleanup.
>>
>>*API Compatibility Considerations*
>>
>>It would be helpful to clarify how the new application model aligns with
>>existing APIs. Many external systems currently rely on job IDs to monitor
>>or cancel jobs. Will these operations still be supported under the new
>>model? For example, can users continue to use the existing REST endpoints
>>to cancel a job or check its status using the job ID, even when the job was
>>launched through this new application framework?
>>
>>
>>Best,
>>Shengkai
>>
>>Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道：
>>
>>> Hi everyone,
>>>
>>>
>>> I would like to start a discussion about FLIP-549: Support Application
>>> Management [1].
>>>
>>>
>>> Despite Flink’s widespread adoption, the existing model for running user
>>> logic limits observability and execution flexibility, which affects user
>>> experience. This FLIP introduces a new application management framework
>>> designed to close these gaps and provide a foundation for future
>>> improvements.
>>>
>>>
>>> Looking forward to your feedback and suggestions.
>>>
>>>
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management
>>>
>>>
>>> Best regards,
>>>
>>> Yi Zhang

Re: [DISCUSS] FLIP-549: Support Application Management

Reply via email to