Hi Shengkai,



Thanks for taking the time to review the FLIP and for your thoughtful and 
constructive feedback! I would like to share the planned updates based on your 
points:



1. Asynchronous REST API for Application Submission

This is an excellent point. To maintain compatibility, the existing 
/jars/:jarid/run API will retain its synchronous behavior, returning a Job ID 
only after the user's main method has completed and the job has been submitted. 
However, as you pointed out, this can lead to long response times and poor user 
experience.

Therefore, we plan to introduce a new asynchronous REST API, likely 
/jars/:jarid/run-application. This API will submit the application and return 
an Application ID for status polling immediately, without waiting for the main 
method or job submission to finish. I'll add this to the proposal.



2. Clarification on "Pre-termination Cleanup"

Thank you for pointing out the ambiguity. I will update the document to clarify 
that "pre-termination cleanup" refers to the process where an application, 
before transitioning to a terminal state, will actively cancel all the jobs it 
manages and wait for them to reach their own terminal states. This ensures that 
the application's lifecycle is cohesively tied to the lifecycle of the jobs it 
owns.



3. Potential Job Leak Prevention

You've raised a critical concern here. As described in the point above, the 
primary mechanism is that the application itself ensures its jobs are 
terminated before it shuts down, which should prevent leaks in normal 
circumstances.

The question then becomes how to handle the exceptional case where a job fails 
to respond to a cancellation request. Upon further reflection, I believe that 
if a job is unresponsive to a cancellation initiated by the application, a 
background monitor issuing the same request would likely face the same problem.

Therefore, maybe triggering a fatal error is a more appropriate action in this 
scenario. While a fatal error in a session cluster could affect other running 
applications, an unresponsive job indicates a severe underlying issue that 
warrants such a drastic measure to prevent an inconsistent and unpredictable 
system state. I will update the proposal to detail this fault-tolerance 
strategy and the reasoning behind it.



4. API Compatibility Considerations

Ensuring a smooth transition for existing users is a top priority and I can 
confirm that all existing Job ID-based REST APIs will remain fully functional.

Users will still be able to query and cancel jobs launched via this new 
application framework using Job IDs. I will add a specific section in the 
document to explicitly state this, reassuring users that their existing tools 
and scripts will continue to work as expected.



Once again, thank you for your invaluable input. I will incorporate these 
changes into the document shortly. Please let me know if you have any further 
questions or suggestions.



Best regards,

Yi

At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote:
>Hi, Yi.
>
>The FLIP is both interesting and highly promising for Flink users. Once
>implemented, it will enable powerful use cases—such as running a Jupyter
>Notebook kernel or SQL Gateway as a first-class application within the
>JobManager. This represents a significant step forward in usability and
>integration.
>
>I’d like to share a few suggestions and clarifications that could help
>strengthen the proposal:
>
>*Asynchronous REST API for Application Submission*
>
>Given that launching such applications may involve complex initialization
>and take considerable time to complete, it would be beneficial to support
>an asynchronous submission mechanism via REST. A synchronous endpoint might
>lead to timeouts or poor user experience. An async API could return an
>application ID immediately, allowing users to poll or query the status of
>the deployment using that identifier.
>
>*Clarification on "Pre-termination Cleanup"*
>The term pre-termination cleanup is mentioned several times in the
>document. Could you please elaborate on what this entails? Specifically,
>which resources are expected to be released, and at what point in the life
>cycle does this occur? A clearer definition would help ensure consistent
>implementation and improve reliability.
>
>*Potential Job Leak Prevention*
>
>There appears to be a risk of job leaks if an application fails to properly
>cancel its associated Flink job upon termination. To mitigate this, we
>might consider introducing a background daemon thread (or a monitoring
>service) that periodically checks for orphaned jobs whose parent
>applications have already terminated, and automatically triggers cleanup.
>Alternatively, integrating with Flink’s existing lifecycle management
>mechanisms could help ensure robust resource cleanup.
>
>*API Compatibility Considerations*
>
>It would be helpful to clarify how the new application model aligns with
>existing APIs. Many external systems currently rely on job IDs to monitor
>or cancel jobs. Will these operations still be supported under the new
>model? For example, can users continue to use the existing REST endpoints
>to cancel a job or check its status using the job ID, even when the job was
>launched through this new application framework?
>
>
>Best,
>Shengkai
>
>Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道:
>
>> Hi everyone,
>>
>>
>> I would like to start a discussion about FLIP-549: Support Application
>> Management [1].
>>
>>
>> Despite Flink’s widespread adoption, the existing model for running user
>> logic limits observability and execution flexibility, which affects user
>> experience. This FLIP introduces a new application management framework
>> designed to close these gaps and provide a foundation for future
>> improvements.
>>
>>
>> Looking forward to your feedback and suggestions.
>>
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management
>>
>>
>> Best regards,
>>
>> Yi Zhang

Reply via email to