Hi Shengkai,
Thanks for taking the time to review the FLIP and for your thoughtful and constructive feedback! I would like to share the planned updates based on your points: 1. Asynchronous REST API for Application Submission This is an excellent point. To maintain compatibility, the existing /jars/:jarid/run API will retain its synchronous behavior, returning a Job ID only after the user's main method has completed and the job has been submitted. However, as you pointed out, this can lead to long response times and poor user experience. Therefore, we plan to introduce a new asynchronous REST API, likely /jars/:jarid/run-application. This API will submit the application and return an Application ID for status polling immediately, without waiting for the main method or job submission to finish. I'll add this to the proposal. 2. Clarification on "Pre-termination Cleanup" Thank you for pointing out the ambiguity. I will update the document to clarify that "pre-termination cleanup" refers to the process where an application, before transitioning to a terminal state, will actively cancel all the jobs it manages and wait for them to reach their own terminal states. This ensures that the application's lifecycle is cohesively tied to the lifecycle of the jobs it owns. 3. Potential Job Leak Prevention You've raised a critical concern here. As described in the point above, the primary mechanism is that the application itself ensures its jobs are terminated before it shuts down, which should prevent leaks in normal circumstances. The question then becomes how to handle the exceptional case where a job fails to respond to a cancellation request. Upon further reflection, I believe that if a job is unresponsive to a cancellation initiated by the application, a background monitor issuing the same request would likely face the same problem. Therefore, maybe triggering a fatal error is a more appropriate action in this scenario. While a fatal error in a session cluster could affect other running applications, an unresponsive job indicates a severe underlying issue that warrants such a drastic measure to prevent an inconsistent and unpredictable system state. I will update the proposal to detail this fault-tolerance strategy and the reasoning behind it. 4. API Compatibility Considerations Ensuring a smooth transition for existing users is a top priority and I can confirm that all existing Job ID-based REST APIs will remain fully functional. Users will still be able to query and cancel jobs launched via this new application framework using Job IDs. I will add a specific section in the document to explicitly state this, reassuring users that their existing tools and scripts will continue to work as expected. Once again, thank you for your invaluable input. I will incorporate these changes into the document shortly. Please let me know if you have any further questions or suggestions. Best regards, Yi At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote: >Hi, Yi. > >The FLIP is both interesting and highly promising for Flink users. Once >implemented, it will enable powerful use cases—such as running a Jupyter >Notebook kernel or SQL Gateway as a first-class application within the >JobManager. This represents a significant step forward in usability and >integration. > >I’d like to share a few suggestions and clarifications that could help >strengthen the proposal: > >*Asynchronous REST API for Application Submission* > >Given that launching such applications may involve complex initialization >and take considerable time to complete, it would be beneficial to support >an asynchronous submission mechanism via REST. A synchronous endpoint might >lead to timeouts or poor user experience. An async API could return an >application ID immediately, allowing users to poll or query the status of >the deployment using that identifier. > >*Clarification on "Pre-termination Cleanup"* >The term pre-termination cleanup is mentioned several times in the >document. Could you please elaborate on what this entails? Specifically, >which resources are expected to be released, and at what point in the life >cycle does this occur? A clearer definition would help ensure consistent >implementation and improve reliability. > >*Potential Job Leak Prevention* > >There appears to be a risk of job leaks if an application fails to properly >cancel its associated Flink job upon termination. To mitigate this, we >might consider introducing a background daemon thread (or a monitoring >service) that periodically checks for orphaned jobs whose parent >applications have already terminated, and automatically triggers cleanup. >Alternatively, integrating with Flink’s existing lifecycle management >mechanisms could help ensure robust resource cleanup. > >*API Compatibility Considerations* > >It would be helpful to clarify how the new application model aligns with >existing APIs. Many external systems currently rely on job IDs to monitor >or cancel jobs. Will these operations still be supported under the new >model? For example, can users continue to use the existing REST endpoints >to cancel a job or check its status using the job ID, even when the job was >launched through this new application framework? > > >Best, >Shengkai > >Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道: > >> Hi everyone, >> >> >> I would like to start a discussion about FLIP-549: Support Application >> Management [1]. >> >> >> Despite Flink’s widespread adoption, the existing model for running user >> logic limits observability and execution flexibility, which affects user >> experience. This FLIP introduces a new application management framework >> designed to close these gaps and provide a foundation for future >> improvements. >> >> >> Looking forward to your feedback and suggestions. >> >> >> >> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management >> >> >> Best regards, >> >> Yi Zhang
