This is a very interesting FLIP. We also felt the pain of not having better application management within Flink. Especially, with respect to Flink HistoryServer, for eg: it indexes by `jobs` instead of `application` even though we run it in `application mode`, this made the user side observability a bit clunky for finished jobs. This is particularly important for Flink-Batch apps but also relevant stream processing apps as well.
+1 and thanks for working on this FLIP, Yi Zhang. Regards Venkata krishnan On Mon, Sep 29, 2025 at 11:22 PM Yi Zhang <[email protected]> wrote: > Hi Shengkai, > > > Just a quick note, the FLIP has been updated to reflect the latest > changes in the REST API and Application status description. > > > Thanks, > Yi > > > At 2025-09-25 11:47:41, "Yi Zhang" <[email protected]> wrote: > >Hi Shengkai, > > > > > > > > > >Thanks for taking the time to review the FLIP and for your thoughtful and > constructive feedback! I would like to share the planned updates based on > your points: > > > > > > > >1. Asynchronous REST API for Application Submission > > > >This is an excellent point. To maintain compatibility, the existing > /jars/:jarid/run API will retain its synchronous behavior, returning a Job > ID only after the user's main method has completed and the job has been > submitted. However, as you pointed out, this can lead to long response > times and poor user experience. > > > >Therefore, we plan to introduce a new asynchronous REST API, likely > /jars/:jarid/run-application. This API will submit the application and > return an Application ID for status polling immediately, without waiting > for the main method or job submission to finish. I'll add this to the > proposal. > > > > > > > >2. Clarification on "Pre-termination Cleanup" > > > >Thank you for pointing out the ambiguity. I will update the document to > clarify that "pre-termination cleanup" refers to the process where an > application, before transitioning to a terminal state, will actively cancel > all the jobs it manages and wait for them to reach their own terminal > states. This ensures that the application's lifecycle is cohesively tied to > the lifecycle of the jobs it owns. > > > > > > > >3. Potential Job Leak Prevention > > > >You've raised a critical concern here. As described in the point above, > the primary mechanism is that the application itself ensures its jobs are > terminated before it shuts down, which should prevent leaks in normal > circumstances. > > > >The question then becomes how to handle the exceptional case where a job > fails to respond to a cancellation request. Upon further reflection, I > believe that if a job is unresponsive to a cancellation initiated by the > application, a background monitor issuing the same request would likely > face the same problem. > > > >Therefore, maybe triggering a fatal error is a more appropriate action in > this scenario. While a fatal error in a session cluster could affect other > running applications, an unresponsive job indicates a severe underlying > issue that warrants such a drastic measure to prevent an inconsistent and > unpredictable system state. I will update the proposal to detail this > fault-tolerance strategy and the reasoning behind it. > > > > > > > >4. API Compatibility Considerations > > > >Ensuring a smooth transition for existing users is a top priority and I > can confirm that all existing Job ID-based REST APIs will remain fully > functional. > > > >Users will still be able to query and cancel jobs launched via this new > application framework using Job IDs. I will add a specific section in the > document to explicitly state this, reassuring users that their existing > tools and scripts will continue to work as expected. > > > > > > > >Once again, thank you for your invaluable input. I will incorporate these > changes into the document shortly. Please let me know if you have any > further questions or suggestions. > > > > > > > >Best regards, > > > >Yi > > > >At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote: > >>Hi, Yi. > >> > >>The FLIP is both interesting and highly promising for Flink users. Once > >>implemented, it will enable powerful use cases—such as running a Jupyter > >>Notebook kernel or SQL Gateway as a first-class application within the > >>JobManager. This represents a significant step forward in usability and > >>integration. > >> > >>I’d like to share a few suggestions and clarifications that could help > >>strengthen the proposal: > >> > >>*Asynchronous REST API for Application Submission* > >> > >>Given that launching such applications may involve complex initialization > >>and take considerable time to complete, it would be beneficial to support > >>an asynchronous submission mechanism via REST. A synchronous endpoint > might > >>lead to timeouts or poor user experience. An async API could return an > >>application ID immediately, allowing users to poll or query the status of > >>the deployment using that identifier. > >> > >>*Clarification on "Pre-termination Cleanup"* > >>The term pre-termination cleanup is mentioned several times in the > >>document. Could you please elaborate on what this entails? Specifically, > >>which resources are expected to be released, and at what point in the > life > >>cycle does this occur? A clearer definition would help ensure consistent > >>implementation and improve reliability. > >> > >>*Potential Job Leak Prevention* > >> > >>There appears to be a risk of job leaks if an application fails to > properly > >>cancel its associated Flink job upon termination. To mitigate this, we > >>might consider introducing a background daemon thread (or a monitoring > >>service) that periodically checks for orphaned jobs whose parent > >>applications have already terminated, and automatically triggers cleanup. > >>Alternatively, integrating with Flink’s existing lifecycle management > >>mechanisms could help ensure robust resource cleanup. > >> > >>*API Compatibility Considerations* > >> > >>It would be helpful to clarify how the new application model aligns with > >>existing APIs. Many external systems currently rely on job IDs to monitor > >>or cancel jobs. Will these operations still be supported under the new > >>model? For example, can users continue to use the existing REST endpoints > >>to cancel a job or check its status using the job ID, even when the job > was > >>launched through this new application framework? > >> > >> > >>Best, > >>Shengkai > >> > >>Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道: > >> > >>> Hi everyone, > >>> > >>> > >>> I would like to start a discussion about FLIP-549: Support Application > >>> Management [1]. > >>> > >>> > >>> Despite Flink’s widespread adoption, the existing model for running > user > >>> logic limits observability and execution flexibility, which affects > user > >>> experience. This FLIP introduces a new application management framework > >>> designed to close these gaps and provide a foundation for future > >>> improvements. > >>> > >>> > >>> Looking forward to your feedback and suggestions. > >>> > >>> > >>> > >>> [1] > >>> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-549*3A*Support*Application*Management__;JSsrKw!!IKRxdwAv5BmarQ!aRc3tkhpxkAknCHX0KrIgjvhovPNi-8cAXrwPzndtcVGGoRnJb9cP4v5RY7Qz2TmGbBz2LJA4OxnclVlSDLX4OH9$ > >>> > >>> > >>> Best regards, > >>> > >>> Yi Zhang >
