Hi Venkata,
Thanks for the support! I'm really glad to see it resonates with your experience. Best, Yi At 2025-10-01 06:14:57, "Venkatakrishnan Sowrirajan" <[email protected]> wrote: >This is a very interesting FLIP. We also felt the pain of not having better >application management within Flink. Especially, with respect to Flink >HistoryServer, for eg: it indexes by `jobs` instead of `application` even >though we run it in `application mode`, this made the user side >observability a bit clunky for finished jobs. This is particularly >important for Flink-Batch apps but also relevant stream processing apps as >well. > >+1 and thanks for working on this FLIP, Yi Zhang. > >Regards >Venkata krishnan > > >On Mon, Sep 29, 2025 at 11:22 PM Yi Zhang <[email protected]> wrote: > >> Hi Shengkai, >> >> >> Just a quick note, the FLIP has been updated to reflect the latest >> changes in the REST API and Application status description. >> >> >> Thanks, >> Yi >> >> >> At 2025-09-25 11:47:41, "Yi Zhang" <[email protected]> wrote: >> >Hi Shengkai, >> > >> > >> > >> > >> >Thanks for taking the time to review the FLIP and for your thoughtful and >> constructive feedback! I would like to share the planned updates based on >> your points: >> > >> > >> > >> >1. Asynchronous REST API for Application Submission >> > >> >This is an excellent point. To maintain compatibility, the existing >> /jars/:jarid/run API will retain its synchronous behavior, returning a Job >> ID only after the user's main method has completed and the job has been >> submitted. However, as you pointed out, this can lead to long response >> times and poor user experience. >> > >> >Therefore, we plan to introduce a new asynchronous REST API, likely >> /jars/:jarid/run-application. This API will submit the application and >> return an Application ID for status polling immediately, without waiting >> for the main method or job submission to finish. I'll add this to the >> proposal. >> > >> > >> > >> >2. Clarification on "Pre-termination Cleanup" >> > >> >Thank you for pointing out the ambiguity. I will update the document to >> clarify that "pre-termination cleanup" refers to the process where an >> application, before transitioning to a terminal state, will actively cancel >> all the jobs it manages and wait for them to reach their own terminal >> states. This ensures that the application's lifecycle is cohesively tied to >> the lifecycle of the jobs it owns. >> > >> > >> > >> >3. Potential Job Leak Prevention >> > >> >You've raised a critical concern here. As described in the point above, >> the primary mechanism is that the application itself ensures its jobs are >> terminated before it shuts down, which should prevent leaks in normal >> circumstances. >> > >> >The question then becomes how to handle the exceptional case where a job >> fails to respond to a cancellation request. Upon further reflection, I >> believe that if a job is unresponsive to a cancellation initiated by the >> application, a background monitor issuing the same request would likely >> face the same problem. >> > >> >Therefore, maybe triggering a fatal error is a more appropriate action in >> this scenario. While a fatal error in a session cluster could affect other >> running applications, an unresponsive job indicates a severe underlying >> issue that warrants such a drastic measure to prevent an inconsistent and >> unpredictable system state. I will update the proposal to detail this >> fault-tolerance strategy and the reasoning behind it. >> > >> > >> > >> >4. API Compatibility Considerations >> > >> >Ensuring a smooth transition for existing users is a top priority and I >> can confirm that all existing Job ID-based REST APIs will remain fully >> functional. >> > >> >Users will still be able to query and cancel jobs launched via this new >> application framework using Job IDs. I will add a specific section in the >> document to explicitly state this, reassuring users that their existing >> tools and scripts will continue to work as expected. >> > >> > >> > >> >Once again, thank you for your invaluable input. I will incorporate these >> changes into the document shortly. Please let me know if you have any >> further questions or suggestions. >> > >> > >> > >> >Best regards, >> > >> >Yi >> > >> >At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote: >> >>Hi, Yi. >> >> >> >>The FLIP is both interesting and highly promising for Flink users. Once >> >>implemented, it will enable powerful use cases—such as running a Jupyter >> >>Notebook kernel or SQL Gateway as a first-class application within the >> >>JobManager. This represents a significant step forward in usability and >> >>integration. >> >> >> >>I’d like to share a few suggestions and clarifications that could help >> >>strengthen the proposal: >> >> >> >>*Asynchronous REST API for Application Submission* >> >> >> >>Given that launching such applications may involve complex initialization >> >>and take considerable time to complete, it would be beneficial to support >> >>an asynchronous submission mechanism via REST. A synchronous endpoint >> might >> >>lead to timeouts or poor user experience. An async API could return an >> >>application ID immediately, allowing users to poll or query the status of >> >>the deployment using that identifier. >> >> >> >>*Clarification on "Pre-termination Cleanup"* >> >>The term pre-termination cleanup is mentioned several times in the >> >>document. Could you please elaborate on what this entails? Specifically, >> >>which resources are expected to be released, and at what point in the >> life >> >>cycle does this occur? A clearer definition would help ensure consistent >> >>implementation and improve reliability. >> >> >> >>*Potential Job Leak Prevention* >> >> >> >>There appears to be a risk of job leaks if an application fails to >> properly >> >>cancel its associated Flink job upon termination. To mitigate this, we >> >>might consider introducing a background daemon thread (or a monitoring >> >>service) that periodically checks for orphaned jobs whose parent >> >>applications have already terminated, and automatically triggers cleanup. >> >>Alternatively, integrating with Flink’s existing lifecycle management >> >>mechanisms could help ensure robust resource cleanup. >> >> >> >>*API Compatibility Considerations* >> >> >> >>It would be helpful to clarify how the new application model aligns with >> >>existing APIs. Many external systems currently rely on job IDs to monitor >> >>or cancel jobs. Will these operations still be supported under the new >> >>model? For example, can users continue to use the existing REST endpoints >> >>to cancel a job or check its status using the job ID, even when the job >> was >> >>launched through this new application framework? >> >> >> >> >> >>Best, >> >>Shengkai >> >> >> >>Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道: >> >> >> >>> Hi everyone, >> >>> >> >>> >> >>> I would like to start a discussion about FLIP-549: Support Application >> >>> Management [1]. >> >>> >> >>> >> >>> Despite Flink’s widespread adoption, the existing model for running >> user >> >>> logic limits observability and execution flexibility, which affects >> user >> >>> experience. This FLIP introduces a new application management framework >> >>> designed to close these gaps and provide a foundation for future >> >>> improvements. >> >>> >> >>> >> >>> Looking forward to your feedback and suggestions. >> >>> >> >>> >> >>> >> >>> [1] >> >>> >> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-549*3A*Support*Application*Management__;JSsrKw!!IKRxdwAv5BmarQ!aRc3tkhpxkAknCHX0KrIgjvhovPNi-8cAXrwPzndtcVGGoRnJb9cP4v5RY7Qz2TmGbBz2LJA4OxnclVlSDLX4OH9$ >> >>> >> >>> >> >>> Best regards, >> >>> >> >>> Yi Zhang >>
