Hi Yi, Interesting FLIP, thanks for putting it together. Overall it would be good to unify the dispatcher for the different modes - although this will be a big lift.
One small question I had was in relation to the hierarchy in the UI. Specifically the left navigation. At first glance it wasn't clear to me as a user that Applications and Jobs are tied together from the navigation. I saw that you can click on an application to get its corresponding jobs but the navigation still gives me pause. Maybe there is a UI change we can make to clear up any potential confusion? Ryan van Huuksloot Staff Engineer, Infrastructure | Streaming Platform [image: Shopify] <https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email> On Wed, Sep 24, 2025 at 11:48 PM Yi Zhang <[email protected]> wrote: > Hi Shengkai, > > > > > Thanks for taking the time to review the FLIP and for your thoughtful and > constructive feedback! I would like to share the planned updates based on > your points: > > > > 1. Asynchronous REST API for Application Submission > > This is an excellent point. To maintain compatibility, the existing > /jars/:jarid/run API will retain its synchronous behavior, returning a Job > ID only after the user's main method has completed and the job has been > submitted. However, as you pointed out, this can lead to long response > times and poor user experience. > > Therefore, we plan to introduce a new asynchronous REST API, likely > /jars/:jarid/run-application. This API will submit the application and > return an Application ID for status polling immediately, without waiting > for the main method or job submission to finish. I'll add this to the > proposal. > > > > 2. Clarification on "Pre-termination Cleanup" > > Thank you for pointing out the ambiguity. I will update the document to > clarify that "pre-termination cleanup" refers to the process where an > application, before transitioning to a terminal state, will actively cancel > all the jobs it manages and wait for them to reach their own terminal > states. This ensures that the application's lifecycle is cohesively tied to > the lifecycle of the jobs it owns. > > > > 3. Potential Job Leak Prevention > > You've raised a critical concern here. As described in the point above, > the primary mechanism is that the application itself ensures its jobs are > terminated before it shuts down, which should prevent leaks in normal > circumstances. > > The question then becomes how to handle the exceptional case where a job > fails to respond to a cancellation request. Upon further reflection, I > believe that if a job is unresponsive to a cancellation initiated by the > application, a background monitor issuing the same request would likely > face the same problem. > > Therefore, maybe triggering a fatal error is a more appropriate action in > this scenario. While a fatal error in a session cluster could affect other > running applications, an unresponsive job indicates a severe underlying > issue that warrants such a drastic measure to prevent an inconsistent and > unpredictable system state. I will update the proposal to detail this > fault-tolerance strategy and the reasoning behind it. > > > > 4. API Compatibility Considerations > > Ensuring a smooth transition for existing users is a top priority and I > can confirm that all existing Job ID-based REST APIs will remain fully > functional. > > Users will still be able to query and cancel jobs launched via this new > application framework using Job IDs. I will add a specific section in the > document to explicitly state this, reassuring users that their existing > tools and scripts will continue to work as expected. > > > > Once again, thank you for your invaluable input. I will incorporate these > changes into the document shortly. Please let me know if you have any > further questions or suggestions. > > > > Best regards, > > Yi > > At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote: > >Hi, Yi. > > > >The FLIP is both interesting and highly promising for Flink users. Once > >implemented, it will enable powerful use cases—such as running a Jupyter > >Notebook kernel or SQL Gateway as a first-class application within the > >JobManager. This represents a significant step forward in usability and > >integration. > > > >I’d like to share a few suggestions and clarifications that could help > >strengthen the proposal: > > > >*Asynchronous REST API for Application Submission* > > > >Given that launching such applications may involve complex initialization > >and take considerable time to complete, it would be beneficial to support > >an asynchronous submission mechanism via REST. A synchronous endpoint > might > >lead to timeouts or poor user experience. An async API could return an > >application ID immediately, allowing users to poll or query the status of > >the deployment using that identifier. > > > >*Clarification on "Pre-termination Cleanup"* > >The term pre-termination cleanup is mentioned several times in the > >document. Could you please elaborate on what this entails? Specifically, > >which resources are expected to be released, and at what point in the life > >cycle does this occur? A clearer definition would help ensure consistent > >implementation and improve reliability. > > > >*Potential Job Leak Prevention* > > > >There appears to be a risk of job leaks if an application fails to > properly > >cancel its associated Flink job upon termination. To mitigate this, we > >might consider introducing a background daemon thread (or a monitoring > >service) that periodically checks for orphaned jobs whose parent > >applications have already terminated, and automatically triggers cleanup. > >Alternatively, integrating with Flink’s existing lifecycle management > >mechanisms could help ensure robust resource cleanup. > > > >*API Compatibility Considerations* > > > >It would be helpful to clarify how the new application model aligns with > >existing APIs. Many external systems currently rely on job IDs to monitor > >or cancel jobs. Will these operations still be supported under the new > >model? For example, can users continue to use the existing REST endpoints > >to cancel a job or check its status using the job ID, even when the job > was > >launched through this new application framework? > > > > > >Best, > >Shengkai > > > >Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道: > > > >> Hi everyone, > >> > >> > >> I would like to start a discussion about FLIP-549: Support Application > >> Management [1]. > >> > >> > >> Despite Flink’s widespread adoption, the existing model for running user > >> logic limits observability and execution flexibility, which affects user > >> experience. This FLIP introduces a new application management framework > >> designed to close these gaps and provide a foundation for future > >> improvements. > >> > >> > >> Looking forward to your feedback and suggestions. > >> > >> > >> > >> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management > >> > >> > >> Best regards, > >> > >> Yi Zhang >
