Re: [DISCUSS] FLIP-549: Support Application Management

Venkatakrishnan Sowrirajan Tue, 30 Sep 2025 15:15:38 -0700

This is a very interesting FLIP. We also felt the pain of not having better
application management within Flink. Especially, with respect to Flink
HistoryServer, for eg: it indexes by `jobs` instead of `application` even
though we run it in `application mode`, this made the user side
observability a bit clunky for finished jobs. This is particularly
important for Flink-Batch apps but also relevant stream processing apps as
well.


+1 and thanks for working on this FLIP, Yi Zhang.

Regards
Venkata krishnan


On Mon, Sep 29, 2025 at 11:22 PM Yi Zhang <[email protected]> wrote:

> Hi Shengkai,
>
>
> Just a quick note, the FLIP has been updated to reflect the latest
> changes in the REST API and Application status description.
>
>
> Thanks,
> Yi
>
>
> At 2025-09-25 11:47:41, "Yi Zhang" <[email protected]> wrote:
> >Hi Shengkai,
> >
> >
> >
> >
> >Thanks for taking the time to review the FLIP and for your thoughtful and
> constructive feedback! I would like to share the planned updates based on
> your points:
> >
> >
> >
> >1. Asynchronous REST API for Application Submission
> >
> >This is an excellent point. To maintain compatibility, the existing
> /jars/:jarid/run API will retain its synchronous behavior, returning a Job
> ID only after the user's main method has completed and the job has been
> submitted. However, as you pointed out, this can lead to long response
> times and poor user experience.
> >
> >Therefore, we plan to introduce a new asynchronous REST API, likely
> /jars/:jarid/run-application. This API will submit the application and
> return an Application ID for status polling immediately, without waiting
> for the main method or job submission to finish. I'll add this to the
> proposal.
> >
> >
> >
> >2. Clarification on "Pre-termination Cleanup"
> >
> >Thank you for pointing out the ambiguity. I will update the document to
> clarify that "pre-termination cleanup" refers to the process where an
> application, before transitioning to a terminal state, will actively cancel
> all the jobs it manages and wait for them to reach their own terminal
> states. This ensures that the application's lifecycle is cohesively tied to
> the lifecycle of the jobs it owns.
> >
> >
> >
> >3. Potential Job Leak Prevention
> >
> >You've raised a critical concern here. As described in the point above,
> the primary mechanism is that the application itself ensures its jobs are
> terminated before it shuts down, which should prevent leaks in normal
> circumstances.
> >
> >The question then becomes how to handle the exceptional case where a job
> fails to respond to a cancellation request. Upon further reflection, I
> believe that if a job is unresponsive to a cancellation initiated by the
> application, a background monitor issuing the same request would likely
> face the same problem.
> >
> >Therefore, maybe triggering a fatal error is a more appropriate action in
> this scenario. While a fatal error in a session cluster could affect other
> running applications, an unresponsive job indicates a severe underlying
> issue that warrants such a drastic measure to prevent an inconsistent and
> unpredictable system state. I will update the proposal to detail this
> fault-tolerance strategy and the reasoning behind it.
> >
> >
> >
> >4. API Compatibility Considerations
> >
> >Ensuring a smooth transition for existing users is a top priority and I
> can confirm that all existing Job ID-based REST APIs will remain fully
> functional.
> >
> >Users will still be able to query and cancel jobs launched via this new
> application framework using Job IDs. I will add a specific section in the
> document to explicitly state this, reassuring users that their existing
> tools and scripts will continue to work as expected.
> >
> >
> >
> >Once again, thank you for your invaluable input. I will incorporate these
> changes into the document shortly. Please let me know if you have any
> further questions or suggestions.
> >
> >
> >
> >Best regards,
> >
> >Yi
> >
> >At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote:
> >>Hi, Yi.
> >>
> >>The FLIP is both interesting and highly promising for Flink users. Once
> >>implemented, it will enable powerful use cases—such as running a Jupyter
> >>Notebook kernel or SQL Gateway as a first-class application within the
> >>JobManager. This represents a significant step forward in usability and
> >>integration.
> >>
> >>I’d like to share a few suggestions and clarifications that could help
> >>strengthen the proposal:
> >>
> >>*Asynchronous REST API for Application Submission*
> >>
> >>Given that launching such applications may involve complex initialization
> >>and take considerable time to complete, it would be beneficial to support
> >>an asynchronous submission mechanism via REST. A synchronous endpoint
> might
> >>lead to timeouts or poor user experience. An async API could return an
> >>application ID immediately, allowing users to poll or query the status of
> >>the deployment using that identifier.
> >>
> >>*Clarification on "Pre-termination Cleanup"*
> >>The term pre-termination cleanup is mentioned several times in the
> >>document. Could you please elaborate on what this entails? Specifically,
> >>which resources are expected to be released, and at what point in the
> life
> >>cycle does this occur? A clearer definition would help ensure consistent
> >>implementation and improve reliability.
> >>
> >>*Potential Job Leak Prevention*
> >>
> >>There appears to be a risk of job leaks if an application fails to
> properly
> >>cancel its associated Flink job upon termination. To mitigate this, we
> >>might consider introducing a background daemon thread (or a monitoring
> >>service) that periodically checks for orphaned jobs whose parent
> >>applications have already terminated, and automatically triggers cleanup.
> >>Alternatively, integrating with Flink’s existing lifecycle management
> >>mechanisms could help ensure robust resource cleanup.
> >>
> >>*API Compatibility Considerations*
> >>
> >>It would be helpful to clarify how the new application model aligns with
> >>existing APIs. Many external systems currently rely on job IDs to monitor
> >>or cancel jobs. Will these operations still be supported under the new
> >>model? For example, can users continue to use the existing REST endpoints
> >>to cancel a job or check its status using the job ID, even when the job
> was
> >>launched through this new application framework?
> >>
> >>
> >>Best,
> >>Shengkai
> >>
> >>Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道：
> >>
> >>> Hi everyone,
> >>>
> >>>
> >>> I would like to start a discussion about FLIP-549: Support Application
> >>> Management [1].
> >>>
> >>>
> >>> Despite Flink’s widespread adoption, the existing model for running
> user
> >>> logic limits observability and execution flexibility, which affects
> user
> >>> experience. This FLIP introduces a new application management framework
> >>> designed to close these gaps and provide a foundation for future
> >>> improvements.
> >>>
> >>>
> >>> Looking forward to your feedback and suggestions.
> >>>
> >>>
> >>>
> >>> [1]
> >>>
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-549*3A*Support*Application*Management__;JSsrKw!!IKRxdwAv5BmarQ!aRc3tkhpxkAknCHX0KrIgjvhovPNi-8cAXrwPzndtcVGGoRnJb9cP4v5RY7Qz2TmGbBz2LJA4OxnclVlSDLX4OH9$
> >>>
> >>>
> >>> Best regards,
> >>>
> >>> Yi Zhang
>

Re: [DISCUSS] FLIP-549: Support Application Management

Reply via email to