Hi Yi,

Interesting FLIP, thanks for putting it together. Overall it would be good
to unify the dispatcher for the different modes - although this will be a
big lift.

One small question I had was in relation to the hierarchy in the UI.
Specifically the left navigation. At first glance it wasn't clear to me as
a user that Applications and Jobs are tied together from the navigation. I
saw that you can click on an application to get its corresponding jobs but
the navigation still gives me pause. Maybe there is a UI change we can make
to clear up any potential confusion?

Ryan van Huuksloot
Staff Engineer, Infrastructure | Streaming Platform
[image: Shopify]
<https://www.shopify.com/?utm_medium=salessignatures&utm_source=hs_email>


On Wed, Sep 24, 2025 at 11:48 PM Yi Zhang <[email protected]> wrote:

> Hi Shengkai,
>
>
>
>
> Thanks for taking the time to review the FLIP and for your thoughtful and
> constructive feedback! I would like to share the planned updates based on
> your points:
>
>
>
> 1. Asynchronous REST API for Application Submission
>
> This is an excellent point. To maintain compatibility, the existing
> /jars/:jarid/run API will retain its synchronous behavior, returning a Job
> ID only after the user's main method has completed and the job has been
> submitted. However, as you pointed out, this can lead to long response
> times and poor user experience.
>
> Therefore, we plan to introduce a new asynchronous REST API, likely
> /jars/:jarid/run-application. This API will submit the application and
> return an Application ID for status polling immediately, without waiting
> for the main method or job submission to finish. I'll add this to the
> proposal.
>
>
>
> 2. Clarification on "Pre-termination Cleanup"
>
> Thank you for pointing out the ambiguity. I will update the document to
> clarify that "pre-termination cleanup" refers to the process where an
> application, before transitioning to a terminal state, will actively cancel
> all the jobs it manages and wait for them to reach their own terminal
> states. This ensures that the application's lifecycle is cohesively tied to
> the lifecycle of the jobs it owns.
>
>
>
> 3. Potential Job Leak Prevention
>
> You've raised a critical concern here. As described in the point above,
> the primary mechanism is that the application itself ensures its jobs are
> terminated before it shuts down, which should prevent leaks in normal
> circumstances.
>
> The question then becomes how to handle the exceptional case where a job
> fails to respond to a cancellation request. Upon further reflection, I
> believe that if a job is unresponsive to a cancellation initiated by the
> application, a background monitor issuing the same request would likely
> face the same problem.
>
> Therefore, maybe triggering a fatal error is a more appropriate action in
> this scenario. While a fatal error in a session cluster could affect other
> running applications, an unresponsive job indicates a severe underlying
> issue that warrants such a drastic measure to prevent an inconsistent and
> unpredictable system state. I will update the proposal to detail this
> fault-tolerance strategy and the reasoning behind it.
>
>
>
> 4. API Compatibility Considerations
>
> Ensuring a smooth transition for existing users is a top priority and I
> can confirm that all existing Job ID-based REST APIs will remain fully
> functional.
>
> Users will still be able to query and cancel jobs launched via this new
> application framework using Job IDs. I will add a specific section in the
> document to explicitly state this, reassuring users that their existing
> tools and scripts will continue to work as expected.
>
>
>
> Once again, thank you for your invaluable input. I will incorporate these
> changes into the document shortly. Please let me know if you have any
> further questions or suggestions.
>
>
>
> Best regards,
>
> Yi
>
> At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote:
> >Hi, Yi.
> >
> >The FLIP is both interesting and highly promising for Flink users. Once
> >implemented, it will enable powerful use cases—such as running a Jupyter
> >Notebook kernel or SQL Gateway as a first-class application within the
> >JobManager. This represents a significant step forward in usability and
> >integration.
> >
> >I’d like to share a few suggestions and clarifications that could help
> >strengthen the proposal:
> >
> >*Asynchronous REST API for Application Submission*
> >
> >Given that launching such applications may involve complex initialization
> >and take considerable time to complete, it would be beneficial to support
> >an asynchronous submission mechanism via REST. A synchronous endpoint
> might
> >lead to timeouts or poor user experience. An async API could return an
> >application ID immediately, allowing users to poll or query the status of
> >the deployment using that identifier.
> >
> >*Clarification on "Pre-termination Cleanup"*
> >The term pre-termination cleanup is mentioned several times in the
> >document. Could you please elaborate on what this entails? Specifically,
> >which resources are expected to be released, and at what point in the life
> >cycle does this occur? A clearer definition would help ensure consistent
> >implementation and improve reliability.
> >
> >*Potential Job Leak Prevention*
> >
> >There appears to be a risk of job leaks if an application fails to
> properly
> >cancel its associated Flink job upon termination. To mitigate this, we
> >might consider introducing a background daemon thread (or a monitoring
> >service) that periodically checks for orphaned jobs whose parent
> >applications have already terminated, and automatically triggers cleanup.
> >Alternatively, integrating with Flink’s existing lifecycle management
> >mechanisms could help ensure robust resource cleanup.
> >
> >*API Compatibility Considerations*
> >
> >It would be helpful to clarify how the new application model aligns with
> >existing APIs. Many external systems currently rely on job IDs to monitor
> >or cancel jobs. Will these operations still be supported under the new
> >model? For example, can users continue to use the existing REST endpoints
> >to cancel a job or check its status using the job ID, even when the job
> was
> >launched through this new application framework?
> >
> >
> >Best,
> >Shengkai
> >
> >Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道:
> >
> >> Hi everyone,
> >>
> >>
> >> I would like to start a discussion about FLIP-549: Support Application
> >> Management [1].
> >>
> >>
> >> Despite Flink’s widespread adoption, the existing model for running user
> >> logic limits observability and execution flexibility, which affects user
> >> experience. This FLIP introduces a new application management framework
> >> designed to close these gaps and provide a foundation for future
> >> improvements.
> >>
> >>
> >> Looking forward to your feedback and suggestions.
> >>
> >>
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-549%3A+Support+Application+Management
> >>
> >>
> >> Best regards,
> >>
> >> Yi Zhang
>

Reply via email to