Re: [DISCUSS] FLIP-549: Support Application Management

Yi Zhang Sat, 18 Oct 2025 09:47:21 -0700

Hi Venkata,


Thanks for the support! I'm really glad to see it resonates 
with your experience.


Best,
Yi

At 2025-10-01 06:14:57, "Venkatakrishnan Sowrirajan" <[email protected]> wrote:
>This is a very interesting FLIP. We also felt the pain of not having better
>application management within Flink. Especially, with respect to Flink
>HistoryServer, for eg: it indexes by `jobs` instead of `application` even
>though we run it in `application mode`, this made the user side
>observability a bit clunky for finished jobs. This is particularly
>important for Flink-Batch apps but also relevant stream processing apps as
>well.
>
>+1 and thanks for working on this FLIP, Yi Zhang.
>
>Regards
>Venkata krishnan
>
>
>On Mon, Sep 29, 2025 at 11:22 PM Yi Zhang <[email protected]> wrote:
>
>> Hi Shengkai,
>>
>>
>> Just a quick note, the FLIP has been updated to reflect the latest
>> changes in the REST API and Application status description.
>>
>>
>> Thanks,
>> Yi
>>
>>
>> At 2025-09-25 11:47:41, "Yi Zhang" <[email protected]> wrote:
>> >Hi Shengkai,
>> >
>> >
>> >
>> >
>> >Thanks for taking the time to review the FLIP and for your thoughtful and
>> constructive feedback! I would like to share the planned updates based on
>> your points:
>> >
>> >
>> >
>> >1. Asynchronous REST API for Application Submission
>> >
>> >This is an excellent point. To maintain compatibility, the existing
>> /jars/:jarid/run API will retain its synchronous behavior, returning a Job
>> ID only after the user's main method has completed and the job has been
>> submitted. However, as you pointed out, this can lead to long response
>> times and poor user experience.
>> >
>> >Therefore, we plan to introduce a new asynchronous REST API, likely
>> /jars/:jarid/run-application. This API will submit the application and
>> return an Application ID for status polling immediately, without waiting
>> for the main method or job submission to finish. I'll add this to the
>> proposal.
>> >
>> >
>> >
>> >2. Clarification on "Pre-termination Cleanup"
>> >
>> >Thank you for pointing out the ambiguity. I will update the document to
>> clarify that "pre-termination cleanup" refers to the process where an
>> application, before transitioning to a terminal state, will actively cancel
>> all the jobs it manages and wait for them to reach their own terminal
>> states. This ensures that the application's lifecycle is cohesively tied to
>> the lifecycle of the jobs it owns.
>> >
>> >
>> >
>> >3. Potential Job Leak Prevention
>> >
>> >You've raised a critical concern here. As described in the point above,
>> the primary mechanism is that the application itself ensures its jobs are
>> terminated before it shuts down, which should prevent leaks in normal
>> circumstances.
>> >
>> >The question then becomes how to handle the exceptional case where a job
>> fails to respond to a cancellation request. Upon further reflection, I
>> believe that if a job is unresponsive to a cancellation initiated by the
>> application, a background monitor issuing the same request would likely
>> face the same problem.
>> >
>> >Therefore, maybe triggering a fatal error is a more appropriate action in
>> this scenario. While a fatal error in a session cluster could affect other
>> running applications, an unresponsive job indicates a severe underlying
>> issue that warrants such a drastic measure to prevent an inconsistent and
>> unpredictable system state. I will update the proposal to detail this
>> fault-tolerance strategy and the reasoning behind it.
>> >
>> >
>> >
>> >4. API Compatibility Considerations
>> >
>> >Ensuring a smooth transition for existing users is a top priority and I
>> can confirm that all existing Job ID-based REST APIs will remain fully
>> functional.
>> >
>> >Users will still be able to query and cancel jobs launched via this new
>> application framework using Job IDs. I will add a specific section in the
>> document to explicitly state this, reassuring users that their existing
>> tools and scripts will continue to work as expected.
>> >
>> >
>> >
>> >Once again, thank you for your invaluable input. I will incorporate these
>> changes into the document shortly. Please let me know if you have any
>> further questions or suggestions.
>> >
>> >
>> >
>> >Best regards,
>> >
>> >Yi
>> >
>> >At 2025-09-24 16:59:52, "Shengkai Fang" <[email protected]> wrote:
>> >>Hi, Yi.
>> >>
>> >>The FLIP is both interesting and highly promising for Flink users. Once
>> >>implemented, it will enable powerful use cases—such as running a Jupyter
>> >>Notebook kernel or SQL Gateway as a first-class application within the
>> >>JobManager. This represents a significant step forward in usability and
>> >>integration.
>> >>
>> >>I’d like to share a few suggestions and clarifications that could help
>> >>strengthen the proposal:
>> >>
>> >>*Asynchronous REST API for Application Submission*
>> >>
>> >>Given that launching such applications may involve complex initialization
>> >>and take considerable time to complete, it would be beneficial to support
>> >>an asynchronous submission mechanism via REST. A synchronous endpoint
>> might
>> >>lead to timeouts or poor user experience. An async API could return an
>> >>application ID immediately, allowing users to poll or query the status of
>> >>the deployment using that identifier.
>> >>
>> >>*Clarification on "Pre-termination Cleanup"*
>> >>The term pre-termination cleanup is mentioned several times in the
>> >>document. Could you please elaborate on what this entails? Specifically,
>> >>which resources are expected to be released, and at what point in the
>> life
>> >>cycle does this occur? A clearer definition would help ensure consistent
>> >>implementation and improve reliability.
>> >>
>> >>*Potential Job Leak Prevention*
>> >>
>> >>There appears to be a risk of job leaks if an application fails to
>> properly
>> >>cancel its associated Flink job upon termination. To mitigate this, we
>> >>might consider introducing a background daemon thread (or a monitoring
>> >>service) that periodically checks for orphaned jobs whose parent
>> >>applications have already terminated, and automatically triggers cleanup.
>> >>Alternatively, integrating with Flink’s existing lifecycle management
>> >>mechanisms could help ensure robust resource cleanup.
>> >>
>> >>*API Compatibility Considerations*
>> >>
>> >>It would be helpful to clarify how the new application model aligns with
>> >>existing APIs. Many external systems currently rely on job IDs to monitor
>> >>or cancel jobs. Will these operations still be supported under the new
>> >>model? For example, can users continue to use the existing REST endpoints
>> >>to cancel a job or check its status using the job ID, even when the job
>> was
>> >>launched through this new application framework?
>> >>
>> >>
>> >>Best,
>> >>Shengkai
>> >>
>> >>Yi Zhang <[email protected]> 于2025年9月23日周二 11:23写道：
>> >>
>> >>> Hi everyone,
>> >>>
>> >>>
>> >>> I would like to start a discussion about FLIP-549: Support Application
>> >>> Management [1].
>> >>>
>> >>>
>> >>> Despite Flink’s widespread adoption, the existing model for running
>> user
>> >>> logic limits observability and execution flexibility, which affects
>> user
>> >>> experience. This FLIP introduces a new application management framework
>> >>> designed to close these gaps and provide a foundation for future
>> >>> improvements.
>> >>>
>> >>>
>> >>> Looking forward to your feedback and suggestions.
>> >>>
>> >>>
>> >>>
>> >>> [1]
>> >>>
>> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-549*3A*Support*Application*Management__;JSsrKw!!IKRxdwAv5BmarQ!aRc3tkhpxkAknCHX0KrIgjvhovPNi-8cAXrwPzndtcVGGoRnJb9cP4v5RY7Qz2TmGbBz2LJA4OxnclVlSDLX4OH9$
>> >>>
>> >>>
>> >>> Best regards,
>> >>>
>> >>> Yi Zhang
>>

Re: [DISCUSS] FLIP-549: Support Application Management

Reply via email to