Hi Fabian,

could explain a bit how you are cancelling a job with savepoint and then
try to retrieve the savepoint path?

When running Flink in per-job mode, the system should not shut down if you
have an asynchronous operation running whose result you have not yet
queried. I believe that this feature was introduced with FLINK-10309 [1].
The semantics is that Flink waits 5 minutes or until the result has been
queried (by any client) [2]. If this is not working, then this is clearly a
bug.

FLINK-18663 [3] solved a bug where the cluster would hang while trying to
shut it down. This was also a bug obviously.

[1] https://issues.apache.org/jira/browse/FLINK-10309
[2]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L141
[3] https://issues.apache.org/jira/browse/FLINK-18663

Cheers,
Till

On Fri, Aug 7, 2020 at 5:58 PM Eleanore Jin <eleanore....@gmail.com> wrote:

> +1 Thank you Fabian!
>
> On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <fabianp...@data-artisans.com>
> wrote:
>
> > Hi all,
> >
> > Due to recent changes in the shutdown mechanism of Flink [1] it is not
> > conveniently possible anymore to suspend a job running on a jobcluster
> > with a savepoint and retrieve the savepoint location via the Flink API
> > programmatically.
> >
> > With the introduced changes the rest endpoint shutdowns immediately
> > and rejects new request which makes the information inaccessible.
> >
> > Before the changes it was possible to stop the job and query the
> savepoint
> > info endpoint until the location was shown.
> > Admittedly, this was never a safe solution because it expected that the
> > rest endpoint stays alive long enough.
> >
> > I would like to see what the community thinks about this and whether it
> is
> > worth to implement a different solution to retrieve those information.
> >
> > Best,
> > Fabian
> > [1] https://issues.apache.org/jira/browse/FLINK-18663
> >
>

Reply via email to