Hi Fabian, could explain a bit how you are cancelling a job with savepoint and then try to retrieve the savepoint path?
When running Flink in per-job mode, the system should not shut down if you have an asynchronous operation running whose result you have not yet queried. I believe that this feature was introduced with FLINK-10309 [1]. The semantics is that Flink waits 5 minutes or until the result has been queried (by any client) [2]. If this is not working, then this is clearly a bug. FLINK-18663 [3] solved a bug where the cluster would hang while trying to shut it down. This was also a bug obviously. [1] https://issues.apache.org/jira/browse/FLINK-10309 [2] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L141 [3] https://issues.apache.org/jira/browse/FLINK-18663 Cheers, Till On Fri, Aug 7, 2020 at 5:58 PM Eleanore Jin <eleanore....@gmail.com> wrote: > +1 Thank you Fabian! > > On Fri, Aug 7, 2020 at 6:58 AM Fabian Paul <fabianp...@data-artisans.com> > wrote: > > > Hi all, > > > > Due to recent changes in the shutdown mechanism of Flink [1] it is not > > conveniently possible anymore to suspend a job running on a jobcluster > > with a savepoint and retrieve the savepoint location via the Flink API > > programmatically. > > > > With the introduced changes the rest endpoint shutdowns immediately > > and rejects new request which makes the information inaccessible. > > > > Before the changes it was possible to stop the job and query the > savepoint > > info endpoint until the location was shown. > > Admittedly, this was never a safe solution because it expected that the > > rest endpoint stays alive long enough. > > > > I would like to see what the community thinks about this and whether it > is > > worth to implement a different solution to retrieve those information. > > > > Best, > > Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-18663 > > >