[
https://issues.apache.org/jira/browse/FLINK-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chesnay Schepler closed FLINK-24053.
------------------------------------
Resolution: Not A Problem
This is not a bug.
When a stop-with-savepoint operation is triggered then the clusters waits with
the shutdown until the result from the savepoint operation (i.e., the final
path the Savepoint was written to) was consumed through the [REST
API|https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/rest_api/#jobs-jobid-savepoints-triggerid].
This is to ensure that users have some time-frame in which they are guaranteed
to be able to consume said result.
> stop with savepoint timeout
> ---------------------------
>
> Key: FLINK-24053
> URL: https://issues.apache.org/jira/browse/FLINK-24053
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / REST
> Affects Versions: 1.11.0, 1.12.0, 1.13.0
> Reporter: 刘方奇
> Priority: Major
>
> Hello, when we use the "stop with savepoint" feature, we always meet a bug.
> We will always cost 5 mins waiting the application to end, then the
> application will throw a timeout exception.
>
> {code:java}
> java.util.concurrent.TimeoutException: null
> at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
> ~[classes/:?]
> at
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
> ~[classes/:?]
> at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$14(FutureUtils.java:445)
> ~[classes/:?]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_251]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_251]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> ~[?:1.8.0_251]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ~[?:1.8.0_251]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ~[?:1.8.0_251]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ~[?:1.8.0_251]
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_251]
> {code}
> And we found there was always the function called
> org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers.SavepointStatusHandler.closeHandlerAsync()
> run timeout, and its timeout setting is 5mins.
> There was a question that the handler 's close may be not important, cause
> the handler serves other handler called
> org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers.StopWithSavepointHandler
> which was already closed.So should we skip this close ?
> PS : There was no problem when we test the code that skip the handler 's
> close.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)