[jira] [Comment Edited] (FLINK-32104) stop-with-savepoint fails and times out with simple reproducible example

Weijie Guo (Jira) Mon, 15 May 2023 22:50:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17722993#comment-17722993
 ]


Weijie Guo edited comment on FLINK-32104 at 5/16/23 5:49 AM:
-------------------------------------------------------------

Hi Kurt, 

I have tested the uploaded job without any code change. However the native 
savepoint completed successfully on my own macbook. I'm worried that the 
timeout comes from prolonged sleep in {{DemoMapFunction}}.

{{NumberSequenceSource}} will be finished soon after the job submitted, and the 
recoed(i.e. 1...20) will accumulate in the downstream's received buffers. 
Savepoint will be inserted as the last record, and the previous data processing 
time will be {{20 * 5 = 100 s}}, but {{client.timeout}} defaults to only 60s. 
If a savepoint is submitted before most of the data is processed, there is a 
high probability that it will directly timeout. Have you tried increasing the 
timeout here?



was (Author: weijie guo):
Hi Kurt, 

I have tested the uploaded job without any code change. However the native 
savepoint completed successfully on my own macbook. I'm worried that the 
timeout comes from prolonged sleep in {{DemoMapFunction}}.

{{NumberSequenceSource}} will be finished soon after the job submitted, and the 
recoed(i.e. 1...20) will accumulate in the downstream's received buffers. 
Savepoint will be inserted as the last record, and the previous data processing 
time will be {{20 * 5 = 100 s}}, but {{client.timeout}} defaults to only 60s. 
If a savepoint is submitted before most of the data is processed, there is a 
high probability that it will directly timeout.


> stop-with-savepoint fails and times out with simple reproducible example
> ------------------------------------------------------------------------
>
>                 Key: FLINK-32104
>                 URL: https://issues.apache.org/jira/browse/FLINK-32104
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream
>    Affects Versions: 1.17.0
>            Reporter: Kurt Ostfeld
>            Priority: Major
>
> I've put together a simple demo app that reproduces the issue with 
> instructions on how to reproduce:
> [https://github.com/kurtostfeld/flink-stop-issue]
>  
> The issue is that with a very simple Flink DataStream API application, the 
> `stop-with-savepoint` fails and times out like this:
>  
> {code:java}
> ./bin/flink stop --type native --savepointPath ../savepoints 
> d69a952625497cca0665dfdcdb9f4718
> Suspending job "d69a952625497cca0665dfdcdb9f4718" with a NATIVE savepoint.
> ------------------------------------------------------------
>  The program finished with the following exception:
> org.apache.flink.util.FlinkException: Could not stop with a savepoint job 
> "d69a952625497cca0665dfdcdb9f4718".
>     at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$4(CliFrontend.java:595)
>     at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:1041)
>     at org.apache.flink.client.cli.CliFrontend.stop(CliFrontend.java:578)
>     at 
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1110)
>     at 
> org.apache.flink.client.cli.CliFrontend.lambda$mainInternal$9(CliFrontend.java:1189)
>     at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
>     at 
> org.apache.flink.client.cli.CliFrontend.mainInternal(CliFrontend.java:1189)
>     at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1157)
> Caused by: java.util.concurrent.TimeoutException
>     at 
> java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
>     at 
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
>     at 
> org.apache.flink.client.cli.CliFrontend.lambda$stop$4(CliFrontend.java:591)
>     ... 7 more {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-32104) stop-with-savepoint fails and times out with simple reproducible example

Reply via email to