Re: Restoring from Flink Savepoint in Kubernetes not working

Matthias Pohl Thu, 01 Apr 2021 01:44:18 -0700

The logs would have helped to understand better what you were doing.

The stacktrace you shared indicates that you either asked for the status of
a savepoint creation that had already been completed and was, therefore,
removed from the operations cache or you used some job ID/request ID
pair that was not connected with any savepoint creation operation.
The operations are only cached for 300 seconds before being removed from
the cache. You could verify that the specific operation did expire and was
removed from the cache in the logs [1] stating something like: "Evicted
result with trigger id {} because its TTL of {}s has expired."


But you should be also able to verify the completion of the savepoint in
the logs.

[1]
https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L104

On Wed, Mar 31, 2021 at 4:46 PM Claude M <claudemur...@gmail.com> wrote:

> Thanks for your reply.  I'm using the flink docker
> image flink:1.12.2-scala_2.11-java8.  Yes, the folder was created in S3.  I
> took a look at the UI and it showed the following:
>
> *Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type:
> CheckpointPath:
> s3://<bucket>/<folder>/fcc82deebb4565f31a7f63989939c463/chk-49*
>
> However, this is different from the savepoint path I specified.  I
> specified the following:
>
> *s3://<bucket>/<folder>/savepoint2/savepoint-9fe457-504c312ffabe*
>
> Is there anything specific you're looking for in the logs?  I did not find
> any exceptions and there is a lot of sensitive information I would have to
> extract from it.
>
> Also, this morning, I tried creating another savepoint.  It first showed
> it was In Progress.
>
> curl 
> http://localhost:8081/jobs/fcc82deebb4565f31a7f63989939c463/savepoints/4d19307dd99337257c4738871b1c63d8
> {"status":{"id":"IN_PROGRESS"},"operation":null}
>
> Then later when I tried to check the status, I saw the attached
> exception.
>
> In the UI, I see the following:
>
> *Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause:
> Asynchronous task checkpoint failed.*
>
> What does this failure mean?
>
>
> On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl <matth...@ververica.com>
> wrote:
>
>> Hi Claude,
>> thanks for reaching out to the Flink community. Could you provide the
>> Flink logs for this run to get a better understanding of what's going on?
>> Additionally, what exact Flink 1.12 version are you using? Did you also
>> verify that the snapshot was created by checking the actual folder?
>>
>> Best,
>> Matthias
>>
>> On Wed, Mar 31, 2021 at 4:56 AM Claude M <claudemur...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I have Flink setup as an Application Cluster in Kubernetes, using Flink
>>> version 1.12.  I created a savepoint using the curl command and the status
>>> indicated it was completed.  I then tried to relaunch the job from that
>>> save point using the following arguments as indicated in the doc found
>>> here:
>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes
>>>
>>> args: ["standalone-job", "--job-classname", "<class-name>", "--job-id",
>>> "<job-id>", "--fromSavepoint", "s3://<bucket>/<folder>",
>>> "--allowNonRestoredState"]
>>>
>>> After the job launches, I check the offsets and they are not the same as
>>> when the savepoint was created.  The job id passed in also does not match
>>> the job id that was launched.  I even put an incorrect savepoint path to
>>> see what happens and there were no errors in the logs and the job still
>>> launches.  It seems these arguments are not even being evaluated.  Any
>>> ideas about this?
>>>
>>>
>>> Thanks
>>>
>>

Re: Restoring from Flink Savepoint in Kubernetes not working

Reply via email to