Re: Restoring from Flink Savepoint in Kubernetes not working

2021-04-01 Thread Matthias Pohl
The logs would have helped to understand better what you were doing.

The stacktrace you shared indicates that you either asked for the status of
a savepoint creation that had already been completed and was, therefore,
removed from the operations cache or you used some job ID/request ID
pair that was not connected with any savepoint creation operation.
The operations are only cached for 300 seconds before being removed from
the cache. You could verify that the specific operation did expire and was
removed from the cache in the logs [1] stating something like: "Evicted
result with trigger id {} because its TTL of {}s has expired."

But you should be also able to verify the completion of the savepoint in
the logs.

[1]
https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/async/CompletedOperationCache.java#L104

On Wed, Mar 31, 2021 at 4:46 PM Claude M  wrote:

> Thanks for your reply.  I'm using the flink docker
> image flink:1.12.2-scala_2.11-java8.  Yes, the folder was created in S3.  I
> took a look at the UI and it showed the following:
>
> *Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type:
> CheckpointPath:
> s3:fcc82deebb4565f31a7f63989939c463/chk-49*
>
> However, this is different from the savepoint path I specified.  I
> specified the following:
>
> *s3:savepoint2/savepoint-9fe457-504c312ffabe*
>
> Is there anything specific you're looking for in the logs?  I did not find
> any exceptions and there is a lot of sensitive information I would have to
> extract from it.
>
> Also, this morning, I tried creating another savepoint.  It first showed
> it was In Progress.
>
> curl 
> http://localhost:8081/jobs/fcc82deebb4565f31a7f63989939c463/savepoints/4d19307dd99337257c4738871b1c63d8
> {"status":{"id":"IN_PROGRESS"},"operation":null}
>
> Then later when I tried to check the status, I saw the attached
> exception.
>
> In the UI, I see the following:
>
> *Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause:
> Asynchronous task checkpoint failed.*
>
> What does this failure mean?
>
>
> On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl 
> wrote:
>
>> Hi Claude,
>> thanks for reaching out to the Flink community. Could you provide the
>> Flink logs for this run to get a better understanding of what's going on?
>> Additionally, what exact Flink 1.12 version are you using? Did you also
>> verify that the snapshot was created by checking the actual folder?
>>
>> Best,
>> Matthias
>>
>> On Wed, Mar 31, 2021 at 4:56 AM Claude M  wrote:
>>
>>> Hello,
>>>
>>> I have Flink setup as an Application Cluster in Kubernetes, using Flink
>>> version 1.12.  I created a savepoint using the curl command and the status
>>> indicated it was completed.  I then tried to relaunch the job from that
>>> save point using the following arguments as indicated in the doc found
>>> here:
>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes
>>>
>>> args: ["standalone-job", "--job-classname", "", "--job-id",
>>> "", "--fromSavepoint", "s3:///",
>>> "--allowNonRestoredState"]
>>>
>>> After the job launches, I check the offsets and they are not the same as
>>> when the savepoint was created.  The job id passed in also does not match
>>> the job id that was launched.  I even put an incorrect savepoint path to
>>> see what happens and there were no errors in the logs and the job still
>>> launches.  It seems these arguments are not even being evaluated.  Any
>>> ideas about this?
>>>
>>>
>>> Thanks
>>>
>>


Re: Restoring from Flink Savepoint in Kubernetes not working

2021-03-31 Thread Claude M
Thanks for your reply.  I'm using the flink docker
image flink:1.12.2-scala_2.11-java8.  Yes, the folder was created in S3.  I
took a look at the UI and it showed the following:

*Latest Restore ID: 49Restore Time: 2021-03-31 09:37:43Type:
CheckpointPath:
s3:fcc82deebb4565f31a7f63989939c463/chk-49*

However, this is different from the savepoint path I specified.  I
specified the following:

*s3:savepoint2/savepoint-9fe457-504c312ffabe*

Is there anything specific you're looking for in the logs?  I did not find
any exceptions and there is a lot of sensitive information I would have to
extract from it.

Also, this morning, I tried creating another savepoint.  It first showed it
was In Progress.

curl 
http://localhost:8081/jobs/fcc82deebb4565f31a7f63989939c463/savepoints/4d19307dd99337257c4738871b1c63d8
{"status":{"id":"IN_PROGRESS"},"operation":null}

Then later when I tried to check the status, I saw the attached exception.

In the UI, I see the following:

*Latest Failed Checkpoint ID: 50Failure Time: 2021-03-31 09:34:43Cause:
Asynchronous task checkpoint failed.*

What does this failure mean?


On Wed, Mar 31, 2021 at 9:22 AM Matthias Pohl 
wrote:

> Hi Claude,
> thanks for reaching out to the Flink community. Could you provide the
> Flink logs for this run to get a better understanding of what's going on?
> Additionally, what exact Flink 1.12 version are you using? Did you also
> verify that the snapshot was created by checking the actual folder?
>
> Best,
> Matthias
>
> On Wed, Mar 31, 2021 at 4:56 AM Claude M  wrote:
>
>> Hello,
>>
>> I have Flink setup as an Application Cluster in Kubernetes, using Flink
>> version 1.12.  I created a savepoint using the curl command and the status
>> indicated it was completed.  I then tried to relaunch the job from that
>> save point using the following arguments as indicated in the doc found
>> here:
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes
>>
>> args: ["standalone-job", "--job-classname", "", "--job-id",
>> "", "--fromSavepoint", "s3:///",
>> "--allowNonRestoredState"]
>>
>> After the job launches, I check the offsets and they are not the same as
>> when the savepoint was created.  The job id passed in also does not match
>> the job id that was launched.  I even put an incorrect savepoint path to
>> see what happens and there were no errors in the logs and the job still
>> launches.  It seems these arguments are not even being evaluated.  Any
>> ideas about this?
>>
>>
>> Thanks
>>
>
{"errors":["org.apache.flink.runtime.rest.NotFoundException: Operation not 
found under key: 
org.apache.flink.runtime.rest.handler.job.AsynchronousJobOperationKey@4b261c41\n\tat
 
org.apache.flink.runtime.rest.handler.async.AbstractAsynchronousOperationHandlers$StatusHandler.handleRequest
(AbstractAsynchronousOperationHandlers.java:182)\n\tat 
org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.handleRequest
(SavepointHandlers.java:219)\n\tat 
org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest
(AbstractRestHandler.java:83)\n\tat 
org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader
(AbstractHandler.java:195)\n\tat 
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0
(LeaderRetrievalHandler.java:83)\n\tat 
java.util.Optional.ifPresent(Optional.java:159)\n\tat 
org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)\n\tat
 
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)\n\tat
 
org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)\n\tat
 
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)\n\tat
 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat
 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat
 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat
 
org.apache.flink.runtime.rest.handler.router.RouterHandler.routed(RouterHandler.java:115)\n\tat
 
org.apache.flink.runtime.rest.handler.router.RouterHandler.channelRead0(RouterHandler.java:94)\n\tat
 
org.apache.flink.runtime.rest.handler.router.RouterHandler.channelRead0(RouterHandler.java:55)\n\tat
 
org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead
(SimpleChannelInboundHandler.java:99)\n\tat 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat
 

Re: Restoring from Flink Savepoint in Kubernetes not working

2021-03-31 Thread Matthias Pohl
Hi Claude,
thanks for reaching out to the Flink community. Could you provide the Flink
logs for this run to get a better understanding of what's going on?
Additionally, what exact Flink 1.12 version are you using? Did you also
verify that the snapshot was created by checking the actual folder?

Best,
Matthias

On Wed, Mar 31, 2021 at 4:56 AM Claude M  wrote:

> Hello,
>
> I have Flink setup as an Application Cluster in Kubernetes, using Flink
> version 1.12.  I created a savepoint using the curl command and the status
> indicated it was completed.  I then tried to relaunch the job from that
> save point using the following arguments as indicated in the doc found
> here:
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes
>
> args: ["standalone-job", "--job-classname", "", "--job-id",
> "", "--fromSavepoint", "s3:///",
> "--allowNonRestoredState"]
>
> After the job launches, I check the offsets and they are not the same as
> when the savepoint was created.  The job id passed in also does not match
> the job id that was launched.  I even put an incorrect savepoint path to
> see what happens and there were no errors in the logs and the job still
> launches.  It seems these arguments are not even being evaluated.  Any
> ideas about this?
>
>
> Thanks
>


Restoring from Flink Savepoint in Kubernetes not working

2021-03-30 Thread Claude M
Hello,

I have Flink setup as an Application Cluster in Kubernetes, using Flink
version 1.12.  I created a savepoint using the curl command and the status
indicated it was completed.  I then tried to relaunch the job from that
save point using the following arguments as indicated in the doc found
here:
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes

args: ["standalone-job", "--job-classname", "", "--job-id",
"", "--fromSavepoint", "s3:///",
"--allowNonRestoredState"]

After the job launches, I check the offsets and they are not the same as
when the savepoint was created.  The job id passed in also does not match
the job id that was launched.  I even put an incorrect savepoint path to
see what happens and there were no errors in the logs and the job still
launches.  It seems these arguments are not even being evaluated.  Any
ideas about this?


Thanks