Hi Jeffrey

Thanks for your willingness to try this un-merged PR. Would you please share 
your exception stack and your proto file? If you could just summed up a job 
which could be able to reproduce this bug and share this to me, that would be 
better to fix this problem.


Best
Yun Tang

From: Yu Li <car...@gmail.com>
Date: Friday, November 15, 2019 at 1:29 AM
To: dev <dev@flink.apache.org>, Yun Tang <myas...@live.com>
Subject: Re: flink checkpoint state data corruption

@Yun Tang<mailto:myas...@live.com> As the author of the referenced PR, it would 
be great if you could help take a look here. Thanks.

@Jeffery:
For your second question, with officially released Flink, once checkpoint has 
been completed successfully it's safe to restore from [1]. The issue you 
encountered probably was caused by some bug of the un-merged PR and let's wait 
for the author's answer.

Hope this helps.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/internals/stream_checkpointing.html<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.9%2Finternals%2Fstream_checkpointing.html&data=02%7C01%7C%7C20ad43d960484165b4ab08d7692845e0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637093493994961183&sdata=otC1ekbvD107uY0EzWxq8DrJhHRmGzX2xjK%2FdOB5zMU%3D&reserved=0>

Best Regards,
Yu


On Thu, 14 Nov 2019 at 06:27, Jeffrey Martin 
<jeffrey.martin...@gmail.com<mailto:jeffrey.martin...@gmail.com>> wrote:
Hi all,

I'm using protobufs as keys of a Flink stream using code copied from this pull 
request<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fpull%2F7598&data=02%7C01%7C%7C20ad43d960484165b4ab08d7692845e0%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637093493994981205&sdata=F%2BIELKMtJTqci1B9a0%2BlTvvQJuaGzdif7hAYtdyx53w%3D&reserved=0>,
 but deserialization is failing after checkpoint restore due to missing data.

I'm using HDFS and the RocksDB backend. I tried providing the path to a 
previous retained checkpoint (i.e., 
.../flink-checkpoints/{jobIdHex}/chk-14/_metadata). The proto deserializer 
failed on a serialized record that had been truncated and was missing its last 
20 bytes out of 77 total.

The same serializers work fine if I don't try restoring from a checkpoint, 
worked fine for a different job, are fairly well unit-tested, and mostly just 
delegate to the protobuf serde code so I'm pretty certain my serializer is not 
the issue. Which means I'm doing something else wrong.

Questions:
1. Have others encountered issues like this?
2. How do I know when a checkpoint has been completed and is safe to restore 
from? (Is checkpoint completion atomic?)

Thanks,

Jeff Martin

Reply via email to