[ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789409#comment-17789409
 ] 

Rui Fan commented on FLINK-27681:
---------------------------------

Thanks for the quick response.
{quote}Strictly speaking, I think it is possible for file corruption to occur 
during the process of uploading and downloading to local. It might be better if 
Flink can add the file verification mechanism during Checkpoint upload and 
download processes. But as far as I know, most DFSs have data verification 
mechanisms, so at least we have not encountered this situation in our 
production environment. Most file corruption occurs before being uploaded to 
HDFS.
{quote}
Sorry, my expression may not be clear.

Yeah, most DFSs have data verification mechanisms. Please see this comment, 
that's my concern. 
[https://github.com/apache/flink/pull/23765#discussion_r1404136470]

 
{quote}Under the default Rocksdb option, after a damaged SST is created, if 
there is no Compaction or Get/Iterator to access this file, DB can always run 
normally. 
{quote}
If so, when this situation is discovered, is it reasonable to let the 
checkpoint fail?

>From a technical perspective, a solution with less impact on the job might be:
 * If this file is uploaded to hdfs before, flink try to download it, and let 
rocksdb become health.
 * If this file isn't uploaded to hdfs, flink job should fail directly, right?

If we only fail the current checkpoint, and 
execution.checkpointing.tolerable-failed-checkpoints > 0, the job will continue 
to run. And flink job will fail later (when this file is read.). And then job 
will recover from latest checkpoint, flink job will consume more duplicate data 
than fail job directly.

Please correct me if I misunderstood anything, thanks~

Also, I'd like to cc [~pnowojski] , he might also be interested in this JIRA.

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to