ming li created FLINK-27681:
-------------------------------

             Summary: Improve the availability of Flink when the RocksDB file 
is corrupted.
                 Key: FLINK-27681
                 URL: https://issues.apache.org/jira/browse/FLINK-27681
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / State Backends
            Reporter: ming li


We have encountered several times when the RocksDB checksum does not match or 
the block verification fails when the job is restored. The reason for this 
situation is generally that there are some problems with the machine where the 
task is located, which causes the files uploaded to HDFS to be incorrect, but 
it has been a long time (a dozen minutes to half an hour) when we found this 
problem. I'm not sure if anyone else has had a similar problem.

Since this file is referenced by incremental checkpoints for a long time, when 
the maximum number of checkpoints reserved is exceeded, we can only use this 
file until it is no longer referenced. When the job failed, it cannot be 
recovered.

Therefore we consider:
1. Can RocksDB periodically check whether all files are correct and find the 
problem in time?
2. Can Flink automatically roll back to the previous checkpoint when there is a 
problem with the checkpoint data, because even with manual intervention, it 
just tries to recover from the existing checkpoint or discard the entire state.
3. Can we increase the maximum number of references to a file based on the 
maximum number of checkpoints reserved? When the number of references exceeds 
the maximum number of checkpoints -1, the Task side is required to upload a new 
file for this reference. Not sure if this way will ensure that the new file we 
upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to