[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Rui Fan (Jira) Mon, 27 Nov 2023 02:17:42 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790010#comment-17790010
 ]


Rui Fan commented on FLINK-27681:
---------------------------------

{quote} * That's why we'd like to focus on not uploading the corruped files 
(Also for just fail the job simply to make job restore from the last complete 
checkpoint).{quote}
Fail job directly is fine for me, but I guess the PR doesn't fail the job, it 
just fails the current checkpoint, right?

 
{quote}File corruption will not affect the read path because the checksum will 
be checked when reading rocksdb block. The job will failover when read the 
corrupted one.
{quote}
If the checksum is called for each reading, can we think the check is very 
quick? If so, could we enable it directly without any option? Hey [~mayuehappy] 
 , could you provide some simple benchmark here?

> Improve the availability of Flink when the RocksDB file is corrupted.
> ---------------------------------------------------------------------
>
>                 Key: FLINK-27681
>                 URL: https://issues.apache.org/jira/browse/FLINK-27681
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Ming Li
>            Assignee: Yue Ma
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: image-2023-08-23-15-06-16-717.png
>
>
> We have encountered several times when the RocksDB checksum does not match or 
> the block verification fails when the job is restored. The reason for this 
> situation is generally that there are some problems with the machine where 
> the task is located, which causes the files uploaded to HDFS to be incorrect, 
> but it has been a long time (a dozen minutes to half an hour) when we found 
> this problem. I'm not sure if anyone else has had a similar problem.
> Since this file is referenced by incremental checkpoints for a long time, 
> when the maximum number of checkpoints reserved is exceeded, we can only use 
> this file until it is no longer referenced. When the job failed, it cannot be 
> recovered.
> Therefore we consider:
> 1. Can RocksDB periodically check whether all files are correct and find the 
> problem in time?
> 2. Can Flink automatically roll back to the previous checkpoint when there is 
> a problem with the checkpoint data, because even with manual intervention, it 
> just tries to recover from the existing checkpoint or discard the entire 
> state.
> 3. Can we increase the maximum number of references to a file based on the 
> maximum number of checkpoints reserved? When the number of references exceeds 
> the maximum number of checkpoints -1, the Task side is required to upload a 
> new file for this reference. Not sure if this way will ensure that the new 
> file we upload will be correct.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-27681) Improve the availability of Flink when the RocksDB file is corrupted.

Reply via email to