Hi devs, I'm opening this thread to discuss FLIP-414: Support Retry Mechanism in RocksDBStateDataTransfer[1].
Currently, there is no retry mechanism for downloading and uploading RocksDB state files. Any jittering of remote filesystem might lead to a checkpoint failure. By supporting retry mechanism in `RocksDBStateDataTransfer`, we can significantly reduce the failure rate of checkpoint during asynchronous phrase. To make this retry mechanism configurable, we have introduced two options in this FLIP: `state.backend.rocksdb.checkpoint.transfer.retry.times` and ` state.backend.rocksdb.checkpoint.transfer.retry.interval`. The default behavior remains to be no retry will be performed in order to be consistent with the original behavior. Looking forward to your feedback, thanks. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-414%3A+Support+Retry+Mechanism+in+RocksDBStateDataTransfer Best regards, Xiangyu Feng