[ https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878970#comment-17878970 ]
Tsz-wo Sze commented on RATIS-2148: ----------------------------------- That's great, thanks! > GRPC streaming snapshot transfer may cause followers to trigger > reloadStateMachine incorrectly > ---------------------------------------------------------------------------------------------- > > Key: RATIS-2148 > URL: https://issues.apache.org/jira/browse/RATIS-2148 > Project: Ratis > Issue Type: Bug > Components: snapshot > Affects Versions: 3.1.0, 3.2.0 > Reporter: yuuka > Priority: Major > Attachments: image-2024-09-03-14-24-25-652.png, > image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, > image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, > image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png > > > Due to the fact that grpc streaming snapshot sending sends all requests at > once, error handling is performed after all are sent, and the last snapshot > request is used as a completion flag, which may lead to the successful > receipt of the last request, but the previous request has failed. The sender > handles the failure event during the retransmission of the snapshot. The > receiver triggers state.reloadStateMachine because it successfully receives > the last request, but due to incomplete snapshot reception > > An md5 mismatch exception occurred before the last SnapshotRequest was > received > !image-2024-09-03-14-27-39-406.png! > > The last snapshot request arrived, then successfully received, and then > updated the index. > !image-2024-09-03-14-28-31-529.png! > !image-2024-09-03-14-30-02-751.png! > > However, the snapshot reception is incomplete and triggers the > reloadStateMachine. > !image-2024-09-03-14-33-49-573.png! > > I suggest using a flag to identify whether the entire snapshot request is > abnormal. > If an exception occurs, the subsequent content of the request will not be > processed. > Or the sender will wait for the receiver's reply. If there is a release > error, resend it. > > Finally, the current error retry level is the entire snapshot directory > rather than a single chunk, which will cause a large number of snapshot files > to be sent repeatedly, which can be optimized later -- This message was sent by Atlassian Jira (v8.20.10#820010)