[ 
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878970#comment-17878970
 ] 

Tsz-wo Sze commented on RATIS-2148:
-----------------------------------

That's great, thanks!

> GRPC streaming snapshot transfer may cause followers to trigger 
> reloadStateMachine incorrectly
> ----------------------------------------------------------------------------------------------
>
>                 Key: RATIS-2148
>                 URL: https://issues.apache.org/jira/browse/RATIS-2148
>             Project: Ratis
>          Issue Type: Bug
>          Components: snapshot
>    Affects Versions: 3.1.0, 3.2.0
>            Reporter: yuuka
>            Priority: Major
>         Attachments: image-2024-09-03-14-24-25-652.png, 
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, 
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, 
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
>
> Due to the fact that grpc streaming snapshot sending sends all requests at 
> once, error handling is performed after all are sent, and the last snapshot 
> request is used as a completion flag, which may lead to the successful 
> receipt of the last request, but the previous request has failed. The sender 
> handles the failure event during the retransmission of the snapshot. The 
> receiver triggers state.reloadStateMachine because it successfully receives 
> the last request, but due to incomplete snapshot reception
>  
> An md5 mismatch exception occurred before the last SnapshotRequest was 
> received
> !image-2024-09-03-14-27-39-406.png!
>  
> The last snapshot request arrived, then successfully received, and then 
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>  
> However, the snapshot reception is incomplete and triggers the 
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>  
> I suggest using a flag to identify whether the entire snapshot request is 
> abnormal.
> If an exception occurs, the subsequent content of the request will not be 
> processed.
> Or the sender will wait for the receiver's reply. If there is a release 
> error, resend it.
>  
> Finally, the current error retry level is the entire snapshot directory 
> rather than a single chunk, which will cause a large number of snapshot files 
> to be sent repeatedly, which can be optimized later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to