[ 
https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878848#comment-17878848
 ] 

yuuka commented on RATIS-2147:
------------------------------

[~szetszwo], I have submitted a pull request

> MD5 mismatch when accept snapshot
> ---------------------------------
>
>                 Key: RATIS-2147
>                 URL: https://issues.apache.org/jira/browse/RATIS-2147
>             Project: Ratis
>          Issue Type: Bug
>          Components: snapshot
>    Affects Versions: 3.1.0, 3.2.0
>            Reporter: yuuka
>            Priority: Major
>         Attachments: image-2024-09-03-10-35-08-315.png, 
> image-2024-09-03-10-35-28-617.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> We encountered an MD5 mismatch issue in IoTDB, and after multiple 
> investigations, we found that the digester was contaminated
>  
> We have checked that it is not a network and disk problem
>  
> In implementation, the received snapshot will be written to a temporary file 
> first. If there is an md5 mismatch, we will read the data from this temporary 
> file and use a new digest to calculate md5, but the result of this 
> calculation is the same as the md5 hash value sent
> !image-2024-09-03-10-35-28-617.png!
>  
> !image-2024-09-03-10-35-08-315.png!
>  
>  
> Use the saved corrupted file name to locate the relevant log, here to 
> tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YjM4MWY1MTA2Y2EyYWU4MmZlNDE0Mzg3MDRjYTBjMjRfU0dPbEpVbWFNalV1V1lSUVllOGFISUdWbUhqanRFdFdfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA!
> Before encountering corrupt, the sender sent several consecutive snapshot 
> installation requests to the receiver.
>  
> The receiver successfully received some requests, and then encountered a 
> request for corrupt, and began printing "recompute again" to start 
> recalculating.
>  
> After execution, the ERROR log of the rename will be printed, and the data 
> will be read from the file and compared with the received chunk data.
>  
> If a byte does not match, the corresponding information will be printed, but 
> no log information will be printed, which means that the content written to 
> the disk is the same as the content sent
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YmZlYjk1YjAwOWE4MDJlYTEzZjkxMjljODU1MzQxMTZfMkU0NmlPRWpidDBweGNzWXY4cHNJZG14b1o3Z1BZMzhfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA!
> This makes the problem very clear. There is a problem with the MD5 
> calculation class, and the reasons are as follows:
>  
>      If a byte in the middle of the data part is incorrect due to network 
> reasons, the calculated result and the hash sent must be different
>  
>     If there is a problem with the part that stores the hash value, the final 
> calculation result will also be different.
>  
> I suggest creating a new digest every time follower receive a snapshot, so as 
> to avoid pollution problems. Under normal network and disk conditions, 
> Corrupt will not occur



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to