Re: Checksum mismatch in single repo
On Fri, Oct 27, 2023 at 6:42 AM Pierre Fourès wrote: > Hi Felix, > > Your SMART data looks good to me, except for the hard drive temperature. > Experiencing 53°C looks quite a lot to me. Yet, this should not be the > cause of your corrupted data. > > Two data-corruption problems on the same server which looks independant > from each other, and occured at a quite long time range interval from each > other, reminds me of a server who caused me lots of trouble until I > discovered it had memory defects. I suspected hard disk failure and/or hard > drive data corruption, but couldn't nail it with smartctl nor with the > badblocks utility. I eventually nailed the problem when doing extensive > test with the stress utility, showing that in some runs, the memory was > corrupting data (which ended up corrupting data on disk). I had to run the > tests many times to spot the defect. Subtle defects are real hard to spot > on. > > IMO, I would advice you to do a full scan of this server to spot where the > problem is in order to file this trail of problems as definitively solved. > In my situation, similar to your one, the problems occured too distantly > from each other to commit resources to investigate thoroughly. This period > of uncertaintly and intuitive distrust of the server caused us a hidden > costs like stress and fatigue. Having experienced it, if that happened > again, I would prefer to rule out this situation quickly instead of knowing > it dormant. > > Here are some links which might be relevant to you : > - https://en.wikipedia.org/wiki/Badblocks > - https://wiki.archlinux.org/title/Badblocks > - https://man.archlinux.org/man/stress.1 > - https://wiki.archlinux.org/title/Stress_testing > - https://www.memtest.org/ > > Best Regards, > Pierre. > I can speak to RAM corruption as well. In one instance, we were experiencing the strangest problems and blamed just about everything until I ran the above memtest utility and it showed tremendous numbers of memory errors. When I opened up the hardware, I found dust on and around the memory. I cleaned that very thoroughly, put the system back together, and ran memtest overnight or over a weekend with zero errors. Evidently, dust can be conductive enough to act like a bunch of resistors across pins that shouldn't have resistors across them. As trivial as that sounds, I recommend to check for things like dust, and since heat was mentioned, I'd check for fans that don't spin very freely. I also recommend running memtest over a weekend, and finally, I am with the camp who believe that ECC RAM is a good idea, so I'd suggest to check whether you are using ECC RAM. Hope this helps, Nathan
Re: Checksum mismatch in single repo
Hi Felix, Your SMART data looks good to me, except for the hard drive temperature. Experiencing 53°C looks quite a lot to me. Yet, this should not be the cause of your corrupted data. Two data-corruption problems on the same server which looks independant from each other, and occured at a quite long time range interval from each other, reminds me of a server who caused me lots of trouble until I discovered it had memory defects. I suspected hard disk failure and/or hard drive data corruption, but couldn't nail it with smartctl nor with the badblocks utility. I eventually nailed the problem when doing extensive test with the stress utility, showing that in some runs, the memory was corrupting data (which ended up corrupting data on disk). I had to run the tests many times to spot the defect. Subtle defects are real hard to spot on. IMO, I would advice you to do a full scan of this server to spot where the problem is in order to file this trail of problems as definitively solved. In my situation, similar to your one, the problems occured too distantly from each other to commit resources to investigate thoroughly. This period of uncertaintly and intuitive distrust of the server caused us a hidden costs like stress and fatigue. Having experienced it, if that happened again, I would prefer to rule out this situation quickly instead of knowing it dormant. Here are some links which might be relevant to you : - https://en.wikipedia.org/wiki/Badblocks - https://wiki.archlinux.org/title/Badblocks - https://man.archlinux.org/man/stress.1 - https://wiki.archlinux.org/title/Stress_testing - https://www.memtest.org/ Best Regards, Pierre.
Re: Checksum mismatch in single repo
hello Daniel, thank you for your quick answer, I reply inline: On 27.10.23 08:23, Daniel Sahlberg wrote: Den fre 27 okt. 2023 kl 07:30 skrev Felix Natter : Dear svn experts, I do a daily dump+backup of my svn server. Without any known trigger (no server crash, except about 2 months ago I had a single I/O error on the ProxMox virtualization server), the dump of one repo failed with: svnadmin: E200046: LZ4 decompression failed The svnadmin verify I ran to double check that also failed for that one repo: verifying /repos/X/Y... * Error verifying repository metadata. svnadmin: E160004: Checksum mismatch in item at offset 18983705 of length 11921122 bytes in file /repos/X/Y/db/revs/0/221 After I restored X/Y from the last backup, and ran a dump/backup/verify, everything is fine for 4 days now. Good thing you did the dump/backup and verify steps! Do I understand the issue occurred about a week ago, you restored the backup and now it has been working fine for the last 4 days? As compared with the known I/O error 2 month ago (ie, a lot earlier)? Yes, the I/O error occurred earlier and did not have consequences for "svnadmin dump/verify". With the current (4 days ago) corruption, I did not see any I/O errors. SMART is also green (please see below). I couldn't find an error in the system logs (especially no I/O errors). The repos are on a HDD (in my experience they last longer than SSDs with lots of write activity, i.e. daily dumps/backups/etc...). Question: Can I rule out software failure? It is difficult to rule out, but there are not many reports of this failure so I would guess it is more likely to be a corrupted bit of data on your HDD. Ok, thanks. I am running svn 1.14.1 on ALMA Linux 8.x. Shall I install on a new HDD? You should probably check the SMART stats on the drive (on the virtualisation host!) or any other indications you might have on an upcoming failure to see if the HDD is indeed the issue. I do not see a single problem in "smartctl -a /dev/sda" (I started a long test with -t long earlier this week): https://pastebin.com/7hi31CUg But then I never identified a failing HDD using SMART... Many Thanks and Best Regards, Felix No action needed? Any other advice? Many Thanks in Advance and Best Regards, Felix Kind regards, Daniel Sahlberg -- *SIDACT GmbH Simulation Data Analysis and Compression Technologies * *Felix Natter* /Software Developer / Auguststraße 29 53229 Bonn Germany Phone : +49 228 5348 0430 Direct : +49 228 4097 7118 Email : felix.nat...@sidact.com Web : http://www.sidact.com/
Re: Checksum mismatch in single repo
Den fre 27 okt. 2023 kl 07:30 skrev Felix Natter : > Dear svn experts, > > I do a daily dump+backup of my svn server. Without any known trigger > (no server crash, except about 2 months ago I had a single I/O error on > the > ProxMox virtualization server), the dump of one repo failed with: > > svnadmin: E200046: LZ4 decompression failed > > The svnadmin verify I ran to double check that also failed for that one > repo: > > verifying /repos/X/Y... > * Error verifying repository metadata. > svnadmin: E160004: Checksum mismatch in item at offset 18983705 of length > 11921122 bytes in file /repos/X/Y/db/revs/0/221 > > After I restored X/Y from the last backup, and ran a dump/backup/verify, > everything is fine for 4 days now. > Good thing you did the dump/backup and verify steps! Do I understand the issue occurred about a week ago, you restored the backup and now it has been working fine for the last 4 days? As compared with the known I/O error 2 month ago (ie, a lot earlier)? > I couldn't find an error in the system logs (especially no I/O errors). > The repos are on a HDD (in my experience they last longer than SSDs > with lots of write activity, i.e. daily dumps/backups/etc...). > > Question: Can I rule out software failure? > It is difficult to rule out, but there are not many reports of this failure so I would guess it is more likely to be a corrupted bit of data on your HDD. > I am running svn 1.14.1 > on ALMA Linux 8.x. Shall I install on a new HDD? > You should probably check the SMART stats on the drive (on the virtualisation host!) or any other indications you might have on an upcoming failure to see if the HDD is indeed the issue. > No action needed? > Any other advice? > > Many Thanks in Advance and Best Regards, > Felix > Kind regards, Daniel Sahlberg