AWS - UFS corrupted when restoring from AWS Backup service

bogdan-lists Sat, 23 Jul 2022 01:33:11 -0700

Hello,

TL;DR: We have a bunch of EC2 machines in AWS running FreeBSD. AMI from Market, 
file system is UFS.  We have AWS Backup service taking hourly snapshots of 
these machines (AMI + EBS snapshots I believe). After a few months of snapshots 
we had to restore one of them and found out that the file system is corrupted 
and fsck was not able to recover it. We are going to enable sync in fstab, see 
if that helps, but it’s hard to know because it is hard to reproduce the 
problem, and details about how everything works are fuzzy to me.


Longer version:

We use FreeBSD on web servers in AWS. Until January we were doing weekly AMI 
snapshots by running a script that would shut down the machine, create the AMI, 
then start the machine back up. Which worked for a long time, but is less than 
ideal and shutting down production more often than weekly is rude.

At the start of this year we switched to running AWS Backup hourly. It takes 
snapshots of a running machine without stopping it. I believe it’s the same as 
creating an AMI and checking the “No reboot” checkbox. It should use the same 
API call, but I wouldn’t know. We ran a few recovery tests, we read the docs, 
we confirmed with support, everything looked like it should work with no issues.

A couple of weeks ago the EBS disk on one of the machines failed and we needed 
to restore it. When we did, it ran fsck on boot (which it didn’t on our 
previous tests) and failed to recover it, so the machine was effectively dead. 
I know we can mount the disk on a different machine and recover (some) data, 
that’s not the point. We tried a few backups going back two weeks, same issue. 
We tried a few more instances, about 5, all of them ran fsck on boot. A couple 
were recovered, but it doesn’t matter, it still means it’s not working as we 
thought. So now we’re effectively running without backups on EC2 instances.

I’m not sure why it happens. Information is sparse and I’m making a lot of 
assumptions. Basically I believe that the snapshot process is equivalent to 
cutting off power to the machine and that happens every hour for months. The 
docs on UFS soft updates say that there’s a small chance of data loss, but 
since that power-cutting snapshot happens every hour over a time of months, 
that chance isn’t that small any more. Still, apparently Linux doesn’t have 
this problem, and everywhere I read it says that data might be lost, but the 
file system should not be corrupted. And yet fsck isn’t always able to recover 
it.

As far as I understand, with soft updates and “noasync” in fstab (default), 
data is flushed to disk about every 30 seconds (according to syncer man page), 
asynchronously, while metadata is written synchronously. I’m thinking that 
maybe that’s an issue and turning on sync in fstab might help. On the other 
hand, the man page for syncer says “It is possible on some systems that a 
sync(2) occurring simultaneously with a crash may cause file system damage.”, 
which means it might make it worse? I don’t know.

We were not able to reproduce the problem reliably so that we can test. I’m not 
sure if or how anyone can help. I just wanted to send this message so that at 
least some other people are aware that AWS Backup doesn’t play nice with 
FreeBSD.

AWS - UFS corrupted when restoring from AWS Backup service

Reply via email to