> Date: Thu, 23 Jul 2020 07:45:08 +0200 > From: Michał_Górny <mgo...@gentoo.org> > > On Thu, 2020-07-23 at 05:17 +0000, David Holland wrote: > > The problem is that because it still doesn't do anything about > > journaling or preserving file contents, but runs a lot faster, it > > loses more data when interrupted. > > How does that compare to the level of damage non-journaled FFS takes? > My VM was just bricked a second time because /etc/passwd was turned to > junk. I dare say that a proper metadata journaling + proper writes > (i.e. using rename() -- haven't verified whether that's done correctly) > should prevent that from happening again.
Metadata journaling doesn't do anything about that, and it never has. It is a common misconception that metadata journaling has anything to do with making a system more robust against data corruption. Metadata journaling is primarily about making it _faster_ to pick up after interruption such as crash or power failure, and faster to issue writes in the first place at the cost of doubling the number of metadata writes. - In traditional ffs, every operation issues metadata writes synchronously in a particular order. This way, if an operation is interrupted, then on reboot, `fsck -p' can reliably identify what state the file system was in, and either roll back to undo the operation or roll forward to complete it. Of course, identifying that state requires doing a global analysis of the file system structure, so it's slow, and the larger the file system is the slower it gets. (Note: `fsck -p' is part of the file system design; fsck _without_ `-p' is pray-to-recover from `unexpected inconsistencies' arising either from bugs or from hardware failures.) - With wapbl, every operation issues metadata writes in order _twice_: first to a sequential log and then -- once all the writes to the log for the operation have been committed to disk -- to the locations where the metadata blocks actually live. This way, if an operation is interrupted, then on reboot, log replay can reliably roll forward operations whose metadata writes were committed in the log, and discard the rest to roll back operations whose metadata writes were not committed. Log replay takes time proportional roughly to the number of in-flight operations rather than to the size of the file system, so it's much cheaper than the global analysis of `fsck -p' for large disks. wapbl only requires the metadata writes to be serialized -- not synchronous -- so even though it issues every metadata write twice, it tends to have much higher write throughput (especially on spinning rust) since metadata writes don't happen in lock-step with the disk write latency. Of course, the devil is in the details, and wapbl is actually more complicated than that, and we screwed up the on-disk format ages ago. So wapbl has various shortcomings, like crashes when the number of metadata writes needed to atomically truncate a large file exceeds the free space left in the log on disk because we failed to guarantee every operation runs in (small) constant log space and preallocate enough space up front. ffs also has a long-standing bug I call the `garbage data appended after crash' bug: when you append data to a file, ffs will _synchronously_ allocate data blocks and update the inode length, and _asynchronously_ write the data to the new blocks. If interrupted, the new blocks may be allocated and the inode length updated, but the new blocks may contain garbage because the asynchronous data writes haven't completed yet. The result is that it's as if you appended garbage data to the end of the file. You can work around it by writing to a temporary file, fsyncing the temporary file, and renaming to the permanent location, but it's a bug nevertheless. wapbl makes this bug _worse_ by issuing the metadata writes much faster -- since they only need to be serialized, not synchronous -- so the bug can apply to many more files and much more data. All of this is to say: wapbl -- and journaling generally -- doesn't do anything more than ffs to change the `level of damage' in any qualitative way; but both traditional ffs and ffs+wapbl have something that you might call a `data loss' bug (more accurately, file corruption), and it's quantitatively _worse_ for wapbl. So I'm not clear on where kamil gets the idea that wapbl is less prone to data loss, and the symptom you (mgorny) described is consistent with the bug that wapbl makes worse. (There are various ways we _could_ approach the shortcomings of ffs and wapbl: impose ordering constraints on data writes to fix the garbage data appended after crash bug (`soft updates'), for example; create new types of logical log entries to atomically truncate inodes so that truncation can run in constant log space; do bookkeeping for wapbl transactions better so we never run out of space. But some of these require changes to the on-disk format, and overall it's a lot of work...which is why I used to use ffs+sync on my laptop, and these days I avoid ffs altogether in favour of zfs and lfs, except on install images written to USB media.)