We had a report of corruption on nixos, on tests that build a system
image, it bisected to the patch that enabled buffered writes without
taking the inode lock:

https://evilpiepirate.org/git/bcachefs.git/commit/?id=7e64c86cdc6c

It appears that dirty folios are being dropped somehow; corrupt files,
when checked against good copies, have ranges of 0s that are 4k aligned
(modulo 2k, likely a misaligned partition).

Interestingly, it only triggers for QEMU - the test fails pretty
consistently and we have a lot of nixos users, we'd notice (via nix
store verifies) if the corruption was more widespread. We believe it
only triggers with QEMU's snapshots mode (but don't quote me on that).

Further digging implicates CONFIG_COMPACTION or CONFIG_MIGRATION.

Testing with COMPACTION, MIGRATION=n and TRANSPARENT_HUGEPAGE=y passes
reliably.

On the bcachefs side, I've been testing with that patch reduced to just
"don't take inode lock if not extending"; i.e. killing the fancy stuff
to preserve write atomicity. It really does appear to be "don't take
inode lock -> dirty folios get dropped".

It's not a race with truncate, or anything silly like that; bcachefs has
the pagecache add lock, which serves here for locking vs. truncate.

So - this is a real head scratcher. The inode lock really doesn't do
much in IO paths, it's there for synchronization with truncate and write
vs. write atomicity - the mm paths know nothing about it. Page
fault/mkwrite paths don't take it at all; a buffered non-extending write
should be able to work similarly: the folio lock should be entirely
sufficient here.

Anyone got any bright ideas?

Reply via email to