We've been hunting the root cause of data crc errors here at FB for a while.
We'd find one or two corrupted files, usually displaying crc errors without any
corresponding IO errors from the storage.  The bug was rare enough that we'd
need to watch a large number of machines for a few days just to catch it
happening.

We're still running these patches through testing, but the fixup worker bug
seems to account for the vast majority of crc errors we're seeing in the fleet.
It's cleaning pages that were dirty, and creating a window where they can be
reclaimed before we finish processing the page.

btrfs_file_write() has a similar bug when copy_from_user catches a page fault
and we're writing to a page that was already dirty when file_write started.
This one is much harder to trigger, and I haven't confirmed yet that we're
seeing it in the fleet.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to