We've been hunting the root cause of data crc errors here at FB for a while. We'd find one or two corrupted files, usually displaying crc errors without any corresponding IO errors from the storage. The bug was rare enough that we'd need to watch a large number of machines for a few days just to catch it happening.
We're still running these patches through testing, but the fixup worker bug seems to account for the vast majority of crc errors we're seeing in the fleet. It's cleaning pages that were dirty, and creating a window where they can be reclaimed before we finish processing the page. btrfs_file_write() has a similar bug when copy_from_user catches a page fault and we're writing to a page that was already dirty when file_write started. This one is much harder to trigger, and I haven't confirmed yet that we're seeing it in the fleet. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html