On 08/14/2017 09:28 PM, Chris Murphy wrote: > On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreij...@inwind.it> > wrote: >> On 08/13/2017 08:45 PM, Chris Murphy wrote: >>> [2] >>> Is Btrfs subject to the write hole problem manifesting on disk? I'm >>> not sure, sadly I don't read the code well enough. But if all Btrfs >>> raid56 writes are full stripe CoW writes, and if the prescribed order >>> guarantees still happen: data CoW to disk > metadata CoW to disk > >>> superblock update, then I don't see how the write hole happens. Write >>> hole requires: RMW of a stripe, which is a partial stripe overwrite, >>> and a crash during the modification of the stripe making that stripe >>> inconsistent as well as still pointed to by metadata. >> >> >> RAID5 is *single* failure prof. And in order to have the write hole bug we >> need two failure: >> 1) a transaction is aborted (e.g. due to a power failure) and the results is >> that data and parity are mis-aligned >> 2) a disk disappears >> >> These two events may happen even in different moment. >> >> The key is that when a disk disappear, all remaining ones are used to >> rebuild the missing one. So if data and parity are mis-aligned the rebuild >> disk is wrong. >> >> Let me to show an example >> >> Disk 1 Disk 2 Disk 3 (parity) >> AAAAAA BBBBBB CCCCCC >> >> where CCCCCC = AAAAA ^ BBBBB >> >> Note1: AAAAA is a valid data >> >> Supposing to update B and due to a power failure you can't update parity, >> you have: >> >> >> Disk 1 Disk 2 Disk 3 (parity) >> AAAAAA DDDDDDD CCCCCC >> >> Of course CCCCCC != AAAAA ^ DDDDD (data and parity are misaligned). >> >> >> Pay attention that AAAAAA is still valid data. >> >> Now suppose to loose disk1. If you want to read from it, you have to perform >> a read of disk2 and disk3 to compute disk1. >> >> However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't >> got AAAAA anymore. >> >> >> Note that it is not important if DDDDDD or BBBBB are valid or invalid data. > > > Doesn't matter on Btrfs. Bad reconstruction due to wrong parity > results in csum mismatch. This I've tested.
I never argued about that. The write hole is related to *loss* of "valid data" due to a mis-alignement between data and parity. The fact that BTRFS is capable to detect the problem and return an -EIO, doesn't mitigate the loss of valid data. Pay attention that in my example AAAAA reached the disk before the "failure events" > > I vaguely remember a while ago doing a dd conv=notrunc modification of > a file that's raid5, and there was no RMW, what happened is the whole > stripe was CoW'd and had the modification. So that would, hardware > behaving correctly, mean that the raid5 data CoW succeeds, then there > is a metadata CoW to point to it, then the super block is updated to > point to the new tree. > > At any point, if there's an interruption, we have the old super > pointing to the old tree which points to premodified data. > > Anyway, I do wish I read the code better, so I knew exactly where, if > at all, the RMW code was happening on disk rather than just in memory. > There very clearly is RMW in memory code as a performanc optimizer, > before a stripe gets written out it's possible to RMW it to add in > more changes or new files, that way raid56 isn't dog slow CoW'ing > literally a handful of 16KiB leaves each time, that then translate > into a minimum of 384K of writes. In case of a fully stripe write, there is no RMW cycle, so no "write hole". Unfortunately not all writes are full stripe size. I never checked the code, but I hope that during a commit of the transaction all the writing are grouped in "full stripe write" as possible. Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity)..... > But yeah, Qu just said in another thread that Liu is working on a > journal for the raid56 write hole problem. Thing is I don't see when > it happens in the code or in practice (so far, it's really tedious to > poke a file system with a stick). > > > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html