On 08/14/2017 09:28 PM, Chris Murphy wrote:
> On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli <kreij...@inwind.it> 
> wrote:
>> On 08/13/2017 08:45 PM, Chris Murphy wrote:
>>> [2]
>>> Is Btrfs subject to the write hole problem manifesting on disk? I'm
>>> not sure, sadly I don't read the code well enough. But if all Btrfs
>>> raid56 writes are full stripe CoW writes, and if the prescribed order
>>> guarantees still happen: data CoW to disk > metadata CoW to disk >
>>> superblock update, then I don't see how the write hole happens. Write
>>> hole requires: RMW of a stripe, which is a partial stripe overwrite,
>>> and a crash during the modification of the stripe making that stripe
>>> inconsistent as well as still pointed to by metadata.
>>
>>
>> RAID5 is *single* failure prof. And in order to have the write hole bug we 
>> need two failure:
>> 1) a transaction is aborted (e.g. due to a power failure) and the results is 
>> that data and parity are mis-aligned
>> 2) a disk disappears
>>
>> These two events may happen even in different moment.
>>
>> The key is that when a disk disappear, all remaining ones are used to 
>> rebuild the missing one. So if data and parity are mis-aligned the rebuild 
>> disk is wrong.
>>
>> Let me to show an example
>>
>> Disk 1            Disk 2         Disk 3  (parity)
>> AAAAAA            BBBBBB         CCCCCC
>>
>> where CCCCCC = AAAAA ^ BBBBB
>>
>> Note1: AAAAA is a valid data
>>
>> Supposing to update B and due to a power failure you can't update parity, 
>> you have:
>>
>>
>> Disk 1            Disk 2         Disk 3  (parity)
>> AAAAAA            DDDDDDD        CCCCCC
>>
>> Of course CCCCCC != AAAAA ^ DDDDD  (data and parity are misaligned).
>>
>>
>> Pay attention that AAAAAA is still valid data.
>>
>> Now suppose to loose disk1. If you want to read from it, you have to perform 
>> a read of disk2 and disk3 to compute disk1.
>>
>> However Disk2 and disk3 are misaligned, so doing a DDDDD ^ CCCCC you don't 
>> got AAAAA anymore.
>>
>>
>> Note that it is not important if DDDDDD or BBBBB are valid or invalid data.
> 
> 
> Doesn't matter on Btrfs. Bad reconstruction due to wrong parity
> results in csum mismatch. This I've tested.

I never argued about that. The write hole is related to *loss* of "valid data" 
due to a mis-alignement between data and parity.
The fact that  BTRFS is capable to detect the problem and return an -EIO, 
doesn't mitigate the loss of valid data. Pay attention that in my example AAAAA 
reached the disk before the "failure events"

> 
> I vaguely remember a while ago doing a dd conv=notrunc modification of
> a file that's raid5, and there was no RMW, what happened is the whole
> stripe was CoW'd and had the modification. So that would, hardware
> behaving correctly, mean that the raid5 data CoW succeeds, then there
> is a metadata CoW to point to it, then the super block is updated to
> point to the new tree.
> 
> At any point, if there's an interruption, we have the old super
> pointing to the old tree which points to premodified data.
> 
> Anyway, I do wish I read the code better, so I knew exactly where, if
> at all, the RMW code was happening on disk rather than just in memory.
> There very clearly is RMW in memory code as a performanc optimizer,
> before a stripe gets written out it's possible to RMW it to add in
> more changes or new files, that way raid56 isn't dog slow CoW'ing
> literally a handful of 16KiB leaves each time, that then translate
> into a minimum of 384K of writes.

In case of a fully stripe write, there is no RMW cycle, so no "write hole". 
Unfortunately not all writes are full stripe size. I never checked the code, 
but I hope that during a commit of the transaction all the writing are grouped 
in "full stripe write" as possible.

Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the 
minimum data is 64k * 2 (+ 64kb of parity).....

> But yeah, Qu just said in another thread that Liu is working on a
> journal for the raid56 write hole problem. Thing is I don't see when
> it happens in the code or in practice (so far, it's really tedious to
> poke a file system with a stick).
> 



> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to