Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:
> On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
>  wrote:
> > On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>  btrfs has no optimization like mdadm write-intent bitmaps; recovery
>  is always a full-device operation.  In theory btrfs could track
>  modifications at the chunk level but this isn't even specified in the
>  on-disk format, much less implemented.
> >>> It could go even further; it would be sufficient to track which
> >>> *partial* stripes update will be performed before a commit, in one
> >>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >>> a scrub on these stripes would be sufficient.
> >
> >> A scrub cannot fix a raid56 write hole--the data is already lost.
> >> The damaged stripe updates must be replayed from the log.
> >
> > Your statement is correct, but you doesn't consider the COW nature of btrfs.
> >
> > The key is that if a data write is interrupted, all the transaction is 
> > interrupted and aborted. And due to the COW nature of btrfs, the "old 
> > state" is restored at the next reboot.
> >
> > What is needed in any case is rebuild of parity to avoid the "write-hole" 
> > bug.
> 
> Write hole happens on disk in Btrfs, but the ensuing corruption on
> rebuild is detected. Corrupt data never propagates. 

Data written with nodatasum or nodatacow is corrupted without detection
(same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
journal device).

Metadata always has csums, and files have checksums if they are created
with default attributes and mount options.  Those cases are covered,
any corrupted data will give EIO on reads (except once per 4 billion
blocks, where the corrupted CRC matches at random).

> The problem is that Btrfs gives up when it's detected.

Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
combinations of recovery blocks for raid6, and earlier kernels than
those would not recover correctly for raid5 either.  I think this has
all been fixed in recent kernels but I haven't tested these myself so
don't quote me on that.

Other than that, btrfs doesn't give up in the write hole case.
It rebuilds the data according to the raid5/6 parity algorithm, but
the algorithm doesn't produce correct data for interrupted RMW writes
when there is no stripe update journal.  There is nothing else to try
at that point.  By the time the error is detected the opportunity to
recover the data has long passed.

The data that comes out of the recovery algorithm is a mixture of old
and new data from the filesystem.  The "new" data is something that
was written just before a failure, but the "old" data could be data
of any age, even a block of free space, that previously existed on the
filesystem.  If you bypass the EIO from the failing csums (e.g. by using
btrfs rescue) it will appear as though someone took the XOR of pairs of
random blocks from the disk and wrote it over one of the data blocks
at random.  When this happens to btrfs metadata, it is effectively a
fuzz tester for tools like 'btrfs check' which will often splat after
a write hole failure happens.

> If it assumes just a bit flip - not always a correct assumption but
> might be reasonable most of the time, it could iterate very quickly.

That is not how write hole works (or csum recovery for that matter).
Write hole producing a single bit flip would occur extremely rarely
outside of contrived test cases.

Recall that in a write hole, one or more 4K blocks are updated on some
of the disks in a stripe, but other blocks retain their original values
from prior to the update.  This is OK as long as all disks are online,
since the parity can be ignored or recomputed from the data blocks.  It is
also OK if the writes on all disks are completed without interruption,
since the data and parity eventually become consistent when all writes
complete as intended.  It is also OK if the entire stripe is written at
once, since then there is only one transaction referring to the stripe,
and if that transaction is not committed then the content of the stripe
is irrelevant.

The write hole error event is when all of the following occur:

- a stripe containing committed data from one or more btrfs
transactions is modified by raid5/6 RMW update in a new
transaction.  This is the usual case on a btrfs filesystem
with the default, 'nossd' or 'ssd' mount options.

- the write is not completed (due to crash, power failure, disk
failure, bad sector, SCSI timeout, bad cable, firmware bug, etc),
so the parity block is out of sync with modified data blocks
(before or after, order doesn't matter).

- the array is alredy degraded, or later becomes degraded before
the parity block can be recomputed by a scrub.

Users can run scrub immediately after _every_ unclean shutdown to
reduce the risk of inconsistent parity and unrecoverable data

Re: bug? fstrim only trims unallocated space, not unused in bg's

2018-03-31 Thread Chris Murphy
On Mon, Nov 20, 2017 at 11:10 PM, Qu Wenruo  wrote:
>
>
> On 2017年11月21日 13:58, Chris Murphy wrote:
>> On Mon, Nov 20, 2017 at 9:58 PM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2017年11月21日 12:49, Chris Murphy wrote:
 On Mon, Nov 20, 2017 at 9:43 PM, Qu Wenruo  wrote:
>
>
>>
>> Apply in addition to previous patch? Or apply to clean v4.14?
>
> On previous patch.

 Refuses to apply with or without previous patch.

 $ git apply -v ~/qufstrim3.patch
 Checking patch fs/btrfs/extent-tree.c...
 error: while searching for:
int dev_ret = 0;
int ret = 0;

/*
 * try to trim all FS space, our block group may start from 
 non-zero.
 */

 error: patch failed: fs/btrfs/extent-tree.c:10972
 error: fs/btrfs/extent-tree.c: patch does not apply

>>>
>>> Please try this branch.
>>>
>>> It's just previous patch and diff merged together and applied on v4.14
>>> tag from torvalds.
>>>
>>> https://github.com/adam900710/linux/tree/tmp
>>
>> # fstrim -v /
>> /: 38 GiB (40767586304 bytes) trimmed
>> # dmesg
>>
>> ..snip...
>> [   46.408792] BTRFS info (device nvme0n1p8): trimming btrfs, start=0
>> len=75161927680 minlen=512
>> [   46.408800] BTRFS info (device nvme0n1p8): bg start=140882477056
>> len=1073741824
>> [   46.433867] BTRFS info (device nvme0n1p8): trimming done
>
> Great (for the output, not for the trimming failure).
>
> And the problem is very obvious now.
> 140882477056 << First chunk start
> 75161927680  << length of fstrim_range passed in
>
> Obviously, fstrim_range passed in is using the filesystem size it
> assumes to be.
>
> While we stupidly use the range in fstrim_range without considering the
> fact that, we're dealing with *btrfs logical address space*.
> Where our chunk can start from any bytenr (well, at least aligned with
> sectorsize).
> When I read the code I also think the range check has nothing wrong at all.
>
> So the truth here is, we should not ever try to check the range from
> fstrim_range.
>
> And the problem means that, a normal btrfs with some usage and after
> several full balance, fstrim will only trim the unallocated space for btrfs.
>
> Now the fix should not be a hard to craft.
>
> Great thanks for all your help to locate the problem.

I still see this problem in 4.16.0-0.rc7.git1.1.fc29.x86_64. Are there
any patches to test?

[chris@f27h mnt]$ sudo btrfs fi us /
Overall:
Device size:  70.00GiB
Device allocated:  60.06GiB
Device unallocated:   9.94GiB
Device missing: 0.00B
Used:  36.89GiB
Free (estimated):  29.75GiB(min: 24.78GiB)
Data ratio:  1.00
Metadata ratio:  1.00
Global reserve: 107.77MiB(used: 12.38MiB)

Data,single: Size:55.00GiB, Used:35.19GiB
   /dev/nvme0n1p9  55.00GiB

Metadata,single: Size:5.00GiB, Used:1.70GiB
   /dev/nvme0n1p9   5.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB
   /dev/nvme0n1p9  64.00MiB

Unallocated:
   /dev/nvme0n1p9   9.94GiB
[chris@f27h mnt]$ sudo fstrim -v /
[sudo] password for chris:
/: 10 GiB (10669260800 bytes) trimmed




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dead or dying SDXC card fsck's OK but mount hangs indefinitely

2018-03-31 Thread Chris Murphy
On Sat, Mar 31, 2018 at 4:40 PM, Adam Borowski  wrote:
> On Sat, Mar 31, 2018 at 03:50:03PM -0600, Chris Murphy wrote:
>> This is perhaps a novelty problem report. And it's also a throw away
>> card data wise, and is a $12 Samsung EVO+ SDXC card used in an Intel
>> NUC.
>>
>> It contains FAT, ext4, and Btrfs file systems. They can all be fsck'd,
>> but none can be mounted. Even if it use blockdev --setro on the entire
>> mmc device and then mount -o ro it still fails. Kinda weird huh?
>
>> [69107.488123] mmc0: Card stuck in programming state! mmcblk0 
>> card_busy_detect
>> [69107.539128] mmc0: Tuning timeout, falling back to fixed sampling clock
>
>> The corrupt 10 was fixed with a scrub months ago, and I never reset
>> the counter so that's not a current corruption. The most recent scrub
>> was maybe a week ago. And even offline scrub works. So literally every
>> single block related to this Btrfs file system is readable, and yet
>> it's not mountable? Very weird.
>
>> Basically the card is in some sort of state that the kernel code is
>> not going to ever second guess. So there's no further error or reset
>> attempt.
>
> As the SD card is tiny compared to any hard disks (be they spinning or
> solid), I'd dd it as an image as the first step.
>
> This in general is a wise step for any kind of recovery, but in your case
> it's especially easy.  It also lets you separate hardware issues from
> filesystem corruption.

It's definitely not file system corruption. The FAT and ext4 volumes
have the same issue, they will not mount, presumably because they
immediately want to update a superblock.

Anyway, recovery not required. Just a spectacle that it fails in a
readable state (it's completely totally readable once mounting
ro,nologreplay) but not a single write can happen.

Also blkdiscard command succeeds, but does nothing. All the data is
still present.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Chris Murphy
On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
 wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
 btrfs has no optimization like mdadm write-intent bitmaps; recovery
 is always a full-device operation.  In theory btrfs could track
 modifications at the chunk level but this isn't even specified in the
 on-disk format, much less implemented.
>>> It could go even further; it would be sufficient to track which
>>> *partial* stripes update will be performed before a commit, in one
>>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>>> a scrub on these stripes would be sufficient.
>
>> A scrub cannot fix a raid56 write hole--the data is already lost.
>> The damaged stripe updates must be replayed from the log.
>
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
>
> The key is that if a data write is interrupted, all the transaction is 
> interrupted and aborted. And due to the COW nature of btrfs, the "old state" 
> is restored at the next reboot.
>
> What is needed in any case is rebuild of parity to avoid the "write-hole" bug.

Write hole happens on disk in Btrfs, but the ensuing corruption on
rebuild is detected. Corrupt data never propagates. The problem is
that Btrfs gives up when it's detected.

If it assumes just a bit flip - not always a correct assumption but
might be reasonable most of the time, it could iterate very quickly.
Flip bit, and recompute and compare checksum. It doesn't have to
iterate across 64KiB times the number of devices. It really only has
to iterate bit flips on the particular 4KiB block that has failed csum
(or in the case of metadata, 16KiB for the default leaf size, up to a
max of 64KiB).

That's a maximum of 4096 iterations and comparisons. It'd be quite
fast. And going for two bit flips while a lot slower is probably not
all that bad either.

Now if it's the kind of corruption you get from a torn or misdirected
write, there's enough corruption that now you're trying to find a
collision on crc32c with a partial match as a guide. That'd take a
while and who knows you might actually get corrupted data anyway since
crc32c isn't cryptographically secure.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dead or dying SDXC card fsck's OK but mount hangs indefinitely

2018-03-31 Thread Chris Murphy
On Sat, Mar 31, 2018 at 3:50 PM, Chris Murphy  wrote:
> This is perhaps a novelty problem report. And it's also a throw away
> card data wise, and is a $12 Samsung EVO+ SDXC card used in an Intel
> NUC.
>
> Kernel is 4.15.14-300.fc27.x86_64
>
> It contains FAT, ext4, and Btrfs file systems. They can all be fsck'd,
> but none can be mounted. Even if it use blockdev --setro on the entire
> mmc device and then mount -o ro it still fails. Kinda weird huh?
>
> [68498.521260] mmc0: new ultra high speed SDR104 SDHC card at address 59b4
> [68498.521990] mmcblk0: mmc0:59b4 EB2MW 29.8 GiB
> [68498.530899]  mmcblk0: p1 p2 p3 p4
> [68507.152842] BTRFS info (device mmcblk0p4): using free space tree
> [68507.152919] BTRFS info (device mmcblk0p4): has skinny extents
> [68507.165855] BTRFS info (device mmcblk0p4): bdev /dev/mmcblk0p4
> errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
> [68507.192935] BTRFS info (device mmcblk0p4): enabling ssd optimizations
> [69107.488123] mmc0: Card stuck in programming state! mmcblk0 card_busy_detect
> [69107.539128] mmc0: Tuning timeout, falling back to fixed sampling clock
>
>
> The corrupt 10 was fixed with a scrub months ago, and I never reset
> the counter so that's not a current corruption. The most recent scrub
> was maybe a week ago. And even offline scrub works. So literally every
> single block related to this Btrfs file system is readable, and yet
> it's not mountable? Very weird.
>
> Anyway, that's it. No further errors. No timeouts. User space is hung
> on the mount command. mccqd/0 is using 7% CPU, the stack for that
> process is:
>
> [root@f27s ~]# cat /proc/2865/stack
> [<0>] mmc_wait_for_req_done+0x7b/0x130 [mmc_core]
> [<0>] mmc_wait_for_cmd+0x66/0x90 [mmc_core]
> [<0>] __mmc_send_status+0x70/0xa0 [mmc_core]
> [<0>] card_busy_detect+0x59/0x160 [mmc_block]
> [<0>] mmc_blk_err_check+0x170/0x640 [mmc_block]
> [<0>] mmc_start_areq+0xc6/0x3c0 [mmc_core]
> [<0>] mmc_blk_issue_rw_rq+0xcf/0x3b0 [mmc_block]
> [<0>] mmc_blk_issue_rq+0x298/0x7c0 [mmc_block]
> [<0>] mmc_queue_thread+0xce/0x160 [mmc_block]
> [<0>] kthread+0x113/0x130
> [<0>] ret_from_fork+0x35/0x40
> [<0>] 0x
> [root@f27s ~]#
>
> Basically the card is in some sort of state that the kernel code is
> not going to ever second guess. So there's no further error or reset
> attempt.
>
> # cat /sys/block/mmcblk0/queue/scheduler
> noop [deadline] cfq
>
>
> This part of the stack trace is interesting:
>
> mmc_blk_issue_rw_rq
>
> Something wants to write something, even though I'm using mount -o ro,
> and also blockdev --setro should prevent any writes from being
> attempted? Almost sounds like user error...

OK starting over:

[root@f27s ~]# blockdev --setro /dev/mmcblk0
[root@f27s ~]# blockdev --report /dev/mmcblk0
RORA   SSZ   BSZ   StartSecSize   Device
ro   256   512  4096  0 32026656768   /dev/mmcblk0
[root@f27s ~]# mount -o ro,nologreplay /dev/mmcblk0p4 /mnt/sd

This works. It was doing log replay and expected to write something,
even ro, I guess. I expect log replay when mounting ro, but I'm not
expecting it to also write something but I guess it's not an in-memory
only kind of log replay.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


dead or dying SDXC card fsck's OK but mount hangs indefinitely

2018-03-31 Thread Chris Murphy
This is perhaps a novelty problem report. And it's also a throw away
card data wise, and is a $12 Samsung EVO+ SDXC card used in an Intel
NUC.

Kernel is 4.15.14-300.fc27.x86_64

It contains FAT, ext4, and Btrfs file systems. They can all be fsck'd,
but none can be mounted. Even if it use blockdev --setro on the entire
mmc device and then mount -o ro it still fails. Kinda weird huh?

[68498.521260] mmc0: new ultra high speed SDR104 SDHC card at address 59b4
[68498.521990] mmcblk0: mmc0:59b4 EB2MW 29.8 GiB
[68498.530899]  mmcblk0: p1 p2 p3 p4
[68507.152842] BTRFS info (device mmcblk0p4): using free space tree
[68507.152919] BTRFS info (device mmcblk0p4): has skinny extents
[68507.165855] BTRFS info (device mmcblk0p4): bdev /dev/mmcblk0p4
errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
[68507.192935] BTRFS info (device mmcblk0p4): enabling ssd optimizations
[69107.488123] mmc0: Card stuck in programming state! mmcblk0 card_busy_detect
[69107.539128] mmc0: Tuning timeout, falling back to fixed sampling clock


The corrupt 10 was fixed with a scrub months ago, and I never reset
the counter so that's not a current corruption. The most recent scrub
was maybe a week ago. And even offline scrub works. So literally every
single block related to this Btrfs file system is readable, and yet
it's not mountable? Very weird.

Anyway, that's it. No further errors. No timeouts. User space is hung
on the mount command. mccqd/0 is using 7% CPU, the stack for that
process is:

[root@f27s ~]# cat /proc/2865/stack
[<0>] mmc_wait_for_req_done+0x7b/0x130 [mmc_core]
[<0>] mmc_wait_for_cmd+0x66/0x90 [mmc_core]
[<0>] __mmc_send_status+0x70/0xa0 [mmc_core]
[<0>] card_busy_detect+0x59/0x160 [mmc_block]
[<0>] mmc_blk_err_check+0x170/0x640 [mmc_block]
[<0>] mmc_start_areq+0xc6/0x3c0 [mmc_core]
[<0>] mmc_blk_issue_rw_rq+0xcf/0x3b0 [mmc_block]
[<0>] mmc_blk_issue_rq+0x298/0x7c0 [mmc_block]
[<0>] mmc_queue_thread+0xce/0x160 [mmc_block]
[<0>] kthread+0x113/0x130
[<0>] ret_from_fork+0x35/0x40
[<0>] 0x
[root@f27s ~]#

Basically the card is in some sort of state that the kernel code is
not going to ever second guess. So there's no further error or reset
attempt.

# cat /sys/block/mmcblk0/queue/scheduler
noop [deadline] cfq


This part of the stack trace is interesting:

mmc_blk_issue_rw_rq

Something wants to write something, even though I'm using mount -o ro,
and also blockdev --setro should prevent any writes from being
attempted? Almost sounds like user error...



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote:
> 31.03.2018 11:16, Goffredo Baroncelli пишет:
> > On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
> >>> The key is that if a data write is interrupted, all the transaction
> >>> is interrupted and aborted. And due to the COW nature of btrfs, the
> >>> "old state" is restored at the next reboot.
> > 
> >> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> >> RMW operations which are not COW and don't provide any data integrity
> >> guarantee.  Old data (i.e. data from very old transactions that are not
> >> part of the currently written transaction) can be destroyed by this.
> > 
> > Could you elaborate a bit ?
> > 
> > Generally speaking, updating a part of a stripe require a RMW cycle, because
> > - you need to read all data stripe (with parity in case of a problem)
> > - then you should write
> > - the new data
> > - the new parity (calculated on the basis of the first read, and the 
> > new data)
> > 
> > However the "old" data should be untouched; or you are saying that the 
> > "old" data is rewritten with the same data ? 
> > 
> 
> If old data block becomes unavailable, it can no more be reconstructed
> because old content of "new data" and "new priority" blocks are lost.
> Fortunately if checksum is in use it does not cause silent data
> corruption but it effectively means data loss.
> 
> Writing of data belonging to unrelated transaction affects previous
> transactions precisely due to RMW cycle. This fundamentally violates
> btrfs claim of always having either old or new consistent state.

Correct.

To fix this, any RMW stripe update on raid56 has to be written to a
log first.  All RMW updates must be logged because a disk failure could
happen at any time.

Full stripe writes don't need to be logged because all the data in the
stripe belongs to the same transaction, so if a disk fails the entire
stripe is either committed or it is not.

One way to avoid the logging is to change the btrfs allocation parameters
so that the filesystem doesn't allocate data in RAID stripes that are
already occupied by data from older transactions.  This is similar to
what 'ssd_spread' does, although the ssd_spread option wasn't designed
for this and won't be effective on large arrays.  This avoids modifying
stripes that contain old committed data, but it also means the free space
on the filesystem will become heavily fragmented over time.  Users will
have to run balance *much* more often to defragment the free space.



signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-03-31 Thread Goffredo Baroncelli
On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
>> The key is that if a data write is interrupted, all the transaction
>> is interrupted and aborted. And due to the COW nature of btrfs, the
>> "old state" is restored at the next reboot.

> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> RMW operations which are not COW and don't provide any data integrity
> guarantee.  Old data (i.e. data from very old transactions that are not
> part of the currently written transaction) can be destroyed by this.

Could you elaborate a bit ?

Generally speaking, updating a part of a stripe require a RMW cycle, because
- you need to read all data stripe (with parity in case of a problem)
- then you should write
- the new data
- the new parity (calculated on the basis of the first read, and the 
new data)

However the "old" data should be untouched; or you are saying that the "old" 
data is rewritten with the same data ? 

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-03-31 Thread Zygo Blaxell
On Sat, Mar 31, 2018 at 08:57:18AM +0200, Goffredo Baroncelli wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>> is always a full-device operation.  In theory btrfs could track
> >>> modifications at the chunk level but this isn't even specified in the
> >>> on-disk format, much less implemented.
> >> It could go even further; it would be sufficient to track which
> >> *partial* stripes update will be performed before a commit, in one
> >> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >> a scrub on these stripes would be sufficient.
> 
> > A scrub cannot fix a raid56 write hole--the data is already lost.
> > The damaged stripe updates must be replayed from the log.
> 
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
> 
> The key is that if a data write is interrupted, all the transaction
> is interrupted and aborted. And due to the COW nature of btrfs, the
> "old state" is restored at the next reboot.

This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
RMW operations which are not COW and don't provide any data integrity
guarantee.  Old data (i.e. data from very old transactions that are not
part of the currently written transaction) can be destroyed by this.

> What is needed in any case is rebuild of parity to avoid the
> "write-hole" bug. And this is needed only for a partial stripe
> write. For a full stripe write, due to the fact that the commit is
> not flushed, it is not needed the scrub at all.
> 
> Of course for the NODATACOW file this is not entirely true; but I
> don't see the gain to switch from the cost of COW to the cost of a log.
> 
> The above sentences are correct (IMHO) if we don't consider a power
> failure+device missing case. However in this case even logging the
> "new data" would be not sufficient.
> 
> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature