Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 14:45, Chris Murphy wrote:

>> I think it's statistically 0 people changing this from default. It's
>> people with drives that have no SCT ERC support, used in raid1+, who
>> happen to stumble upon this very obscure work around to avoid link
>> resets in the face of media defects. Rare.
>
> Not as much as you think, once someone has this issue, they usually put
> preventative measures in place on any system where it applies.  I'd be
> willing to bet that most sysadmins at big companies like RedHat or Oracle
> are setting this.

SCT ERC yes. Changing the kernel's command timer? I think almost zero.



>> Well they have link resets and their file system presumably face
>> plants as a result of a pile of commands in the queue returning as
>> unsuccessful. So they have premature death of their system, rather
>> than it getting sluggish. This is a long standing indicator on Windows
>> to just reinstall the OS and restore data from backups -> the user has
>> an opportunity to freshen up user data backup, and the reinstallation
>> and restore from backup results in freshly written sectors which is
>> how bad sectors get fixed. The marginally bad sectors get new writes
>> and now read fast (or fast enough), and the persistently bad sectors
>> result in the drive firmware remapping to reserve sectors.
>>
>> The main thing in my opinion is less extension of drive life, as it is
>> the user gets to use the system, albeit sluggish, to make a backup of
>> their data rather than possibly losing it.
>
> The extension of the drive's lifetime is a nice benefit, but not what my
> point was here.  For people in this particular case, it will almost
> certainly only make things better (although at first it may make performance
> worse).

I'm not sure why it makes performance worse. The options are, slower
reads vs a file system that almost certainly face plants upon a link
reset.




>> Basically it's:
>>
>> For SATA and USB drives:
>>
>> if data redundant, then enable short SCT ERC time if supported, if not
>> supported then extend SCSI command timer to 200;
>>
>> if data not redundant, then disable SCT ERC if supported, and extend
>> SCSI command timer to 200.
>>
>> For SCSI (SAS most likely these days), keep things the same as now.
>> But that's only because this is a rare enough configuration now I
>> don't know if we really know the problems there. It may be that their
>> error recovery in 7 seconds is massively better and more reliable than
>> consumer drives over 180 seconds.
>
> I don't see why you would think this is not common.

I was not clear. Single device SAS is probably not common. They're
typically being used in arrays where data is redundant. Using such a
drive with short error recovery as a single boot drive? Probably not
that common.



> Separately, USB gets _really_ complicated if you want to cover everything,
> USB drives may or may not present as non-rotational, may or may not show up
> as SATA or SCSI bridges (there are some of the more expensive flash drives
> that actually use SSD controllers plus USB-SAT chips internally), if they do
> show up as such, may or may not support the required commands (most don't,
> but it's seemingly hit or miss which do).

Yup. Well, do what we can instead of just ignoring the problem? They
can still be polled for features including SCT ERC and if it's not
supported or configurable then fallback to increasing the command
timer. I'm not sure what else can be done anyway.

The main obstacle is squaring the device capability (low level) with
storage stack redundancy 0 or 1 (high level). Something has to be
aware of both to ideally get all devices ideally configured.



>> Yep it's imperfect unless there's the proper cross communication
>> between layers. There are some such things like hardware raid geometry
>> that optionally poke through (when supported by hardware raid drivers)
>> so that things like mkfs.xfs can automatically provide the right sunit
>> swidth for optimized layout; which the device mapper already does
>> automatically. So it could be done it's just a matter of how big of a
>> problem is this to build it, vs just going with a new one size fits
>> all default command timer?
>
> The other problem though is that the existing things pass through
> _read-only_ data, while this requires writable data to be passed through,
> which leads to all kinds of complicated issues potentially.

I'm aware. There are also plenty of bugs even if write were to pass
through. I've encountered more drives than not which accept only one
SCT ERC change per poweron. A 2nd change causes the drive to offline
and vanish off the bus. So no doubt this whole area is fragile enough
not even the drive, controller, enclosure vendors are aware of where
all the bodies are buried.

What I think is fairly well established is that at least on Windows
their lower level stuff including kernel 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 14:45, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
 wrote:

On 2016-07-06 12:43, Chris Murphy wrote:



So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.


I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.
Not as much as you think, once someone has this issue, they usually put 
preventative measures in place on any system where it applies.  I'd be 
willing to bet that most sysadmins at big companies like RedHat or 
Oracle are setting this.




2. People using single disks which have lots of errors may or may not see an
apparent degradation of performance, but will likely have the life
expectancy of their device extended.


Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.
The extension of the drive's lifetime is a nice benefit, but not what my 
point was here.  For people in this particular case, it will almost 
certainly only make things better (although at first it may make 
performance worse).




3. Individuals who are not setting this but should be will on average be no
worse off than before other than seeing a bigger performance hit on a disk
error.
4. People with single disks which are new will see no functional change
until the disk has an error.


I follow.




In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
percentage over that (probably something like 25%, which would give roughly
10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the
expected disk timeout to account for things like latency outside of the
disk).


Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.
I agree, but if it's only the kernel doing this, then we can't make 
judgements based on userspace usage.  Also, the first situation while 
not optimal is still better than what happens now, at least there you 
will get an I/O error in a reasonable amount of time (as opposed to 
after a really long time if ever).


Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.
I don't see why you would think this is not common.  If you count just 
by systems, then it's absolutely outnumbered at least 100 to 1 by 
regular ATA disks.  If you look at individual disks though, the reverse 
is true, because people who use SCSI drives tend to use _lots_ of disks 
(think big data centers, NAS and SAN systems and such).  OTOH, both are 
probably vastly outnumbered by stuff that doesn't use either standard 
for storage...


Separately, USB gets _really_ complicated if you want to cover 
everything, USB drives may or may not present as non-rotational, may or 
may not show 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 12:43, Chris Murphy wrote:

>> So does it make sense to just set the default to 180? Or is there a
>> smarter way to do this? I don't know.
>
> Just thinking about this:
> 1. People who are setting this somewhere will be functionally unaffected.

I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.


> 2. People using single disks which have lots of errors may or may not see an
> apparent degradation of performance, but will likely have the life
> expectancy of their device extended.

Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.


> 3. Individuals who are not setting this but should be will on average be no
> worse off than before other than seeing a bigger performance hit on a disk
> error.
> 4. People with single disks which are new will see no functional change
> until the disk has an error.

I follow.


>
> In an ideal situation, what I'd want to see is:
> 1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
> percentage over that (probably something like 25%, which would give roughly
> 10 seconds for the normal 7 second ERC timer).
> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
> this is reasonable for SCSI disks).
> 3. Otherwise, set the timer to 200 (we need a slight buffer over the
> expected disk timeout to account for things like latency outside of the
> disk).

Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.

Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.




>
>>
>>
 I suspect, but haven't tested, that ZFS On Linux would be equally
 affected, unless they're completely reimplementing their own block
 layer (?) So there are quite a few parties now negatively impacted by
 the current default behavior.
>>>
>>>
>>> OTOH, I would not be surprised if the stance there is 'you get no support
>>> if
>>> your not using enterprise drives', not because of the project itself, but
>>> because it's ZFS.  Part of their minimum recommended hardware
>>> requirements
>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>>> there too.
>>
>>
>> http://open-zfs.org/wiki/Hardware
>> "Consistent performance requires hard drives that support error
>> recovery control. "
>>
>> "Drives that lack such functionality can be expected to have
>> arbitrarily high limits. Several minutes is not impossible. Drives
>> with this functionality typically default to 7 seconds. ZFS does not
>> currently adjust this setting on drives. However, it is advisable to
>> write a script to set the error recovery time to a low value, such as
>> 0.1 seconds until ZFS is modified to control it. This must be done on
>> every boot. "
>>
>> They do not explicitly require enterprise drives, but they clearly
>> expect SCT ERC enabled to some sane value.
>>
>> At least for Btrfs and ZFS, the mkfs is in a position to know all
>> 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 12:43, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
 wrote:

On 2016-07-05 19:05, Chris Murphy wrote:


Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.


The thing I don't get about this is that while the per-device settings on a
given system are policy, the default value is not, and should be expected to
work correctly (but not necessarily optimally) on as many systems as
possible, so any claim that this should be fixed in udev are bogus by the
regular kernel rules.


Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.

Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.
2. People using single disks which have lots of errors may or may not 
see an apparent degradation of performance, but will likely have the 
life expectancy of their device extended.
3. Individuals who are not setting this but should be will on average be 
no worse off than before other than seeing a bigger performance hit on a 
disk error.
4. People with single disks which are new will see no functional change 
until the disk has an error.


In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable 
percentage over that (probably something like 25%, which would give 
roughly 10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer 
(IIRC< this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the 
expected disk timeout to account for things like latency outside of the 
disk).




I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.


OTOH, I would not be surprised if the stance there is 'you get no support if
your not using enterprise drives', not because of the project itself, but
because it's ZFS.  Part of their minimum recommended hardware requirements
is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
there too.


http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.
I"m pretty certain they don't create rules, or even try to check the 
drive for SCT ERC support.  The problem with doing this is that you 
can't be certain that your underlying device is actually a physical 
storage device or not, and thus you have to check more than just the SCT 
ERC commands, and many people (myself included) don't like tools doing 
things that modify the persistent functioning of their system that the 
tool itself is not intended to do (and messing with block layer settings 
falls into that category for a mkfs tool).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-05 19:05, Chris Murphy wrote:
>>
>> Related:
>> http://www.spinics.net/lists/raid/msg52880.html
>>
>> Looks like there is some traction to figuring out what to do about
>> this, whether it's a udev rule or something that happens in the kernel
>> itself. Pretty much the only hardware setup unaffected by this are
>> those with enterprise or NAS drives. Every configuration of a consumer
>> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
>> RAID Levels are adversely affected by this.
>
> The thing I don't get about this is that while the per-device settings on a
> given system are policy, the default value is not, and should be expected to
> work correctly (but not necessarily optimally) on as many systems as
> possible, so any claim that this should be fixed in udev are bogus by the
> regular kernel rules.

Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


>> I suspect, but haven't tested, that ZFS On Linux would be equally
>> affected, unless they're completely reimplementing their own block
>> layer (?) So there are quite a few parties now negatively impacted by
>> the current default behavior.
>
> OTOH, I would not be surprised if the stance there is 'you get no support if
> your not using enterprise drives', not because of the project itself, but
> because it's ZFS.  Part of their minimum recommended hardware requirements
> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
> there too.

http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-05 19:05, Chris Murphy wrote:

Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.
The thing I don't get about this is that while the per-device settings 
on a given system are policy, the default value is not, and should be 
expected to work correctly (but not necessarily optimally) on as many 
systems as possible, so any claim that this should be fixed in udev are 
bogus by the regular kernel rules.


I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.
OTOH, I would not be surprised if the stance there is 'you get no 
support if your not using enterprise drives', not because of the project 
itself, but because it's ZFS.  Part of their minimum recommended 
hardware requirements is ECC RAM, so it wouldn't surprise me if 
enterprise storage devices are there too.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-05 Thread Chris Murphy
Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.

I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Steven Haigh
On 29/06/16 04:01, Chris Murphy wrote:
> Just wiping the slate clean to summarize:
> 
> 
> 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
> of *data extent* parity during a scrub with raid5. Goffredo and I have
> both reproduced it. It's a big bug. It might still be useful if
> someone else can reproduce it too.
> 
> Goffredo, can you file a bug at bugzilla.kernel.org and reference your
> bug thread?  I don't know if the key developers know about this, it
> might be worth pinging them on IRC once the bug is filed.
> 
> Unknown if it affects balance, or raid 6. And if it affects raid 6, is
> p or q corrupted, or both? Unknown how this manifests on metadata
> raid5 profile (only tested was data raid5). Presumably if there is
> metadata corruption that's fixed during a scrub, and its parity is
> overwritten with corrupt parity, the next time there's a degraded
> state, the file system would face plant somehow. And we've seen quite
> a few degraded raid5's (and even 6's) face plant in inexplicable ways
> and we just kinda go, shit. Which is what the fs is doing when it
> encounters a pile of csum errors. It treats the csum errors as a
> signal to disregard the fs rather than maybe only being suspicious of
> the fs. Could it turn out that these file systems were recoverable,
> just that Btrfs wasn't tolerating any csum error and wouldn't proceed
> further?

I believe this is the same case for RAID6 based on my experiences. I
actually wondered if the system halts were the result of a TON of csum
errors - not the actual result of those errors. Just about every system
hang when to 100% CPU usage on all cores and the system just stopped was
after a flood of csum errors. If it was only one or two (or I copied
data off via a network connection where the read rate was slower), I
found I had a MUCH lower chance of the system locking up.

In fact, now that I think about it, when I was copying data to an
external USB drive (maxed out at ~30MB/sec), I still got csum errors -
but the system never hung.

Every crash ended with the last line along the lines of "Stopped
recurring error. Your system needs rebooting". I wonder if this error
reporting was altered, that the system wouldn't go down.

Of course I have no way of testing this.


-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Chris Murphy
Just wiping the slate clean to summarize:


1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
of *data extent* parity during a scrub with raid5. Goffredo and I have
both reproduced it. It's a big bug. It might still be useful if
someone else can reproduce it too.

Goffredo, can you file a bug at bugzilla.kernel.org and reference your
bug thread?  I don't know if the key developers know about this, it
might be worth pinging them on IRC once the bug is filed.

Unknown if it affects balance, or raid 6. And if it affects raid 6, is
p or q corrupted, or both? Unknown how this manifests on metadata
raid5 profile (only tested was data raid5). Presumably if there is
metadata corruption that's fixed during a scrub, and its parity is
overwritten with corrupt parity, the next time there's a degraded
state, the file system would face plant somehow. And we've seen quite
a few degraded raid5's (and even 6's) face plant in inexplicable ways
and we just kinda go, shit. Which is what the fs is doing when it
encounters a pile of csum errors. It treats the csum errors as a
signal to disregard the fs rather than maybe only being suspicious of
the fs. Could it turn out that these file systems were recoverable,
just that Btrfs wasn't tolerating any csum error and wouldn't proceed
further?

2. The existing scrub code computes parity on-the-fly, compares it
with what's on-disk, and overwrites if there's a mismatch. If there's
a mismatch, there's no message anywhere. It's a feature request to get
a message on parity mismatches. An additional feature request would be
to get a parity_error counter along the lines of the other error
counters we have for scrub stats and dev stats.

3. I think it's a more significant change to get parity checksums
stored some where. Right now the csum tree holds item type EXTENT_CSUM
but parity is not an extent, it's also not data, it's a variant of
data. So it seems to me we'd need a new item type PARITY_CSUM to get
it into the existing csum tree. And I'm not sure what incompatibility
that brings; presumably older kernels could mount such a volume ro
safely, but shouldn't write to it, including btrfs check --repair
should probably fail.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Steven Haigh
On 28/06/16 22:25, Austin S. Hemmelgarn wrote:
> On 2016-06-28 08:14, Steven Haigh wrote:
>> On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
>>> On 2016-06-27 17:57, Zygo Blaxell wrote:
 On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>  wrote:
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions
>> involving
>> highly situational co-processors, computing a checksum of a parity
>> block is
>> always going to be faster than computing parity for the stripe.  By
>> using
>> that to check parity, we can safely speed up the common case of near
>> zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

 A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
 one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
 array of SSDs vs. a slow CPU.
>>> OK, great for people who are using modern desktop or server CPU's.  Not
>>> everyone has that luxury, and even on many such CPU's, it's _still_
>>> faster to computer CRC32c checksums.  On top of that, we don't appear to
>>> be using the in-kernel parity-raid libraries (or if we are, I haven't
>>> been able to find where we are calling the functions for it), so we
>>> don't necessarily get assembly optimized or co-processor accelerated
>>> computation of the parity itself.  The other thing that I didn't mention
>>> above though, is that computing parity checksums will always take less
>>> time than computing parity, because you have to process significantly
>>> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
>>> much data to do the parity checksums instead of parity itself, which
>>> means that the parity computation would need to be 200% faster than the
>>> CRC32c computation to break even, and this margin gets bigger and bigger
>>> as you add more disks.
>>>
>>> On small arrays, this obviously won't have much impact.  Once you start
>>> to scale past a few TB though, even a few hundred MB/s faster processing
>>> means a significant decrease in processing time.  Say you have a CPU
>>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
>>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
>>> instruction and assembly optimized RAID5 parity computations on a modern
>>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
>>> the Xeon E3 in my home server is closer to 5%)).  Assuming those
>>> numbers, and that we're already checking checksums on non-parity blocks,
>>> processing 120TB of data in a 4 disk array (which gives 40TB of parity
>>> data, so 160TB total) gives:
>>> For computing the parity to scrub:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 120TB / 12GB= 1 seconds for processing parity of all stripes
>>> = 19795.9 seconds total
>>> ~ 5.4 hours total
>>>
>>> For computing csums of the parity:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
>>> parity data
>>> = 13061.2 seconds total
>>> ~ 3.6 hours total
>>>
>>> The checksum based computation is approximately 34% faster than the
>>> parity computation.  Much of this of course is that you have to process
>>> the regular data twice for the parity computation method (once for
>>> csums, once for parity).  You could probably do one pass computing both
>>> values, but that would need to be done carefully; and, without
>>> significant optimization, would likely not get you much benefit other
>>> than cutting the number of loads in half.
>>
>> And it all means jack shit because you don't get the data to disk that
>> quick. Who cares if its 500% faster - if it still saturates the
>> throughput of the actual drives, what difference does it make?
> It has less impact on everything else running on the system at the time
> because it uses less CPU time and potentially less memory.  This is the
> exact same reason that you want your RAID parity computation performance
> as good as possible, the less time the CPU spends 

Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Austin S. Hemmelgarn

On 2016-06-28 08:14, Steven Haigh wrote:

On 28/06/16 22:05, Austin S. Hemmelgarn wrote:

On 2016-06-27 17:57, Zygo Blaxell wrote:

On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.


Except that's horribly inefficient.  With limited exceptions involving
highly situational co-processors, computing a checksum of a parity
block is
always going to be faster than computing parity for the stripe.  By
using
that to check parity, we can safely speed up the common case of near
zero
errors during a scrub by a pretty significant factor.


OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.


A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

OK, great for people who are using modern desktop or server CPU's.  Not
everyone has that luxury, and even on many such CPU's, it's _still_
faster to computer CRC32c checksums.  On top of that, we don't appear to
be using the in-kernel parity-raid libraries (or if we are, I haven't
been able to find where we are calling the functions for it), so we
don't necessarily get assembly optimized or co-processor accelerated
computation of the parity itself.  The other thing that I didn't mention
above though, is that computing parity checksums will always take less
time than computing parity, because you have to process significantly
less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
much data to do the parity checksums instead of parity itself, which
means that the parity computation would need to be 200% faster than the
CRC32c computation to break even, and this margin gets bigger and bigger
as you add more disks.

On small arrays, this obviously won't have much impact.  Once you start
to scale past a few TB though, even a few hundred MB/s faster processing
means a significant decrease in processing time.  Say you have a CPU
which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
CRC32c (~2% is a conservative ratio assuming you use the CRC32c
instruction and assembly optimized RAID5 parity computations on a modern
x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
the Xeon E3 in my home server is closer to 5%)).  Assuming those
numbers, and that we're already checking checksums on non-parity blocks,
processing 120TB of data in a 4 disk array (which gives 40TB of parity
data, so 160TB total) gives:
For computing the parity to scrub:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
regular data
120TB / 12GB= 1 seconds for processing parity of all stripes
= 19795.9 seconds total
~ 5.4 hours total

For computing csums of the parity:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
regular data
40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
parity data
= 13061.2 seconds total
~ 3.6 hours total

The checksum based computation is approximately 34% faster than the
parity computation.  Much of this of course is that you have to process
the regular data twice for the parity computation method (once for
csums, once for parity).  You could probably do one pass computing both
values, but that would need to be done carefully; and, without
significant optimization, would likely not get you much benefit other
than cutting the number of loads in half.


And it all means jack shit because you don't get the data to disk that
quick. Who cares if its 500% faster - if it still saturates the
throughput of the actual drives, what difference does it make?
It has less impact on everything else running on the system at the time 
because it uses less CPU time and potentially less memory.  This is the 
exact same reason that you want your RAID parity computation performance 
as good as possible, the less time the CPU spends on that, the more it 
can spend on other things.  On top of that, there are high-end systems 
that do have SSD's that can get multiple GB/s of data transfer per 
second, and NVDIMM's are starting to become popular in the server 
market, and those give you data transfer speeds equivalent to regular 
memory bandwidth (which can be well over 20GB/s on decent hardware (I've 
got a relatively inexpensive system using DDR3-1866 RAM that has 

Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Steven Haigh
On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
> On 2016-06-27 17:57, Zygo Blaxell wrote:
>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>>>  wrote:
 On 2016-06-25 12:44, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>  wrote:
>
> OK but hold on. During scrub, it should read data, compute checksums
> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> the checksum tree, and the parity strip in the chunk tree. And if
> parity is wrong, then it should be replaced.

 Except that's horribly inefficient.  With limited exceptions involving
 highly situational co-processors, computing a checksum of a parity
 block is
 always going to be faster than computing parity for the stripe.  By
 using
 that to check parity, we can safely speed up the common case of near
 zero
 errors during a scrub by a pretty significant factor.
>>>
>>> OK I'm in favor of that. Although somehow md gets away with this by
>>> computing and checking parity for its scrubs, and still manages to
>>> keep drives saturated in the process - at least HDDs, I'm not sure how
>>> it fares on SSDs.
>>
>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
>> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
>> array of SSDs vs. a slow CPU.
> OK, great for people who are using modern desktop or server CPU's.  Not
> everyone has that luxury, and even on many such CPU's, it's _still_
> faster to computer CRC32c checksums.  On top of that, we don't appear to
> be using the in-kernel parity-raid libraries (or if we are, I haven't
> been able to find where we are calling the functions for it), so we
> don't necessarily get assembly optimized or co-processor accelerated
> computation of the parity itself.  The other thing that I didn't mention
> above though, is that computing parity checksums will always take less
> time than computing parity, because you have to process significantly
> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
> much data to do the parity checksums instead of parity itself, which
> means that the parity computation would need to be 200% faster than the
> CRC32c computation to break even, and this margin gets bigger and bigger
> as you add more disks.
> 
> On small arrays, this obviously won't have much impact.  Once you start
> to scale past a few TB though, even a few hundred MB/s faster processing
> means a significant decrease in processing time.  Say you have a CPU
> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
> instruction and assembly optimized RAID5 parity computations on a modern
> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
> the Xeon E3 in my home server is closer to 5%)).  Assuming those
> numbers, and that we're already checking checksums on non-parity blocks,
> processing 120TB of data in a 4 disk array (which gives 40TB of parity
> data, so 160TB total) gives:
> For computing the parity to scrub:
> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
> regular data
> 120TB / 12GB= 1 seconds for processing parity of all stripes
> = 19795.9 seconds total
> ~ 5.4 hours total
> 
> For computing csums of the parity:
> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
> regular data
> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
> parity data
> = 13061.2 seconds total
> ~ 3.6 hours total
> 
> The checksum based computation is approximately 34% faster than the
> parity computation.  Much of this of course is that you have to process
> the regular data twice for the parity computation method (once for
> csums, once for parity).  You could probably do one pass computing both
> values, but that would need to be done carefully; and, without
> significant optimization, would likely not get you much benefit other
> than cutting the number of loads in half.

And it all means jack shit because you don't get the data to disk that
quick. Who cares if its 500% faster - if it still saturates the
throughput of the actual drives, what difference does it make?

I'm all for actual solutions, but the nirvana fallacy seems to apply here...

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Austin S. Hemmelgarn

On 2016-06-27 17:57, Zygo Blaxell wrote:

On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.


Except that's horribly inefficient.  With limited exceptions involving
highly situational co-processors, computing a checksum of a parity block is
always going to be faster than computing parity for the stripe.  By using
that to check parity, we can safely speed up the common case of near zero
errors during a scrub by a pretty significant factor.


OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.


A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.
OK, great for people who are using modern desktop or server CPU's.  Not 
everyone has that luxury, and even on many such CPU's, it's _still_ 
faster to computer CRC32c checksums.  On top of that, we don't appear to 
be using the in-kernel parity-raid libraries (or if we are, I haven't 
been able to find where we are calling the functions for it), so we 
don't necessarily get assembly optimized or co-processor accelerated 
computation of the parity itself.  The other thing that I didn't mention 
above though, is that computing parity checksums will always take less 
time than computing parity, because you have to process significantly 
less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as 
much data to do the parity checksums instead of parity itself, which 
means that the parity computation would need to be 200% faster than the 
CRC32c computation to break even, and this margin gets bigger and bigger 
as you add more disks.


On small arrays, this obviously won't have much impact.  Once you start 
to scale past a few TB though, even a few hundred MB/s faster processing 
means a significant decrease in processing time.  Say you have a CPU 
which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for 
CRC32c (~2% is a conservative ratio assuming you use the CRC32c 
instruction and assembly optimized RAID5 parity computations on a modern 
x86_64 processor (the ratio on both the mobile Core i5 in my laptop and 
the Xeon E3 in my home server is closer to 5%)).  Assuming those 
numbers, and that we're already checking checksums on non-parity blocks, 
processing 120TB of data in a 4 disk array (which gives 40TB of parity 
data, so 160TB total) gives:

For computing the parity to scrub:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the 
regular data

120TB / 12GB= 1 seconds for processing parity of all stripes
= 19795.9 seconds total
~ 5.4 hours total

For computing csums of the parity:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the 
regular data
40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the 
parity data

= 13061.2 seconds total
~ 3.6 hours total

The checksum based computation is approximately 34% faster than the 
parity computation.  Much of this of course is that you have to process 
the regular data twice for the parity computation method (once for 
csums, once for parity).  You could probably do one pass computing both 
values, but that would need to be done carefully; and, without 
significant optimization, would likely not get you much benefit other 
than cutting the number of loads in half.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Austin S. Hemmelgarn

On 2016-06-27 23:17, Zygo Blaxell wrote:

On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:

On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
 wrote:

On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


It's a crude form of "resilvering" as ZFS calls it.


In what manner is it crude?


Balance relocates extents, looks up backrefs, and rewrites metadata, all
of which are extra work above what is required by resilvering (and extra
work that is proportional to the number of backrefs and the (currently
extremely poor) performance of the backref walking code, so snapshots
and large files multiply the workload).

Resilvering should just read data, reconstruct it from a mirror if
necessary, and write it back to the original location (or read one
mirror and rewrite the other).  That's more like what scrub does, except
scrub rewrites only the blocks it couldn't read (or that failed csum).
It's worth pointing out that balance was not designed for resilvering, 
it was designed for reshaping arrays, converting replication profiles, 
and compaction at the chunk level.  Balance is not a resilvering tool, 
that just happens to be a useful side effect of running a balance 
(actually, so is the chunk level compaction).



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> >> Btrfs does have something of a work around for when things get slow,
> >> and that's balance, read and rewrite everything. The write forces
> >> sector remapping by the drive firmware for bad sectors.
> >
> > It's a crude form of "resilvering" as ZFS calls it.
> 
> In what manner is it crude?

Balance relocates extents, looks up backrefs, and rewrites metadata, all
of which are extra work above what is required by resilvering (and extra
work that is proportional to the number of backrefs and the (currently
extremely poor) performance of the backref walking code, so snapshots
and large files multiply the workload).

Resilvering should just read data, reconstruct it from a mirror if
necessary, and write it back to the original location (or read one
mirror and rewrite the other).  That's more like what scrub does, except
scrub rewrites only the blocks it couldn't read (or that failed csum).

> > Last time I checked all the RAID implementations on Linux (ok, so that's
> > pretty much just md-raid) had some sort of repair capability.
> 
> You can read man 4 md, and you can also look on linux-raid@, it's very
> clearly necessary for the drive to report a read or write error
> explicitly with LBA for md to do repairs. If there are link resets,
> bad sectors accumulate and the obvious inevitably happens.

I am looking at the md code.  It looks at ->bi_error, and nothing else as
far as I can tell.  It doesn't even care if the error is EIO--any non-zero
return value from the lower bio layer seems to trigger automatic recovery.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
>> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
>>  wrote:
>> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>> > If anything, I want the timeout to be shorter so that upper layers with
>> > redundancy can get an EIO and initiate repair promptly, and admins can
>> > get notified to evict chronic offenders from their drive slots, without
>> > having to pay extra for hard disk firmware with that feature.
>>
>> The drive totally thwarts this. It doesn't report back to the kernel
>> what command is hung, as far as I'm aware. It just hangs and goes into
>> a so called "deep recovery" there is no way to know what sector is
>> causing the problem
>
> I'm proposing just treat the link reset _as_ an EIO, unless transparent
> link resets are required for link speed negotiation or something.

That's not one EIO, that's possibly 31 items in the command queue that
get knocked over when the link is reset. I don't have the expertise to
know whether it's sane to interpret many EIO all at once as an
implicit indication of bad sectors. Off hand I think that's probably
specious.

> The drive wouldn't be thwarting anything, the host would just ignore it
> (unless the drive doesn't respond to a link reset until after its internal
> timeout, in which case nothing is saved by shortening the timeout).
>
>> until the drive reports a read error, which will
>> include the affected sector LBA.
>
> It doesn't matter which sector.  Chances are good that it was more than
> one of the outstanding requested sectors anyway.  Rewrite them all.

*shrug* even if valid, it only helps the raid 1+ cases. It does
nothing to help raid0, linear/concat, or single device deployments.
Those users also deserve to have access to their data, if the drive
can recover it by giving it enough time to do so.


> We know which sectors they are because somebody has an IO operation
> waiting for a status on each of them (unless they're using AIO or some
> other API where a request can be fired at a hard drive and the reply
> discarded).  Notify all of them that their IO failed and move on.

Dunno, maybe.


>
>> Btrfs does have something of a work around for when things get slow,
>> and that's balance, read and rewrite everything. The write forces
>> sector remapping by the drive firmware for bad sectors.
>
> It's a crude form of "resilvering" as ZFS calls it.

In what manner is it crude?




> If btrfs sees EIO from a lower block layer it will try to reconstruct the
> missing data (but not repair it).  If that happens during a scrub,
> it will also attempt to rewrite the missing data over the original
> offending sectors.  This happens every few months in my server pool,
> and seems to be working even on btrfs raid5.
>
> Last time I checked all the RAID implementations on Linux (ok, so that's
> pretty much just md-raid) had some sort of repair capability.

You can read man 4 md, and you can also look on linux-raid@, it's very
clearly necessary for the drive to report a read or write error
explicitly with LBA for md to do repairs. If there are link resets,
bad sectors accumulate and the obvious inevitably happens.



>
>> For single drives and RAID 0, the only possible solution is to not do
>> link resets for up to 3 minutes and hope the drive returns the single
>> copy of data.
>
> So perhaps the timeout should be influenced by higher layers, e.g. if a
> disk becomes part of a raid1, its timeout should be shortened by default,
> while a timeout for a disk that is not used in by redundant layer should
> be longer.

And there are a pile of reasons why link resets are necessary that
have nothing to do with bad sectors. So if you end up with a drive or
controller misbehaving and the new behavior is to force a bunch of new
(corrective) writes to the drive right after a reset it could actually
make its problems worse for all we know.

I think it's highly speculative to assume hung block devices means bad
sector and should be treated as a bad sector, and that doing so will
cause no other side effects. That's a question for block device/SCSI
experts to opine on whether this is at all sane to do. I'm sure
they're reasonably aware of this problem that if it were that simple
they'd have done that already, but conversely 5 years of telling users
to change the command timer or stop using the wrong kind of drives for
RAID really isn't sufficiently good advice either.

The reality is that manufacturers of drives have handed us drives that
far and wide don't support SCT ERC or it's disabled by default, so
yeah maybe the thing to do is udev polls the drive for SCT ERC, if
it's already at 70,70 then leave the SCSI command timer as is. If it
reports it's disabled, then udev needs to know if it's in some kind of
RAID 1+ and if so then set SCT ERC to 70,70. If it's a single drive,

Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> > If anything, I want the timeout to be shorter so that upper layers with
> > redundancy can get an EIO and initiate repair promptly, and admins can
> > get notified to evict chronic offenders from their drive slots, without
> > having to pay extra for hard disk firmware with that feature.
> 
> The drive totally thwarts this. It doesn't report back to the kernel
> what command is hung, as far as I'm aware. It just hangs and goes into
> a so called "deep recovery" there is no way to know what sector is
> causing the problem

I'm proposing just treat the link reset _as_ an EIO, unless transparent
link resets are required for link speed negotiation or something.
The drive wouldn't be thwarting anything, the host would just ignore it
(unless the drive doesn't respond to a link reset until after its internal
timeout, in which case nothing is saved by shortening the timeout).

> until the drive reports a read error, which will
> include the affected sector LBA.

It doesn't matter which sector.  Chances are good that it was more than
one of the outstanding requested sectors anyway.  Rewrite them all.

We know which sectors they are because somebody has an IO operation
waiting for a status on each of them (unless they're using AIO or some
other API where a request can be fired at a hard drive and the reply
discarded).  Notify all of them that their IO failed and move on.

> Btrfs does have something of a work around for when things get slow,
> and that's balance, read and rewrite everything. The write forces
> sector remapping by the drive firmware for bad sectors.

It's a crude form of "resilvering" as ZFS calls it.

> > The upper layers could time the IOs, and make their own decisions based
> > on the timing (e.g. btrfs or mdadm could proactively repair anything that
> > took more than 10 seconds to read).  That might be a better approach,
> > since shortening the time to an EIO is only useful when you have a
> > redundancy layer in place to do something about them.
> 
> For RAID with redundancy, that's doable, although I have no idea what
> work is needed, or even if it's possible, to track commands in this
> manner, and fall back to some kind of repair mode as if it were a read
> error.

If btrfs sees EIO from a lower block layer it will try to reconstruct the
missing data (but not repair it).  If that happens during a scrub,
it will also attempt to rewrite the missing data over the original
offending sectors.  This happens every few months in my server pool,
and seems to be working even on btrfs raid5.

Last time I checked all the RAID implementations on Linux (ok, so that's
pretty much just md-raid) had some sort of repair capability.  lvm uses
(or can use) the md-raid implementation.  ext4 and xfs on naked disk
partitions will have problems, but that's because they were designed in
the 1990's when we were young and naive and still believed hard disks
would one day become reliable devices without buggy firmware.

> For single drives and RAID 0, the only possible solution is to not do
> link resets for up to 3 minutes and hope the drive returns the single
> copy of data.

So perhaps the timeout should be influenced by higher layers, e.g. if a
disk becomes part of a raid1, its timeout should be shortened by default,
while a timeout for a disk that is not used in by redundant layer should
be longer.

> Even in the case of Btrfs DUP, it's thwarted without a read error
> reported from the drive (or it returning bad data).

That case gets messy--different timeouts for different parts of the disk.
Probably not practical.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

>
>> It just came up again in a thread over the weekend on linux-raid@. I'm
>> going to ask while people are paying attention if a patch to change
>> the 30 second time out to something a lot higher has ever been
>> floated, what the negatives might be, and where to get this fixed if
>> it wouldn't be accepted in the kernel code directly.
>
> Defaults are defaults, they're not for everyone.  30 seconds is about
> two minutes too short for an SMR drive's worst-case write latency, or
> 28 seconds too long for an OLTP system, or just right for an end-user's
> personal machine with a low-energy desktop drive and a long spin-up time.

The question is where is the correct place to change the default to
broadly capture most use cases, because it's definitely incompatible
with consumer SATA drives, whether in an enclosure or not.

Maybe it's with the kernel teams at each distribution? Or maybe an
upstream udev rule?

In any case something needs to give here because it's been years of
bugging users about this misconfiguration and people constantly run
into it, which means user education is not working.


>
> Once a drive starts taking 30+ seconds to do I/O, I consider the drive
> failed in the sense that it's too slow to meet latency requirements.

Well that is then a mismatch between use case and the drive purchasing
decision. Consumer drives do this. It's how they're designed to work.


> When the problem is that it's already taking too long, the solution is
> not waiting even longer.  To put things in perspective, consider that
> server hardware watchdog timeouts are typically 60 seconds by default
> (if not maximum).

If you want the data retrieved from that particular device, the only
solution is waiting longer. The alternative is what you get, an IO
error (well actually you get a link reset, which also means the entire
command queue is purged on SATA drives).


> If anything, I want the timeout to be shorter so that upper layers with
> redundancy can get an EIO and initiate repair promptly, and admins can
> get notified to evict chronic offenders from their drive slots, without
> having to pay extra for hard disk firmware with that feature.

The drive totally thwarts this. It doesn't report back to the kernel
what command is hung, as far as I'm aware. It just hangs and goes into
a so called "deep recovery" there is no way to know what sector is
causing the problem until the drive reports a read error, which will
include the affected sector LBA.

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


>> *Ideally* I think we'd want two timeouts. I'd like to see commands
>> have a timer that results in merely a warning that could be used by
>> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
>> write over those sectors". That's how bad sectors start out, they read
>> slower and eventually go beyond 30 seconds and now it's all link
>> resets. If the problem could be fixed before then... that's the best
>> scenario.
>
> What's the downside of a link reset?  Can the driver not just return
> EIO for all the outstanding IOs in progress at reset, and let the upper
> layers deal with it?  Or is the problem that the upper layers are all
> horribly broken by EIOs, or drive firmware horribly broken by link resets?

Link reset clears the entire command queue on SATA drives, and it
wipes away any possibility of finding out what LBA or even a range of
LBAs, is the source of the stall. So it pretty much gets you nothing.


> The upper layers could time the IOs, and make their own decisions based
> on the timing (e.g. btrfs or mdadm could proactively repair anything that
> took more than 10 seconds to read).  That might be a better approach,
> since shortening the time to an EIO is only useful when you have a
> redundancy layer in place to do something about them.

For RAID with redundancy, that's doable, although I have no idea what
work is needed, or even if it's possible, to track commands in this
manner, and fall back to some kind of repair mode as if it were a read
error.

For single drives and RAID 0, the only possible solution is to not do
link resets for up to 3 minutes and hope the drive returns the single
copy of data.

Even in the case of Btrfs DUP, it's thwarted without a read error
reported from the drive (or it returning bad data).



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2016-06-25 12:44, Chris Murphy wrote:
> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> >>  wrote:
> >>
> >> OK but hold on. During scrub, it should read data, compute checksums
> >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> >> the checksum tree, and the parity strip in the chunk tree. And if
> >> parity is wrong, then it should be replaced.
> >
> > Except that's horribly inefficient.  With limited exceptions involving
> > highly situational co-processors, computing a checksum of a parity block is
> > always going to be faster than computing parity for the stripe.  By using
> > that to check parity, we can safely speed up the common case of near zero
> > errors during a scrub by a pretty significant factor.
> 
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

> It just came up again in a thread over the weekend on linux-raid@. I'm
> going to ask while people are paying attention if a patch to change
> the 30 second time out to something a lot higher has ever been
> floated, what the negatives might be, and where to get this fixed if
> it wouldn't be accepted in the kernel code directly.

Defaults are defaults, they're not for everyone.  30 seconds is about
two minutes too short for an SMR drive's worst-case write latency, or
28 seconds too long for an OLTP system, or just right for an end-user's
personal machine with a low-energy desktop drive and a long spin-up time.

Once a drive starts taking 30+ seconds to do I/O, I consider the drive
failed in the sense that it's too slow to meet latency requirements.
When the problem is that it's already taking too long, the solution is
not waiting even longer.  To put things in perspective, consider that
server hardware watchdog timeouts are typically 60 seconds by default
(if not maximum).

If anything, I want the timeout to be shorter so that upper layers with
redundancy can get an EIO and initiate repair promptly, and admins can
get notified to evict chronic offenders from their drive slots, without
having to pay extra for hard disk firmware with that feature.

> *Ideally* I think we'd want two timeouts. I'd like to see commands
> have a timer that results in merely a warning that could be used by
> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
> write over those sectors". That's how bad sectors start out, they read
> slower and eventually go beyond 30 seconds and now it's all link
> resets. If the problem could be fixed before then... that's the best
> scenario.

What's the downside of a link reset?  Can the driver not just return
EIO for all the outstanding IOs in progress at reset, and let the upper
layers deal with it?  Or is the problem that the upper layers are all
horribly broken by EIOs, or drive firmware horribly broken by link resets?

The upper layers could time the IOs, and make their own decisions based
on the timing (e.g. btrfs or mdadm could proactively repair anything that
took more than 10 seconds to read).  That might be a better approach,
since shortening the time to an EIO is only useful when you have a
redundancy layer in place to do something about them.

> The 2nd timer would be, OK the controller or drive just face planted, reset.
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy  wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>  wrote:
>>>
 Well, the obvious major advantage that comes to mind for me to
 checksumming
 parity is that it would let us scrub the parity data itself and verify
 it.
>>>
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions involving
>> highly situational co-processors, computing a checksum of a parity block is
>> always going to be faster than computing parity for the stripe.  By using
>> that to check parity, we can safely speed up the common case of near zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

What I read in this thread clarifies the different flavors of errors I
saw when trying btrfs raid5 while corrupting 1 device or just
unexpectedly removing a device and replacing it with a fresh one.
Especially the lack of parity csums I was not aware of and I think
this is really wrong.

Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect
against the loss of 1 device or badblocks on 1 device. In the current
design (unoptimized for performance), raid10 reads from 2 disk and
raid5 as well (as far as I remember) per task/process.
Which pair of strips for raid10 is pseudo-random AFAIK, so one could
get low throughput if some device in the array is older/slower and
that one is picked. From device to fs logical layer is just a simple
function, namely copy, so having the option to keep data in-place
(zerocopy). The data is at least read by the csum check and in case of
failure, the btrfs code picks the alternative strip and corrects etc.

For raid5, assuming it does avoid the parity in principle, it is also
a strip pair and csum check. In case of csum failure, one needs the
parity strip parity calculation. To me, It looks like that the 'fear'
of this calculation has made raid56 as a sort of add-on, instead of a
more integral part.

Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even
higher than memory bandwidth. So although a calculation is needed in
case data0strip+paritystrip would be used instead of
data0strip+data1strip, I think looking at total cost, it can be
cheaper than spending time on seeks, at least on HDDs. If the parity
calculation is treated in a transparent way, same as copy, then there
is more flexibility in selecting disks (and strips) and enables easier
design and performance optimizations I think.

>> The ideal situation that I'd like to see for scrub WRT parity is:
>> 1. Store checksums for the parity itself.
>> 2. During scrub, if the checksum is good, the parity is good, and we just
>> saved the time of computing the whole parity block.
>> 3. If the checksum is not good, then compute the parity.  If the parity just
>> computed matches what is there already, the checksum is bad and should be
>> rewritten (and we should probably recompute the whole block of checksums
>> it's in), otherwise, the parity was bad, write out the new parity and update
>> the checksum.

This 3rd point: if parity matches but csum is not good, then there is
a btrfs design error or some hardware/CPU/memory problem. Compare with
btrfs raid10: if the copies match but csum wrong, then there is
something fatally wrong. Just the first step, csum check and if wrong,
it would mean you generate the assumed corrupt strip newly from the 3
others. And for 3 disk raid5 from the 2 others, whether it is copying
or paritycalculation.

>> 4. Have an option to skip the csum check on the parity and always compute
>> it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
For what it's worth I found btrfs-map-logical can produce mapping for
raid5 (didn't test raid6) by specifying the extent block length. If
that's omitted it only shows the device+mapping for the first strip.

This example is a 3 disk raid5, with a 128KiB file all in a single extent.

[root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c

[root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c

It's also possible to use -c and -o to copy the extent to a file and
more easily diff it with a control file, rather than using dd.

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-06-25 12:44, Chris Murphy wrote:
>>
>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>  wrote:
>>
>>> Well, the obvious major advantage that comes to mind for me to
>>> checksumming
>>> parity is that it would let us scrub the parity data itself and verify
>>> it.
>>
>>
>> OK but hold on. During scrub, it should read data, compute checksums
>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>> the checksum tree, and the parity strip in the chunk tree. And if
>> parity is wrong, then it should be replaced.
>
> Except that's horribly inefficient.  With limited exceptions involving
> highly situational co-processors, computing a checksum of a parity block is
> always going to be faster than computing parity for the stripe.  By using
> that to check parity, we can safely speed up the common case of near zero
> errors during a scrub by a pretty significant factor.

OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.




> The ideal situation that I'd like to see for scrub WRT parity is:
> 1. Store checksums for the parity itself.
> 2. During scrub, if the checksum is good, the parity is good, and we just
> saved the time of computing the whole parity block.
> 3. If the checksum is not good, then compute the parity.  If the parity just
> computed matches what is there already, the checksum is bad and should be
> rewritten (and we should probably recompute the whole block of checksums
> it's in), otherwise, the parity was bad, write out the new parity and update
> the checksum.
> 4. Have an option to skip the csum check on the parity and always compute
> it.
>>
>>
>> Even check > md/sync_action does this. So no pun intended but Btrfs
>> isn't even at parity with mdadm on data integrity if it doesn't check
>> if the parity matches data.
>
> Except that MD and LVM don't have checksums to verify anything outside of
> the very high-level metadata.  They have to compute the parity during a
> scrub because that's the _only_ way they have to check data integrity.  Just
> because that's the only way for them to check it does not mean we have to
> follow their design, especially considering that we have other, faster ways
> to check it.

I'm not opposed to this optimization. But retroactively better
qualifying my previous "major advantage" what I meant was in terms of
solving functional deficiency.



>> The much bigger problem we have right now that affects Btrfs,
>> LVM/mdadm md raid, is this silly bad default with non-enterprise
>> drives having no configurable SCT ERC, with ensuing long recovery
>> times, and the kernel SCSI command timer at 30 seconds - which
>> actually also fucks over regular single disk users also because it
>> means they don't get the "benefit" of long recovery times, which is
>> the whole g'd point of that feature. This itself causes so many
>> problems where bad sectors just get worse and don't get fixed up
>> because of all the link resets. So I still think it's a bullshit
>> default kernel side because it pretty much affects the majority use
>> case, it is only a non-problem with proprietary hardware raid, and
>> software raid using enterprise (or NAS specific) drives that already
>> have short recovery times by default.
>
> On this, we can agree.

It just came up again in a thread over the weekend on linux-raid@. I'm
going to ask while people are paying attention if a patch to change
the 30 second time out to something a lot higher has ever been
floated, what the negatives might be, and where to get this fixed if
it wouldn't be accepted in the kernel code directly.

*Ideally* I think we'd want two timeouts. I'd like to see commands
have a timer that results in merely a warning that could be used by
e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
write over those sectors". That's how bad sectors start out, they read
slower and eventually go beyond 30 seconds and now it's all link
resets. If the problem could be fixed before then... that's the best
scenario.

The 2nd timer would be, OK the controller or drive just face planted, reset.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Austin S. Hemmelgarn

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:


Well, the obvious major advantage that comes to mind for me to checksumming
parity is that it would let us scrub the parity data itself and verify it.


OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.
Except that's horribly inefficient.  With limited exceptions involving 
highly situational co-processors, computing a checksum of a parity block 
is always going to be faster than computing parity for the stripe.  By 
using that to check parity, we can safely speed up the common case of 
near zero errors during a scrub by a pretty significant factor.


The ideal situation that I'd like to see for scrub WRT parity is:
1. Store checksums for the parity itself.
2. During scrub, if the checksum is good, the parity is good, and we 
just saved the time of computing the whole parity block.
3. If the checksum is not good, then compute the parity.  If the parity 
just computed matches what is there already, the checksum is bad and 
should be rewritten (and we should probably recompute the whole block of 
checksums it's in), otherwise, the parity was bad, write out the new 
parity and update the checksum.
4. Have an option to skip the csum check on the parity and always 
compute it.


Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.
Except that MD and LVM don't have checksums to verify anything outside 
of the very high-level metadata.  They have to compute the parity during 
a scrub because that's the _only_ way they have to check data integrity. 
 Just because that's the only way for them to check it does not mean we 
have to follow their design, especially considering that we have other, 
faster ways to check it.




I'd personally much rather know my parity is bad before I need to use it
than after using it to reconstruct data and getting an error there, and I'd
be willing to be that most seasoned sysadmins working for companies using
big storage arrays likely feel the same about it.


That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.
It does not require it, but it can make it significantly more efficient, 
and even a 1% increase in efficiency is a huge difference on a big array.


So that's also something to check in the code or poke a system with a
stick and see what happens.


I could see it being
practical to have an option to turn this off for performance reasons or
similar, but again, I have a feeling that most people would rather be able
to check if a rebuild will eat data before trying to rebuild (depending on
the situation in such a case, it will sometimes just make more sense to nuke
the array and restore from a backup instead of spending time waiting for it
to rebuild).


The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

On this, we can agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Chris Murphy
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov  wrote:
> 26.06.2016 00:52, Chris Murphy пишет:
>> Interestingly enough, so far I'm finding with full stripe writes, i.e.
>> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
>> is raid4.
>
> That's not what code suggests and what I see in practice - parity seems
> to be distributed across all disks; each new 128KiB file (extent) has
> parity on new disk. At least as long as we can trust btrfs-map-logical
> to always show parity as "mirror 2".


tl;dr Andrei is correct there's no raid4 behavior here.

Looks like mirror 2 is always parity, more on that below.


>
> Do you see consecutive full stripes in your tests? Or how do you
> determine which devid has parity for a given full stripe?

I do see consecutive full stripe writes, but it doesn't always happen.
But not checking the consecutivity is where I became confused.

[root@f24s ~]# filefrag -v /mnt/5/ab*
Filesystem type is: 9123683e
File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456128..   3456159: 32: last,eof
/mnt/5/ab128_2.txt: 1 extent found
File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456224..   3456255: 32: last,eof
/mnt/5/ab128_3.txt: 1 extent found
File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456320..   3456351: 32: last,eof
/mnt/5/ab128_4.txt: 1 extent found
File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456352..   3456383: 32: last,eof
/mnt/5/ab128_5.txt: 1 extent found
File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456384..   3456415: 32: last,eof
/mnt/5/ab128_6.txt: 1 extent found
File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456416..   3456447: 32: last,eof
/mnt/5/ab128_7.txt: 1 extent found
File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456448..   3456479: 32: last,eof
/mnt/5/ab128_8.txt: 1 extent found
File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456480..   3456511: 32: last,eof
/mnt/5/ab128_9.txt: 1 extent found
File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456096..   3456127: 32: last,eof
/mnt/5/ab128.txt: 1 extent found

Starting with the bottom file then from the top so they're in 4096
byte block order; and the 2nd column is the difference in value:

3456096
3456128 32
3456224 96
3456320 96
3456352 32
3456384 32
3456416 32
3456448 32
3456480 32

So the first two files are consecutive full stripe writes. The next
two aren't. The next five are. They were all copied at the same time.
I don't know why they aren't always consecutive writes.


[root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
[root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c

Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Duncan
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted:

> P.S. usage of "stripe" to mean "stripe element" actually adds to
> confusion when reading code :)

... and posts (including patches, which I guess are code as well, just 
not applied yet).  I've been noticing that in the "stripe length" 
patches, when the comment associated with the patch suggests it's "strip 
length" they're actually talking about, using the "N strips, one per 
device, make a stripe" definition.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Andrei Borzenkov
26.06.2016 00:52, Chris Murphy пишет:
> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> is raid4.

That's not what code suggests and what I see in practice - parity seems
to be distributed across all disks; each new 128KiB file (extent) has
parity on new disk. At least as long as we can trust btrfs-map-logical
to always show parity as "mirror 2".

Do you see consecutive full stripes in your tests? Or how do you
determine which devid has parity for a given full stripe? This
information is not actually stored anywhere, it is computed based on
block group geometry and logical stripe offset.

P.S. usage of "stripe" to mean "stripe element" actually adds to
confusion when reading code :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-25 Thread Chris Murphy
Interestingly enough, so far I'm finding with full stripe writes, i.e.
3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
is raid4. So...I wonder if some of these slow cases end up with a
bunch of stripes that are effectively raid4-like, and have a lot of
parity overwrites, which is where raid4 suffers due to disk
contention.

Totally speculative as the sample size is too small and distinctly non-random.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-25 Thread Chris Murphy
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

> Well, the obvious major advantage that comes to mind for me to checksumming
> parity is that it would let us scrub the parity data itself and verify it.

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.

Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.


> I'd personally much rather know my parity is bad before I need to use it
> than after using it to reconstruct data and getting an error there, and I'd
> be willing to be that most seasoned sysadmins working for companies using
> big storage arrays likely feel the same about it.

That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.

So that's also something to check in the code or poke a system with a
stick and see what happens.

> I could see it being
> practical to have an option to turn this off for performance reasons or
> similar, but again, I have a feeling that most people would rather be able
> to check if a rebuild will eat data before trying to rebuild (depending on
> the situation in such a case, it will sometimes just make more sense to nuke
> the array and restore from a backup instead of spending time waiting for it
> to rebuild).

The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

This has been true for a very long time, maybe a decade. And it's such
complete utter crap that this hasn't been dealt with properly by any
party. No distribution has fixed this for their users. Upstream udev
hasn't dealt with it. And kernel folks haven't dealt with it. It's a
perverse joke on the user to do this out of the box.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Austin S. Hemmelgarn

On 2016-06-24 13:52, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov  wrote:

24.06.2016 20:06, Chris Murphy пишет:

On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  wrote:

On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
eta)data and RAID56 parity is not data.


   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).



Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.


I don't understand this. Whether the failed disk means a stripe is
missing a data strip or parity strip, if any other strip is damaged of
course the reconstruction isn't going to match checksum. This does not
make raid5 pointless.



Yes, you are right. We have double failure here. Still, in current
situation we apparently may end with btrfs reconstructing missing block
using wrong information. As was mentioned elsewhere, btrfs does not
verify checksum of reconstructed block, meaning data corruption.


Well that'd be bad, but also good in that it would explain a lot of
problems people have when metadata is also raid5. In this whole thread
the premise is the metadata is raid1, so the fs doesn't totally face
plant we just get a bunch of weird data corruptions. The metadata
raid5 case were sorta "WTF happened?" and not much was really said
about it other than telling the user to scrape off what they can and
start over.

Anyway, while not good I still think this is not super problematic to
at least *do* check EXTENT_CSUM after reconstruction from parity
rather than assuming that reconstruction happened correctly. The data
needed to pass fail the rebuild is already on the disk. It just needs
to be checked.

Better would be to get parity csummed and put into the csum tree. But
I don't know how much that helps. Think about always computing and
writing csums for parity, which almost never get used vs keeping
things the way they are now and just *checking our work* after
reconstruction from parity. If there's some obvious major advantage to
checksumming the parity I'm all ears but I'm not thinking of it at the
moment.

Well, the obvious major advantage that comes to mind for me to 
checksumming parity is that it would let us scrub the parity data itself 
and verify it.  I'd personally much rather know my parity is bad before 
I need to use it than after using it to reconstruct data and getting an 
error there, and I'd be willing to be that most seasoned sysadmins 
working for companies using big storage arrays likely feel the same 
about it.  I could see it being practical to have an option to turn this 
off for performance reasons or similar, but again, I have a feeling that 
most people would rather be able to check if a rebuild will eat data 
before trying to rebuild (depending on the situation in such a case, it 
will sometimes just make more sense to nuke the array and restore from a 
backup instead of spending time waiting for it to rebuild).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Zygo Blaxell
On Fri, Jun 24, 2016 at 11:40:56AM -0600, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills  wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> 
> The last sentence is hugely problematic. Parity doesn't appear to be
> either CoW'd or checksummed. If it is used for reconstruction and the
> reconstructed data isn't compared to the data's EXTENT_CSUM entry, but
> that entry is rather recomputed and written, that's just like blindly
> trusting the parity is correct and then authenticating it with a csum.

I think what happens is the data is recomputed, but the csum on the
data is _not_ updated (the csum does not reside in the raid56 code).
A read of the reconstructed data would get a csum failure (of course,
every 4 billionth time this happens the csum is correct by random chance,
so you wouldn't want to be reading parity blocks from a drive full of
garbage, but that's a different matter).

> It's  not difficult to test. Corrupt one byte of parity. Yank a drive.
> Add a new one. Start a reconstruction with scrub or balance (or both
> to see if they differ) and find out what happens. What should happen
> is the reconstruct should work for everything except that one file. If
> it's reconstructed silently, it should contain visible corruption and
> we all collectively raise our eyebrows.

I've done something like that test:  write random data to 1000 random
blocks on one disk, then run scrub.  It reconstructs the data without
problems (except for the minor wart that 'scrub status -d' counts the
randomly against every device, while 'dev stats' counts all the errors
on the disk that was corrupted).

Disk-side data corruption is a thing I have to deal with a few times each
year, so I tested the btrfs raid5 implementation for that case before I
started using it.  As far as I can tell so far, everything in btrfs raid5
works properly if a disk fails _while the filesystem is not mounted_.

The problem I see in the field is not *silent* corruption.  It's a whole
lot of very *noisy* corruption detected under circumstances where I'd
expect to see no corruption at all (silent or otherwise).



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov  wrote:
> 24.06.2016 20:06, Chris Murphy пишет:
>> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  
>> wrote:
>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>>> eta)data and RAID56 parity is not data.

Checksums are not parity, correct. However, every data block
 (including, I think, the parity) is checksummed and put into the csum
 tree. This allows the FS to determine where damage has occurred,
 rather thansimply detecting that it has occurred (which would be the
 case if the parity doesn't match the data, or if the two copies of a
 RAID-1 array don't match).

>>>
>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>> degraded disk won't be able to reconstruct data on this degraded disk
>>> because reconstructed extent content won't match checksum. Which kinda
>>> makes RAID5 pointless.
>>
>> I don't understand this. Whether the failed disk means a stripe is
>> missing a data strip or parity strip, if any other strip is damaged of
>> course the reconstruction isn't going to match checksum. This does not
>> make raid5 pointless.
>>
>
> Yes, you are right. We have double failure here. Still, in current
> situation we apparently may end with btrfs reconstructing missing block
> using wrong information. As was mentioned elsewhere, btrfs does not
> verify checksum of reconstructed block, meaning data corruption.

Well that'd be bad, but also good in that it would explain a lot of
problems people have when metadata is also raid5. In this whole thread
the premise is the metadata is raid1, so the fs doesn't totally face
plant we just get a bunch of weird data corruptions. The metadata
raid5 case were sorta "WTF happened?" and not much was really said
about it other than telling the user to scrape off what they can and
start over.

Anyway, while not good I still think this is not super problematic to
at least *do* check EXTENT_CSUM after reconstruction from parity
rather than assuming that reconstruction happened correctly. The data
needed to pass fail the rebuild is already on the disk. It just needs
to be checked.

Better would be to get parity csummed and put into the csum tree. But
I don't know how much that helps. Think about always computing and
writing csums for parity, which almost never get used vs keeping
things the way they are now and just *checking our work* after
reconstruction from parity. If there's some obvious major advantage to
checksumming the parity I'm all ears but I'm not thinking of it at the
moment.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills  wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:

>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>Eh? How do you come to that conclusion?
>
>For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest
> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.

The last sentence is hugely problematic. Parity doesn't appear to be
either CoW'd or checksummed. If it is used for reconstruction and the
reconstructed data isn't compared to the data's EXTENT_CSUM entry, but
that entry is rather recomputed and written, that's just like blindly
trusting the parity is correct and then authenticating it with a csum.

It's  not difficult to test. Corrupt one byte of parity. Yank a drive.
Add a new one. Start a reconstruction with scrub or balance (or both
to see if they differ) and find out what happens. What should happen
is the reconstruct should work for everything except that one file. If
it's reconstructed silently, it should contain visible corruption and
we all collectively raise our eyebrows.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 4:16 AM, Andrei Borzenkov  wrote:
> On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy  wrote:
>
>> [root@f24s ~]# filefrag -v /mnt/5/*
>> Filesystem type is: 9123683e
>> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>>  ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..   3:2931712..   2931715:  4: 
>> last,eof
>
> Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)
>
> nohostname:~ # filefrag -v /mnt/1
> Filesystem type is: 9123683e
> File size of /mnt/1 is 3072 (1 block of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   0: 269376..269376:  1: last,eof
> /mnt/1: 1 extent found
>
> But!
>
> nohostname:~ # filefrag -v /etc/passwd
> Filesystem type is: 9123683e
> File size of /etc/passwd is 1527 (1 block of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..4095:  0..  4095:   4096:
> last,not_aligned,inline,eof
> /etc/passwd: 1 extent found
> nohostname:~ #
>
> Why it works for one filesystem but does not work for an other one?
> ...
>>
>> So at the old address, it shows the "a..." is still there. And at
>> the added single block for this file at new logical and physical
>> addresses, is the modification substituting the first "a" for "g".
>>
>> In this case, no rmw, no partial stripe modification, and no data
>> already on-disk is at risk.
>
> You misunderstand the nature of problem. What is put at risk is data
> that is already on disk and "shares" parity with new data.
>
> As example, here are the first 64K in several extents on 4 disk RAID5
> with so far single data chunk
>
> item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
> 15491 itemsize 176
> chunk length 3221225472 owner 2 stripe_len 65536
> type DATA|RAID5 num_stripes 4
> stripe 0 devid 4 offset 9437184
> dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
> stripe 1 devid 3 offset 9437184
> dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
> stripe 2 devid 2 offset 1074790400
> dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
> stripe 3 devid 1 offset 1094713344
> dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
>
>
> item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
> item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
> item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
> item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
>
> where devid 1 = sdb1, 2 = sdc1 etc.
>
> Now let's write some data (I created several files) up to 64K in size:
>
> mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
> mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
> mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
> mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
> mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
> mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
> mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
> mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
> mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
> mirror 2 logical 1103380480 physical 9519104 device /dev/sde1
>
> Note that btrfs allocates 64K on the same device before switching to
> the next one. What is a bit misleading here, sdc1 is data and sde1 is
> parity (you can see it in checksum tree, where only items for sdc1
> exist).
>
> Now let's write next 64k and see what happens
>
> nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
> mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
> mirror 2 logical 1103429632 physical 9502720 device /dev/sde1
>
> See? btrfs now allocates new stripe on sdb1; this stripe is at the
> same offset as previous one on sdc1 (64K) and so shares the same
> parity stripe on sde1.

Yep, I've seen this also. What's not clear is if there's any
optimization where it's doing partial strip writes, i.e. only a

Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
24.06.2016 20:06, Chris Murphy пишет:
> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>> eta)data and RAID56 parity is not data.
>>>
>>>Checksums are not parity, correct. However, every data block
>>> (including, I think, the parity) is checksummed and put into the csum
>>> tree. This allows the FS to determine where damage has occurred,
>>> rather thansimply detecting that it has occurred (which would be the
>>> case if the parity doesn't match the data, or if the two copies of a
>>> RAID-1 array don't match).
>>>
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
> 
> I don't understand this. Whether the failed disk means a stripe is
> missing a data strip or parity strip, if any other strip is damaged of
> course the reconstruction isn't going to match checksum. This does not
> make raid5 pointless.
> 

Yes, you are right. We have double failure here. Still, in current
situation we apparently may end with btrfs reconstructing missing block
using wrong information. As was mentioned elsewhere, btrfs does not
verify checksum of reconstructed block, meaning data corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>eta)data and RAID56 parity is not data.
>>
>>Checksums are not parity, correct. However, every data block
>> (including, I think, the parity) is checksummed and put into the csum
>> tree. This allows the FS to determine where damage has occurred,
>> rather thansimply detecting that it has occurred (which would be the
>> case if the parity doesn't match the data, or if the two copies of a
>> RAID-1 array don't match).
>>
>
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

I don't understand this. Whether the failed disk means a stripe is
missing a data strip or parity strip, if any other strip is damaged of
course the reconstruction isn't going to match checksum. This does not
make raid5 pointless.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills  wrote:

>Checksums are not parity, correct. However, every data block
> (including, I think, the parity) is checksummed and put into the csum
> tree.

I don't see how parity is checksummed. It definitely is not in the
csum tree. Two file systems, one raid5, one single, each with a single
identical file:

raid5
item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
extent csum item

single

item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
extent csum item

That's the only entry in the csum tree. The raid5 one is not 33.33%
bigger to account for the extra parity being checksummed.

Now, if parity is used to reconstruction of data, that data *is*
checksummed so if it fails checksum after reconstruction the
information is available to determine it was incorrectly
reconstructed. The notes in btrfs/raid56.c recognize the possibility
of parity corruption and how to handle it. But I think that corruption
is inferred. Maybe the parity csums are in some other metadata item,
but I don't see how it's in the csum tree.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 10:52:53AM -0600, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills  wrote:
> 
> >Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree.
> 
> I don't see how parity is checksummed. It definitely is not in the
> csum tree. Two file systems, one raid5, one single, each with a single
> identical file:

   It isn't -- I was wrong up there, and corrected myself in a later
message after investigation. (Although in this case, I regard reality
as being at fault ;) ).

   Hugo.

> raid5
> item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 
> 128
> extent csum item
> 
> single
> 
> item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
> extent csum item
> 
> That's the only entry in the csum tree. The raid5 one is not 33.33%
> bigger to account for the extra parity being checksummed.
> 
> Now, if parity is used to reconstruction of data, that data *is*
> checksummed so if it fails checksum after reconstruction the
> information is available to determine it was incorrectly
> reconstructed. The notes in btrfs/raid56.c recognize the possibility
> of parity corruption and how to handle it. But I think that corruption
> is inferred. Maybe the parity csums are in some other metadata item,
> but I don't see how it's in the csum tree.
> 
> 

-- 
Hugo Mills | Great oxymorons of the world, no. 2:
hugo@... carfax.org.uk | Common Sense
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Zygo Blaxell
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.

The allocator could try harder to avoid partial stripe writes.  We can
write multiple small extents to the same stripe as long as we always do
it all within one transaction, and then later treat the entire stripe
as read-only until every extent is removed.  It would be possible to do
that by fudging extent lengths (effectively adding a bunch of prealloc-ish
space if we have a partial write after all the delalloc stuff is done),
but it could also waste some blocks on every single transaction, or
create a bunch of "free but unavailable" space that makes df/statvfs
output even more wrong than it usually is.

The raid5 rmw code could try to relocate the other extents sharing a
stripe, but I fear that with the current state of backref walking code
that would make raid5 spectacularly slow if a filesystem is anywhere
near full.

We could also write rmw parity block updates to a journal (like another
log tree).  That would enable us to at least fix up the parity blocks
after a crash, and close the write hole.  That's an on-disk format
change though.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 11:20:40PM -0600, Chris Murphy wrote:
> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   0:2931732..   2931732:  1:
>1:1..   3:2931713..   2931715:  3:2931733: last,eof
> /mnt/5/a.txt: 2 extents found
> File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931716..   2931719:  4: last,eof
> /mnt/5/b.txt: 1 extent found
> File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931720..   2931723:  4: last,eof
> /mnt/5/c.txt: 1 extent found
> File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931724..   2931727:  4: last,eof
> /mnt/5/d.txt: 1 extent found
> File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931728..   2931731:  4: last,eof
> /mnt/5/e.txt: 1 extent found

> So at the old address, it shows the "a..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
> 
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk. Even the metadata leaf/node is cow'd, it
> has a new logical and physical address as well, which contains
> information for all five files.

Well, of course not.  You're not setting up the conditions for failure.
The extent at 2931712..2931715 is 4 blocks long, so when you overwrite
part of the extent all 4 blocks remain occupied.

You need extents that are shorter than the stripe width, and you need to
write to the same stripe in two different btrfs transactions (i.e. you
need to delete an extent and then have a new extent mapped in the old
location).


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Austin S. Hemmelgarn

On 2016-06-24 06:59, Hugo Mills wrote:

On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:

On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills  wrote:

On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:

On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:

On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:

24.06.2016 04:47, Zygo Blaxell пишет:

On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:

On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  wrote:

The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.


Yeah I'm kinda confused on this point.

https://btrfs.wiki.kernel.org/index.php/RAID56

It says there is a write hole for Btrfs. But defines it in terms of
parity possibly being stale after a crash. I think the term comes not
from merely parity being wrong but parity being wrong *and* then being
used to wrongly reconstruct data because it's blindly trusted.


I think the opposite is more likely, as the layers above raid56
seem to check the data against sums before raid56 ever sees it.
(If those layers seem inverted to you, I agree, but OTOH there are
probably good reason to do it that way).



Yes, that's how I read code as well. btrfs layer that does checksumming
is unaware of parity blocks at all; for all practical purposes they do
not exist. What happens is approximately

1. logical extent is allocated and checksum computed
2. it is mapped to physical area(s) on disks, skipping over what would
be parity blocks
3. when these areas are written out, RAID56 parity is computed and filled in

IOW btrfs checksums are for (meta)data and RAID56 parity is not data.


   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).



Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.


   Eh? How do you come to that conclusion?

   For data, say you have n-1 good devices, with n-1 blocks on them.
Each block has a checksum in the metadata, so you can read that
checksum, read the blocks, and verify that they're not damaged. From
those n-1 known-good blocks (all data, or one parity and the rest


We do not know whether parity is good or not because as far as I can
tell parity is not checksummed.


   I was about to write a devastating rebuttal of this... then I
actually tested it, and holy crap you're right.

   I've just closed the terminal in question by accident, so I can't
copy-and-paste, but the way I checked was:

# mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
# mount /dev/loop0 foo
# dd if=/dev/urandom of=foo/file bs=4k count=32
# umount /dev/loop0
# btrfs-debug-tree /dev/loop0

then look at the csum tree:

 item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
  extent csum item

There is a single csum item in it, of length 128. At 4 bytes per csum,
that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
nothing for the parity.

   This is fundamentally broken, and I think we need to change the
wiki to indicate that the parity RAID implementation is not
recommended, because it doesn't actually do the job it's meant to in a
reliable way. :(


So item 4 now then, together with:
1. Rebuilds seemingly randomly decide based on the filesystem whether or 
not to take an insanely long time (always happens on some arrays, never 
happens on others, I have yet to see a report where it happens 
intermittently).

2. Failed disks seem to occasionally cause irreversible data corruption.
3. Classic erasure-code write-hole, just slightly different because of COW.

TBH, as much as I hate to say this, it looks like the raid5/6 code needs 
redone from scratch.  At an absolute minimum, we need to put a warning 
in mkfs for people using raid5/6 to tell them they shouldn't be using it 
outside of testing.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Austin S. Hemmelgarn

On 2016-06-24 01:20, Chris Murphy wrote:

On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell
 wrote:


With simple files changing one character with vi and gedit,
I get completely different logical and physical numbers with each
change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
raid5).


You are COWing the entire file because vi and gedit do truncate followed
by full-file write.


I'm seeing the file inode changes with either a vi or gedit
modification, even when file size is exactly the same, just character
substitute. So as far as VFS and Btrfs are concerned, it's an entirely
different file, so it's like faux-CoW that would have happened on any
file system, not an overwrite.
Yes, at least Vim (which is what most Linux systems use for vi) writes 
to a temporary file then does a replace by rename.  The idea is that 
POSIX implies that this should be atomic (except it's not actually 
required by POSIX, and even on some journaled and COW filesystems, it 
isn't actually atomic).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills  wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
> >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
> >> >> >>  wrote:
> >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >> >> >>> checksum.
> >> >> >>
> >> >> >> Yeah I'm kinda confused on this point.
> >> >> >>
> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >> >>
> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> >> from merely parity being wrong but parity being wrong *and* then 
> >> >> >> being
> >> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >> >
> >> >> > I think the opposite is more likely, as the layers above raid56
> >> >> > seem to check the data against sums before raid56 ever sees it.
> >> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> >> > probably good reason to do it that way).
> >> >> >
> >> >>
> >> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> >> is unaware of parity blocks at all; for all practical purposes they do
> >> >> not exist. What happens is approximately
> >> >>
> >> >> 1. logical extent is allocated and checksum computed
> >> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> >> be parity blocks
> >> >> 3. when these areas are written out, RAID56 parity is computed and 
> >> >> filled in
> >> >>
> >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >> >
> >> >Checksums are not parity, correct. However, every data block
> >> > (including, I think, the parity) is checksummed and put into the csum
> >> > tree. This allows the FS to determine where damage has occurred,
> >> > rather thansimply detecting that it has occurred (which would be the
> >> > case if the parity doesn't match the data, or if the two copies of a
> >> > RAID-1 array don't match).
> >> >
> >>
> >> Yes, that is what I wrote below. But that means that RAID5 with one
> >> degraded disk won't be able to reconstruct data on this degraded disk
> >> because reconstructed extent content won't match checksum. Which kinda
> >> makes RAID5 pointless.
> >
> >Eh? How do you come to that conclusion?
> >
> >For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> 
> We do not know whether parity is good or not because as far as I can
> tell parity is not checksummed.

   I was about to write a devastating rebuttal of this... then I
actually tested it, and holy crap you're right.

   I've just closed the terminal in question by accident, so I can't
copy-and-paste, but the way I checked was:

# mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
# mount /dev/loop0 foo
# dd if=/dev/urandom of=foo/file bs=4k count=32
# umount /dev/loop0
# btrfs-debug-tree /dev/loop0

then look at the csum tree:

 item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
  extent csum item

There is a single csum item in it, of length 128. At 4 bytes per csum,
that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
nothing for the parity.

   This is fundamentally broken, and I think we need to change the
wiki to indicate that the parity RAID implementation is not
recommended, because it doesn't actually do the job it's meant to in a
reliable way. :(

   Hugo.

> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> >
> 
> So we have silent corruption. I fail to understand how it is an improvement :)
> 
> >Hugo.
> >
> >> ...
> >> >
> >> >> > It looks like uncorrectable failures might occur because parity is
> >> >> > correct, but the parity checksum is out of date, so the parity 
> >> >> > checksum
> >> >> > doesn't match even though data blindly reconstructed from the parity
> >> >> > *would* match the data.
> >> >> >
> >> >>
> >> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> >> should at least avoid silent corruption.
> >> >>
> >

-- 
Hugo Mills | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and

Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills  wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> >> 24.06.2016 04:47, Zygo Blaxell пишет:
>> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
>> >> >>  wrote:
>> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>> >> >>> checksum.
>> >> >>
>> >> >> Yeah I'm kinda confused on this point.
>> >> >>
>> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >> >>
>> >> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> >> parity possibly being stale after a crash. I think the term comes not
>> >> >> from merely parity being wrong but parity being wrong *and* then being
>> >> >> used to wrongly reconstruct data because it's blindly trusted.
>> >> >
>> >> > I think the opposite is more likely, as the layers above raid56
>> >> > seem to check the data against sums before raid56 ever sees it.
>> >> > (If those layers seem inverted to you, I agree, but OTOH there are
>> >> > probably good reason to do it that way).
>> >> >
>> >>
>> >> Yes, that's how I read code as well. btrfs layer that does checksumming
>> >> is unaware of parity blocks at all; for all practical purposes they do
>> >> not exist. What happens is approximately
>> >>
>> >> 1. logical extent is allocated and checksum computed
>> >> 2. it is mapped to physical area(s) on disks, skipping over what would
>> >> be parity blocks
>> >> 3. when these areas are written out, RAID56 parity is computed and filled 
>> >> in
>> >>
>> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>> >
>> >Checksums are not parity, correct. However, every data block
>> > (including, I think, the parity) is checksummed and put into the csum
>> > tree. This allows the FS to determine where damage has occurred,
>> > rather thansimply detecting that it has occurred (which would be the
>> > case if the parity doesn't match the data, or if the two copies of a
>> > RAID-1 array don't match).
>> >
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>Eh? How do you come to that conclusion?
>
>For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest

We do not know whether parity is good or not because as far as I can
tell parity is not checksummed.

> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.
>

So we have silent corruption. I fail to understand how it is an improvement :)

>Hugo.
>
>> ...
>> >
>> >> > It looks like uncorrectable failures might occur because parity is
>> >> > correct, but the parity checksum is out of date, so the parity checksum
>> >> > doesn't match even though data blindly reconstructed from the parity
>> >> > *would* match the data.
>> >> >
>> >>
>> >> Yep, that is how I read it too. So if your data is checksummed, it
>> >> should at least avoid silent corruption.
>> >>
>
> --
> Hugo Mills | Debugging is like hitting yourself in the head with
> hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
> http://carfax.org.uk/  | you're allowed to stop debugging.
> PGP: E2AB1DE4  |PotatoEngineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
> >> >>  wrote:
> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >> >>> checksum.
> >> >>
> >> >> Yeah I'm kinda confused on this point.
> >> >>
> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >>
> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> from merely parity being wrong but parity being wrong *and* then being
> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >
> >> > I think the opposite is more likely, as the layers above raid56
> >> > seem to check the data against sums before raid56 ever sees it.
> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> > probably good reason to do it that way).
> >> >
> >>
> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> is unaware of parity blocks at all; for all practical purposes they do
> >> not exist. What happens is approximately
> >>
> >> 1. logical extent is allocated and checksum computed
> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> be parity blocks
> >> 3. when these areas are written out, RAID56 parity is computed and filled 
> >> in
> >>
> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >
> >Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree. This allows the FS to determine where damage has occurred,
> > rather thansimply detecting that it has occurred (which would be the
> > case if the parity doesn't match the data, or if the two copies of a
> > RAID-1 array don't match).
> >
> 
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

   Eh? How do you come to that conclusion?

   For data, say you have n-1 good devices, with n-1 blocks on them.
Each block has a checksum in the metadata, so you can read that
checksum, read the blocks, and verify that they're not damaged. From
those n-1 known-good blocks (all data, or one parity and the rest
data) you can reconstruct the remaining block. That reconstructed
block won't be checked against the csum for the missing block -- it'll
just be written and a new csum for it written with it.

   Hugo.

> ...
> >
> >> > It looks like uncorrectable failures might occur because parity is
> >> > correct, but the parity checksum is out of date, so the parity checksum
> >> > doesn't match even though data blindly reconstructed from the parity
> >> > *would* match the data.
> >> >
> >>
> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> should at least avoid silent corruption.
> >>

-- 
Hugo Mills | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
http://carfax.org.uk/  | you're allowed to stop debugging.
PGP: E2AB1DE4  |PotatoEngineer


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy  wrote:

> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931712..   2931715:  4: last,eof

Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)

nohostname:~ # filefrag -v /mnt/1
Filesystem type is: 9123683e
File size of /mnt/1 is 3072 (1 block of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0: 269376..269376:  1: last,eof
/mnt/1: 1 extent found

But!

nohostname:~ # filefrag -v /etc/passwd
Filesystem type is: 9123683e
File size of /etc/passwd is 1527 (1 block of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..4095:  0..  4095:   4096:
last,not_aligned,inline,eof
/etc/passwd: 1 extent found
nohostname:~ #

Why it works for one filesystem but does not work for an other one?
...
>
> So at the old address, it shows the "a..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
>
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk.

You misunderstand the nature of problem. What is put at risk is data
that is already on disk and "shares" parity with new data.

As example, here are the first 64K in several extents on 4 disk RAID5
with so far single data chunk

item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
15491 itemsize 176
chunk length 3221225472 owner 2 stripe_len 65536
type DATA|RAID5 num_stripes 4
stripe 0 devid 4 offset 9437184
dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
stripe 1 devid 3 offset 9437184
dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
stripe 2 devid 2 offset 1074790400
dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
stripe 3 devid 1 offset 1094713344
dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824


item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824

where devid 1 = sdb1, 2 = sdc1 etc.

Now let's write some data (I created several files) up to 64K in size:

mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
mirror 2 logical 1103380480 physical 9519104 device /dev/sde1

Note that btrfs allocates 64K on the same device before switching to
the next one. What is a bit misleading here, sdc1 is data and sde1 is
parity (you can see it in checksum tree, where only items for sdc1
exist).

Now let's write next 64k and see what happens

nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
mirror 2 logical 1103429632 physical 9502720 device /dev/sde1

See? btrfs now allocates new stripe on sdb1; this stripe is at the
same offset as previous one on sdc1 (64K) and so shares the same
parity stripe on sde1. If you compare 64K on sde1 at offset 9502720
before and after, you will see that it has changed. INPLACE. Without
CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1
is updated but sde1 is not, attempt to reconstruct data on sdc1 will
either fail (if we have checksums) or result in silent corruption.
--
To unsubscribe from this list: send the line "unsubscribe 

Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
> On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> 24.06.2016 04:47, Zygo Blaxell пишет:
>> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
>> >> wrote:
>> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>> >>> checksum.
>> >>
>> >> Yeah I'm kinda confused on this point.
>> >>
>> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >>
>> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> parity possibly being stale after a crash. I think the term comes not
>> >> from merely parity being wrong but parity being wrong *and* then being
>> >> used to wrongly reconstruct data because it's blindly trusted.
>> >
>> > I think the opposite is more likely, as the layers above raid56
>> > seem to check the data against sums before raid56 ever sees it.
>> > (If those layers seem inverted to you, I agree, but OTOH there are
>> > probably good reason to do it that way).
>> >
>>
>> Yes, that's how I read code as well. btrfs layer that does checksumming
>> is unaware of parity blocks at all; for all practical purposes they do
>> not exist. What happens is approximately
>>
>> 1. logical extent is allocated and checksum computed
>> 2. it is mapped to physical area(s) on disks, skipping over what would
>> be parity blocks
>> 3. when these areas are written out, RAID56 parity is computed and filled in
>>
>> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>
>Checksums are not parity, correct. However, every data block
> (including, I think, the parity) is checksummed and put into the csum
> tree. This allows the FS to determine where damage has occurred,
> rather thansimply detecting that it has occurred (which would be the
> case if the parity doesn't match the data, or if the two copies of a
> RAID-1 array don't match).
>

Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.

...
>
>> > It looks like uncorrectable failures might occur because parity is
>> > correct, but the parity checksum is out of date, so the parity checksum
>> > doesn't match even though data blindly reconstructed from the parity
>> > *would* match the data.
>> >
>>
>> Yep, that is how I read it too. So if your data is checksummed, it
>> should at least avoid silent corruption.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> 24.06.2016 04:47, Zygo Blaxell пишет:
> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
> >> wrote:
> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >>> checksum.
> >>
> >> Yeah I'm kinda confused on this point.
> >>
> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >>
> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> parity possibly being stale after a crash. I think the term comes not
> >> from merely parity being wrong but parity being wrong *and* then being
> >> used to wrongly reconstruct data because it's blindly trusted.
> > 
> > I think the opposite is more likely, as the layers above raid56
> > seem to check the data against sums before raid56 ever sees it.
> > (If those layers seem inverted to you, I agree, but OTOH there are
> > probably good reason to do it that way).
> > 
> 
> Yes, that's how I read code as well. btrfs layer that does checksumming
> is unaware of parity blocks at all; for all practical purposes they do
> not exist. What happens is approximately
> 
> 1. logical extent is allocated and checksum computed
> 2. it is mapped to physical area(s) on disks, skipping over what would
> be parity blocks
> 3. when these areas are written out, RAID56 parity is computed and filled in
> 
> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.

   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).

   (Note that csums for metadata are stored in the metadata block
itself, not in the csum tree).

   Hugo.

> > It looks like uncorrectable failures might occur because parity is
> > correct, but the parity checksum is out of date, so the parity checksum
> > doesn't match even though data blindly reconstructed from the parity
> > *would* match the data.
> > 
> 
> Yep, that is how I read it too. So if your data is checksummed, it
> should at least avoid silent corruption.
> 
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> > 
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.
> 



-- 
Hugo Mills | Darkling's First Law of Filesystems:
hugo@... carfax.org.uk | The user hates their data
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Chris Murphy
On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell
 wrote:

>> With simple files changing one character with vi and gedit,
>> I get completely different logical and physical numbers with each
>> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
>> raid5).
>
> You are COWing the entire file because vi and gedit do truncate followed
> by full-file write.

I'm seeing the file inode changes with either a vi or gedit
modification, even when file size is exactly the same, just character
substitute. So as far as VFS and Btrfs are concerned, it's an entirely
different file, so it's like faux-CoW that would have happened on any
file system, not an overwrite.


> Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or
> edit the file with a sector-level hex editor.

The inode is now the same, one of the 4096 byte blocks is
dereferenced, a new 4096 byte block is referenced, and written, the
other 3 blocks remain untouched, the other files in the stripe remain
untouched. So it's pretty clearly cow'd in this case.

[root@f24s ~]# filefrag -v /mnt/5/*
Filesystem type is: 9123683e
File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931712..   2931715:  4: last,eof
/mnt/5/a.txt: 1 extent found
File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931716..   2931719:  4: last,eof
/mnt/5/b.txt: 1 extent found
File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931720..   2931723:  4: last,eof
/mnt/5/c.txt: 1 extent found
File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931724..   2931727:  4: last,eof
/mnt/5/d.txt: 1 extent found
File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931728..   2931731:  4: last,eof
/mnt/5/e.txt: 1 extent found

[root@f24s ~]# ls -li /mnt/5/*
285 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/a.txt
286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt
287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt
288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt
289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt

[root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a
mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a
mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931716] /dev/VG/a
mirror 1 logical 12008308736 physical 34619392 device /dev/mapper/VG-a
mirror 2 logical 12008308736 physical 1108361216 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931720] /dev/VG/a
mirror 1 logical 12008325120 physical 34635776 device /dev/mapper/VG-a
mirror 2 logical 12008325120 physical 1108377600 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931724] /dev/VG/a
mirror 1 logical 12008341504 physical 34652160 device /dev/mapper/VG-a
mirror 2 logical 12008341504 physical 1108393984 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931728] /dev/VG/a
mirror 1 logical 12008357888 physical 1048576 device /dev/mapper/VG-b
mirror 2 logical 12008357888 physical 1108344832 device /dev/mapper/VG-c


[root@f24s ~]# echo -n "g" | dd of=/mnt/5/a.txt conv=notrunc
0+1 records in
0+1 records out
1 byte copied, 0.000314582 s, 3.2 kB/s
[root@f24s ~]# ls -li /mnt/5/*
285 -rw-r--r--. 1 root root 16383 Jun 23 23:06 /mnt/5/a.txt
286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt
287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt
288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt
289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt

[root@f24s ~]# filefrag -v /mnt/5/*
Filesystem type is: 9123683e
File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0:2931732..   2931732:  1:
   1:1..   3:2931713..   2931715:  3:2931733: last,eof
/mnt/5/a.txt: 2 extents found
File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931716..   2931719:  4: last,eof
/mnt/5/b.txt: 1 extent found
File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931720..   2931723:  4:  

Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Andrei Borzenkov
24.06.2016 04:47, Zygo Blaxell пишет:
> On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
>> wrote:
>>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>>> checksum.
>>
>> Yeah I'm kinda confused on this point.
>>
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> It says there is a write hole for Btrfs. But defines it in terms of
>> parity possibly being stale after a crash. I think the term comes not
>> from merely parity being wrong but parity being wrong *and* then being
>> used to wrongly reconstruct data because it's blindly trusted.
> 
> I think the opposite is more likely, as the layers above raid56
> seem to check the data against sums before raid56 ever sees it.
> (If those layers seem inverted to you, I agree, but OTOH there are
> probably good reason to do it that way).
> 

Yes, that's how I read code as well. btrfs layer that does checksumming
is unaware of parity blocks at all; for all practical purposes they do
not exist. What happens is approximately

1. logical extent is allocated and checksum computed
2. it is mapped to physical area(s) on disks, skipping over what would
be parity blocks
3. when these areas are written out, RAID56 parity is computed and filled in

IOW btrfs checksums are for (meta)data and RAID56 parity is not data.

> It looks like uncorrectable failures might occur because parity is
> correct, but the parity checksum is out of date, so the parity checksum
> doesn't match even though data blindly reconstructed from the parity
> *would* match the data.
> 

Yep, that is how I read it too. So if your data is checksummed, it
should at least avoid silent corruption.

>> I don't read code well enough, but I'd be surprised if Btrfs
>> reconstructs from parity and doesn't then check the resulting
>> reconstructed data to its EXTENT_CSUM.
> 
> I wouldn't be surprised if both things happen in different code paths,
> given the number of different paths leading into the raid56 code and
> the number of distinct failure modes it seems to have.
> 

Well, the problem is that parity block cannot be redirected on write as
data blocks; which makes it impossible to version control it. The only
solution I see is to always use full stripe writes by either wasting
time in fixed width stripe or using variable width, so that every stripe
always gets new version of parity. This makes it possible to keep parity
checksums like data checksums.



signature.asc
Description: OpenPGP digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote:
> > I expect that parity is in this data block group, and therefore is
> > checksummed the same as any other data in that block group.
> 
> This appears to be wrong. Comparing the same file, one file only, one
> two new Btrfs volumes, one volume single, one volume raid5, I get a
> single csum tree entry:
> 
> raid5
> item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 
> 128
> extent csum item
> 
> single
> 
> item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
> extent csum item
> 
> They're both the same size. They both contain the same data. So it
> looks like parity is not separately checksummed.

I'm inclined to agree because I didn't find any code that *writes* parity
csums...but if there are no parity csums, what does this code do?

scrub.c:
static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx 
*sctx,
[...]
ret = btrfs_lookup_csums_range(csum_root,
extent_logical,
extent_logical + extent_len - 1,
>csum_list, 1);
if (ret)
goto out;

ret = scrub_extent_for_parity(sparity, extent_logical,
  extent_len,
  extent_physical,
  extent_dev, flags,
  generation,
  extent_mirror_num);



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote:
> > So in your example of degraded writes, no matter what the on disk
> > format makes it discoverable there is a problem:
> >
> > A. The "updating" is still always COW so there is no overwriting.
> 
> There is RMW code in btrfs/raid56.c but I don't know when that gets
> triggered. 

RMW seems to be for cases where part of a stripe is modified but the
entire stripe has not yet been read into memory.  It reads the remaining
blocks (reconstructing missing blocks if necessary) then calculates
new parity blocks.

> With simple files changing one character with vi and gedit,
> I get completely different logical and physical numbers with each
> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
> raid5).

You are COWing the entire file because vi and gedit do truncate followed
by full-file write.

Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or
edit the file with a sector-level hex editor.

> [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
> Filesystem type is: 9123683e
> File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:2931744..   2931775: 32: last,eof
> /mnt/5/64k-a-then64k-b.txt: 1 extent found
> [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a
> mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b
> mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a
> [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt
> [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
> Filesystem type is: 9123683e
> File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:2931776..   2931807: 32: last,eof
> /mnt/5/64k-a-then64k-b.txt: 1 extent found
> [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a
> No extent found at range [120085610496,120085626880)
> [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a
> mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c
> mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b
> [root@f24s ~]#
> 
> There is a neat bug/rfe I found for btrfs-map-logical, it doesn't
> report back the physical locations for all num_stripes on the volume.
> It only spits back two, and sometimes it's the two data strips,
> sometimes it's one data and one parity strip.
> 
> 
> [1]
> https://bugzilla.kernel.org/show_bug.cgi?id=120941
> 
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
> wrote:
> > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> > checksum.
> 
> Yeah I'm kinda confused on this point.
> 
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> It says there is a write hole for Btrfs. But defines it in terms of
> parity possibly being stale after a crash. I think the term comes not
> from merely parity being wrong but parity being wrong *and* then being
> used to wrongly reconstruct data because it's blindly trusted.

I think the opposite is more likely, as the layers above raid56
seem to check the data against sums before raid56 ever sees it.
(If those layers seem inverted to you, I agree, but OTOH there are
probably good reason to do it that way).

It looks like uncorrectable failures might occur because parity is
correct, but the parity checksum is out of date, so the parity checksum
doesn't match even though data blindly reconstructed from the parity
*would* match the data.

> I don't read code well enough, but I'd be surprised if Btrfs
> reconstructs from parity and doesn't then check the resulting
> reconstructed data to its EXTENT_CSUM.

I wouldn't be surprised if both things happen in different code paths,
given the number of different paths leading into the raid56 code and
the number of distinct failure modes it seems to have.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 09:32:50PM +0200, Goffredo Baroncelli wrote:
> The raid write hole happens when a stripe is not completely written
> on the platters: the parity and the related data mismatch. In this
> case a "simple" raid5 may return wrong data if the parity is used to
> compute the data. But this happens because a "simple" raid5 is unable
> to detected if the returned data is right or none.
> 
> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the
> checksum.

Checksums do not help with the raid5 write hole.  The way btrfs does
checksums might even make it worse.

ZFS reduces the number of disks in a stripe when a disk failure
is detected so that writes are always in non-degraded mode, and they
presumably avoid sub-stripe-width data allocations or use journalling
to avoid the write hole.  btrfs seems to use neither tactic.  At best,
btrfs will avoid creating new block groups on disks that are missing at
mount time, and it doesn't deal with sub-stripe-width allocations at all.

I'm working from two assumptions as I haven't found all the relevant
code yet:

1.  btrfs writes parity stripes at fixed locations relative to
the data in the same stripe.  If this is true, then the parity
blocks are _not_ CoW while the data blocks and their checksums
_are_ CoW.  I don't know if the parity block checksums are also
CoW.

2.  btrfs sometimes puts data from two different transactions in
the same stripe at the same time--a fundamental violation of the
CoW concept.  I inferred this from the logical block addresses.

Unless I'm missing something in the code somewhere, parity blocks can
have out-of-date checksums for short periods of time between flushes and
commits.  This would lose data by falsely reporting valid parity blocks
as checksum failures.  If any *single* failure occurs at the same time
(such as a missing write or disk failure) a small amount of data will
be lost.

> BTRFS is able to discard the wrong data: i.e. in case of a 3 disks
> raid5, the right data may be extracted from the data1+data2 or if the
> checksum doesn't match from data1+parity or if the checksum doesn't
> match from data2+parity.

Suppose we have a sequence like this (3-disk RAID5 array, one stripe
containing 2 data and 1 parity block) starting with the stripe empty:

1.  write data block 1 to disk 1 of stripe (parity is now invalid, no 
checksum yet)

2.  write parity block to disk 3 in stripe (parity becomes valid again, 
no checksum yet)

3.  commit metadata pointing to block 1 (parity and checksums now valid)

4.  write data block 2 to disk 2 of stripe (parity and parity checksum 
now invalid)

5.  write parity block to disk 3 in stripe (parity valid now, parity 
checksum still invalid)

6.  commit metadata pointing to block 2 (parity and checksums now valid)

We can be interrupted at any point between step 1 and 4 with no data loss.
Before step 3 the data and parity blocks are not part of the extent
tree so their contents are irrelevant.  After step 3 (assuming each
step is completed in order) data block 1 is part of the extent tree and
can be reconstructed if any one disk fails.  This is the part of btrfs
raid5 that works.

If we are interrupted between steps 4 and 6 (e.g. power fails), a single
disk failure or corruption will cause data loss in block 1.  Note that
block 1 is *not* written between steps 4 and 6, so we are retroactively
damaging some previously written data that is not part of the current
transaction.

If we are interrupted between steps 4 and 5, we can no longer reconstruct
block 1 (block2 ^ parity) or block 2 (block1 ^ parity) because the parity
block doesn't match the data blocks in the same stripe
(i.e. block1 ^ block2 != parity).

If we are interrupted between step 5 and 6, the parity block checksum
committed at step 3 will fail.  Data block 2 will not be accessible
since the metadata was not written to point to it, but data block 1
will be intact, readable, and have a correct checksum as long as none
of the disks fail.  This can be repaired by a scrub (scrub will simply
throw the parity block away and reconstruct it from block1 and block2).
If disk 1 fails before the next scrub, data block 1 will be lost because
btrfs will believe the parity block is incorrect even though it is not.

This risk happens on *every* write to a stripe that is not a full stripe
write and contains existing committed data blocks.  It will occur more
often on full and heavily fragmented filesystems (filesystems which 
have these properties are more likely to write new data on stripes that 
already contain old data).

In cases where an entire stripe is written at once, or a stripe is
partially filled but no further writes ever modify the stripe, everything
works as intended in btrfs.

> NOTE2: this works if only one write is corrupted. If more writes (==
> more disks) are involved, you got checksum 

Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Chris Murphy
On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  wrote:

>
> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.

Yeah I'm kinda confused on this point.

https://btrfs.wiki.kernel.org/index.php/RAID56

It says there is a write hole for Btrfs. But defines it in terms of
parity possibly being stale after a crash. I think the term comes not
from merely parity being wrong but parity being wrong *and* then being
used to wrongly reconstruct data because it's blindly trusted.

I don't read code well enough, but I'd be surprised if Btrfs
reconstructs from parity and doesn't then check the resulting
reconstructed data to its EXTENT_CSUM.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Chris Murphy
On Wed, Jun 22, 2016 at 11:14 AM, Chris Murphy  wrote:

>
> However, from btrfs-debug-tree from a 3 device raid5 volume:
>
> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144
> chunk length 2147483648 owner 2 stripe_len 65536
> type DATA|RAID5 num_stripes 3
> stripe 0 devid 2 offset 9437184
> dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55
> stripe 1 devid 1 offset 1094713344
> dev uuid: 13104709-6f30-4982-979e-4f055c326fad
> stripe 2 devid 3 offset 1083179008
> dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80
>
> I expect that parity is in this data block group, and therefore is
> checksummed the same as any other data in that block group.

This appears to be wrong. Comparing the same file, one file only, one
two new Btrfs volumes, one volume single, one volume raid5, I get a
single csum tree entry:

raid5
item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
extent csum item

single

item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
extent csum item

They're both the same size. They both contain the same data. So it
looks like parity is not separately checksummed.

If there's a missing 64KiB data strip (bad sector, or dead drive), the
reconstruction of that strip from parity should match available csums
for those blocks. So in this way it's possible to infer if the parity
strip is bad. But, it also means assuming everything else about this
full stripe: the remaining data strips and their csums, are correct.




> So in your example of degraded writes, no matter what the on disk
> format makes it discoverable there is a problem:
>
> A. The "updating" is still always COW so there is no overwriting.

There is RMW code in btrfs/raid56.c but I don't know when that gets
triggered. With simple files changing one character with vi and gedit,
I get completely different logical and physical numbers with each
change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
raid5).


[root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
Filesystem type is: 9123683e
File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:2931744..   2931775: 32: last,eof
/mnt/5/64k-a-then64k-b.txt: 1 extent found
[root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a
mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b
mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a
[root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt
[root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
Filesystem type is: 9123683e
File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:2931776..   2931807: 32: last,eof
/mnt/5/64k-a-then64k-b.txt: 1 extent found
[root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a
No extent found at range [120085610496,120085626880)
[root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a
mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c
mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b
[root@f24s ~]#

There is a neat bug/rfe I found for btrfs-map-logical, it doesn't
report back the physical locations for all num_stripes on the volume.
It only spits back two, and sometimes it's the two data strips,
sometimes it's one data and one parity strip.


[1]
https://bugzilla.kernel.org/show_bug.cgi?id=120941


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Goffredo Baroncelli
On 2016-06-22 22:35, Zygo Blaxell wrote:
>> I do not know the exact nature of the Btrfs raid56 write hole. Maybe a
>> > dev or someone who knows can explain it.
> If you have 3 raid5 devices, they might be laid out on disk like this
> (e.g. with a 16K stripe width):
> 
>   Address:  0..16K16..32K 32..64K
> Disk 1: [0..16K][32..64K]   [PARITY]
> Disk 2: [16..32K]   [PARITY][80..96K]
> Disk 3: [PARITY][64..80K]   [96..112K]
> 
> btrfs logical address ranges are inside [].  Disk physical address ranges
> are shown at the top of each column.  (I've simplified the mapping here;
> pretend all the addresses are relative to the start of a block group).
> 
> If we want to write a 32K extent at logical address 0, we'd write all
> three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3
> gets parity for the other two disks).  The parity will be temporarily
> invalid for the time between the first disk write and the last disk write.
> In non-degraded mode the parity isn't necessary, but in degraded mode
> the entire column cannot be reconstructed because of invalid parity.
> 
> To see why this could be a problem, suppose btrfs writes a 4K extent at
> logical address 32K.  This requires updating (at least) disk 1 (where the
> logical address 32K resides) and disk 2 (the parity for this column).
> This means any data that existed at logical addresses 36K..80K (or at
> least 32..36K and 64..68K) has its parity temporarily invalidated between
> the write to the first and last disks.  If there were metadata pointing
> to other blocks in this column, the metadata temporarily points to
> damaged data during the write.  If there is no data in other blocks in
> this column then it doesn't matter that the parity doesn't match--the
> content of the reconstructed unallocated blocks would be undefined
> even in the success cases.
[...]

Sorry, but I can follow you.

RAID5 protect you in case of a failure (or a missing write) of a *single* disk.

The raid write hole happens when a stripe is not completely written on the 
platters: the parity and the related data mismatch. In this case a "simple" 
raid5 may return wrong data if the parity is used to compute the data. But this 
happens because a "simple" raid5 is unable to detected if the returned data is 
right or none.

The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. 

BTRFS is able to discard the wrong data: i.e. in case of a 3 disks raid5, the 
right data may be extracted from the data1+data2 or if the checksum doesn't 
match from data1+parity or if the checksum doesn't match from data2+parity. 
NOTE1: the real difference between the BTRFS (and ZFS) raid and the "simple" 
raid5 is that the latter doesn't try another pair of disks.
NOTE2: this works if only one write is corrupted. If more writes (== more 
disks) are involved, you got checksum mismatch. If more than one write are 
corrupted, raid5 is unable to protect you. 

In case of "degraded mode", you don't have any redundancy. So if a stripe of a 
degraded filesystem is not fully written to the disk, is like a block not fully 
written to the disk. And you have checksums mismatch. But this is not what is 
called raid write hole.


On 2016-06-22 22:35, Zygo Blaxell wrote:
> If in the future btrfs allocates physical block 2412725692 to
> a different file, up to 3 other blocks in this file (most likely
> 2412725689..2412725691) could be lost if a crash or disk I/O error also
> occurs during the same transaction.  btrfs does do this--in fact, the
> _very next block_ allocated by the filesystem is 2412725692:
> 
>   # head -c 4096 < /dev/urandom >> f; sync; filefrag -v f
>   Filesystem type is: 9123683e
>   File size of f is 45056 (11 blocks of 4096 bytes)
>ext: logical_offset:physical_offset: length:   expected: 
> flags:
>  0:0..   0: 2412725689..2412725689:  1:
>  1:1..   1: 2412725690..2412725690:  1:
>  2:2..   2: 2412725691..2412725691:  1:
>  3:3..   3: 2412725701..2412725701:  1: 2412725692:
>  4:4..   4: 2412725693..2412725693:  1: 2412725702:
>  5:5..   5: 2412725694..2412725694:  1:
>  6:6..   6: 2412725695..2412725695:  1:
>  7:7..   7: 2412725698..2412725698:  1: 2412725696:
>  8:8..   8: 2412725699..2412725699:  1:
>  9:9..   9: 2412725700..2412725700:  1:
> 10:   10..  10: 2412725692..2412725692:  1: 2412725701: 
> last,eof
>   f: 5 extents found

You are assuming that if you touch a block, all the blocks of the same stripe 
spread over the disks are involved. I disagree. The only parts which are 
involved, are the part 

Re: Adventures in btrfs raid5 disk recovery

2016-06-22 Thread Zygo Blaxell
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote:
> > Before deploying raid5, I tested these by intentionally corrupting
> > one disk in an otherwise healthy raid5 array and watching the result.
> 
> It's difficult to reproduce if no one understands how you
> intentionally corrupted that disk. Literal reading, you corrupted the
> entire disk, but that's impractical. The fs is expected to behave
> differently depending on what's been corrupted and how much.

The first round of testing I did (a year ago, when deciding whether
btrfs raid5 was mature enough to start using) was:

Create a 5-disk RAID5

Put some known data on it until it's full (i.e. random test
patterns).  At the time I didn't do any tests involving
compressible data, which I now realize was a serious gap in
my test coverage.

Pick 1000 random blocks (excluding superblocks) on one of the
disks and write random data to them

Read and verify the data through the filesystem, do scrub, etc.
Exercise all the btrfs features related to error reporting
and recovery.

I expected scrub and dev stat to report accurate corruption counts (except
for the 1 in 4 billion case where a bad csum matches by random chance),
and I expect all the data to be reconstructed since only one drive was
corrupted (assuming there are no unplanned disk failures during the
test, obviously) and the corruption occurred while the filesystem was
offline so there was no possibility of RAID write hole.

My results from that testing were that everything worked except for the
mostly-harmless quirk where scrub counts errors on random disks instead
of the disk where the errors occur.

> I don't often use the -Bd options, so I haven't tested it thoroughly,
> but what you're describing sounds like a bug in user space tools. I've
> found it reflects the same information as btrfs dev stats, and dev
> stats have been reliable in my testing.

Don't the user space tools just read what the kernel tells them?
I don't know how *not* to produce this behavior on btrfs raid5 or raid6.
It should show up on any btrfs raid56 system.

> > A different thing happens if there is a crash.  In that case, scrub cannot
> > repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> > behaves this way when disks turn bad.  I had assumed it was a software bug
> > in the comparatively new raid5 support that would get fixed eventually.
> 
> This is really annoyingly vague. You don't give a complete recipe for
> reproducing this sequence. Here's what I'm understanding and what I'm
> missing:
> 
> 1. The intentional corruption, extent of which is undefined, is still present.

No intentional corruption here (quote:  "A different thing happens if
there is a crash...").  Now we are talking about the baseline behavior
when there is a crash on a btrfs raid5 array, especially crashes
triggered by a disk-level failure (e.g. watchdog timeout because a disk
or controller has hung) but also ordinary power failures or other crashes
triggered by external causes.

> 2. A drive is bad, but that doesn't tell us if it's totally dead, or
> only intermittently spitting out spurious information.

The most common drive-initiated reboot case is that one drive temporarily
locks up and triggers the host to perform a watchdog reset.  The reset
is successful and the filesystem can be mounted again with all drives
present; however, a small amount of raid5 data appears to be corrupted
each time.  The raid1 metadata passes all the integrity checks I can
throw at it:  btrfs check, scrub, balance, walk the filesystem with find
-type f -exec cat ..., compare with the last backup, etc.

Usually when I detect this case, I delete any corrupted data, delete
the disk that triggers the lockups and have no further problems with
that array.

> 3. Is the volume remounted degraded or is the bad drive still being
> used by Btrfs? Because Btrfs has no concept (patches pending) of drive
> faulty state like md, let alone an automatic change to that faulty
> state. It just keeps on trying to read or write to bad drives, even if
> they're physically removed.

In the baseline case the filesystem has all drives present after remount.
It could be as simple as power-cycling the host while writes are active.

> 4. You've initiated a scrub, and the corruption in 1 is not fixed.

In this pattern, btrfs may find both correctable and uncorrectable
corrupted data, usually on one of the drives.  scrub fixes the correctable
corruption, but fails on the uncorrectable.

> OK so what am I missing?

Nothing yet.  The above is the "normal" btrfs raid5 crash experience with
a non-degraded raid5 array.  A few megabytes of corrupt extents can easily
be restored from backups or deleted and everything's fine after that.

In my *current* failure case, I'm experiencing *additional* issues.
In total:

all of the above, plus:

BUG_ON()s as described in the first mail on 

Re: Adventures in btrfs raid5 disk recovery

2016-06-22 Thread Chris Murphy
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
>> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
>>  wrote:
>> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
>>
>> >> For me the critical question is what does "some corrupted sectors" mean?
>> >
>> > On other raid5 arrays, I would observe a small amount of corruption every
>> > time there was a system crash (some of which were triggered by disk
>> > failures, some not).
>>
>> What test are you using to determine there is corruption, and how much
>> data is corrupted? Is this on every disk? Non-deterministically fewer
>> than all disks? Have you identified this as a torn write or
>> misdirected write or is it just garbage at some sectors? And what's
>> the size? Partial sector? Partial md chunk (or fs block?)
>
> In earlier cases, scrub, read(), and btrfs dev stat all reported the
> incidents differently.  Scrub would attribute errors randomly to disks
> (error counts spread randomly across all the disks in the 'btrfs scrub
> status -d' output).  'dev stat' would correctly increment counts on only
> those disks which had individually had an event (e.g. media error or
> SATA bus reset).
>
> Before deploying raid5, I tested these by intentionally corrupting
> one disk in an otherwise healthy raid5 array and watching the result.

It's difficult to reproduce if no one understands how you
intentionally corrupted that disk. Literal reading, you corrupted the
entire disk, but that's impractical. The fs is expected to behave
differently depending on what's been corrupted and how much.

> When scrub identified an inode and offset in the kernel log, the csum
> failure log message matched the offsets producing EIO on read(), but
> the statistics reported by scrub about which disk had been corrupted
> were mostly wrong.  In such cases a scrub could repair the data.

I don't often use the -Bd options, so I haven't tested it thoroughly,
but what you're describing sounds like a bug in user space tools. I've
found it reflects the same information as btrfs dev stats, and dev
stats have been reliable in my testing.


> A different thing happens if there is a crash.  In that case, scrub cannot
> repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> behaves this way when disks turn bad.  I had assumed it was a software bug
> in the comparatively new raid5 support that would get fixed eventually.

This is really annoyingly vague. You don't give a complete recipe for
reproducing this sequence. Here's what I'm understanding and what I'm
missing:

1. The intentional corruption, extent of which is undefined, is still present.
2. A drive is bad, but that doesn't tell us if it's totally dead, or
only intermittently spitting out spurious information.
3. Is the volume remounted degraded or is the bad drive still being
used by Btrfs? Because Btrfs has no concept (patches pending) of drive
faulty state like md, let alone an automatic change to that faulty
state. It just keeps on trying to read or write to bad drives, even if
they're physically removed.
4. You've initiated a scrub, and the corruption in 1 is not fixed.

OK so what am I missing?

Because it sounds to me like you have two copies of data that are
gone. For raid 5 that's data loss, scrub can't fix things. Corruption
is missing data. The bad drive is missing data.

What values do you get for

smartctl -l scterc /dev/sdX
cat /sys/block/sdX/device/timeout



>> This is on Btrfs? This isn't supposed to be possible. Even a literal
>> overwrite of a file is not an overwrite on Btrfs unless the file is
>> nodatacow. Data extents get written, then the metadata is updated to
>> point to those new blocks. There should be flush or fua requests to
>> make sure the order is such that the fs points to either the old or
>> new file, in either case uncorrupted. That's why I'm curious about the
>> nature of this corruption. It sounds like your hardware is not exactly
>> honoring flush requests.
>
> That's true when all the writes are ordered within a single device, but
> possibly not when writes must be synchronized across multiple devices.

I think that's a big problem, the fs cannot be consistent if the super
block points to any tree whose metadata or data isn't on stable media.

But if you think it's happening you might benefit from integrity
checking, maybe try just the metadata one for starters which is the
check_int mount option (it must be compiled in first for that mount
option to work).

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2



> It could be possible to corrupt existing data if:
>
> - btrfs puts multiple extents in the same RAID5 stripe
>
> - writes two of the extents in different transactions
>
> - the array is degraded so the RAID5 write hole applies
>
>

Re: Adventures in btrfs raid5 disk recovery - update

2016-06-21 Thread Zygo Blaxell
TL;DR:

Kernel 4.6.2 causes a world of pain.  Use 4.5.7 instead.

'btrfs dev stat' doesn't seem to count "csum failed"
(i.e. corruption) errors in compressed extents.



On Sun, Jun 19, 2016 at 11:44:27PM -0400, Zygo Blaxell wrote:
> Not so long ago, I had a disk fail in a btrfs filesystem with raid1
> metadata and raid5 data.  I mounted the filesystem readonly, replaced
> the failing disk, and attempted to recover by adding the new disk and
> deleting the missing disk.

> I'm currently using kernel 4.6.2

That turned out to be a mistake.  4.6.2 has some severe problems.

Over the past few days I've been upgrading other machines from 4.5.7
to 4.6.2.  This morning I saw the aggregate data coming back from
those machines, and it's all bad:  stalls in snapshot delete, balance,
and sync; some machines just lock up with no console messages; a lot of
watchdog timeouts.  None of the machines could get to an uptime over 26
hours and still be in a usable state.

I switched to 4.5.7 and the crashes, balance/delete hangs, and some of
the data corruption modes stopped.

> I'm
> getting EIO randomly all over the filesystem, including in files that were
> written entirely _after_ the disk failure.

There were actually four distinct corruption modes happening:

1.  There are some number (16500 so far) "normal" corrupt blocks:  read
repeatably returns EIO, they show up in scrub with sane log messages,
and replacing the files that contain these blocks makes them go away.
These blocks appear to be contained in extents that coincide with the
date of the disk failure.  Interestingly, no matter how many times I
read these blocks, I get no increase in the 'btrfs dev stat' numbers
even though I get kernel csum failure messages.  That looks like a bug.

2.  When attempting to replace corrupted files with rsync, I had used
'rsync --inplace'.  This caused bad blocks to be overwritten within
extents, but does not necessarily replace the _entire_ extent containing a
bad block.  This creates corrupt blocks that show up in scrub, balance,
and device delete, but not when reading files.  It also updates the
timestamps so a file with old corruption looks "new" to an insufficiently
sophisticated analysis tool.

3.  Files were corrupted while they were written and accessed via NFS.
This created files with correct btrfs checksums, but garbage contents.
This would show up as failures during 'git gc' or rsync checksum
mismatches.  During one of the many VM crashes, any writes in progress at
the time of the crash were lost.  This effectively rewound the filesystem
several minutes each time as btrfs reverts to the previous committed
tree on the next mount.  4.6.2's hanging issues made this worse by
delaying btrfs commits indefinitely.  The NFS clients were completely
unaware of this, so when the VM rebooted, files ended up with holes,
or would just disappear while in use.

4.  After a VM crash and the filesystem reverted to the previous
committed tree, files with bad blocks that had been repaired through
the NFS server or with rsync would be "unrepaired" (i.e. the filesystem
would revert back to the original corrupted blocks after the mount).

Combinations of these could occur as well for extra confusion, and some
corrupted blocks are contained in many files thanks to dedup.

With kernel 4.5.7 there have been no lockups during commit and no VM
crashes, so I haven't seen any of corruption modes 3 and 4 since 4.5.7.

Balance is now running normally to move the remaining data off the
missing disk.  ETA is 558 hours.  See you in mid-July!  ;)



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 09:55:59PM -0400, Zygo Blaxell wrote:
> In this current case, I'm getting things like this:
> 
> [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 
> 26805825306624 csum 4105596028 wanted 787343232 mirror 0
[...]
> The other other weird thing here is that I can't find an example in the
> logs of an extent with an EIO that isn't compressed.  I've been looking
> up a random sample of the extent numbers, matching them up to filefrag
> output, and finding e.g. the one compressed extent in the middle of an
> otherwise uncompressed git pack file.  That's...odd.  Maybe there's a
> problem with compressed extents in particular?  I'll see if I can
> script something to check all the logs at once...

No need for a script:  this message wording appears only in
fs/btrfs/compression.c so it can only ever be emitted by reading a
compressed extent.

Maybe there's a problem specific to raid5, degraded mode, and compressed
extents?



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> 
> >> For me the critical question is what does "some corrupted sectors" mean?
> >
> > On other raid5 arrays, I would observe a small amount of corruption every
> > time there was a system crash (some of which were triggered by disk
> > failures, some not).
> 
> What test are you using to determine there is corruption, and how much
> data is corrupted? Is this on every disk? Non-deterministically fewer
> than all disks? Have you identified this as a torn write or
> misdirected write or is it just garbage at some sectors? And what's
> the size? Partial sector? Partial md chunk (or fs block?)

In earlier cases, scrub, read(), and btrfs dev stat all reported the
incidents differently.  Scrub would attribute errors randomly to disks
(error counts spread randomly across all the disks in the 'btrfs scrub
status -d' output).  'dev stat' would correctly increment counts on only
those disks which had individually had an event (e.g. media error or
SATA bus reset).

Before deploying raid5, I tested these by intentionally corrupting
one disk in an otherwise healthy raid5 array and watching the result.
When scrub identified an inode and offset in the kernel log, the csum
failure log message matched the offsets producing EIO on read(), but
the statistics reported by scrub about which disk had been corrupted
were mostly wrong.  In such cases a scrub could repair the data.

A different thing happens if there is a crash.  In that case, scrub cannot
repair the errors.  Every btrfs raid5 filesystem I've deployed so far
behaves this way when disks turn bad.  I had assumed it was a software bug
in the comparatively new raid5 support that would get fixed eventually.

In this current case, I'm getting things like this:

[12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4105596028 wanted 787343232 mirror 0
[12008.243876] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1689373462 wanted 787343232 mirror 0
[12008.243885] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3621611229 wanted 787343232 mirror 0
[12008.243893] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 113993114 wanted 787343232 mirror 0
[12008.243902] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1464956834 wanted 787343232 mirror 0
[12008.243911] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2545274038 wanted 787343232 mirror 0
[12008.243942] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4090153227 wanted 787343232 mirror 0
[12008.243952] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243961] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243976] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 172651968 wanted 787343232 mirror 0
[12008.246158] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 1
[12008.247557] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1374425809 wanted 787343232 mirror 1
[12008.403493] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1567917468 wanted 787343232 mirror 1
[12008.409809] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2881359629 wanted 787343232 mirror 0
[12008.411165] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3021442070 wanted 787343232 mirror 0
[12008.411180] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3984314874 wanted 787343232 mirror 0
[12008.411189] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 599192427 wanted 787343232 mirror 0
[12008.411199] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2887010053 wanted 787343232 mirror 0
[12008.411208] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1314141634 wanted 787343232 mirror 0
[12008.411217] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3156167613 wanted 787343232 mirror 0
[12008.411227] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 565550942 wanted 787343232 mirror 0
[12008.411236] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4068631390 wanted 787343232 mirror 0
[12008.411245] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 531263990 wanted 787343232 mirror 0
[12008.411255] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 

Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:

>> For me the critical question is what does "some corrupted sectors" mean?
>
> On other raid5 arrays, I would observe a small amount of corruption every
> time there was a system crash (some of which were triggered by disk
> failures, some not).

What test are you using to determine there is corruption, and how much
data is corrupted? Is this on every disk? Non-deterministically fewer
than all disks? Have you identified this as a torn write or
misdirected write or is it just garbage at some sectors? And what's
the size? Partial sector? Partial md chunk (or fs block?)

>  It looked like any writes in progress at the time
> of the failure would be damaged.  In the past I would just mop up the
> corrupt files (they were always the last extents written, easy to find
> with find-new or scrub) and have no further problems.

This is on Btrfs? This isn't supposed to be possible. Even a literal
overwrite of a file is not an overwrite on Btrfs unless the file is
nodatacow. Data extents get written, then the metadata is updated to
point to those new blocks. There should be flush or fua requests to
make sure the order is such that the fs points to either the old or
new file, in either case uncorrupted. That's why I'm curious about the
nature of this corruption. It sounds like your hardware is not exactly
honoring flush requests.

With md raid and any other file system, it's pure luck that such
corrupted writes would only affect data extents and not the fs
metadata. Corrupted fs metadata is not well tolerated by any file
system, not least of which is most of them have no idea the metadata
is corrupt. At least Btrfs can determine this and if there's another
copy use that or just stop and face plant before more damage happens.
Maybe an exception now is XFS v5 metadata which employs checksumming.
But it still doesn't know if data extents are wrong (i.e. a torn or
misdirected write).

I've had perhaps a hundred power off during write with Btrfs and SSD
and I don't ever see corrupt files. It's definitely not normal to see
this with Btrfs.


> In the earlier
> cases there were no new instances of corruption after the initial failure
> event and manual cleanup.
>
> Now that I did a little deeper into this, I do see one fairly significant
> piece of data:
>
> root@host:~# btrfs dev stat /data | grep -v ' 0$'
> [/dev/vdc].corruption_errs 16774
> [/dev/vde].write_io_errs   121
> [/dev/vde].read_io_errs4
> [devid:8].read_io_errs16
>
> Prior to the failure of devid:8, vde had 121 write errors and 4 read
> errors (these counter values are months old and the errors were long
> since repaired by scrub).  The 16774 corruption errors on vdc are all
> new since the devid:8 failure, though.

On md RAID 5 and 6, if the array gets parity mismatch counts above 0
doing a scrub (check > md/sync_action) there's a hardware problem.
It's entirely possible you've found a bug, but it must be extremely
obscure to basically not have hit everyone trying Btrfs raid56. I
think you need to track down the source of this corruption and stop it
 however possible; whether that's changing hardware, or making sure
the system isn't crashing.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> >> On Sun, 19 Jun 2016 23:44:27 -0400
> Seems difficult at best due to this:
> >>The normal 'device delete' operation got about 25% of the way in,
> then got stuck on some corrupted sectors and aborting with EIO.
> 
> In effect it's like a 2 disk failure for a raid5 (or it's
> intermittently a 2 disk failure but always at least a 1 disk failure).
> That's not something md raid recovers from. Even manual recovery in
> such a case is far from certain.
> 
> Perhaps Roman's advice is also a question about the cause of this
> corruption? I'm wondering this myself. That's the real problem here as
> I see it. Losing a drive is ordinary. Additional corruptions happening
> afterward is not. And are those corrupt sectors hardware corruptions,
> or Btrfs corruptions at the time the data was written to disk, or
> Btrfs being confused as it's reading the data from disk?

> For me the critical question is what does "some corrupted sectors" mean?

On other raid5 arrays, I would observe a small amount of corruption every
time there was a system crash (some of which were triggered by disk
failures, some not).  It looked like any writes in progress at the time
of the failure would be damaged.  In the past I would just mop up the
corrupt files (they were always the last extents written, easy to find
with find-new or scrub) and have no further problems.  In the earlier
cases there were no new instances of corruption after the initial failure
event and manual cleanup.

Now that I did a little deeper into this, I do see one fairly significant
piece of data:

root@host:~# btrfs dev stat /data | grep -v ' 0$'
[/dev/vdc].corruption_errs 16774
[/dev/vde].write_io_errs   121
[/dev/vde].read_io_errs4
[devid:8].read_io_errs16

Prior to the failure of devid:8, vde had 121 write errors and 4 read
errors (these counter values are months old and the errors were long
since repaired by scrub).  The 16774 corruption errors on vdc are all
new since the devid:8 failure, though.

> 
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
>> On Sun, 19 Jun 2016 23:44:27 -0400
>> Zygo Blaxell  wrote:
>> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
>> better off shutting down the system, booting a rescue OS, copying the content
>> of the failing disk to the replacement one using 'ddrescue', then removing 
>> the
>> bad disk, and after boot up your main system wouldn't notice anything has 
>> ever
>> happened, aside from a few recoverable CRC errors in the "holes" on the areas
>> which ddrescue failed to copy.
>
> I'm aware of ddrescue and myrescue, but in this case the disk has failed,
> past tense.  At this point the remaining choices are to make btrfs native
> raid5 recovery work, or to restore from backups.

Seems difficult at best due to this:
>>The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.

In effect it's like a 2 disk failure for a raid5 (or it's
intermittently a 2 disk failure but always at least a 1 disk failure).
That's not something md raid recovers from. Even manual recovery in
such a case is far from certain.

Perhaps Roman's advice is also a question about the cause of this
corruption? I'm wondering this myself. That's the real problem here as
I see it. Losing a drive is ordinary. Additional corruptions happening
afterward is not. And are those corrupt sectors hardware corruptions,
or Btrfs corruptions at the time the data was written to disk, or
Btrfs being confused as it's reading the data from disk?

For me the critical question is what does "some corrupted sectors" mean?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> On Sun, 19 Jun 2016 23:44:27 -0400
> Zygo Blaxell  wrote:
> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
> better off shutting down the system, booting a rescue OS, copying the content
> of the failing disk to the replacement one using 'ddrescue', then removing the
> bad disk, and after boot up your main system wouldn't notice anything has ever
> happened, aside from a few recoverable CRC errors in the "holes" on the areas
> which ddrescue failed to copy.

I'm aware of ddrescue and myrescue, but in this case the disk has failed,
past tense.  At this point the remaining choices are to make btrfs native
raid5 recovery work, or to restore from backups.

> But in general it's commendable that you're experimenting with doing things
> "the native way", as this is provides feedback to the developers and could 
> help
> make the RAID implementation better. I guess that's the whole point of the
> exercise and the report, and hope this ends up being useful for everyone.

The intent was both to provide a cautionary tale for anyone considering
deploying a btrfs raid5 system today, and to possibly engage some
developers to help solve the problems.

The underlying causes seem to be somewhat removed from where the symptoms
are appearing, and at the moment I don't understand this code well enough
to know where to look for them.  Any assistance would be greatly appreciated.


> -- 
> With respect,
> Roman




signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Roman Mamedov
On Sun, 19 Jun 2016 23:44:27 -0400
Zygo Blaxell  wrote:

> It's not going well so far.  Pay attention, there are at least four
> separate problems in here and we're not even half done yet.
> 
> I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
> 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
> also pulled in some patches from the 4.7-rc series.
> 
> This fixed a few problems I encountered early on, and I'm still making
> forward progress, but I've only replaced 50% of the failed disk so far,
> and this is week four of this particular project.

From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
better off shutting down the system, booting a rescue OS, copying the content
of the failing disk to the replacement one using 'ddrescue', then removing the
bad disk, and after boot up your main system wouldn't notice anything has ever
happened, aside from a few recoverable CRC errors in the "holes" on the areas
which ddrescue failed to copy.

But in general it's commendable that you're experimenting with doing things
"the native way", as this is provides feedback to the developers and could help
make the RAID implementation better. I guess that's the whole point of the
exercise and the report, and hope this ends up being useful for everyone.

-- 
With respect,
Roman


pgp8h7EycbEd6.pgp
Description: OpenPGP digital signature


Adventures in btrfs raid5 disk recovery

2016-06-19 Thread Zygo Blaxell
Not so long ago, I had a disk fail in a btrfs filesystem with raid1
metadata and raid5 data.  I mounted the filesystem readonly, replaced
the failing disk, and attempted to recover by adding the new disk and
deleting the missing disk.

It's not going well so far.  Pay attention, there are at least four
separate problems in here and we're not even half done yet.

I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
also pulled in some patches from the 4.7-rc series.

This fixed a few problems I encountered early on, and I'm still making
forward progress, but I've only replaced 50% of the failed disk so far,
and this is week four of this particular project.

What worked:

'mount -odegraded,...' successfully mounts the filesystem RW.  
'btrfs device add' adds the new disk.  Success!

The first thing I did was balance the metadata onto non-missing disks.
That went well.  Now there are only data chunks to recover from the
missing disk.  Success!

The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.  
That ends the success, but I've had similar problems with raid5 arrays
before and been able to solve them.

I've managed to remove about half of the data from the missing disk
so far.  'balance start -ddevid=,drange=0..1000'
(with increasing values for drange) is able to move data off the failed
disk while avoiding the damaged regions.  It looks like this process could
reduce the amount of data on "missing" devices to a manageable number,
then I could identify the offending corrupted extents with 'btrfs scrub',
remove the files containing them, and finish the device delete operation.
Hope!

What doesn't work:

The first problem is that the kernel keeps crashing.  I put the filesystem
and all its disks in a KVM so the crashes are less disruptive, and I can
debug them (or at least collect panic logs).  OK now crashes are merely a
performance problem.

Why did I mention 'btrfs scrub' above?  Because 'btrfs scrub' tells me
where corrupted blocks are.  'device delete' fills my kernel logs with
lines like this:

[26054.744158] BTRFS info (device vdc): relocating block group 
27753592127488 flags 129
[26809.746993] BTRFS warning (device vdc): csum failed ino 404 off 
6021976064 csum 778377694 expected csum 2827380172
[26809.747029] BTRFS warning (device vdc): csum failed ino 404 off 
6021980160 csum 3776938678 expected csum 514150079
[26809.747077] BTRFS warning (device vdc): csum failed ino 404 off 
6021984256 csum 470593400 expected csum 642831408
[26809.747093] BTRFS warning (device vdc): csum failed ino 404 off 
6021988352 csum 796755777 expected csum 690854341
[26809.747108] BTRFS warning (device vdc): csum failed ino 404 off 
6021992448 csum 4115095129 expected csum 249712906
[26809.747122] BTRFS warning (device vdc): csum failed ino 404 off 
6021996544 csum 2337431338 expected csum 1869250975
[26809.747138] BTRFS warning (device vdc): csum failed ino 404 off 
6022000640 csum 3543852608 expected csum 1929026437
[26809.747154] BTRFS warning (device vdc): csum failed ino 404 off 
6022004736 csum 3417780495 expected csum 3698318115
[26809.747169] BTRFS warning (device vdc): csum failed ino 404 off 
6022008832 csum 3423877520 expected csum 2981727596
[26809.747183] BTRFS warning (device vdc): csum failed ino 404 off 
6022012928 csum 550838742 expected csum 1005563554
[26896.379773] BTRFS info (device vdc): relocating block group 
27753592127488 flags 129
[27791.128098] __readpage_endio_check: 7 callbacks suppressed
[27791.236794] BTRFS warning (device vdc): csum failed ino 405 off 
6021980160 csum 3776938678 expected csum 514150079
[27791.236799] BTRFS warning (device vdc): csum failed ino 405 off 
6021971968 csum 3304844252 expected csum 4171523312
[27791.236821] BTRFS warning (device vdc): csum failed ino 405 off 
6021984256 csum 470593400 expected csum 642831408
[27791.236825] BTRFS warning (device vdc): csum failed ino 405 off 
6021988352 csum 796755777 expected csum 690854341
[27791.236842] BTRFS warning (device vdc): csum failed ino 405 off 
6021992448 csum 4115095129 expected csum 249712906
[27791.236847] BTRFS warning (device vdc): csum failed ino 405 off 
6021996544 csum 2337431338 expected csum 1869250975
[27791.236857] BTRFS warning (device vdc): csum failed ino 405 off 
6022004736 csum 3417780495 expected csum 3698318115
[27791.236864] BTRFS warning (device vdc): csum failed ino 405 off 
6022000640 csum 3543852608 expected csum 1929026437
[27791.236874] BTRFS warning (device vdc): csum failed ino 405 off 
6022008832 csum 3423877520 expected csum 2981727596
[27791.236978] BTRFS warning (device vdc): csum failed ino 405 off 
6021976064 csum 778377694 expected