BTRFS RAID5 disk failed while balancing

2018-11-01 Thread Oliver R.

If you clicked on the link to this topic: Thank you!

I have the following setup:

6x 500GB HDD-Drives
1x 32GB NVME-SSD (Intel Optane)

I used bcache to setup up the SSD as caching device and all other six 
drives are backing devices. After all that was in place, I formatted the 
six HHDs with btrfs in RAID5. Everything works as expected for the last 
7 months now.


By now I have a spare of 6x 2TB HDD drives and I want to replace the old 
500GB disks one by one. So I started with the first one by deleting it 
from the btrfs. This worked fine, I had no issues there. After that I 
cleanly detached the empty disk from bcache, still everything is fine, 
so I removed it. Here are the commandlines for this:


sudo btrfs device delete /dev/bcacheX /media/raid
cat /sys/block/bcacheX/bcache/state
cat /sys/block/bcacheX/bcache/dirty_data
sudo sh -c "echo 1 > /sys/block/bcacheX/bcache/detach"
cat /sys/block/bcacheX/bcache/state

After that I installed one of 2TB drives, attached it to bcache and 
added it to the raid. The next step was to balance the data over to the 
new drive. Please see the commandlines:


sudo make-bcache -B /dev/sdY
sudo sh -c "echo '60a63f7c-2e68-4503-9f25-71b6b00e47b2' > 
/sys/block/bcacheY/bcache/attach"

sudo sh -c "echo writeback > /sys/block/bcacheY/bcache/cache_mode"
sudo btrfs device add /dev/bcacheY /media/raid
sudo btrfs fi ba start /media/raid/

The balance worked fine until ~164GB were written to the new drive, this 
is about 50% of the data to be balanced. Suddenly write errors on the 
disk appear. The Raid slowly became unusable (I was running 3 VMs of the 
RAID while balancing). I think it worked for some time due to the SSD 
commiting the writes. At some point the balancing stopped and I was only 
able to kill the VMs. I checked the I/Os on the disks and the SSD spit 
out constant 1,2 GB/s read. I think the bcache somehow delivered data to 
the btrfs and it got rejected there and requested again, but this is 
just a guess. Anyway, I ended up resetting the host and I physically 
disconnected the broken disk and put a new one in place. I also created 
a bcache backing device on it and issued the following command to 
replace the faulty disk:


sudo btrfs replace start -r 7 /dev/bcache5 /media/raid

The filesystem needs to be mounted read/write for this command to work. 
It is now doing its work, but very slow, about 3,5 MB/s. Unfortunately 
the syslog reports a lot of these messages:


...
scrub_missing_raid56_worker: 62 callbacks suppressed
BTRFS error (device bcache0): failed to rebuild valid logical 
4929143865344 for dev (null)

...
BTRFS error (device bcache0): failed to rebuild valid logical 
4932249866240 for dev (null)

scrub_missing_raid56_worker: 1 callbacks suppressed
BTRFS error (device bcache0): failed to rebuild valid logical 
4933254250496 for dev (null)



If I try to read a file from the filesystem, the output-command fails 
with a simple I/O error and the syslog shows something entries similar 
to this:


BTRFS warning (device bcache0): csum failed root 5 ino 1143 off 
7274496 csum 0xf554 expected csum 0x6340b527 mirror 2


So far, so good (or bad). It took about 6 hours for 4,3% of the 
replacement so far. No read or write errors have been reported for the 
replacement procedure ("btrfs replace status"). I will let it to its 
thing until finished. Before the first 2TB disk failed, 164 GB of data 
have been written according to "btrfs filesystem show". If  I check the 
amount of data written to the new drive, the 4,3% represent about 82 GB 
(according to /proc/diskstats). I don't know how to interpret this, but 
anyway.


And now finally my questions: If the replace command finishes 
successfully, what should I do next. A scrub? A balance? Another backup? ;-)
Do you see anything that I have done wrong in this procedure? Do the 
warnings and the errors reported from btrfs mean, that the data is lost? :-(


Here is some additional info (**edited**):

$ sudo btrfs fi sh
Total devices 7 FS bytes used 1.56TiB
Label: none  uuid: 9f765025-5354-47e4-afcc-a601b2a52703
devid0 size 1.82TiB used 164.03GiB path /dev/bcache5
devid1 size 465.76GiB used 360.03GiB path /dev/bcache4
devid3 size 465.76GiB used 360.00GiB path /dev/bcache3
devid4 size 465.76GiB used 359.03GiB path /dev/bcache1
devid5 size 465.76GiB used 360.00GiB path /dev/bcache0
devid6 size 465.76GiB used 360.03GiB path /dev/bcache2
*** Some devices missing

$ sudo btrfs dev stats /media/raid/
[/dev/bcache5].write_io_errs0
[/dev/bcache5].read_io_errs 0
[/dev/bcache5].flush_io_errs0
[/dev/bcache5].corruption_errs  0
[/dev/bcache5].generation_errs  0
[/dev/bcache4].write_io_errs0
[/dev/bcache4].read_io_errs 0
[/dev/bcache4].flush_io_errs0
[/dev/bcache4].corruption_errs  0

[PATCH V8] Add support for BTRFS raid5/6 to GRUB

2018-09-27 Thread Goffredo Baroncelli


i All,

the aim of this patches set is to provide support for a BTRFS raid5/6
filesystem in GRUB.

The first patch, implements the basic support for raid5/6. I.e this works when
all the disks are present.

The next 5 patches, are preparatory ones.

The 7th patch implements the raid5 recovery for btrfs (i.e. handling the
disappearing of 1 disk).
The 8th patch makes the code for handling the raid6 recovery more generic.
The last one implements the raid6 recovery for btrfs (i.e. handling the
disappearing up to two disks).

I tested the code in grub-emu, and it works both with all the disks,
and with some disks missing. I checked the crc32 calculated from grub and
from linux and these matched. Finally I checked if the support for md raid6
still works properly, and it does (with all drives and with up to 2 drives
missing)

Comments are welcome.

Changelog
v1: initial support for btrfs raid5/6. No recovery allowed
v2: full support for btrfs raid5/6. Recovery allowed
v3: some minor cleanup suggested by Daniel Kiper; reusing the
original raid6 recovery code of grub
v4: Several spell fix; better description of the RAID layout
in btrfs, and the variables which describes the stripe
positioning; split the patch #5 in two (#5 and #6)
v5: Several spell fix; improved code comment in patch #1, small
clean up in the code
v6: Small cleanup; improved the wording in the RAID6 layout
description; in the function raid6_recover_read_buffer() avoid
a unnecessary memcpy in case of invalid data;
v7: - patch 2,3,5,6,8 received an Review-by Daniel, and were unchanged from
the last time (only minor cleanup in the commit description requested by
Daniel)
- patch 7 received some small update rearranging a for(), and some
bracket around if()
- patch 4, received an update message which explains better why NULL
is stored in data->devices_attached[]
- patch 9, received a blank line to separate better a code line from
a previous comment. A description of 'parities_pos' was added
- patch 1, received a major update about the variable meaning description
in the comment. However I suspect that we need some further review to reach
a fully agreement about this text. NB: the update are relate only to 
comments
v8: - patch 2,5,6,8 received an Review-by Daniel, and were unchanged from
the last time (only minor cleanup in the commit description requested by
Daniel)
- patch 1 received some adjustement to the variables description due to
  the different terminology between BTRFS and other RAID implementatio.
  Added a description for the "nparities" variable.
- patch 3 removed some unnecessary curly brackets (change request by Daniel)
- patch 4 received an improved commit description about why and how 
  the function find_device() is changed
- patch 7 received an update which transforms a i = 0; while(i..) i++; in
  for( i = 0. ; i++);
- patch 9 received an update to the comment

BR
G.Baroncelli

--
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5




Re: [RFC] Add support for BTRFS raid5/6 to GRUB

2018-04-23 Thread Goffredo Baroncelli
On 04/23/2018 01:50 PM, Daniel Kiper wrote:
> On Tue, Apr 17, 2018 at 09:57:40PM +0200, Goffredo Baroncelli wrote:
>> Hi All,
>>
>> Below you can find a patch to add support for accessing files from
>> grub in a RAID5/6 btrfs filesystem. This is a RFC because it is
>> missing the support for recovery (i.e. if some devices are missed). In
>> the next days (weeks ?) I will extend this patch to support also this
>> case.
>>
>> Comments are welcome.
> 
> More or less LGTM. Just a nitpick below... I am happy to take full blown
> patch into GRUB if it is ready.

Thanks for the comments; however now I implemented also the recovery. It is 
under testing. Let me few days and I will resubmit the patches.

> 
>> BR
>> G.Baroncelli
>>
>>
>> ---
>>
>> commit 8c80a1b7c913faf50f95c5c76b4666ed17685666
>> Author: Goffredo Baroncelli <kreij...@inwind.it>
>> Date:   Tue Apr 17 21:40:31 2018 +0200
>>
>> Add initial support for btrfs raid5/6 chunk
>>
>> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
>> index be195448d..4c5632acb 100644
>> --- a/grub-core/fs/btrfs.c
>> +++ b/grub-core/fs/btrfs.c
>> @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
>>  #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
>>  #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
>>  #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
>> +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
>> +#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100
>>grub_uint8_t dummy2[0xc];
>>grub_uint16_t nstripes;
>>grub_uint16_t nsubstripes;
>> @@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
>> grub_disk_addr_t addr,
>>stripe_offset = low + chunk_stripe_length
>>  * high;
>>csize = chunk_stripe_length - low;
>> +  break;
>> +}
>> +  case GRUB_BTRFS_CHUNK_TYPE_RAID5:
>> +  case GRUB_BTRFS_CHUNK_TYPE_RAID6:
>> +{
>> +  grub_uint64_t nparities;
>> +  grub_uint64_t parity_pos;
>> +  grub_uint64_t stripe_nr, high;
>> +  grub_uint64_t low;
>> +
>> +  redundancy = 1;   /* no redundancy for now */
>> +
>> +  if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
>> +{
>> +  grub_dprintf ("btrfs", "RAID5\n");
>> +  nparities = 1;
>> +}
>> +  else
>> +{
>> +  grub_dprintf ("btrfs", "RAID6\n");
>> +  nparities = 2;
>> +}
>> +
>> +  stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
>> +
>> +  high = grub_divmod64 (stripe_nr, nstripes - nparities, );
>> +  grub_divmod64 (high+nstripes-nparities, nstripes, _pos);
>> +  grub_divmod64 (parity_pos+nparities+stripen, nstripes, );
> 
> Missing spaces around "+" and "-".
> 
> Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Add support for BTRFS raid5/6 to GRUB

2018-04-23 Thread Daniel Kiper
On Tue, Apr 17, 2018 at 09:57:40PM +0200, Goffredo Baroncelli wrote:
> Hi All,
>
> Below you can find a patch to add support for accessing files from
> grub in a RAID5/6 btrfs filesystem. This is a RFC because it is
> missing the support for recovery (i.e. if some devices are missed). In
> the next days (weeks ?) I will extend this patch to support also this
> case.
>
> Comments are welcome.

More or less LGTM. Just a nitpick below... I am happy to take full blown
patch into GRUB if it is ready.

> BR
> G.Baroncelli
>
>
> ---
>
> commit 8c80a1b7c913faf50f95c5c76b4666ed17685666
> Author: Goffredo Baroncelli <kreij...@inwind.it>
> Date:   Tue Apr 17 21:40:31 2018 +0200
>
> Add initial support for btrfs raid5/6 chunk
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index be195448d..4c5632acb 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
>  #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
>  #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
>  #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
> +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
> +#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100
>grub_uint8_t dummy2[0xc];
>grub_uint16_t nstripes;
>grub_uint16_t nsubstripes;
> @@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
> stripe_offset = low + chunk_stripe_length
>   * high;
> csize = chunk_stripe_length - low;
> +   break;
> + }
> +   case GRUB_BTRFS_CHUNK_TYPE_RAID5:
> +   case GRUB_BTRFS_CHUNK_TYPE_RAID6:
> + {
> +   grub_uint64_t nparities;
> +   grub_uint64_t parity_pos;
> +   grub_uint64_t stripe_nr, high;
> +   grub_uint64_t low;
> +
> +   redundancy = 1;   /* no redundancy for now */
> +
> +   if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
> + {
> +   grub_dprintf ("btrfs", "RAID5\n");
> +   nparities = 1;
> + }
> +   else
> + {
> +   grub_dprintf ("btrfs", "RAID6\n");
> +   nparities = 2;
> + }
> +
> +   stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
> +
> +   high = grub_divmod64 (stripe_nr, nstripes - nparities, );
> +   grub_divmod64 (high+nstripes-nparities, nstripes, _pos);
> +   grub_divmod64 (parity_pos+nparities+stripen, nstripes, );

Missing spaces around "+" and "-".

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Add support for BTRFS raid5/6 to GRUB

2018-04-17 Thread Goffredo Baroncelli
Hi All,

Below you can find a patch to add support for accessing files from grub in a 
RAID5/6 btrfs filesystem. This is a RFC because it is missing the support for 
recovery (i.e. if some devices are missed). In the next days (weeks ?) I will 
extend this patch to support also this case.

Comments are welcome.

BR
G.Baroncelli


---

commit 8c80a1b7c913faf50f95c5c76b4666ed17685666
Author: Goffredo Baroncelli <kreij...@inwind.it>
Date:   Tue Apr 17 21:40:31 2018 +0200

Add initial support for btrfs raid5/6 chunk

diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
index be195448d..4c5632acb 100644
--- a/grub-core/fs/btrfs.c
+++ b/grub-core/fs/btrfs.c
@@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
 #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
 #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
 #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
+#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
+#define GRUB_BTRFS_CHUNK_TYPE_RAID60x100
   grub_uint8_t dummy2[0xc];
   grub_uint16_t nstripes;
   grub_uint16_t nsubstripes;
@@ -764,6 +766,39 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
grub_disk_addr_t addr,
  stripe_offset = low + chunk_stripe_length
* high;
  csize = chunk_stripe_length - low;
+ break;
+   }
+ case GRUB_BTRFS_CHUNK_TYPE_RAID5:
+ case GRUB_BTRFS_CHUNK_TYPE_RAID6:
+   {
+ grub_uint64_t nparities;
+ grub_uint64_t parity_pos;
+ grub_uint64_t stripe_nr, high;
+ grub_uint64_t low;
+
+ redundancy = 1;   /* no redundancy for now */
+
+ if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
+   {
+ grub_dprintf ("btrfs", "RAID5\n");
+ nparities = 1;
+   }
+ else
+   {
+ grub_dprintf ("btrfs", "RAID6\n");
+ nparities = 2;
+   }
+
+ stripe_nr = grub_divmod64 (off, chunk_stripe_length, );
+
+ high = grub_divmod64 (stripe_nr, nstripes - nparities, );
+ grub_divmod64 (high+nstripes-nparities, nstripes, _pos);
+ grub_divmod64 (parity_pos+nparities+stripen, nstripes, );
+
+ stripe_offset = low + chunk_stripe_length * high;
+ csize = chunk_stripe_length - low;
+
  break;
}
  default:


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Raid5 issue.

2017-08-22 Thread Qu Wenruo



On 2017年08月23日 00:37, Robert LeBlanc wrote:

Thanks for the explanations. Chris, I don't think 'degraded' did
anything to help the mounting, I just passed it in to see if it would
help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
would increase the chance of mounting the volume even if it is
degraded, but one could hope). I believe the key was 'nologreplay'.
Here is some info about the corrupted fs:

# btrfs fi show /tmp/root/
Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
 Total devices 3 FS bytes used 3.30TiB
 devid1 size 2.73TiB used 2.09TiB path /dev/bcache32
 devid2 size 2.73TiB used 2.09TiB path /dev/bcache0
 devid3 size 2.73TiB used 2.09TiB path /dev/bcache16

# btrfs fi usage /tmp/root/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
 Device size:   8.18TiB
 Device allocated:0.00B
 Device unallocated:8.18TiB
 Device missing:  0.00B
 Used:0.00B
 Free (estimated):0.00B  (min: 8.00EiB)
 Data ratio:   0.00
 Metadata ratio:   0.00
 Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID5: Size:4.15TiB, Used:3.28TiB
/dev/bcache02.08TiB
/dev/bcache16   2.08TiB
/dev/bcache32   2.08TiB

Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
/dev/bcache0   11.00GiB
/dev/bcache16  11.00GiB
/dev/bcache32  11.00GiB

System,RAID5: Size:64.00MiB, Used:400.00KiB
/dev/bcache0   32.00MiB
/dev/bcache16  32.00MiB
/dev/bcache32  32.00MiB

Unallocated:
/dev/bcache0  655.00GiB
/dev/bcache16 655.00GiB
/dev/bcache32 656.49GiB

So it looks like I set the metadata and system data to RAID5 and not
RAID1. I guess that it could have been affected by the write hole
causing the problem I was seeing.

Since I get the same space usage with RAID1 and RAID5,


Well, RAID1 has larger space usage than 3-disk RAID5.
Space efficiency will be 50% for RAID1 while 66% for 3-disk RAID5.

So you may lost some available space.


I think I'm
just going to use RAID1. I don't need stripe performance or anything
like that.


And RAID5/6 won't always improve performance.
Especially when IO blocksize is smaller than full stripe size (in your 
case it's 128K).


When doing sequential IO with blocksize smaller than 128K, there will be 
an obvious performance drop due to RMW cycle.

This is not limited to Btrfs RAID56 but all RAID56.


It would be nice if btrfs supported hotplug and re-plug a
little better so that it is more "production" quality, but I just have
to be patient. I'm familiar with Gluster and contributed code to Ceph,
so I'm familiar with those types of distributed systems. I really like
them, but the complexity is quite overkill for my needs at home.

As far as bcache performance:
I have two Crucial MX200 250GB drives that were md raid1 containing
/boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
would be painfully slow. Running iostat, the SSDs would be doing a few
hundred IOPs and the backing disks would be very busy and would be the
limiting factor overall. Even though apt-get just downloaded the file
(should be on the SSDs because of writeback), it still involved the
backend disks way too much. The amount of dirty data was always less
than 10% so there should have been plenty of space to free up cache
without having to flush. I experimented with changing the size of
contiguous IO to force more to cache, increasing the dirty ratio, etc,
nothing seemed to provide the performance I was hoping. To be fair
having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
may not be an ideal configuration. If I had three SSDs, one for each
drive, then it may have performed better?? I have also ~980 snapshots
spread over a years time, so I don't know how much that impacts
things. I did use a btrfs utility to help find duplicate files/chunks
and dedupe them so that updated system binaries between upgraded LXC
containers would use the same space on disk and be more efficient in
bcache cache usage.


Well, RAID1 ssd, offline dedupe, bcache, many snapshots, way more 
complex than I though.

So I'm uncertain where the bottleneck is.



After restoring the root and LXC roots snapshots on the SSD (broke the
md raid1 so I could restore to one of them), I ran apt-get and got
upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
single on md raid1 degraded). I know that btrfs has some performance
challenges, but I don't think I was hitting those. I was most likely a
very unusual set-up of bcache and btrfs raid that caused the problem.
I have bcache on 10 

Re: Btrfs Raid5 issue.

2017-08-22 Thread Robert LeBlanc
Thanks for the explanations. Chris, I don't think 'degraded' did
anything to help the mounting, I just passed it in to see if it would
help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
would increase the chance of mounting the volume even if it is
degraded, but one could hope). I believe the key was 'nologreplay'.
Here is some info about the corrupted fs:

# btrfs fi show /tmp/root/
Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
Total devices 3 FS bytes used 3.30TiB
devid1 size 2.73TiB used 2.09TiB path /dev/bcache32
devid2 size 2.73TiB used 2.09TiB path /dev/bcache0
devid3 size 2.73TiB used 2.09TiB path /dev/bcache16

# btrfs fi usage /tmp/root/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
Device size:   8.18TiB
Device allocated:0.00B
Device unallocated:8.18TiB
Device missing:  0.00B
Used:0.00B
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   0.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID5: Size:4.15TiB, Used:3.28TiB
   /dev/bcache02.08TiB
   /dev/bcache16   2.08TiB
   /dev/bcache32   2.08TiB

Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
   /dev/bcache0   11.00GiB
   /dev/bcache16  11.00GiB
   /dev/bcache32  11.00GiB

System,RAID5: Size:64.00MiB, Used:400.00KiB
   /dev/bcache0   32.00MiB
   /dev/bcache16  32.00MiB
   /dev/bcache32  32.00MiB

Unallocated:
   /dev/bcache0  655.00GiB
   /dev/bcache16 655.00GiB
   /dev/bcache32 656.49GiB

So it looks like I set the metadata and system data to RAID5 and not
RAID1. I guess that it could have been affected by the write hole
causing the problem I was seeing.

Since I get the same space usage with RAID1 and RAID5, I think I'm
just going to use RAID1. I don't need stripe performance or anything
like that. It would be nice if btrfs supported hotplug and re-plug a
little better so that it is more "production" quality, but I just have
to be patient. I'm familiar with Gluster and contributed code to Ceph,
so I'm familiar with those types of distributed systems. I really like
them, but the complexity is quite overkill for my needs at home.

As far as bcache performance:
I have two Crucial MX200 250GB drives that were md raid1 containing
/boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
would be painfully slow. Running iostat, the SSDs would be doing a few
hundred IOPs and the backing disks would be very busy and would be the
limiting factor overall. Even though apt-get just downloaded the file
(should be on the SSDs because of writeback), it still involved the
backend disks way too much. The amount of dirty data was always less
than 10% so there should have been plenty of space to free up cache
without having to flush. I experimented with changing the size of
contiguous IO to force more to cache, increasing the dirty ratio, etc,
nothing seemed to provide the performance I was hoping. To be fair
having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
may not be an ideal configuration. If I had three SSDs, one for each
drive, then it may have performed better?? I have also ~980 snapshots
spread over a years time, so I don't know how much that impacts
things. I did use a btrfs utility to help find duplicate files/chunks
and dedupe them so that updated system binaries between upgraded LXC
containers would use the same space on disk and be more efficient in
bcache cache usage.

After restoring the root and LXC roots snapshots on the SSD (broke the
md raid1 so I could restore to one of them), I ran apt-get and got
upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
single on md raid1 degraded). I know that btrfs has some performance
challenges, but I don't think I was hitting those. I was most likely a
very unusual set-up of bcache and btrfs raid that caused the problem.
I have bcache on 10 year old desktop box with a single nvme drive that
performs a little better, but it is hard to be certain because of its
age. It has bcache in write-around (since there is only a single nvme)
and btrfs in raid1. I haven't watched that box as closely because it
is responsive enough. It also only has four Gb of RAM so it constantly
has to swap (web pages are hogs these days) and one of the reasons to
retrofit that box with nvme rather than MX200.

If you have any other questions, feel free to ask.

Thanks


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
t

Re: Btrfs Raid5 issue.

2017-08-22 Thread Qu Wenruo



On 2017年08月22日 13:19, Robert LeBlanc wrote:

Chris and Qu thanks for your help. I was able to restore the data off
the volume. I only could not read one file that I tried to rsync (a
MySQl bin log), but it wasn't critical as I had an off-site snapshot
from that morning and ownclould could resync the files that were
changed anyway. This turned out much better than the md RAID failure
that I had a year ago. Much faster recovery thanks to snapshots.

Is there anything you would like from this damaged filesystem to help
determine what went wrong and to help make btrfs better? If I don't
hear back from you in a day, I'll destroy it so that I can add the
disks into the new btrfs volumes to restore redundancy.

Feel free to destroy the old images.

If nologreplay works, that's good enough.
The problem seems to be extent tree, but it's too hard to locate the 
real problem.




Bcache wasn't providing the performance I was hoping for, so I'm
putting the root and roots for my LXC containers on the SSDs (btrfs
RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).


Well, I'm more interested in the bcache performance.

I was considering to using my Intel 600P NVMe to cache one 2.5' HGST 1T 
HDD (7200rpm) in my btrfs KVM host (also my daily machine).


Would you please share more details about the performance problem?
(Maybe it's about some btrfs performance problems, not bcache. Btrfs is 
not good at workload like DB or metadata heavy operations)



For some reason, it seemed that the btrfs RAID5 setup required one of
the drives, but I thought I had data with RAID5 and metadata with 2
copies. Was I missing something else that prevented mounting with that
specific drive? I don't want to get into a situation where one drive
dies and I can't get to any data.


The direct cause is btrfs fails to replay its log, and it's corrupted 
extent tree causing log replay failed.
And normally such failure will definitely cause problem, so btrfs just 
stop the mount procedure.


In your case, if "nologreplay" is specified, btrfs skips the problem, 
and since you must specify RO for nologrelay, btrfs has nothing to do 
with extent tree at all.

So btrfs can be mounted.

Why extent tree get corrupted is still unknown. If your metadata is also 
RAID5, then write-hole may be the cause.

If your metadata profile is RAID1, then I don't know why this could happen.

So from this point of view, even we fixed btrfs scrub/race problems, 
it's still not good enough to survive a disk removal in real world.


With RAID1 setup, at least we don't need to care about write hole and 
csum will help us to determine which copy is correct, so I think it will 
be much better than RAID56.


If you have spare time, you could try to hot-plug RAID1 devices to 
verify how it works.
But please note that, re-attach plugged device may need to umount the fs 
and re-scan btrfs.


And even you're using 3 devices with RAID1, it's still 2 copies.
So you can lose at most 1 device.

Thanks,
Qu



Thank you again.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Raid5 issue.

2017-08-21 Thread Chris Murphy
On Mon, Aug 21, 2017 at 11:19 PM, Robert LeBlanc <rob...@leblancnet.us> wrote:
> Chris and Qu thanks for your help. I was able to restore the data off
> the volume. I only could not read one file that I tried to rsync (a
> MySQl bin log), but it wasn't critical as I had an off-site snapshot
> from that morning and ownclould could resync the files that were
> changed anyway. This turned out much better than the md RAID failure
> that I had a year ago. Much faster recovery thanks to snapshots.
>
> Is there anything you would like from this damaged filesystem to help
> determine what went wrong and to help make btrfs better? If I don't
> hear back from you in a day, I'll destroy it so that I can add the
> disks into the new btrfs volumes to restore redundancy.
>
> Bcache wasn't providing the performance I was hoping for, so I'm
> putting the root and roots for my LXC containers on the SSDs (btrfs
> RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).
> For some reason, it seemed that the btrfs RAID5 setup required one of
> the drives, but I thought I had data with RAID5 and metadata with 2
> copies. Was I missing something else that prevented mounting with that
> specific drive? I don't want to get into a situation where one drive
> dies and I can't get to any data.

With all three connected, what do you get for 'btrfs fi show' ?

The first email says the supers on all three drives are OK, but still
it's confusing the degraded is working. It suggests it's not finding
something on one of the drives that it needs to mount - usually that's
the first superblock or it could be the system block group is partly
corrupt or read error or something; and when degraded it makes it
possible to mount.

Anyway at least all of the data is safe now. Pretty much all you can
do to guard against data loss is backups. Any degraded state is
precarious because it requires just one more thing to go wrong and
it's all bad news from there.

Gluster is pretty easy to setup, and use either gluster native mount
on linux or smb with everything else. Stick a big drive in a raspberry
pi (or two) and even though it's only fast ethernet (haha, now slow
100bps ethernet) it will still replicate automatically as well as
failover. Plus one of those could be XFS if you wanted to hedge your
bets. Or one of the less expensive Intel NUCs will also work if you
want to stick with x86.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Raid5 issue.

2017-08-21 Thread Robert LeBlanc
Chris and Qu thanks for your help. I was able to restore the data off
the volume. I only could not read one file that I tried to rsync (a
MySQl bin log), but it wasn't critical as I had an off-site snapshot
from that morning and ownclould could resync the files that were
changed anyway. This turned out much better than the md RAID failure
that I had a year ago. Much faster recovery thanks to snapshots.

Is there anything you would like from this damaged filesystem to help
determine what went wrong and to help make btrfs better? If I don't
hear back from you in a day, I'll destroy it so that I can add the
disks into the new btrfs volumes to restore redundancy.

Bcache wasn't providing the performance I was hoping for, so I'm
putting the root and roots for my LXC containers on the SSDs (btrfs
RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).
For some reason, it seemed that the btrfs RAID5 setup required one of
the drives, but I thought I had data with RAID5 and metadata with 2
copies. Was I missing something else that prevented mounting with that
specific drive? I don't want to get into a situation where one drive
dies and I can't get to any data.

Thank you again.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Raid5 issue.

2017-08-21 Thread Chris Murphy
On Mon, Aug 21, 2017 at 10:31 AM, Robert LeBlanc  wrote:
> Qu,
>
> Sorry, I'm not on the list (I was for a few years about three years ago).
>
> I looked at the backup roots like you mentioned.
>
> # ./btrfs inspect dump-super -f /dev/bcache0
> superblock: bytenr=65536, device=/dev/bcache0
> -
> csum_type   0 (crc32c)
> csum_size   4
> csum0x45302c8f [match]
> bytenr  65536
> flags   0x1
> ( WRITTEN )
> magic   _BHRfS_M [match]
> fsidfef29f0a-dc4c-4cc4-b524-914e6630803c
> label   kvm-btrfs
> generation  1620386
> root5310022877184
> sys_array_size  161
> chunk_root_generation   1620164
> root_level  1
> chunk_root  4725030256640
> chunk_root_level1
> log_root2876047507456
> log_root_transid0
> log_root_level  0
> total_bytes 8998588280832
> bytes_used  3625869234176
> sectorsize  4096
> nodesize16384
> leafsize (deprecated)   16384
> stripesize  4096
> root_dir6
> num_devices 3
> compat_flags0x0
> compat_ro_flags 0x0
> incompat_flags  0x1e1
> ( MIXED_BACKREF |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   RAID56 |
>   SKINNY_METADATA )
> cache_generation1620386
> uuid_tree_generation42
> dev_item.uuid   cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
> dev_item.fsid   fef29f0a-dc4c-4cc4-b524-914e6630803c [match]
> dev_item.type   0
> dev_item.total_bytes2998998654976
> dev_item.bytes_used 2295693574144
> dev_item.io_align   4096
> dev_item.io_width   4096
> dev_item.sector_size4096
> dev_item.devid  2
> dev_item.dev_group  0
> dev_item.seek_speed 0
> dev_item.bandwidth  0
> dev_item.generation 0
> sys_chunk_array[2048]:
> item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640)
> length 67108864 owner 2 stripe_len 65536 type
> SYSTEM|RAID5
> io_align 65536 io_width 65536 sector_size 4096
> num_stripes 3 sub_stripes 1
> stripe 0 devid 1 offset 2185232384
> dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643
> stripe 1 devid 3 offset 1195075698688
> dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09
> stripe 2 devid 2 offset 41340108800
> dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
> backup_roots[4]:
> backup 0:
> backup_tree_root:   5309879451648   gen: 1620384
> level: 1
> backup_chunk_root:  4725030256640   gen: 1620164
> level: 1
> backup_extent_root: 5309910958080   gen: 1620385
> level: 2
> backup_fs_root: 3658468147200   gen: 1618016
> level: 1
> backup_dev_root:5309904224256   gen: 1620384
> level: 1
> backup_csum_root:   5309910532096   gen: 1620385
> level: 3
> backup_total_bytes: 8998588280832
> backup_bytes_used:  3625871646720
> backup_num_devices: 3
>
> backup 1:
> backup_tree_root:   5309780492288   gen: 1620385
> level: 1
> backup_chunk_root:  4725030256640   gen: 1620164
> level: 1
> backup_extent_root: 5309659037696   gen: 1620385
> level: 2
> backup_fs_root: 0   gen: 0  level: 0
> backup_dev_root:5309872275456   gen: 1620385
> level: 1
> backup_csum_root:   5309674536960   gen: 1620385
> level: 3
> backup_total_bytes: 8998588280832
> backup_bytes_used:  3625869234176
> backup_num_devices: 3


Well that's strange. A backup entry with a null fs root.



> I noticed on that page that there is a 'nologreplay' mount option so I
> tried it with degraded and it requires ro, but the volume mounted and
> I can "see" things on the volume.

Degraded suggests it's not finding one of the three devices.


> So with this nologreplay option, if I do a btrfs send of the subvolume
> that I'm interested in (I don't think it was being written to at the
> time of failure), would it copy (send) over the corruption as well.

Anything that results in EIO will get included in the send, and by
default receive fails. You can use verbose messaging on the receive
side, and use -E option to permit the errors. But file system specific
problems aren't going to 

Re: Btrfs Raid5 issue.

2017-08-21 Thread Robert LeBlanc
Qu,

Sorry, I'm not on the list (I was for a few years about three years ago).

I looked at the backup roots like you mentioned.

# ./btrfs inspect dump-super -f /dev/bcache0
superblock: bytenr=65536, device=/dev/bcache0
-
csum_type   0 (crc32c)
csum_size   4
csum0x45302c8f [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsidfef29f0a-dc4c-4cc4-b524-914e6630803c
label   kvm-btrfs
generation  1620386
root5310022877184
sys_array_size  161
chunk_root_generation   1620164
root_level  1
chunk_root  4725030256640
chunk_root_level1
log_root2876047507456
log_root_transid0
log_root_level  0
total_bytes 8998588280832
bytes_used  3625869234176
sectorsize  4096
nodesize16384
leafsize (deprecated)   16384
stripesize  4096
root_dir6
num_devices 3
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x1e1
( MIXED_BACKREF |
  BIG_METADATA |
  EXTENDED_IREF |
  RAID56 |
  SKINNY_METADATA )
cache_generation1620386
uuid_tree_generation42
dev_item.uuid   cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
dev_item.fsid   fef29f0a-dc4c-4cc4-b524-914e6630803c [match]
dev_item.type   0
dev_item.total_bytes2998998654976
dev_item.bytes_used 2295693574144
dev_item.io_align   4096
dev_item.io_width   4096
dev_item.sector_size4096
dev_item.devid  2
dev_item.dev_group  0
dev_item.seek_speed 0
dev_item.bandwidth  0
dev_item.generation 0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 4725030256640)
length 67108864 owner 2 stripe_len 65536 type
SYSTEM|RAID5
io_align 65536 io_width 65536 sector_size 4096
num_stripes 3 sub_stripes 1
stripe 0 devid 1 offset 2185232384
dev_uuid e273c794-b231-4d86-9a38-53a6d2fa8643
stripe 1 devid 3 offset 1195075698688
dev_uuid 120d6a05-b0bc-46c8-a87e-ca4fe5008d09
stripe 2 devid 2 offset 41340108800
dev_uuid cb56a9b7-8d67-4ae8-8cb0-076b0b93f9c4
backup_roots[4]:
backup 0:
backup_tree_root:   5309879451648   gen: 1620384level: 1
backup_chunk_root:  4725030256640   gen: 1620164level: 1
backup_extent_root: 5309910958080   gen: 1620385level: 2
backup_fs_root: 3658468147200   gen: 1618016level: 1
backup_dev_root:5309904224256   gen: 1620384level: 1
backup_csum_root:   5309910532096   gen: 1620385level: 3
backup_total_bytes: 8998588280832
backup_bytes_used:  3625871646720
backup_num_devices: 3

backup 1:
backup_tree_root:   5309780492288   gen: 1620385level: 1
backup_chunk_root:  4725030256640   gen: 1620164level: 1
backup_extent_root: 5309659037696   gen: 1620385level: 2
backup_fs_root: 0   gen: 0  level: 0
backup_dev_root:5309872275456   gen: 1620385level: 1
backup_csum_root:   5309674536960   gen: 1620385level: 3
backup_total_bytes: 8998588280832
backup_bytes_used:  3625869234176
backup_num_devices: 3

backup 2:
backup_tree_root:   5310022877184   gen: 1620386level: 1
backup_chunk_root:  4725030256640   gen: 1620164level: 1
backup_extent_root: 2876048949248   gen: 1620387level: 2
backup_fs_root: 3658468147200   gen: 1618016level: 1
backup_dev_root:5309872275456   gen: 1620385level: 1
backup_csum_root:   5310042259456   gen: 1620386level: 3
backup_total_bytes: 8998588280832
backup_bytes_used:  3625869250560
backup_num_devices: 3

backup 3:
backup_tree_root:   5309771448320   gen: 1620383level: 1
backup_chunk_root:  4725030256640   gen: 1620164level: 1
backup_extent_root: 5309779804160   gen: 1620384level: 2
backup_fs_root: 3658468147200   gen: 1618016level: 1
backup_dev_root:5309848158208   

Re: Btrfs Raid5 issue.

2017-08-21 Thread Janos Toth F.
I lost enough Btrfs m=d=s=RAID5 filesystems in past experiments (I
didn't try using RAID5 for metadata and system chunks in the last few
years) to faulty SATA cables + hotplug enabled SATA controllers (where
a disk could disappear and reappear "as the wind blew"). Since then, I
made a habit of always disabling hotplug for all SATA disks involved
with Btrfs, even those with m=d=s=single profile (and I never desired
to built multi-devices filesystems from USB attached disks anyway but
this is good reason for me to explicitly avoid that).

I am not sure if other RAID profiles are affected in a similar way or
it's just RAID56. (Well, I mean RAID0 is obviously toast and RAID1/10
will obviously get degraded but I am not sure if it's possible to
re-sync RAID1/10 with a simple balance [possibly even without
remounting and doing manual device delete/add?] or the filesystem has
to be recreated from scratch [like RAID5].)

I think this hotplug problem is an entirely different issue from the
RAID56-scrub race-conditions (which are now considered fixed in linux
4.12) and nobody is currently working on this (if it's RAID56-only
then I don't expect it anytime soon [think years]).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs Raid5 issue.

2017-08-21 Thread Qu Wenruo



On 2017年08月21日 12:33, Robert LeBlanc wrote:

I've been running btrfs in a raid5 for about a year now with bcache in
front of it. Yesterday, one of my drives was acting really slow, so I
was going to move it to a different port. I guess I get too
comfortable hot plugging drives in at work and didn't think twice
about what could go wrong, hey I set it up in RAID5 so it will be
fine. Well, it wasn't...


Well, Btrfs RAID5 is not that safe.
I would recommend to use RAID1 for metadata at least.
(And in your case, your metadata is damaged, so I really recommend to 
use a better profile for your metadata)




I was aware of the write hole issue, and thought it was committed to
the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs
that are in an md RAID1 that is the cache for the three backing
devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the
kernel booted. I have all my critical data saved off on btrfs
snapshots on a different host, but I don't transfer my MythTV subs
that often, so I'd like to try to recover some of that if possible.

What is really interesting is that I could not boot the first time
(root on the btrfs volume), but I rebooted again and the fs was in
read-only mode, but only one of the three disks was in read-only. I
tried to reboot again and it never mounted again after that. I see
some messages in dmesg like this:

[  151.201637] BTRFS info (device bcache0): disk space caching is enabled
[  151.201640] BTRFS info (device bcache0): has skinny extents
[  151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs:
wr 309, rd 319, flush 39, corrupt 0, gen 0
[  151.931764] BTRFS info (device bcache0): detected SSD devices,
enabling SSD mode
[  152.058915] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.059944] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473


Normally transid error indicates bigger problem, and normally hard to trace.


[  152.060018] BTRFS: error (device bcache0) in
__btrfs_free_extent:6989: errno=-5 IO failure
[  152.060060] BTRFS: error (device bcache0) in
btrfs_run_delayed_refs:3009: errno=-5 IO failure
[  152.071613] BTRFS info (device bcache0): delayed_refs has NO entry
[  152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475:
errno=-5 IO failure (Failed to recover log tree)
[  152.074244] BTRFS error (device bcache0): cleaner transaction
attach returned -30
[  152.148993] BTRFS error (device bcache0): open_ctree failed

So, I thought that the log was corrupted, I could live without the
last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0`
and I get a backtrace.


Yes, your idea about log is correct. It's log replay causing problem.
But the root cause seems to be corrupted extent tree, which is not easy 
to fix.



I ran `btrfs rescue chunk-recover /dev/bcache0`
and it spent hours scanning the three disks and at the end tried to
fix the logs (or tree, I can't remember exactly) and then I got
another backtrace.

Today, I compiled 4.13-rc6 to see if some of the latest fixes would
help, no dice (the dmesg above is from 4.13-rc6). I compiled the
latest master of btrfs-progs, no progress.

Things I've tried:
mount
mount -o degraded
mount -o degraded,ro
mount -o degraded (with each drive disconnected in turn to see if in
would start without one of the drives)
btrfs rescue chunk-recover
btrfs rescue super-recover (all drives report the superblocks are fine)
btrfs rescue zero-log (always has a backtrace)


I think that's some other problem causing the backtrace.
Normally extent tree corruption or transid error.


btrfs check

I know that bcache complicates things, but I'm hoping for two things.
1. Try to get what I can off the volume. 2. Provide some information
that can help make btrfs/bcache better for the future.

Here is what `btrfs rescue zero-log` outputs:

# ./btrfs rescue zero-log /dev/bcache0
Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896

Btrfs Raid5 issue.

2017-08-20 Thread Robert LeBlanc
I've been running btrfs in a raid5 for about a year now with bcache in
front of it. Yesterday, one of my drives was acting really slow, so I
was going to move it to a different port. I guess I get too
comfortable hot plugging drives in at work and didn't think twice
about what could go wrong, hey I set it up in RAID5 so it will be
fine. Well, it wasn't...

I was aware of the write hole issue, and thought it was committed to
the 4.12 branch, so I was running 4.12.5 at the time. I have two SSDs
that are in an md RAID1 that is the cache for the three backing
devices in bcache (bcache{0..2} or bcache{0,16,32} depending on the
kernel booted. I have all my critical data saved off on btrfs
snapshots on a different host, but I don't transfer my MythTV subs
that often, so I'd like to try to recover some of that if possible.

What is really interesting is that I could not boot the first time
(root on the btrfs volume), but I rebooted again and the fs was in
read-only mode, but only one of the three disks was in read-only. I
tried to reboot again and it never mounted again after that. I see
some messages in dmesg like this:

[  151.201637] BTRFS info (device bcache0): disk space caching is enabled
[  151.201640] BTRFS info (device bcache0): has skinny extents
[  151.215697] BTRFS info (device bcache0): bdev /dev/bcache16 errs:
wr 309, rd 319, flush 39, corrupt 0, gen 0
[  151.931764] BTRFS info (device bcache0): detected SSD devices,
enabling SSD mode
[  152.058915] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.059944] BTRFS error (device bcache0): parent transid verify
failed on 5309837426688 wanted 1620383 found 1619473
[  152.060018] BTRFS: error (device bcache0) in
__btrfs_free_extent:6989: errno=-5 IO failure
[  152.060060] BTRFS: error (device bcache0) in
btrfs_run_delayed_refs:3009: errno=-5 IO failure
[  152.071613] BTRFS info (device bcache0): delayed_refs has NO entry
[  152.074126] BTRFS: error (device bcache0) in btrfs_replay_log:2475:
errno=-5 IO failure (Failed to recover log tree)
[  152.074244] BTRFS error (device bcache0): cleaner transaction
attach returned -30
[  152.148993] BTRFS error (device bcache0): open_ctree failed

So, I thought that the log was corrupted, I could live without the
last 30 seconds or so, I tried `btrfs rescue zero-log /dev/bcache0`
and I get a backtrace. I ran `btrfs rescue chunk-recover /dev/bcache0`
and it spent hours scanning the three disks and at the end tried to
fix the logs (or tree, I can't remember exactly) and then I got
another backtrace.

Today, I compiled 4.13-rc6 to see if some of the latest fixes would
help, no dice (the dmesg above is from 4.13-rc6). I compiled the
latest master of btrfs-progs, no progress.

Things I've tried:
mount
mount -o degraded
mount -o degraded,ro
mount -o degraded (with each drive disconnected in turn to see if in
would start without one of the drives)
btrfs rescue chunk-recover
btrfs rescue super-recover (all drives report the superblocks are fine)
btrfs rescue zero-log (always has a backtrace)
btrfs check

I know that bcache complicates things, but I'm hoping for two things.
1. Try to get what I can off the volume. 2. Provide some information
that can help make btrfs/bcache better for the future.

Here is what `btrfs rescue zero-log` outputs:

# ./btrfs rescue zero-log /dev/bcache0
Clearing log on /dev/bcache0, previous log_root 2876047507456, level 0
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
checksum verify failed on 5309233872896 found 6A103358 wanted 8EF38EEE
bytenr mismatch, want=5309233872896, have=65536
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
parent transid verify failed on 5309233872896 wanted 1620381 found 1619462
checksum verify failed on 

Re: Btrfs/RAID5 became unmountable after SATA cable fault

2016-07-23 Thread Janos Toth F.
It seems like I accidentally managed to break my Btrfs/RAID5
filesystem, yet again, in a similar fashion.
This time around, I ran into some random libata driver issue (?)
instead of a faulty hardware part but the end result is quiet similar.

I issued the command (replacing X with valid letters for every
hard-drives in the system):
# echo 1 > /sys/block/sdX/device/queue_depth
and I ended up with read-only filesystems.
I checked dmesg and saw write errors on every disks (not just those in RAID-5).

I tried to reboot immediately without success. My root filesystem with
a single-disk Btrfs (which is an SSD, so it has "single" profile for
both data and metadata) was unmountable, thus the kernel was stuck in
a panic-reboot cycle.
I managed to fix this one by booting from an USB stick and trying
various recovery methods (like mounting it with "-o
clear_cache,nospace_cache,recovery" and running "btrfs rescue
chunk-recovery") until everything seemed to be fine (it can now be
mounted read-write without error messages in the kernel-log, can be
fully scrubbed without errors reported, it passes in "btrfs check",
files can be actually written and read, etc).

Once my system was up and running (well, sort of), I realized my /data
is also un-mountable. I tried the same recovery methods on this RAID-5
filesystem but nothing seemed to help (there is an exception with the
recovery attempts: the system drive was a small and fast SSD so
"chunk-recovery" was a viable option to try but this one consists of
huge slow HDDs - so, I tried to run it as a last-resort over-night but
I found an unresponsive machine on the morning with the process stuck
relatively early in the process).

I can always mount it read-only and access files on it, seemingly
without errors (I compared some of the contents with backups and it
looks good) but as soon as I mount it read-write, all hell breaks
loose and it falls into read-only state in no time (with some files
seemingly disappearing from the filesystem) and the kernel log is
starting to get spammed with various kind of error messages (including
missing csums, etc).


After mounting it like this:
# mount /dev/sdb /data -o rw,noatime,nospace_cache
and doing:
# btrfs scrub start /data
the result is:

scrub status for 7d4769d6-2473-4c94-b476-4facce24b425
scrub started at Sat Jul 23 13:50:55 2016 and was aborted after 00:05:30
total bytes scrubbed: 18.99GiB with 16 errors
error details: read=16
corrected errors: 0, uncorrectable errors: 16, unverified errors: 0

The relevant dmesg output is:

 [ 1047.709830] BTRFS info (device sdc): disabling disk space caching
[ 1047.709846] BTRFS: has skinny extents
[ 1047.895818] BTRFS info (device sdc): bdev /dev/sdc errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 1047.895835] BTRFS info (device sdc): bdev /dev/sdb errs: wr 4, rd
0, flush 0, corrupt 0, gen 0
[ 1065.764352] BTRFS: checking UUID tree
[ 1386.423973] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[ 1386.430922] BTRFS error (device sdc): parent transid verify failed
on 24431936729088 wanted 585936 found 586145
[ 1411.738955] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1411.948040] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.040964] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.040980] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.041134] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.042628] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1412.042748] BTRFS error (device sdc): parent transid verify failed
on 24432322764800 wanted 585779 found 586145
[ 1499.45] BTRFS error (device sdc): parent transid verify failed
on 24432312270848 wanted 585779 found 586143
[ 1499.230264] BTRFS error (device sdc): parent transid verify failed
on 24432312270848 wanted 585779 found 586143
[ 1525.865143] BTRFS error (device sdc): parent transid verify failed
on 24432367730688 wanted 585779 found 586144
[ 1525.880537] BTRFS error (device sdc): parent transid verify failed
on 24432367730688 wanted 585779 found 586144
[ 1552.434209] BTRFS error (device sdc): parent transid verify failed
on 24432415821824 wanted 585781 found 586144
[ 1552.437325] BTRFS error (device sdc): parent transid verify failed
on 24432415821824 wanted 585781 found 586144


btrfs check /dev/sdc results in:

Checking filesystem on /dev/sdc
UUID: 7d4769d6-2473-4c94-b476-4facce24b425
checking extents
parent transid verify failed on 24431859855360 wanted 585941 found 586144
parent transid verify failed on 24431859855360 wanted 585941 found 586144
checksum verify fa

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 14:45, Chris Murphy wrote:

>> I think it's statistically 0 people changing this from default. It's
>> people with drives that have no SCT ERC support, used in raid1+, who
>> happen to stumble upon this very obscure work around to avoid link
>> resets in the face of media defects. Rare.
>
> Not as much as you think, once someone has this issue, they usually put
> preventative measures in place on any system where it applies.  I'd be
> willing to bet that most sysadmins at big companies like RedHat or Oracle
> are setting this.

SCT ERC yes. Changing the kernel's command timer? I think almost zero.



>> Well they have link resets and their file system presumably face
>> plants as a result of a pile of commands in the queue returning as
>> unsuccessful. So they have premature death of their system, rather
>> than it getting sluggish. This is a long standing indicator on Windows
>> to just reinstall the OS and restore data from backups -> the user has
>> an opportunity to freshen up user data backup, and the reinstallation
>> and restore from backup results in freshly written sectors which is
>> how bad sectors get fixed. The marginally bad sectors get new writes
>> and now read fast (or fast enough), and the persistently bad sectors
>> result in the drive firmware remapping to reserve sectors.
>>
>> The main thing in my opinion is less extension of drive life, as it is
>> the user gets to use the system, albeit sluggish, to make a backup of
>> their data rather than possibly losing it.
>
> The extension of the drive's lifetime is a nice benefit, but not what my
> point was here.  For people in this particular case, it will almost
> certainly only make things better (although at first it may make performance
> worse).

I'm not sure why it makes performance worse. The options are, slower
reads vs a file system that almost certainly face plants upon a link
reset.




>> Basically it's:
>>
>> For SATA and USB drives:
>>
>> if data redundant, then enable short SCT ERC time if supported, if not
>> supported then extend SCSI command timer to 200;
>>
>> if data not redundant, then disable SCT ERC if supported, and extend
>> SCSI command timer to 200.
>>
>> For SCSI (SAS most likely these days), keep things the same as now.
>> But that's only because this is a rare enough configuration now I
>> don't know if we really know the problems there. It may be that their
>> error recovery in 7 seconds is massively better and more reliable than
>> consumer drives over 180 seconds.
>
> I don't see why you would think this is not common.

I was not clear. Single device SAS is probably not common. They're
typically being used in arrays where data is redundant. Using such a
drive with short error recovery as a single boot drive? Probably not
that common.



> Separately, USB gets _really_ complicated if you want to cover everything,
> USB drives may or may not present as non-rotational, may or may not show up
> as SATA or SCSI bridges (there are some of the more expensive flash drives
> that actually use SSD controllers plus USB-SAT chips internally), if they do
> show up as such, may or may not support the required commands (most don't,
> but it's seemingly hit or miss which do).

Yup. Well, do what we can instead of just ignoring the problem? They
can still be polled for features including SCT ERC and if it's not
supported or configurable then fallback to increasing the command
timer. I'm not sure what else can be done anyway.

The main obstacle is squaring the device capability (low level) with
storage stack redundancy 0 or 1 (high level). Something has to be
aware of both to ideally get all devices ideally configured.



>> Yep it's imperfect unless there's the proper cross communication
>> between layers. There are some such things like hardware raid geometry
>> that optionally poke through (when supported by hardware raid drivers)
>> so that things like mkfs.xfs can automatically provide the right sunit
>> swidth for optimized layout; which the device mapper already does
>> automatically. So it could be done it's just a matter of how big of a
>> problem is this to build it, vs just going with a new one size fits
>> all default command timer?
>
> The other problem though is that the existing things pass through
> _read-only_ data, while this requires writable data to be passed through,
> which leads to all kinds of complicated issues potentially.

I'm aware. There are also plenty of bugs even if write were to pass
through. I've encountered more drives than not which accept only one
SCT ERC change per poweron. A 2nd change causes the drive to offline
and vanish off the bus. So no doubt this whole area is fragile enough
not even the drive, controller, enclosure vendors are aware of where
all the bodies are buried.

What I think is fairly well established is that at least on Windows
their lower level stuff including kernel 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 14:45, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
 wrote:

On 2016-07-06 12:43, Chris Murphy wrote:



So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.


I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.
Not as much as you think, once someone has this issue, they usually put 
preventative measures in place on any system where it applies.  I'd be 
willing to bet that most sysadmins at big companies like RedHat or 
Oracle are setting this.




2. People using single disks which have lots of errors may or may not see an
apparent degradation of performance, but will likely have the life
expectancy of their device extended.


Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.
The extension of the drive's lifetime is a nice benefit, but not what my 
point was here.  For people in this particular case, it will almost 
certainly only make things better (although at first it may make 
performance worse).




3. Individuals who are not setting this but should be will on average be no
worse off than before other than seeing a bigger performance hit on a disk
error.
4. People with single disks which are new will see no functional change
until the disk has an error.


I follow.




In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
percentage over that (probably something like 25%, which would give roughly
10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the
expected disk timeout to account for things like latency outside of the
disk).


Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.
I agree, but if it's only the kernel doing this, then we can't make 
judgements based on userspace usage.  Also, the first situation while 
not optimal is still better than what happens now, at least there you 
will get an I/O error in a reasonable amount of time (as opposed to 
after a really long time if ever).


Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.
I don't see why you would think this is not common.  If you count just 
by systems, then it's absolutely outnumbered at least 100 to 1 by 
regular ATA disks.  If you look at individual disks though, the reverse 
is true, because people who use SCSI drives tend to use _lots_ of disks 
(think big data centers, NAS and SAN systems and such).  OTOH, both are 
probably vastly outnumbered by stuff that doesn't use either standard 
for storage...


Separately, USB gets _really_ complicated if you want to cover 
everything, USB drives may or may not present as non-rotational, may or 
may not show 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-06 12:43, Chris Murphy wrote:

>> So does it make sense to just set the default to 180? Or is there a
>> smarter way to do this? I don't know.
>
> Just thinking about this:
> 1. People who are setting this somewhere will be functionally unaffected.

I think it's statistically 0 people changing this from default. It's
people with drives that have no SCT ERC support, used in raid1+, who
happen to stumble upon this very obscure work around to avoid link
resets in the face of media defects. Rare.


> 2. People using single disks which have lots of errors may or may not see an
> apparent degradation of performance, but will likely have the life
> expectancy of their device extended.

Well they have link resets and their file system presumably face
plants as a result of a pile of commands in the queue returning as
unsuccessful. So they have premature death of their system, rather
than it getting sluggish. This is a long standing indicator on Windows
to just reinstall the OS and restore data from backups -> the user has
an opportunity to freshen up user data backup, and the reinstallation
and restore from backup results in freshly written sectors which is
how bad sectors get fixed. The marginally bad sectors get new writes
and now read fast (or fast enough), and the persistently bad sectors
result in the drive firmware remapping to reserve sectors.

The main thing in my opinion is less extension of drive life, as it is
the user gets to use the system, albeit sluggish, to make a backup of
their data rather than possibly losing it.


> 3. Individuals who are not setting this but should be will on average be no
> worse off than before other than seeing a bigger performance hit on a disk
> error.
> 4. People with single disks which are new will see no functional change
> until the disk has an error.

I follow.


>
> In an ideal situation, what I'd want to see is:
> 1. If the device supports SCT ERC, set scsi_command_timer to  reasonable
> percentage over that (probably something like 25%, which would give roughly
> 10 seconds for the normal 7 second ERC timer).
> 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC<
> this is reasonable for SCSI disks).
> 3. Otherwise, set the timer to 200 (we need a slight buffer over the
> expected disk timeout to account for things like latency outside of the
> disk).

Well if it's a non-redundant configuration, you'd want those long
recoveries permitted, rather than enable SCT ERC. The drive has the
ability to relocate sector data on a marginal (slow) read that's still
successful. But clearly many manufacturers tolerate slow reads that
don't result in immediate reallocation or overwrite or we wouldn't be
in this situation in the first place. I think this auto reallocation
is thwarted by enabling SCT ERC. It just flat out gives up and reports
a read error. So it is still data loss in the non-redundant
configuration and thus not an improvement.

Basically it's:

For SATA and USB drives:

if data redundant, then enable short SCT ERC time if supported, if not
supported then extend SCSI command timer to 200;

if data not redundant, then disable SCT ERC if supported, and extend
SCSI command timer to 200.

For SCSI (SAS most likely these days), keep things the same as now.
But that's only because this is a rare enough configuration now I
don't know if we really know the problems there. It may be that their
error recovery in 7 seconds is massively better and more reliable than
consumer drives over 180 seconds.




>
>>
>>
 I suspect, but haven't tested, that ZFS On Linux would be equally
 affected, unless they're completely reimplementing their own block
 layer (?) So there are quite a few parties now negatively impacted by
 the current default behavior.
>>>
>>>
>>> OTOH, I would not be surprised if the stance there is 'you get no support
>>> if
>>> your not using enterprise drives', not because of the project itself, but
>>> because it's ZFS.  Part of their minimum recommended hardware
>>> requirements
>>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
>>> there too.
>>
>>
>> http://open-zfs.org/wiki/Hardware
>> "Consistent performance requires hard drives that support error
>> recovery control. "
>>
>> "Drives that lack such functionality can be expected to have
>> arbitrarily high limits. Several minutes is not impossible. Drives
>> with this functionality typically default to 7 seconds. ZFS does not
>> currently adjust this setting on drives. However, it is advisable to
>> write a script to set the error recovery time to a low value, such as
>> 0.1 seconds until ZFS is modified to control it. This must be done on
>> every boot. "
>>
>> They do not explicitly require enterprise drives, but they clearly
>> expect SCT ERC enabled to some sane value.
>>
>> At least for Btrfs and ZFS, the mkfs is in a position to know all
>> 

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-06 12:43, Chris Murphy wrote:

On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
 wrote:

On 2016-07-05 19:05, Chris Murphy wrote:


Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.


The thing I don't get about this is that while the per-device settings on a
given system are policy, the default value is not, and should be expected to
work correctly (but not necessarily optimally) on as many systems as
possible, so any claim that this should be fixed in udev are bogus by the
regular kernel rules.


Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.

Just thinking about this:
1. People who are setting this somewhere will be functionally unaffected.
2. People using single disks which have lots of errors may or may not 
see an apparent degradation of performance, but will likely have the 
life expectancy of their device extended.
3. Individuals who are not setting this but should be will on average be 
no worse off than before other than seeing a bigger performance hit on a 
disk error.
4. People with single disks which are new will see no functional change 
until the disk has an error.


In an ideal situation, what I'd want to see is:
1. If the device supports SCT ERC, set scsi_command_timer to  reasonable 
percentage over that (probably something like 25%, which would give 
roughly 10 seconds for the normal 7 second ERC timer).
2. If the device is actually a SCSI device, keep the 30 second timer 
(IIRC< this is reasonable for SCSI disks).
3. Otherwise, set the timer to 200 (we need a slight buffer over the 
expected disk timeout to account for things like latency outside of the 
disk).




I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.


OTOH, I would not be surprised if the stance there is 'you get no support if
your not using enterprise drives', not because of the project itself, but
because it's ZFS.  Part of their minimum recommended hardware requirements
is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
there too.


http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.
I"m pretty certain they don't create rules, or even try to check the 
drive for SCT ERC support.  The problem with doing this is that you 
can't be certain that your underlying device is actually a physical 
storage device or not, and thus you have to check more than just the SCT 
ERC commands, and many people (myself included) don't like tools doing 
things that modify the persistent functioning of their system that the 
tool itself is not intended to do (and messing with block layer settings 
falls into that category for a mkfs tool).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Chris Murphy
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-07-05 19:05, Chris Murphy wrote:
>>
>> Related:
>> http://www.spinics.net/lists/raid/msg52880.html
>>
>> Looks like there is some traction to figuring out what to do about
>> this, whether it's a udev rule or something that happens in the kernel
>> itself. Pretty much the only hardware setup unaffected by this are
>> those with enterprise or NAS drives. Every configuration of a consumer
>> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
>> RAID Levels are adversely affected by this.
>
> The thing I don't get about this is that while the per-device settings on a
> given system are policy, the default value is not, and should be expected to
> work correctly (but not necessarily optimally) on as many systems as
> possible, so any claim that this should be fixed in udev are bogus by the
> regular kernel rules.

Sure. But changing it in the kernel leads to what other consequences?
It fixes the problem under discussion but what problem will it
introduce? I think it's valid to explore this, at the least so
affected parties can be informed.

Also, the problem isn't instigated by Linux, rather by drive
manufacturers introducing a whole new kind of error recovery, with an
order of magnitude longer recovery time. Now probably most hardware in
the field are such drives. Even SSDs like my Samsung 840 EVO that
support SCT ERC have it disabled, therefore the top end recovery time
is undiscoverable in the device itself. Maybe it's buried in a spec.

So does it make sense to just set the default to 180? Or is there a
smarter way to do this? I don't know.


>> I suspect, but haven't tested, that ZFS On Linux would be equally
>> affected, unless they're completely reimplementing their own block
>> layer (?) So there are quite a few parties now negatively impacted by
>> the current default behavior.
>
> OTOH, I would not be surprised if the stance there is 'you get no support if
> your not using enterprise drives', not because of the project itself, but
> because it's ZFS.  Part of their minimum recommended hardware requirements
> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are
> there too.

http://open-zfs.org/wiki/Hardware
"Consistent performance requires hard drives that support error
recovery control. "

"Drives that lack such functionality can be expected to have
arbitrarily high limits. Several minutes is not impossible. Drives
with this functionality typically default to 7 seconds. ZFS does not
currently adjust this setting on drives. However, it is advisable to
write a script to set the error recovery time to a low value, such as
0.1 seconds until ZFS is modified to control it. This must be done on
every boot. "

They do not explicitly require enterprise drives, but they clearly
expect SCT ERC enabled to some sane value.

At least for Btrfs and ZFS, the mkfs is in a position to know all
parameters for properly setting SCT ERC and the SCSI command timer for
every device. Maybe it could create the udev rule? Single and raid0
profiles need to permit long recoveries; where raid1, 5, 6 need to set
things for very short recoveries.

Possibly mdadm and lvm tools do the same thing.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-06 Thread Austin S. Hemmelgarn

On 2016-07-05 19:05, Chris Murphy wrote:

Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.
The thing I don't get about this is that while the per-device settings 
on a given system are policy, the default value is not, and should be 
expected to work correctly (but not necessarily optimally) on as many 
systems as possible, so any claim that this should be fixed in udev are 
bogus by the regular kernel rules.


I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.
OTOH, I would not be surprised if the stance there is 'you get no 
support if your not using enterprise drives', not because of the project 
itself, but because it's ZFS.  Part of their minimum recommended 
hardware requirements is ECC RAM, so it wouldn't surprise me if 
enterprise storage devices are there too.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-07-05 Thread Chris Murphy
Related:
http://www.spinics.net/lists/raid/msg52880.html

Looks like there is some traction to figuring out what to do about
this, whether it's a udev rule or something that happens in the kernel
itself. Pretty much the only hardware setup unaffected by this are
those with enterprise or NAS drives. Every configuration of a consumer
drive, single, linear/concat, and all software (mdadm, lvm, Btrfs)
RAID Levels are adversely affected by this.

I suspect, but haven't tested, that ZFS On Linux would be equally
affected, unless they're completely reimplementing their own block
layer (?) So there are quite a few parties now negatively impacted by
the current default behavior.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Steven Haigh
On 29/06/16 04:01, Chris Murphy wrote:
> Just wiping the slate clean to summarize:
> 
> 
> 1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
> of *data extent* parity during a scrub with raid5. Goffredo and I have
> both reproduced it. It's a big bug. It might still be useful if
> someone else can reproduce it too.
> 
> Goffredo, can you file a bug at bugzilla.kernel.org and reference your
> bug thread?  I don't know if the key developers know about this, it
> might be worth pinging them on IRC once the bug is filed.
> 
> Unknown if it affects balance, or raid 6. And if it affects raid 6, is
> p or q corrupted, or both? Unknown how this manifests on metadata
> raid5 profile (only tested was data raid5). Presumably if there is
> metadata corruption that's fixed during a scrub, and its parity is
> overwritten with corrupt parity, the next time there's a degraded
> state, the file system would face plant somehow. And we've seen quite
> a few degraded raid5's (and even 6's) face plant in inexplicable ways
> and we just kinda go, shit. Which is what the fs is doing when it
> encounters a pile of csum errors. It treats the csum errors as a
> signal to disregard the fs rather than maybe only being suspicious of
> the fs. Could it turn out that these file systems were recoverable,
> just that Btrfs wasn't tolerating any csum error and wouldn't proceed
> further?

I believe this is the same case for RAID6 based on my experiences. I
actually wondered if the system halts were the result of a TON of csum
errors - not the actual result of those errors. Just about every system
hang when to 100% CPU usage on all cores and the system just stopped was
after a flood of csum errors. If it was only one or two (or I copied
data off via a network connection where the read rate was slower), I
found I had a MUCH lower chance of the system locking up.

In fact, now that I think about it, when I was copying data to an
external USB drive (maxed out at ~30MB/sec), I still got csum errors -
but the system never hung.

Every crash ended with the last line along the lines of "Stopped
recurring error. Your system needs rebooting". I wonder if this error
reporting was altered, that the system wouldn't go down.

Of course I have no way of testing this.


-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Chris Murphy
Just wiping the slate clean to summarize:


1. We have a consistent ~1 in 3 maybe 1 in 2, reproducible corruption
of *data extent* parity during a scrub with raid5. Goffredo and I have
both reproduced it. It's a big bug. It might still be useful if
someone else can reproduce it too.

Goffredo, can you file a bug at bugzilla.kernel.org and reference your
bug thread?  I don't know if the key developers know about this, it
might be worth pinging them on IRC once the bug is filed.

Unknown if it affects balance, or raid 6. And if it affects raid 6, is
p or q corrupted, or both? Unknown how this manifests on metadata
raid5 profile (only tested was data raid5). Presumably if there is
metadata corruption that's fixed during a scrub, and its parity is
overwritten with corrupt parity, the next time there's a degraded
state, the file system would face plant somehow. And we've seen quite
a few degraded raid5's (and even 6's) face plant in inexplicable ways
and we just kinda go, shit. Which is what the fs is doing when it
encounters a pile of csum errors. It treats the csum errors as a
signal to disregard the fs rather than maybe only being suspicious of
the fs. Could it turn out that these file systems were recoverable,
just that Btrfs wasn't tolerating any csum error and wouldn't proceed
further?

2. The existing scrub code computes parity on-the-fly, compares it
with what's on-disk, and overwrites if there's a mismatch. If there's
a mismatch, there's no message anywhere. It's a feature request to get
a message on parity mismatches. An additional feature request would be
to get a parity_error counter along the lines of the other error
counters we have for scrub stats and dev stats.

3. I think it's a more significant change to get parity checksums
stored some where. Right now the csum tree holds item type EXTENT_CSUM
but parity is not an extent, it's also not data, it's a variant of
data. So it seems to me we'd need a new item type PARITY_CSUM to get
it into the existing csum tree. And I'm not sure what incompatibility
that brings; presumably older kernels could mount such a volume ro
safely, but shouldn't write to it, including btrfs check --repair
should probably fail.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Steven Haigh
On 28/06/16 22:25, Austin S. Hemmelgarn wrote:
> On 2016-06-28 08:14, Steven Haigh wrote:
>> On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
>>> On 2016-06-27 17:57, Zygo Blaxell wrote:
 On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>>  wrote:
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions
>> involving
>> highly situational co-processors, computing a checksum of a parity
>> block is
>> always going to be faster than computing parity for the stripe.  By
>> using
>> that to check parity, we can safely speed up the common case of near
>> zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

 A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
 one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
 array of SSDs vs. a slow CPU.
>>> OK, great for people who are using modern desktop or server CPU's.  Not
>>> everyone has that luxury, and even on many such CPU's, it's _still_
>>> faster to computer CRC32c checksums.  On top of that, we don't appear to
>>> be using the in-kernel parity-raid libraries (or if we are, I haven't
>>> been able to find where we are calling the functions for it), so we
>>> don't necessarily get assembly optimized or co-processor accelerated
>>> computation of the parity itself.  The other thing that I didn't mention
>>> above though, is that computing parity checksums will always take less
>>> time than computing parity, because you have to process significantly
>>> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
>>> much data to do the parity checksums instead of parity itself, which
>>> means that the parity computation would need to be 200% faster than the
>>> CRC32c computation to break even, and this margin gets bigger and bigger
>>> as you add more disks.
>>>
>>> On small arrays, this obviously won't have much impact.  Once you start
>>> to scale past a few TB though, even a few hundred MB/s faster processing
>>> means a significant decrease in processing time.  Say you have a CPU
>>> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
>>> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
>>> instruction and assembly optimized RAID5 parity computations on a modern
>>> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
>>> the Xeon E3 in my home server is closer to 5%)).  Assuming those
>>> numbers, and that we're already checking checksums on non-parity blocks,
>>> processing 120TB of data in a 4 disk array (which gives 40TB of parity
>>> data, so 160TB total) gives:
>>> For computing the parity to scrub:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 120TB / 12GB= 1 seconds for processing parity of all stripes
>>> = 19795.9 seconds total
>>> ~ 5.4 hours total
>>>
>>> For computing csums of the parity:
>>> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
>>> regular data
>>> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
>>> parity data
>>> = 13061.2 seconds total
>>> ~ 3.6 hours total
>>>
>>> The checksum based computation is approximately 34% faster than the
>>> parity computation.  Much of this of course is that you have to process
>>> the regular data twice for the parity computation method (once for
>>> csums, once for parity).  You could probably do one pass computing both
>>> values, but that would need to be done carefully; and, without
>>> significant optimization, would likely not get you much benefit other
>>> than cutting the number of loads in half.
>>
>> And it all means jack shit because you don't get the data to disk that
>> quick. Who cares if its 500% faster - if it still saturates the
>> throughput of the actual drives, what difference does it make?
> It has less impact on everything else running on the system at the time
> because it uses less CPU time and potentially less memory.  This is the
> exact same reason that you want your RAID parity computation performance
> as good as possible, the less time the CPU spends 

Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Austin S. Hemmelgarn

On 2016-06-28 08:14, Steven Haigh wrote:

On 28/06/16 22:05, Austin S. Hemmelgarn wrote:

On 2016-06-27 17:57, Zygo Blaxell wrote:

On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.


Except that's horribly inefficient.  With limited exceptions involving
highly situational co-processors, computing a checksum of a parity
block is
always going to be faster than computing parity for the stripe.  By
using
that to check parity, we can safely speed up the common case of near
zero
errors during a scrub by a pretty significant factor.


OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.


A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

OK, great for people who are using modern desktop or server CPU's.  Not
everyone has that luxury, and even on many such CPU's, it's _still_
faster to computer CRC32c checksums.  On top of that, we don't appear to
be using the in-kernel parity-raid libraries (or if we are, I haven't
been able to find where we are calling the functions for it), so we
don't necessarily get assembly optimized or co-processor accelerated
computation of the parity itself.  The other thing that I didn't mention
above though, is that computing parity checksums will always take less
time than computing parity, because you have to process significantly
less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
much data to do the parity checksums instead of parity itself, which
means that the parity computation would need to be 200% faster than the
CRC32c computation to break even, and this margin gets bigger and bigger
as you add more disks.

On small arrays, this obviously won't have much impact.  Once you start
to scale past a few TB though, even a few hundred MB/s faster processing
means a significant decrease in processing time.  Say you have a CPU
which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
CRC32c (~2% is a conservative ratio assuming you use the CRC32c
instruction and assembly optimized RAID5 parity computations on a modern
x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
the Xeon E3 in my home server is closer to 5%)).  Assuming those
numbers, and that we're already checking checksums on non-parity blocks,
processing 120TB of data in a 4 disk array (which gives 40TB of parity
data, so 160TB total) gives:
For computing the parity to scrub:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
regular data
120TB / 12GB= 1 seconds for processing parity of all stripes
= 19795.9 seconds total
~ 5.4 hours total

For computing csums of the parity:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
regular data
40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
parity data
= 13061.2 seconds total
~ 3.6 hours total

The checksum based computation is approximately 34% faster than the
parity computation.  Much of this of course is that you have to process
the regular data twice for the parity computation method (once for
csums, once for parity).  You could probably do one pass computing both
values, but that would need to be done carefully; and, without
significant optimization, would likely not get you much benefit other
than cutting the number of loads in half.


And it all means jack shit because you don't get the data to disk that
quick. Who cares if its 500% faster - if it still saturates the
throughput of the actual drives, what difference does it make?
It has less impact on everything else running on the system at the time 
because it uses less CPU time and potentially less memory.  This is the 
exact same reason that you want your RAID parity computation performance 
as good as possible, the less time the CPU spends on that, the more it 
can spend on other things.  On top of that, there are high-end systems 
that do have SSD's that can get multiple GB/s of data transfer per 
second, and NVDIMM's are starting to become popular in the server 
market, and those give you data transfer speeds equivalent to regular 
memory bandwidth (which can be well over 20GB/s on decent hardware (I've 
got a relatively inexpensive system using DDR3-1866 RAM that has 

Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Steven Haigh
On 28/06/16 22:05, Austin S. Hemmelgarn wrote:
> On 2016-06-27 17:57, Zygo Blaxell wrote:
>> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>>> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>>>  wrote:
 On 2016-06-25 12:44, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>  wrote:
>
> OK but hold on. During scrub, it should read data, compute checksums
> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> the checksum tree, and the parity strip in the chunk tree. And if
> parity is wrong, then it should be replaced.

 Except that's horribly inefficient.  With limited exceptions involving
 highly situational co-processors, computing a checksum of a parity
 block is
 always going to be faster than computing parity for the stripe.  By
 using
 that to check parity, we can safely speed up the common case of near
 zero
 errors during a scrub by a pretty significant factor.
>>>
>>> OK I'm in favor of that. Although somehow md gets away with this by
>>> computing and checking parity for its scrubs, and still manages to
>>> keep drives saturated in the process - at least HDDs, I'm not sure how
>>> it fares on SSDs.
>>
>> A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
>> one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
>> array of SSDs vs. a slow CPU.
> OK, great for people who are using modern desktop or server CPU's.  Not
> everyone has that luxury, and even on many such CPU's, it's _still_
> faster to computer CRC32c checksums.  On top of that, we don't appear to
> be using the in-kernel parity-raid libraries (or if we are, I haven't
> been able to find where we are calling the functions for it), so we
> don't necessarily get assembly optimized or co-processor accelerated
> computation of the parity itself.  The other thing that I didn't mention
> above though, is that computing parity checksums will always take less
> time than computing parity, because you have to process significantly
> less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as
> much data to do the parity checksums instead of parity itself, which
> means that the parity computation would need to be 200% faster than the
> CRC32c computation to break even, and this margin gets bigger and bigger
> as you add more disks.
> 
> On small arrays, this obviously won't have much impact.  Once you start
> to scale past a few TB though, even a few hundred MB/s faster processing
> means a significant decrease in processing time.  Say you have a CPU
> which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for
> CRC32c (~2% is a conservative ratio assuming you use the CRC32c
> instruction and assembly optimized RAID5 parity computations on a modern
> x86_64 processor (the ratio on both the mobile Core i5 in my laptop and
> the Xeon E3 in my home server is closer to 5%)).  Assuming those
> numbers, and that we're already checking checksums on non-parity blocks,
> processing 120TB of data in a 4 disk array (which gives 40TB of parity
> data, so 160TB total) gives:
> For computing the parity to scrub:
> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
> regular data
> 120TB / 12GB= 1 seconds for processing parity of all stripes
> = 19795.9 seconds total
> ~ 5.4 hours total
> 
> For computing csums of the parity:
> 120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the
> regular data
> 40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the
> parity data
> = 13061.2 seconds total
> ~ 3.6 hours total
> 
> The checksum based computation is approximately 34% faster than the
> parity computation.  Much of this of course is that you have to process
> the regular data twice for the parity computation method (once for
> csums, once for parity).  You could probably do one pass computing both
> values, but that would need to be done carefully; and, without
> significant optimization, would likely not get you much benefit other
> than cutting the number of loads in half.

And it all means jack shit because you don't get the data to disk that
quick. Who cares if its 500% faster - if it still saturates the
throughput of the actual drives, what difference does it make?

I'm all for actual solutions, but the nirvana fallacy seems to apply here...

-- 
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897



signature.asc
Description: OpenPGP digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Austin S. Hemmelgarn

On 2016-06-27 17:57, Zygo Blaxell wrote:

On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.


Except that's horribly inefficient.  With limited exceptions involving
highly situational co-processors, computing a checksum of a parity block is
always going to be faster than computing parity for the stripe.  By using
that to check parity, we can safely speed up the common case of near zero
errors during a scrub by a pretty significant factor.


OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.


A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.
OK, great for people who are using modern desktop or server CPU's.  Not 
everyone has that luxury, and even on many such CPU's, it's _still_ 
faster to computer CRC32c checksums.  On top of that, we don't appear to 
be using the in-kernel parity-raid libraries (or if we are, I haven't 
been able to find where we are calling the functions for it), so we 
don't necessarily get assembly optimized or co-processor accelerated 
computation of the parity itself.  The other thing that I didn't mention 
above though, is that computing parity checksums will always take less 
time than computing parity, because you have to process significantly 
less data.  On a 4 disk RAID5 array, you're processing roughly 2/3 as 
much data to do the parity checksums instead of parity itself, which 
means that the parity computation would need to be 200% faster than the 
CRC32c computation to break even, and this margin gets bigger and bigger 
as you add more disks.


On small arrays, this obviously won't have much impact.  Once you start 
to scale past a few TB though, even a few hundred MB/s faster processing 
means a significant decrease in processing time.  Say you have a CPU 
which gets about 12.0GB/s for RAID5 parity, and and about 12.25GB/s for 
CRC32c (~2% is a conservative ratio assuming you use the CRC32c 
instruction and assembly optimized RAID5 parity computations on a modern 
x86_64 processor (the ratio on both the mobile Core i5 in my laptop and 
the Xeon E3 in my home server is closer to 5%)).  Assuming those 
numbers, and that we're already checking checksums on non-parity blocks, 
processing 120TB of data in a 4 disk array (which gives 40TB of parity 
data, so 160TB total) gives:

For computing the parity to scrub:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the 
regular data

120TB / 12GB= 1 seconds for processing parity of all stripes
= 19795.9 seconds total
~ 5.4 hours total

For computing csums of the parity:
120TB / 12.25GB =  9795.9 seconds for processing CRC32c csums of all the 
regular data
40TB / 12.25GB  =  3265.3 seconds for processing CRC32c csums of all the 
parity data

= 13061.2 seconds total
~ 3.6 hours total

The checksum based computation is approximately 34% faster than the 
parity computation.  Much of this of course is that you have to process 
the regular data twice for the parity computation method (once for 
csums, once for parity).  You could probably do one pass computing both 
values, but that would need to be done carefully; and, without 
significant optimization, would likely not get you much benefit other 
than cutting the number of loads in half.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-28 Thread Austin S. Hemmelgarn

On 2016-06-27 23:17, Zygo Blaxell wrote:

On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:

On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
 wrote:

On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


It's a crude form of "resilvering" as ZFS calls it.


In what manner is it crude?


Balance relocates extents, looks up backrefs, and rewrites metadata, all
of which are extra work above what is required by resilvering (and extra
work that is proportional to the number of backrefs and the (currently
extremely poor) performance of the backref walking code, so snapshots
and large files multiply the workload).

Resilvering should just read data, reconstruct it from a mirror if
necessary, and write it back to the original location (or read one
mirror and rewrite the other).  That's more like what scrub does, except
scrub rewrites only the blocks it couldn't read (or that failed csum).
It's worth pointing out that balance was not designed for resilvering, 
it was designed for reshaping arrays, converting replication profiles, 
and compaction at the chunk level.  Balance is not a resilvering tool, 
that just happens to be a useful side effect of running a balance 
(actually, so is the chunk level compaction).



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 08:39:21PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> >> Btrfs does have something of a work around for when things get slow,
> >> and that's balance, read and rewrite everything. The write forces
> >> sector remapping by the drive firmware for bad sectors.
> >
> > It's a crude form of "resilvering" as ZFS calls it.
> 
> In what manner is it crude?

Balance relocates extents, looks up backrefs, and rewrites metadata, all
of which are extra work above what is required by resilvering (and extra
work that is proportional to the number of backrefs and the (currently
extremely poor) performance of the backref walking code, so snapshots
and large files multiply the workload).

Resilvering should just read data, reconstruct it from a mirror if
necessary, and write it back to the original location (or read one
mirror and rewrite the other).  That's more like what scrub does, except
scrub rewrites only the blocks it couldn't read (or that failed csum).

> > Last time I checked all the RAID implementations on Linux (ok, so that's
> > pretty much just md-raid) had some sort of repair capability.
> 
> You can read man 4 md, and you can also look on linux-raid@, it's very
> clearly necessary for the drive to report a read or write error
> explicitly with LBA for md to do repairs. If there are link resets,
> bad sectors accumulate and the obvious inevitably happens.

I am looking at the md code.  It looks at ->bi_error, and nothing else as
far as I can tell.  It doesn't even care if the error is EIO--any non-zero
return value from the lower bio layer seems to trigger automatic recovery.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 7:52 PM, Zygo Blaxell
<ce3g8...@umail.furryterror.org> wrote:
> On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
>> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
>> <ce3g8...@umail.furryterror.org> wrote:
>> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
>> > If anything, I want the timeout to be shorter so that upper layers with
>> > redundancy can get an EIO and initiate repair promptly, and admins can
>> > get notified to evict chronic offenders from their drive slots, without
>> > having to pay extra for hard disk firmware with that feature.
>>
>> The drive totally thwarts this. It doesn't report back to the kernel
>> what command is hung, as far as I'm aware. It just hangs and goes into
>> a so called "deep recovery" there is no way to know what sector is
>> causing the problem
>
> I'm proposing just treat the link reset _as_ an EIO, unless transparent
> link resets are required for link speed negotiation or something.

That's not one EIO, that's possibly 31 items in the command queue that
get knocked over when the link is reset. I don't have the expertise to
know whether it's sane to interpret many EIO all at once as an
implicit indication of bad sectors. Off hand I think that's probably
specious.

> The drive wouldn't be thwarting anything, the host would just ignore it
> (unless the drive doesn't respond to a link reset until after its internal
> timeout, in which case nothing is saved by shortening the timeout).
>
>> until the drive reports a read error, which will
>> include the affected sector LBA.
>
> It doesn't matter which sector.  Chances are good that it was more than
> one of the outstanding requested sectors anyway.  Rewrite them all.

*shrug* even if valid, it only helps the raid 1+ cases. It does
nothing to help raid0, linear/concat, or single device deployments.
Those users also deserve to have access to their data, if the drive
can recover it by giving it enough time to do so.


> We know which sectors they are because somebody has an IO operation
> waiting for a status on each of them (unless they're using AIO or some
> other API where a request can be fired at a hard drive and the reply
> discarded).  Notify all of them that their IO failed and move on.

Dunno, maybe.


>
>> Btrfs does have something of a work around for when things get slow,
>> and that's balance, read and rewrite everything. The write forces
>> sector remapping by the drive firmware for bad sectors.
>
> It's a crude form of "resilvering" as ZFS calls it.

In what manner is it crude?




> If btrfs sees EIO from a lower block layer it will try to reconstruct the
> missing data (but not repair it).  If that happens during a scrub,
> it will also attempt to rewrite the missing data over the original
> offending sectors.  This happens every few months in my server pool,
> and seems to be working even on btrfs raid5.
>
> Last time I checked all the RAID implementations on Linux (ok, so that's
> pretty much just md-raid) had some sort of repair capability.

You can read man 4 md, and you can also look on linux-raid@, it's very
clearly necessary for the drive to report a read or write error
explicitly with LBA for md to do repairs. If there are link resets,
bad sectors accumulate and the obvious inevitably happens.



>
>> For single drives and RAID 0, the only possible solution is to not do
>> link resets for up to 3 minutes and hope the drive returns the single
>> copy of data.
>
> So perhaps the timeout should be influenced by higher layers, e.g. if a
> disk becomes part of a raid1, its timeout should be shortened by default,
> while a timeout for a disk that is not used in by redundant layer should
> be longer.

And there are a pile of reasons why link resets are necessary that
have nothing to do with bad sectors. So if you end up with a drive or
controller misbehaving and the new behavior is to force a bunch of new
(corrective) writes to the drive right after a reset it could actually
make its problems worse for all we know.

I think it's highly speculative to assume hung block devices means bad
sector and should be treated as a bad sector, and that doing so will
cause no other side effects. That's a question for block device/SCSI
experts to opine on whether this is at all sane to do. I'm sure
they're reasonably aware of this problem that if it were that simple
they'd have done that already, but conversely 5 years of telling users
to change the command timer or stop using the wrong kind of drives for
RAID really isn't sufficiently good advice either.

The reality is that manufacturers of drives have handed us drives that
far and wide don't support SCT ERC or it's disabled by default, so
yeah maybe 

Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
> <ce3g8...@umail.furryterror.org> wrote:
> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> > If anything, I want the timeout to be shorter so that upper layers with
> > redundancy can get an EIO and initiate repair promptly, and admins can
> > get notified to evict chronic offenders from their drive slots, without
> > having to pay extra for hard disk firmware with that feature.
> 
> The drive totally thwarts this. It doesn't report back to the kernel
> what command is hung, as far as I'm aware. It just hangs and goes into
> a so called "deep recovery" there is no way to know what sector is
> causing the problem

I'm proposing just treat the link reset _as_ an EIO, unless transparent
link resets are required for link speed negotiation or something.
The drive wouldn't be thwarting anything, the host would just ignore it
(unless the drive doesn't respond to a link reset until after its internal
timeout, in which case nothing is saved by shortening the timeout).

> until the drive reports a read error, which will
> include the affected sector LBA.

It doesn't matter which sector.  Chances are good that it was more than
one of the outstanding requested sectors anyway.  Rewrite them all.

We know which sectors they are because somebody has an IO operation
waiting for a status on each of them (unless they're using AIO or some
other API where a request can be fired at a hard drive and the reply
discarded).  Notify all of them that their IO failed and move on.

> Btrfs does have something of a work around for when things get slow,
> and that's balance, read and rewrite everything. The write forces
> sector remapping by the drive firmware for bad sectors.

It's a crude form of "resilvering" as ZFS calls it.

> > The upper layers could time the IOs, and make their own decisions based
> > on the timing (e.g. btrfs or mdadm could proactively repair anything that
> > took more than 10 seconds to read).  That might be a better approach,
> > since shortening the time to an EIO is only useful when you have a
> > redundancy layer in place to do something about them.
> 
> For RAID with redundancy, that's doable, although I have no idea what
> work is needed, or even if it's possible, to track commands in this
> manner, and fall back to some kind of repair mode as if it were a read
> error.

If btrfs sees EIO from a lower block layer it will try to reconstruct the
missing data (but not repair it).  If that happens during a scrub,
it will also attempt to rewrite the missing data over the original
offending sectors.  This happens every few months in my server pool,
and seems to be working even on btrfs raid5.

Last time I checked all the RAID implementations on Linux (ok, so that's
pretty much just md-raid) had some sort of repair capability.  lvm uses
(or can use) the md-raid implementation.  ext4 and xfs on naked disk
partitions will have problems, but that's because they were designed in
the 1990's when we were young and naive and still believed hard disks
would one day become reliable devices without buggy firmware.

> For single drives and RAID 0, the only possible solution is to not do
> link resets for up to 3 minutes and hope the drive returns the single
> copy of data.

So perhaps the timeout should be influenced by higher layers, e.g. if a
disk becomes part of a raid1, its timeout should be shortened by default,
while a timeout for a disk that is not used in by redundant layer should
be longer.

> Even in the case of Btrfs DUP, it's thwarted without a read error
> reported from the drive (or it returning bad data).

That case gets messy--different timeouts for different parts of the disk.
Probably not practical.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:

>
>> It just came up again in a thread over the weekend on linux-raid@. I'm
>> going to ask while people are paying attention if a patch to change
>> the 30 second time out to something a lot higher has ever been
>> floated, what the negatives might be, and where to get this fixed if
>> it wouldn't be accepted in the kernel code directly.
>
> Defaults are defaults, they're not for everyone.  30 seconds is about
> two minutes too short for an SMR drive's worst-case write latency, or
> 28 seconds too long for an OLTP system, or just right for an end-user's
> personal machine with a low-energy desktop drive and a long spin-up time.

The question is where is the correct place to change the default to
broadly capture most use cases, because it's definitely incompatible
with consumer SATA drives, whether in an enclosure or not.

Maybe it's with the kernel teams at each distribution? Or maybe an
upstream udev rule?

In any case something needs to give here because it's been years of
bugging users about this misconfiguration and people constantly run
into it, which means user education is not working.


>
> Once a drive starts taking 30+ seconds to do I/O, I consider the drive
> failed in the sense that it's too slow to meet latency requirements.

Well that is then a mismatch between use case and the drive purchasing
decision. Consumer drives do this. It's how they're designed to work.


> When the problem is that it's already taking too long, the solution is
> not waiting even longer.  To put things in perspective, consider that
> server hardware watchdog timeouts are typically 60 seconds by default
> (if not maximum).

If you want the data retrieved from that particular device, the only
solution is waiting longer. The alternative is what you get, an IO
error (well actually you get a link reset, which also means the entire
command queue is purged on SATA drives).


> If anything, I want the timeout to be shorter so that upper layers with
> redundancy can get an EIO and initiate repair promptly, and admins can
> get notified to evict chronic offenders from their drive slots, without
> having to pay extra for hard disk firmware with that feature.

The drive totally thwarts this. It doesn't report back to the kernel
what command is hung, as far as I'm aware. It just hangs and goes into
a so called "deep recovery" there is no way to know what sector is
causing the problem until the drive reports a read error, which will
include the affected sector LBA.

Btrfs does have something of a work around for when things get slow,
and that's balance, read and rewrite everything. The write forces
sector remapping by the drive firmware for bad sectors.


>> *Ideally* I think we'd want two timeouts. I'd like to see commands
>> have a timer that results in merely a warning that could be used by
>> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
>> write over those sectors". That's how bad sectors start out, they read
>> slower and eventually go beyond 30 seconds and now it's all link
>> resets. If the problem could be fixed before then... that's the best
>> scenario.
>
> What's the downside of a link reset?  Can the driver not just return
> EIO for all the outstanding IOs in progress at reset, and let the upper
> layers deal with it?  Or is the problem that the upper layers are all
> horribly broken by EIOs, or drive firmware horribly broken by link resets?

Link reset clears the entire command queue on SATA drives, and it
wipes away any possibility of finding out what LBA or even a range of
LBAs, is the source of the stall. So it pretty much gets you nothing.


> The upper layers could time the IOs, and make their own decisions based
> on the timing (e.g. btrfs or mdadm could proactively repair anything that
> took more than 10 seconds to read).  That might be a better approach,
> since shortening the time to an EIO is only useful when you have a
> redundancy layer in place to do something about them.

For RAID with redundancy, that's doable, although I have no idea what
work is needed, or even if it's possible, to track commands in this
manner, and fall back to some kind of repair mode as if it were a read
error.

For single drives and RAID 0, the only possible solution is to not do
link resets for up to 3 minutes and hope the drive returns the single
copy of data.

Even in the case of Btrfs DUP, it's thwarted without a read error
reported from the drive (or it returning bad data).



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Zygo Blaxell
On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2016-06-25 12:44, Chris Murphy wrote:
> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> >>  wrote:
> >>
> >> OK but hold on. During scrub, it should read data, compute checksums
> >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> >> the checksum tree, and the parity strip in the chunk tree. And if
> >> parity is wrong, then it should be replaced.
> >
> > Except that's horribly inefficient.  With limited exceptions involving
> > highly situational co-processors, computing a checksum of a parity block is
> > always going to be faster than computing parity for the stripe.  By using
> > that to check parity, we can safely speed up the common case of near zero
> > errors during a scrub by a pretty significant factor.
> 
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

> It just came up again in a thread over the weekend on linux-raid@. I'm
> going to ask while people are paying attention if a patch to change
> the 30 second time out to something a lot higher has ever been
> floated, what the negatives might be, and where to get this fixed if
> it wouldn't be accepted in the kernel code directly.

Defaults are defaults, they're not for everyone.  30 seconds is about
two minutes too short for an SMR drive's worst-case write latency, or
28 seconds too long for an OLTP system, or just right for an end-user's
personal machine with a low-energy desktop drive and a long spin-up time.

Once a drive starts taking 30+ seconds to do I/O, I consider the drive
failed in the sense that it's too slow to meet latency requirements.
When the problem is that it's already taking too long, the solution is
not waiting even longer.  To put things in perspective, consider that
server hardware watchdog timeouts are typically 60 seconds by default
(if not maximum).

If anything, I want the timeout to be shorter so that upper layers with
redundancy can get an EIO and initiate repair promptly, and admins can
get notified to evict chronic offenders from their drive slots, without
having to pay extra for hard disk firmware with that feature.

> *Ideally* I think we'd want two timeouts. I'd like to see commands
> have a timer that results in merely a warning that could be used by
> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
> write over those sectors". That's how bad sectors start out, they read
> slower and eventually go beyond 30 seconds and now it's all link
> resets. If the problem could be fixed before then... that's the best
> scenario.

What's the downside of a link reset?  Can the driver not just return
EIO for all the outstanding IOs in progress at reset, and let the upper
layers deal with it?  Or is the problem that the upper layers are all
horribly broken by EIOs, or drive firmware horribly broken by link resets?

The upper layers could time the IOs, and make their own decisions based
on the timing (e.g. btrfs or mdadm could proactively repair anything that
took more than 10 seconds to read).  That might be a better approach,
since shortening the time to an EIO is only useful when you have a
redundancy layer in place to do something about them.

> The 2nd timer would be, OK the controller or drive just face planted, reset.
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Henk Slager
On Mon, Jun 27, 2016 at 6:17 PM, Chris Murphy <li...@colorremedies.com> wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
> <ahferro...@gmail.com> wrote:
>> On 2016-06-25 12:44, Chris Murphy wrote:
>>>
>>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>> <ahferro...@gmail.com> wrote:
>>>
>>>> Well, the obvious major advantage that comes to mind for me to
>>>> checksumming
>>>> parity is that it would let us scrub the parity data itself and verify
>>>> it.
>>>
>>>
>>> OK but hold on. During scrub, it should read data, compute checksums
>>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>>> the checksum tree, and the parity strip in the chunk tree. And if
>>> parity is wrong, then it should be replaced.
>>
>> Except that's horribly inefficient.  With limited exceptions involving
>> highly situational co-processors, computing a checksum of a parity block is
>> always going to be faster than computing parity for the stripe.  By using
>> that to check parity, we can safely speed up the common case of near zero
>> errors during a scrub by a pretty significant factor.
>
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

What I read in this thread clarifies the different flavors of errors I
saw when trying btrfs raid5 while corrupting 1 device or just
unexpectedly removing a device and replacing it with a fresh one.
Especially the lack of parity csums I was not aware of and I think
this is really wrong.

Consider a 4 disk btrfs raid10 and a 3 disk btrfs raid5. Both protect
against the loss of 1 device or badblocks on 1 device. In the current
design (unoptimized for performance), raid10 reads from 2 disk and
raid5 as well (as far as I remember) per task/process.
Which pair of strips for raid10 is pseudo-random AFAIK, so one could
get low throughput if some device in the array is older/slower and
that one is picked. From device to fs logical layer is just a simple
function, namely copy, so having the option to keep data in-place
(zerocopy). The data is at least read by the csum check and in case of
failure, the btrfs code picks the alternative strip and corrects etc.

For raid5, assuming it does avoid the parity in principle, it is also
a strip pair and csum check. In case of csum failure, one needs the
parity strip parity calculation. To me, It looks like that the 'fear'
of this calculation has made raid56 as a sort of add-on, instead of a
more integral part.

Looking at raid6 perf test at boot in dmesg, it is 30GByte/s, even
higher than memory bandwidth. So although a calculation is needed in
case data0strip+paritystrip would be used instead of
data0strip+data1strip, I think looking at total cost, it can be
cheaper than spending time on seeks, at least on HDDs. If the parity
calculation is treated in a transparent way, same as copy, then there
is more flexibility in selecting disks (and strips) and enables easier
design and performance optimizations I think.

>> The ideal situation that I'd like to see for scrub WRT parity is:
>> 1. Store checksums for the parity itself.
>> 2. During scrub, if the checksum is good, the parity is good, and we just
>> saved the time of computing the whole parity block.
>> 3. If the checksum is not good, then compute the parity.  If the parity just
>> computed matches what is there already, the checksum is bad and should be
>> rewritten (and we should probably recompute the whole block of checksums
>> it's in), otherwise, the parity was bad, write out the new parity and update
>> the checksum.

This 3rd point: if parity matches but csum is not good, then there is
a btrfs design error or some hardware/CPU/memory problem. Compare with
btrfs raid10: if the copies match but csum wrong, then there is
something fatally wrong. Just the first step, csum check and if wrong,
it would mean you generate the assumed corrupt strip newly from the 3
others. And for 3 disk raid5 from the 2 others, whether it is copying
or paritycalculation.

>> 4. Have an option to skip the csum check on the parity and always compute
>> it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
For what it's worth I found btrfs-map-logical can produce mapping for
raid5 (didn't test raid6) by specifying the extent block length. If
that's omitted it only shows the device+mapping for the first strip.

This example is a 3 disk raid5, with a 128KiB file all in a single extent.

[root@f24s ~]# btrfs-map-logical -l 14157742080 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c

[root@f24s ~]# btrfs-map-logical -l 14157742080 -b 131072 /dev/VG/a
mirror 1 logical 14157742080 physical 1109327872 device /dev/mapper/VG-a
mirror 1 logical 14157807616 physical 1075773440 device /dev/mapper/VG-b
mirror 2 logical 14157742080 physical 2183069696 device /dev/mapper/VG-c
mirror 2 logical 14157807616 physical 2183069696 device /dev/mapper/VG-c

It's also possible to use -c and -o to copy the extent to a file and
more easily diff it with a control file, rather than using dd.

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Chris Murphy
On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-06-25 12:44, Chris Murphy wrote:
>>
>> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
>>  wrote:
>>
>>> Well, the obvious major advantage that comes to mind for me to
>>> checksumming
>>> parity is that it would let us scrub the parity data itself and verify
>>> it.
>>
>>
>> OK but hold on. During scrub, it should read data, compute checksums
>> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
>> the checksum tree, and the parity strip in the chunk tree. And if
>> parity is wrong, then it should be replaced.
>
> Except that's horribly inefficient.  With limited exceptions involving
> highly situational co-processors, computing a checksum of a parity block is
> always going to be faster than computing parity for the stripe.  By using
> that to check parity, we can safely speed up the common case of near zero
> errors during a scrub by a pretty significant factor.

OK I'm in favor of that. Although somehow md gets away with this by
computing and checking parity for its scrubs, and still manages to
keep drives saturated in the process - at least HDDs, I'm not sure how
it fares on SSDs.




> The ideal situation that I'd like to see for scrub WRT parity is:
> 1. Store checksums for the parity itself.
> 2. During scrub, if the checksum is good, the parity is good, and we just
> saved the time of computing the whole parity block.
> 3. If the checksum is not good, then compute the parity.  If the parity just
> computed matches what is there already, the checksum is bad and should be
> rewritten (and we should probably recompute the whole block of checksums
> it's in), otherwise, the parity was bad, write out the new parity and update
> the checksum.
> 4. Have an option to skip the csum check on the parity and always compute
> it.
>>
>>
>> Even check > md/sync_action does this. So no pun intended but Btrfs
>> isn't even at parity with mdadm on data integrity if it doesn't check
>> if the parity matches data.
>
> Except that MD and LVM don't have checksums to verify anything outside of
> the very high-level metadata.  They have to compute the parity during a
> scrub because that's the _only_ way they have to check data integrity.  Just
> because that's the only way for them to check it does not mean we have to
> follow their design, especially considering that we have other, faster ways
> to check it.

I'm not opposed to this optimization. But retroactively better
qualifying my previous "major advantage" what I meant was in terms of
solving functional deficiency.



>> The much bigger problem we have right now that affects Btrfs,
>> LVM/mdadm md raid, is this silly bad default with non-enterprise
>> drives having no configurable SCT ERC, with ensuing long recovery
>> times, and the kernel SCSI command timer at 30 seconds - which
>> actually also fucks over regular single disk users also because it
>> means they don't get the "benefit" of long recovery times, which is
>> the whole g'd point of that feature. This itself causes so many
>> problems where bad sectors just get worse and don't get fixed up
>> because of all the link resets. So I still think it's a bullshit
>> default kernel side because it pretty much affects the majority use
>> case, it is only a non-problem with proprietary hardware raid, and
>> software raid using enterprise (or NAS specific) drives that already
>> have short recovery times by default.
>
> On this, we can agree.

It just came up again in a thread over the weekend on linux-raid@. I'm
going to ask while people are paying attention if a patch to change
the 30 second time out to something a lot higher has ever been
floated, what the negatives might be, and where to get this fixed if
it wouldn't be accepted in the kernel code directly.

*Ideally* I think we'd want two timeouts. I'd like to see commands
have a timer that results in merely a warning that could be used by
e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
write over those sectors". That's how bad sectors start out, they read
slower and eventually go beyond 30 seconds and now it's all link
resets. If the problem could be fixed before then... that's the best
scenario.

The 2nd timer would be, OK the controller or drive just face planted, reset.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-27 Thread Austin S. Hemmelgarn

On 2016-06-25 12:44, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:


Well, the obvious major advantage that comes to mind for me to checksumming
parity is that it would let us scrub the parity data itself and verify it.


OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.
Except that's horribly inefficient.  With limited exceptions involving 
highly situational co-processors, computing a checksum of a parity block 
is always going to be faster than computing parity for the stripe.  By 
using that to check parity, we can safely speed up the common case of 
near zero errors during a scrub by a pretty significant factor.


The ideal situation that I'd like to see for scrub WRT parity is:
1. Store checksums for the parity itself.
2. During scrub, if the checksum is good, the parity is good, and we 
just saved the time of computing the whole parity block.
3. If the checksum is not good, then compute the parity.  If the parity 
just computed matches what is there already, the checksum is bad and 
should be rewritten (and we should probably recompute the whole block of 
checksums it's in), otherwise, the parity was bad, write out the new 
parity and update the checksum.
4. Have an option to skip the csum check on the parity and always 
compute it.


Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.
Except that MD and LVM don't have checksums to verify anything outside 
of the very high-level metadata.  They have to compute the parity during 
a scrub because that's the _only_ way they have to check data integrity. 
 Just because that's the only way for them to check it does not mean we 
have to follow their design, especially considering that we have other, 
faster ways to check it.




I'd personally much rather know my parity is bad before I need to use it
than after using it to reconstruct data and getting an error there, and I'd
be willing to be that most seasoned sysadmins working for companies using
big storage arrays likely feel the same about it.


That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.
It does not require it, but it can make it significantly more efficient, 
and even a 1% increase in efficiency is a huge difference on a big array.


So that's also something to check in the code or poke a system with a
stick and see what happens.


I could see it being
practical to have an option to turn this off for performance reasons or
similar, but again, I have a feeling that most people would rather be able
to check if a rebuild will eat data before trying to rebuild (depending on
the situation in such a case, it will sometimes just make more sense to nuke
the array and restore from a backup instead of spending time waiting for it
to rebuild).


The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

On this, we can agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Chris Murphy
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov  wrote:
> 26.06.2016 00:52, Chris Murphy пишет:
>> Interestingly enough, so far I'm finding with full stripe writes, i.e.
>> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
>> is raid4.
>
> That's not what code suggests and what I see in practice - parity seems
> to be distributed across all disks; each new 128KiB file (extent) has
> parity on new disk. At least as long as we can trust btrfs-map-logical
> to always show parity as "mirror 2".


tl;dr Andrei is correct there's no raid4 behavior here.

Looks like mirror 2 is always parity, more on that below.


>
> Do you see consecutive full stripes in your tests? Or how do you
> determine which devid has parity for a given full stripe?

I do see consecutive full stripe writes, but it doesn't always happen.
But not checking the consecutivity is where I became confused.

[root@f24s ~]# filefrag -v /mnt/5/ab*
Filesystem type is: 9123683e
File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456128..   3456159: 32: last,eof
/mnt/5/ab128_2.txt: 1 extent found
File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456224..   3456255: 32: last,eof
/mnt/5/ab128_3.txt: 1 extent found
File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456320..   3456351: 32: last,eof
/mnt/5/ab128_4.txt: 1 extent found
File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456352..   3456383: 32: last,eof
/mnt/5/ab128_5.txt: 1 extent found
File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456384..   3456415: 32: last,eof
/mnt/5/ab128_6.txt: 1 extent found
File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456416..   3456447: 32: last,eof
/mnt/5/ab128_7.txt: 1 extent found
File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456448..   3456479: 32: last,eof
/mnt/5/ab128_8.txt: 1 extent found
File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456480..   3456511: 32: last,eof
/mnt/5/ab128_9.txt: 1 extent found
File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456096..   3456127: 32: last,eof
/mnt/5/ab128.txt: 1 extent found

Starting with the bottom file then from the top so they're in 4096
byte block order; and the 2nd column is the difference in value:

3456096
3456128 32
3456224 96
3456320 96
3456352 32
3456384 32
3456416 32
3456448 32
3456480 32

So the first two files are consecutive full stripe writes. The next
two aren't. The next five are. They were all copied at the same time.
I don't know why they aren't always consecutive writes.


[root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
[root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c

Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Duncan
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted:

> P.S. usage of "stripe" to mean "stripe element" actually adds to
> confusion when reading code :)

... and posts (including patches, which I guess are code as well, just 
not applied yet).  I've been noticing that in the "stripe length" 
patches, when the comment associated with the patch suggests it's "strip 
length" they're actually talking about, using the "N strips, one per 
device, make a stripe" definition.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Andrei Borzenkov
26.06.2016 00:52, Chris Murphy пишет:
> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> is raid4.

That's not what code suggests and what I see in practice - parity seems
to be distributed across all disks; each new 128KiB file (extent) has
parity on new disk. At least as long as we can trust btrfs-map-logical
to always show parity as "mirror 2".

Do you see consecutive full stripes in your tests? Or how do you
determine which devid has parity for a given full stripe? This
information is not actually stored anywhere, it is computed based on
block group geometry and logical stripe offset.

P.S. usage of "stripe" to mean "stripe element" actually adds to
confusion when reading code :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-25 Thread Chris Murphy
Interestingly enough, so far I'm finding with full stripe writes, i.e.
3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
is raid4. So...I wonder if some of these slow cases end up with a
bunch of stripes that are effectively raid4-like, and have a lot of
parity overwrites, which is where raid4 suffers due to disk
contention.

Totally speculative as the sample size is too small and distinctly non-random.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-25 Thread Chris Murphy
On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
 wrote:

> Well, the obvious major advantage that comes to mind for me to checksumming
> parity is that it would let us scrub the parity data itself and verify it.

OK but hold on. During scrub, it should read data, compute checksums
*and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
the checksum tree, and the parity strip in the chunk tree. And if
parity is wrong, then it should be replaced.

Even check > md/sync_action does this. So no pun intended but Btrfs
isn't even at parity with mdadm on data integrity if it doesn't check
if the parity matches data.


> I'd personally much rather know my parity is bad before I need to use it
> than after using it to reconstruct data and getting an error there, and I'd
> be willing to be that most seasoned sysadmins working for companies using
> big storage arrays likely feel the same about it.

That doesn't require parity csums though. It just requires computing
parity during a scrub and comparing it to the parity on disk to make
sure they're the same. If they aren't, assuming no other error for
that full stripe read, then the parity block is replaced.

So that's also something to check in the code or poke a system with a
stick and see what happens.

> I could see it being
> practical to have an option to turn this off for performance reasons or
> similar, but again, I have a feeling that most people would rather be able
> to check if a rebuild will eat data before trying to rebuild (depending on
> the situation in such a case, it will sometimes just make more sense to nuke
> the array and restore from a backup instead of spending time waiting for it
> to rebuild).

The much bigger problem we have right now that affects Btrfs,
LVM/mdadm md raid, is this silly bad default with non-enterprise
drives having no configurable SCT ERC, with ensuing long recovery
times, and the kernel SCSI command timer at 30 seconds - which
actually also fucks over regular single disk users also because it
means they don't get the "benefit" of long recovery times, which is
the whole g'd point of that feature. This itself causes so many
problems where bad sectors just get worse and don't get fixed up
because of all the link resets. So I still think it's a bullshit
default kernel side because it pretty much affects the majority use
case, it is only a non-problem with proprietary hardware raid, and
software raid using enterprise (or NAS specific) drives that already
have short recovery times by default.

This has been true for a very long time, maybe a decade. And it's such
complete utter crap that this hasn't been dealt with properly by any
party. No distribution has fixed this for their users. Upstream udev
hasn't dealt with it. And kernel folks haven't dealt with it. It's a
perverse joke on the user to do this out of the box.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Austin S. Hemmelgarn

On 2016-06-24 13:52, Chris Murphy wrote:

On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov  wrote:

24.06.2016 20:06, Chris Murphy пишет:

On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  wrote:

On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
eta)data and RAID56 parity is not data.


   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).



Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.


I don't understand this. Whether the failed disk means a stripe is
missing a data strip or parity strip, if any other strip is damaged of
course the reconstruction isn't going to match checksum. This does not
make raid5 pointless.



Yes, you are right. We have double failure here. Still, in current
situation we apparently may end with btrfs reconstructing missing block
using wrong information. As was mentioned elsewhere, btrfs does not
verify checksum of reconstructed block, meaning data corruption.


Well that'd be bad, but also good in that it would explain a lot of
problems people have when metadata is also raid5. In this whole thread
the premise is the metadata is raid1, so the fs doesn't totally face
plant we just get a bunch of weird data corruptions. The metadata
raid5 case were sorta "WTF happened?" and not much was really said
about it other than telling the user to scrape off what they can and
start over.

Anyway, while not good I still think this is not super problematic to
at least *do* check EXTENT_CSUM after reconstruction from parity
rather than assuming that reconstruction happened correctly. The data
needed to pass fail the rebuild is already on the disk. It just needs
to be checked.

Better would be to get parity csummed and put into the csum tree. But
I don't know how much that helps. Think about always computing and
writing csums for parity, which almost never get used vs keeping
things the way they are now and just *checking our work* after
reconstruction from parity. If there's some obvious major advantage to
checksumming the parity I'm all ears but I'm not thinking of it at the
moment.

Well, the obvious major advantage that comes to mind for me to 
checksumming parity is that it would let us scrub the parity data itself 
and verify it.  I'd personally much rather know my parity is bad before 
I need to use it than after using it to reconstruct data and getting an 
error there, and I'd be willing to be that most seasoned sysadmins 
working for companies using big storage arrays likely feel the same 
about it.  I could see it being practical to have an option to turn this 
off for performance reasons or similar, but again, I have a feeling that 
most people would rather be able to check if a rebuild will eat data 
before trying to rebuild (depending on the situation in such a case, it 
will sometimes just make more sense to nuke the array and restore from a 
backup instead of spending time waiting for it to rebuild).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Zygo Blaxell
On Fri, Jun 24, 2016 at 11:40:56AM -0600, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills <h...@carfax.org.uk> wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> 
> The last sentence is hugely problematic. Parity doesn't appear to be
> either CoW'd or checksummed. If it is used for reconstruction and the
> reconstructed data isn't compared to the data's EXTENT_CSUM entry, but
> that entry is rather recomputed and written, that's just like blindly
> trusting the parity is correct and then authenticating it with a csum.

I think what happens is the data is recomputed, but the csum on the
data is _not_ updated (the csum does not reside in the raid56 code).
A read of the reconstructed data would get a csum failure (of course,
every 4 billionth time this happens the csum is correct by random chance,
so you wouldn't want to be reading parity blocks from a drive full of
garbage, but that's a different matter).

> It's  not difficult to test. Corrupt one byte of parity. Yank a drive.
> Add a new one. Start a reconstruction with scrub or balance (or both
> to see if they differ) and find out what happens. What should happen
> is the reconstruct should work for everything except that one file. If
> it's reconstructed silently, it should contain visible corruption and
> we all collectively raise our eyebrows.

I've done something like that test:  write random data to 1000 random
blocks on one disk, then run scrub.  It reconstructs the data without
problems (except for the minor wart that 'scrub status -d' counts the
randomly against every device, while 'dev stats' counts all the errors
on the disk that was corrupted).

Disk-side data corruption is a thing I have to deal with a few times each
year, so I tested the btrfs raid5 implementation for that case before I
started using it.  As far as I can tell so far, everything in btrfs raid5
works properly if a disk fails _while the filesystem is not mounted_.

The problem I see in the field is not *silent* corruption.  It's a whole
lot of very *noisy* corruption detected under circumstances where I'd
expect to see no corruption at all (silent or otherwise).



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 11:21 AM, Andrei Borzenkov  wrote:
> 24.06.2016 20:06, Chris Murphy пишет:
>> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  
>> wrote:
>>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>>> eta)data and RAID56 parity is not data.

Checksums are not parity, correct. However, every data block
 (including, I think, the parity) is checksummed and put into the csum
 tree. This allows the FS to determine where damage has occurred,
 rather thansimply detecting that it has occurred (which would be the
 case if the parity doesn't match the data, or if the two copies of a
 RAID-1 array don't match).

>>>
>>> Yes, that is what I wrote below. But that means that RAID5 with one
>>> degraded disk won't be able to reconstruct data on this degraded disk
>>> because reconstructed extent content won't match checksum. Which kinda
>>> makes RAID5 pointless.
>>
>> I don't understand this. Whether the failed disk means a stripe is
>> missing a data strip or parity strip, if any other strip is damaged of
>> course the reconstruction isn't going to match checksum. This does not
>> make raid5 pointless.
>>
>
> Yes, you are right. We have double failure here. Still, in current
> situation we apparently may end with btrfs reconstructing missing block
> using wrong information. As was mentioned elsewhere, btrfs does not
> verify checksum of reconstructed block, meaning data corruption.

Well that'd be bad, but also good in that it would explain a lot of
problems people have when metadata is also raid5. In this whole thread
the premise is the metadata is raid1, so the fs doesn't totally face
plant we just get a bunch of weird data corruptions. The metadata
raid5 case were sorta "WTF happened?" and not much was really said
about it other than telling the user to scrape off what they can and
start over.

Anyway, while not good I still think this is not super problematic to
at least *do* check EXTENT_CSUM after reconstruction from parity
rather than assuming that reconstruction happened correctly. The data
needed to pass fail the rebuild is already on the disk. It just needs
to be checked.

Better would be to get parity csummed and put into the csum tree. But
I don't know how much that helps. Think about always computing and
writing csums for parity, which almost never get used vs keeping
things the way they are now and just *checking our work* after
reconstruction from parity. If there's some obvious major advantage to
checksumming the parity I'm all ears but I'm not thinking of it at the
moment.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 4:16 AM, Hugo Mills  wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:

>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>Eh? How do you come to that conclusion?
>
>For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest
> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.

The last sentence is hugely problematic. Parity doesn't appear to be
either CoW'd or checksummed. If it is used for reconstruction and the
reconstructed data isn't compared to the data's EXTENT_CSUM entry, but
that entry is rather recomputed and written, that's just like blindly
trusting the parity is correct and then authenticating it with a csum.

It's  not difficult to test. Corrupt one byte of parity. Yank a drive.
Add a new one. Start a reconstruction with scrub or balance (or both
to see if they differ) and find out what happens. What should happen
is the reconstruct should work for everything except that one file. If
it's reconstructed silently, it should contain visible corruption and
we all collectively raise our eyebrows.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 4:16 AM, Andrei Borzenkov  wrote:
> On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy  wrote:
>
>> [root@f24s ~]# filefrag -v /mnt/5/*
>> Filesystem type is: 9123683e
>> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>>  ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..   3:2931712..   2931715:  4: 
>> last,eof
>
> Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)
>
> nohostname:~ # filefrag -v /mnt/1
> Filesystem type is: 9123683e
> File size of /mnt/1 is 3072 (1 block of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   0: 269376..269376:  1: last,eof
> /mnt/1: 1 extent found
>
> But!
>
> nohostname:~ # filefrag -v /etc/passwd
> Filesystem type is: 9123683e
> File size of /etc/passwd is 1527 (1 block of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..4095:  0..  4095:   4096:
> last,not_aligned,inline,eof
> /etc/passwd: 1 extent found
> nohostname:~ #
>
> Why it works for one filesystem but does not work for an other one?
> ...
>>
>> So at the old address, it shows the "a..." is still there. And at
>> the added single block for this file at new logical and physical
>> addresses, is the modification substituting the first "a" for "g".
>>
>> In this case, no rmw, no partial stripe modification, and no data
>> already on-disk is at risk.
>
> You misunderstand the nature of problem. What is put at risk is data
> that is already on disk and "shares" parity with new data.
>
> As example, here are the first 64K in several extents on 4 disk RAID5
> with so far single data chunk
>
> item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
> 15491 itemsize 176
> chunk length 3221225472 owner 2 stripe_len 65536
> type DATA|RAID5 num_stripes 4
> stripe 0 devid 4 offset 9437184
> dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
> stripe 1 devid 3 offset 9437184
> dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
> stripe 2 devid 2 offset 1074790400
> dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
> stripe 3 devid 1 offset 1094713344
> dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
>
>
> item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
> item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
> item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
> item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
> dev extent chunk_tree 3
> chunk objectid 256 chunk offset 1103101952 length 1073741824
>
> where devid 1 = sdb1, 2 = sdc1 etc.
>
> Now let's write some data (I created several files) up to 64K in size:
>
> mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
> mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
> mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
> mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
> mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
> mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
> mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
> mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
> mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
> mirror 2 logical 1103380480 physical 9519104 device /dev/sde1
>
> Note that btrfs allocates 64K on the same device before switching to
> the next one. What is a bit misleading here, sdc1 is data and sde1 is
> parity (you can see it in checksum tree, where only items for sdc1
> exist).
>
> Now let's write next 64k and see what happens
>
> nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
> mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
> mirror 2 logical 1103429632 physical 9502720 device /dev/sde1
>
> See? btrfs now allocates new stripe on sdb1; this stripe is at the
> same offset as previous one on sdc1 (64K) and so shares the same
> parity stripe on sde1.

Yep, I've seen this also. What's not clear is if there's any
optimization where it's doing partial strip writes, i.e. only a

Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
24.06.2016 20:06, Chris Murphy пишет:
> On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>> eta)data and RAID56 parity is not data.
>>>
>>>Checksums are not parity, correct. However, every data block
>>> (including, I think, the parity) is checksummed and put into the csum
>>> tree. This allows the FS to determine where damage has occurred,
>>> rather thansimply detecting that it has occurred (which would be the
>>> case if the parity doesn't match the data, or if the two copies of a
>>> RAID-1 array don't match).
>>>
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
> 
> I don't understand this. Whether the failed disk means a stripe is
> missing a data strip or parity strip, if any other strip is damaged of
> course the reconstruction isn't going to match checksum. This does not
> make raid5 pointless.
> 

Yes, you are right. We have double failure here. Still, in current
situation we apparently may end with btrfs reconstructing missing block
using wrong information. As was mentioned elsewhere, btrfs does not
verify checksum of reconstructed block, meaning data corruption.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 3:52 AM, Andrei Borzenkov  wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>eta)data and RAID56 parity is not data.
>>
>>Checksums are not parity, correct. However, every data block
>> (including, I think, the parity) is checksummed and put into the csum
>> tree. This allows the FS to determine where damage has occurred,
>> rather thansimply detecting that it has occurred (which would be the
>> case if the parity doesn't match the data, or if the two copies of a
>> RAID-1 array don't match).
>>
>
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

I don't understand this. Whether the failed disk means a stripe is
missing a data strip or parity strip, if any other strip is damaged of
course the reconstruction isn't going to match checksum. This does not
make raid5 pointless.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Chris Murphy
On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills  wrote:

>Checksums are not parity, correct. However, every data block
> (including, I think, the parity) is checksummed and put into the csum
> tree.

I don't see how parity is checksummed. It definitely is not in the
csum tree. Two file systems, one raid5, one single, each with a single
identical file:

raid5
item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
extent csum item

single

item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
extent csum item

That's the only entry in the csum tree. The raid5 one is not 33.33%
bigger to account for the extra parity being checksummed.

Now, if parity is used to reconstruction of data, that data *is*
checksummed so if it fails checksum after reconstruction the
information is available to determine it was incorrectly
reconstructed. The notes in btrfs/raid56.c recognize the possibility
of parity corruption and how to handle it. But I think that corruption
is inferred. Maybe the parity csums are in some other metadata item,
but I don't see how it's in the csum tree.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 10:52:53AM -0600, Chris Murphy wrote:
> On Fri, Jun 24, 2016 at 2:50 AM, Hugo Mills  wrote:
> 
> >Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree.
> 
> I don't see how parity is checksummed. It definitely is not in the
> csum tree. Two file systems, one raid5, one single, each with a single
> identical file:

   It isn't -- I was wrong up there, and corrected myself in a later
message after investigation. (Although in this case, I regard reality
as being at fault ;) ).

   Hugo.

> raid5
> item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 
> 128
> extent csum item
> 
> single
> 
> item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
> extent csum item
> 
> That's the only entry in the csum tree. The raid5 one is not 33.33%
> bigger to account for the extra parity being checksummed.
> 
> Now, if parity is used to reconstruction of data, that data *is*
> checksummed so if it fails checksum after reconstruction the
> information is available to determine it was incorrectly
> reconstructed. The notes in btrfs/raid56.c recognize the possibility
> of parity corruption and how to handle it. But I think that corruption
> is inferred. Maybe the parity csums are in some other metadata item,
> but I don't see how it's in the csum tree.
> 
> 

-- 
Hugo Mills | Great oxymorons of the world, no. 2:
hugo@... carfax.org.uk | Common Sense
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Zygo Blaxell
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.

The allocator could try harder to avoid partial stripe writes.  We can
write multiple small extents to the same stripe as long as we always do
it all within one transaction, and then later treat the entire stripe
as read-only until every extent is removed.  It would be possible to do
that by fudging extent lengths (effectively adding a bunch of prealloc-ish
space if we have a partial write after all the delalloc stuff is done),
but it could also waste some blocks on every single transaction, or
create a bunch of "free but unavailable" space that makes df/statvfs
output even more wrong than it usually is.

The raid5 rmw code could try to relocate the other extents sharing a
stripe, but I fear that with the current state of backref walking code
that would make raid5 spectacularly slow if a filesystem is anywhere
near full.

We could also write rmw parity block updates to a journal (like another
log tree).  That would enable us to at least fix up the parity blocks
after a crash, and close the write hole.  That's an on-disk format
change though.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 11:20:40PM -0600, Chris Murphy wrote:
> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   0:2931732..   2931732:  1:
>1:1..   3:2931713..   2931715:  3:2931733: last,eof
> /mnt/5/a.txt: 2 extents found
> File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931716..   2931719:  4: last,eof
> /mnt/5/b.txt: 1 extent found
> File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931720..   2931723:  4: last,eof
> /mnt/5/c.txt: 1 extent found
> File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931724..   2931727:  4: last,eof
> /mnt/5/d.txt: 1 extent found
> File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931728..   2931731:  4: last,eof
> /mnt/5/e.txt: 1 extent found

> So at the old address, it shows the "a..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
> 
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk. Even the metadata leaf/node is cow'd, it
> has a new logical and physical address as well, which contains
> information for all five files.

Well, of course not.  You're not setting up the conditions for failure.
The extent at 2931712..2931715 is 4 blocks long, so when you overwrite
part of the extent all 4 blocks remain occupied.

You need extents that are shorter than the stripe width, and you need to
write to the same stripe in two different btrfs transactions (i.e. you
need to delete an extent and then have a new extent mapped in the old
location).


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Austin S. Hemmelgarn

On 2016-06-24 06:59, Hugo Mills wrote:

On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:

On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills  wrote:

On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:

On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:

On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:

24.06.2016 04:47, Zygo Blaxell пишет:

On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:

On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  wrote:

The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.


Yeah I'm kinda confused on this point.

https://btrfs.wiki.kernel.org/index.php/RAID56

It says there is a write hole for Btrfs. But defines it in terms of
parity possibly being stale after a crash. I think the term comes not
from merely parity being wrong but parity being wrong *and* then being
used to wrongly reconstruct data because it's blindly trusted.


I think the opposite is more likely, as the layers above raid56
seem to check the data against sums before raid56 ever sees it.
(If those layers seem inverted to you, I agree, but OTOH there are
probably good reason to do it that way).



Yes, that's how I read code as well. btrfs layer that does checksumming
is unaware of parity blocks at all; for all practical purposes they do
not exist. What happens is approximately

1. logical extent is allocated and checksum computed
2. it is mapped to physical area(s) on disks, skipping over what would
be parity blocks
3. when these areas are written out, RAID56 parity is computed and filled in

IOW btrfs checksums are for (meta)data and RAID56 parity is not data.


   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).



Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.


   Eh? How do you come to that conclusion?

   For data, say you have n-1 good devices, with n-1 blocks on them.
Each block has a checksum in the metadata, so you can read that
checksum, read the blocks, and verify that they're not damaged. From
those n-1 known-good blocks (all data, or one parity and the rest


We do not know whether parity is good or not because as far as I can
tell parity is not checksummed.


   I was about to write a devastating rebuttal of this... then I
actually tested it, and holy crap you're right.

   I've just closed the terminal in question by accident, so I can't
copy-and-paste, but the way I checked was:

# mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
# mount /dev/loop0 foo
# dd if=/dev/urandom of=foo/file bs=4k count=32
# umount /dev/loop0
# btrfs-debug-tree /dev/loop0

then look at the csum tree:

 item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
  extent csum item

There is a single csum item in it, of length 128. At 4 bytes per csum,
that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
nothing for the parity.

   This is fundamentally broken, and I think we need to change the
wiki to indicate that the parity RAID implementation is not
recommended, because it doesn't actually do the job it's meant to in a
reliable way. :(


So item 4 now then, together with:
1. Rebuilds seemingly randomly decide based on the filesystem whether or 
not to take an insanely long time (always happens on some arrays, never 
happens on others, I have yet to see a report where it happens 
intermittently).

2. Failed disks seem to occasionally cause irreversible data corruption.
3. Classic erasure-code write-hole, just slightly different because of COW.

TBH, as much as I hate to say this, it looks like the raid5/6 code needs 
redone from scratch.  At an absolute minimum, we need to put a warning 
in mkfs for people using raid5/6 to tell them they shouldn't be using it 
outside of testing.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Austin S. Hemmelgarn

On 2016-06-24 01:20, Chris Murphy wrote:

On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell
 wrote:


With simple files changing one character with vi and gedit,
I get completely different logical and physical numbers with each
change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
raid5).


You are COWing the entire file because vi and gedit do truncate followed
by full-file write.


I'm seeing the file inode changes with either a vi or gedit
modification, even when file size is exactly the same, just character
substitute. So as far as VFS and Btrfs are concerned, it's an entirely
different file, so it's like faux-CoW that would have happened on any
file system, not an overwrite.
Yes, at least Vim (which is what most Linux systems use for vi) writes 
to a temporary file then does a replace by rename.  The idea is that 
POSIX implies that this should be atomic (except it's not actually 
required by POSIX, and even on some journaled and COW filesystems, it 
isn't actually atomic).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 01:19:30PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills  wrote:
> > On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> >> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
> >> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
> >> >> >>  wrote:
> >> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >> >> >>> checksum.
> >> >> >>
> >> >> >> Yeah I'm kinda confused on this point.
> >> >> >>
> >> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >> >>
> >> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> >> from merely parity being wrong but parity being wrong *and* then 
> >> >> >> being
> >> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >> >
> >> >> > I think the opposite is more likely, as the layers above raid56
> >> >> > seem to check the data against sums before raid56 ever sees it.
> >> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> >> > probably good reason to do it that way).
> >> >> >
> >> >>
> >> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> >> is unaware of parity blocks at all; for all practical purposes they do
> >> >> not exist. What happens is approximately
> >> >>
> >> >> 1. logical extent is allocated and checksum computed
> >> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> >> be parity blocks
> >> >> 3. when these areas are written out, RAID56 parity is computed and 
> >> >> filled in
> >> >>
> >> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >> >
> >> >Checksums are not parity, correct. However, every data block
> >> > (including, I think, the parity) is checksummed and put into the csum
> >> > tree. This allows the FS to determine where damage has occurred,
> >> > rather thansimply detecting that it has occurred (which would be the
> >> > case if the parity doesn't match the data, or if the two copies of a
> >> > RAID-1 array don't match).
> >> >
> >>
> >> Yes, that is what I wrote below. But that means that RAID5 with one
> >> degraded disk won't be able to reconstruct data on this degraded disk
> >> because reconstructed extent content won't match checksum. Which kinda
> >> makes RAID5 pointless.
> >
> >Eh? How do you come to that conclusion?
> >
> >For data, say you have n-1 good devices, with n-1 blocks on them.
> > Each block has a checksum in the metadata, so you can read that
> > checksum, read the blocks, and verify that they're not damaged. From
> > those n-1 known-good blocks (all data, or one parity and the rest
> 
> We do not know whether parity is good or not because as far as I can
> tell parity is not checksummed.

   I was about to write a devastating rebuttal of this... then I
actually tested it, and holy crap you're right.

   I've just closed the terminal in question by accident, so I can't
copy-and-paste, but the way I checked was:

# mkfs.btrfs -mraid1 -draid5 /dev/loop{0,1,2}
# mount /dev/loop0 foo
# dd if=/dev/urandom of=foo/file bs=4k count=32
# umount /dev/loop0
# btrfs-debug-tree /dev/loop0

then look at the csum tree:

 item 0 key (EXTENT_CSUM EXTENT_CSUM 351469568) itemoff 16155 itemsize 128
  extent csum item

There is a single csum item in it, of length 128. At 4 bytes per csum,
that's 32 checksums, which covers the 32 4KiB blocks I wrote, leaving
nothing for the parity.

   This is fundamentally broken, and I think we need to change the
wiki to indicate that the parity RAID implementation is not
recommended, because it doesn't actually do the job it's meant to in a
reliable way. :(

   Hugo.

> > data) you can reconstruct the remaining block. That reconstructed
> > block won't be checked against the csum for the missing block -- it'll
> > just be written and a new csum for it written with it.
> >
> 
> So we have silent corruption. I fail to understand how it is an improvement :)
> 
> >Hugo.
> >
> >> ...
> >> >
> >> >> > It looks like uncorrectable failures might occur because parity is
> >> >> > correct, but the parity checksum is out of date, so the parity 
> >> >> > checksum
> >> >> > doesn't match even though data blindly reconstructed from the parity
> >> >> > *would* match the data.
> >> >> >
> >> >>
> >> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> >> should at least avoid silent corruption.
> >> >>
> >

-- 
Hugo Mills | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and

Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
On Fri, Jun 24, 2016 at 1:16 PM, Hugo Mills  wrote:
> On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
>> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
>> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> >> 24.06.2016 04:47, Zygo Blaxell пишет:
>> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
>> >> >>  wrote:
>> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>> >> >>> checksum.
>> >> >>
>> >> >> Yeah I'm kinda confused on this point.
>> >> >>
>> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >> >>
>> >> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> >> parity possibly being stale after a crash. I think the term comes not
>> >> >> from merely parity being wrong but parity being wrong *and* then being
>> >> >> used to wrongly reconstruct data because it's blindly trusted.
>> >> >
>> >> > I think the opposite is more likely, as the layers above raid56
>> >> > seem to check the data against sums before raid56 ever sees it.
>> >> > (If those layers seem inverted to you, I agree, but OTOH there are
>> >> > probably good reason to do it that way).
>> >> >
>> >>
>> >> Yes, that's how I read code as well. btrfs layer that does checksumming
>> >> is unaware of parity blocks at all; for all practical purposes they do
>> >> not exist. What happens is approximately
>> >>
>> >> 1. logical extent is allocated and checksum computed
>> >> 2. it is mapped to physical area(s) on disks, skipping over what would
>> >> be parity blocks
>> >> 3. when these areas are written out, RAID56 parity is computed and filled 
>> >> in
>> >>
>> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>> >
>> >Checksums are not parity, correct. However, every data block
>> > (including, I think, the parity) is checksummed and put into the csum
>> > tree. This allows the FS to determine where damage has occurred,
>> > rather thansimply detecting that it has occurred (which would be the
>> > case if the parity doesn't match the data, or if the two copies of a
>> > RAID-1 array don't match).
>> >
>>
>> Yes, that is what I wrote below. But that means that RAID5 with one
>> degraded disk won't be able to reconstruct data on this degraded disk
>> because reconstructed extent content won't match checksum. Which kinda
>> makes RAID5 pointless.
>
>Eh? How do you come to that conclusion?
>
>For data, say you have n-1 good devices, with n-1 blocks on them.
> Each block has a checksum in the metadata, so you can read that
> checksum, read the blocks, and verify that they're not damaged. From
> those n-1 known-good blocks (all data, or one parity and the rest

We do not know whether parity is good or not because as far as I can
tell parity is not checksummed.

> data) you can reconstruct the remaining block. That reconstructed
> block won't be checked against the csum for the missing block -- it'll
> just be written and a new csum for it written with it.
>

So we have silent corruption. I fail to understand how it is an improvement :)

>Hugo.
>
>> ...
>> >
>> >> > It looks like uncorrectable failures might occur because parity is
>> >> > correct, but the parity checksum is out of date, so the parity checksum
>> >> > doesn't match even though data blindly reconstructed from the parity
>> >> > *would* match the data.
>> >> >
>> >>
>> >> Yep, that is how I read it too. So if your data is checksummed, it
>> >> should at least avoid silent corruption.
>> >>
>
> --
> Hugo Mills | Debugging is like hitting yourself in the head with
> hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
> http://carfax.org.uk/  | you're allowed to stop debugging.
> PGP: E2AB1DE4  |PotatoEngineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 12:52:21PM +0300, Andrei Borzenkov wrote:
> On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
> > On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> >> 24.06.2016 04:47, Zygo Blaxell пишет:
> >> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli 
> >> >>  wrote:
> >> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >> >>> checksum.
> >> >>
> >> >> Yeah I'm kinda confused on this point.
> >> >>
> >> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >> >>
> >> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> >> parity possibly being stale after a crash. I think the term comes not
> >> >> from merely parity being wrong but parity being wrong *and* then being
> >> >> used to wrongly reconstruct data because it's blindly trusted.
> >> >
> >> > I think the opposite is more likely, as the layers above raid56
> >> > seem to check the data against sums before raid56 ever sees it.
> >> > (If those layers seem inverted to you, I agree, but OTOH there are
> >> > probably good reason to do it that way).
> >> >
> >>
> >> Yes, that's how I read code as well. btrfs layer that does checksumming
> >> is unaware of parity blocks at all; for all practical purposes they do
> >> not exist. What happens is approximately
> >>
> >> 1. logical extent is allocated and checksum computed
> >> 2. it is mapped to physical area(s) on disks, skipping over what would
> >> be parity blocks
> >> 3. when these areas are written out, RAID56 parity is computed and filled 
> >> in
> >>
> >> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
> >
> >Checksums are not parity, correct. However, every data block
> > (including, I think, the parity) is checksummed and put into the csum
> > tree. This allows the FS to determine where damage has occurred,
> > rather thansimply detecting that it has occurred (which would be the
> > case if the parity doesn't match the data, or if the two copies of a
> > RAID-1 array don't match).
> >
> 
> Yes, that is what I wrote below. But that means that RAID5 with one
> degraded disk won't be able to reconstruct data on this degraded disk
> because reconstructed extent content won't match checksum. Which kinda
> makes RAID5 pointless.

   Eh? How do you come to that conclusion?

   For data, say you have n-1 good devices, with n-1 blocks on them.
Each block has a checksum in the metadata, so you can read that
checksum, read the blocks, and verify that they're not damaged. From
those n-1 known-good blocks (all data, or one parity and the rest
data) you can reconstruct the remaining block. That reconstructed
block won't be checked against the csum for the missing block -- it'll
just be written and a new csum for it written with it.

   Hugo.

> ...
> >
> >> > It looks like uncorrectable failures might occur because parity is
> >> > correct, but the parity checksum is out of date, so the parity checksum
> >> > doesn't match even though data blindly reconstructed from the parity
> >> > *would* match the data.
> >> >
> >>
> >> Yep, that is how I read it too. So if your data is checksummed, it
> >> should at least avoid silent corruption.
> >>

-- 
Hugo Mills | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
http://carfax.org.uk/  | you're allowed to stop debugging.
PGP: E2AB1DE4  |PotatoEngineer


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
On Fri, Jun 24, 2016 at 8:20 AM, Chris Murphy  wrote:

> [root@f24s ~]# filefrag -v /mnt/5/*
> Filesystem type is: 9123683e
> File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   3:2931712..   2931715:  4: last,eof

Hmm ... I wonder what is wrong here (openSUSE Tumbleweed)

nohostname:~ # filefrag -v /mnt/1
Filesystem type is: 9123683e
File size of /mnt/1 is 3072 (1 block of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0: 269376..269376:  1: last,eof
/mnt/1: 1 extent found

But!

nohostname:~ # filefrag -v /etc/passwd
Filesystem type is: 9123683e
File size of /etc/passwd is 1527 (1 block of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..4095:  0..  4095:   4096:
last,not_aligned,inline,eof
/etc/passwd: 1 extent found
nohostname:~ #

Why it works for one filesystem but does not work for an other one?
...
>
> So at the old address, it shows the "a..." is still there. And at
> the added single block for this file at new logical and physical
> addresses, is the modification substituting the first "a" for "g".
>
> In this case, no rmw, no partial stripe modification, and no data
> already on-disk is at risk.

You misunderstand the nature of problem. What is put at risk is data
that is already on disk and "shares" parity with new data.

As example, here are the first 64K in several extents on 4 disk RAID5
with so far single data chunk

item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff
15491 itemsize 176
chunk length 3221225472 owner 2 stripe_len 65536
type DATA|RAID5 num_stripes 4
stripe 0 devid 4 offset 9437184
dev uuid: ed13e42e-1633-4230-891c-897e86d1c0be
stripe 1 devid 3 offset 9437184
dev uuid: 10032b95-3f48-4ea0-a9ee-90064c53da1f
stripe 2 devid 2 offset 1074790400
dev uuid: cd749bd9-3d72-43b4-89a8-45e4a92658cf
stripe 3 devid 1 offset 1094713344
dev uuid: 41538b9f-3869-4c32-b3e2-30aa2ea1534e
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824


item 5 key (1 DEV_EXTENT 1094713344) itemoff 16027 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 7 key (2 DEV_EXTENT 1074790400) itemoff 15931 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 9 key (3 DEV_EXTENT 9437184) itemoff 15835 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824
item 11 key (4 DEV_EXTENT 9437184) itemoff 15739 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 1103101952 length 1073741824

where devid 1 = sdb1, 2 = sdc1 etc.

Now let's write some data (I created several files) up to 64K in size:

mirror 1 logical 1103364096 physical 1074855936 device /dev/sdc1
mirror 2 logical 1103364096 physical 9502720 device /dev/sde1
mirror 1 logical 1103368192 physical 1074860032 device /dev/sdc1
mirror 2 logical 1103368192 physical 9506816 device /dev/sde1
mirror 1 logical 1103372288 physical 1074864128 device /dev/sdc1
mirror 2 logical 1103372288 physical 9510912 device /dev/sde1
mirror 1 logical 1103376384 physical 1074868224 device /dev/sdc1
mirror 2 logical 1103376384 physical 9515008 device /dev/sde1
mirror 1 logical 1103380480 physical 1074872320 device /dev/sdc1
mirror 2 logical 1103380480 physical 9519104 device /dev/sde1

Note that btrfs allocates 64K on the same device before switching to
the next one. What is a bit misleading here, sdc1 is data and sde1 is
parity (you can see it in checksum tree, where only items for sdc1
exist).

Now let's write next 64k and see what happens

nohostname:~ # btrfs-map-logical -l 1103429632 -b 65536 /dev/sdb1
mirror 1 logical 1103429632 physical 1094778880 device /dev/sdb1
mirror 2 logical 1103429632 physical 9502720 device /dev/sde1

See? btrfs now allocates new stripe on sdb1; this stripe is at the
same offset as previous one on sdc1 (64K) and so shares the same
parity stripe on sde1. If you compare 64K on sde1 at offset 9502720
before and after, you will see that it has changed. INPLACE. Without
CoW. This is exactly what puts existing data on sdc1 at risk - if sdb1
is updated but sde1 is not, attempt to reconstruct data on sdc1 will
either fail (if we have checksums) or result in silent corruption.
--
To unsubscribe from this list: send the line "unsubscribe 

Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Andrei Borzenkov
On Fri, Jun 24, 2016 at 11:50 AM, Hugo Mills  wrote:
> On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
>> 24.06.2016 04:47, Zygo Blaxell пишет:
>> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
>> >> wrote:
>> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>> >>> checksum.
>> >>
>> >> Yeah I'm kinda confused on this point.
>> >>
>> >> https://btrfs.wiki.kernel.org/index.php/RAID56
>> >>
>> >> It says there is a write hole for Btrfs. But defines it in terms of
>> >> parity possibly being stale after a crash. I think the term comes not
>> >> from merely parity being wrong but parity being wrong *and* then being
>> >> used to wrongly reconstruct data because it's blindly trusted.
>> >
>> > I think the opposite is more likely, as the layers above raid56
>> > seem to check the data against sums before raid56 ever sees it.
>> > (If those layers seem inverted to you, I agree, but OTOH there are
>> > probably good reason to do it that way).
>> >
>>
>> Yes, that's how I read code as well. btrfs layer that does checksumming
>> is unaware of parity blocks at all; for all practical purposes they do
>> not exist. What happens is approximately
>>
>> 1. logical extent is allocated and checksum computed
>> 2. it is mapped to physical area(s) on disks, skipping over what would
>> be parity blocks
>> 3. when these areas are written out, RAID56 parity is computed and filled in
>>
>> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.
>
>Checksums are not parity, correct. However, every data block
> (including, I think, the parity) is checksummed and put into the csum
> tree. This allows the FS to determine where damage has occurred,
> rather thansimply detecting that it has occurred (which would be the
> case if the parity doesn't match the data, or if the two copies of a
> RAID-1 array don't match).
>

Yes, that is what I wrote below. But that means that RAID5 with one
degraded disk won't be able to reconstruct data on this degraded disk
because reconstructed extent content won't match checksum. Which kinda
makes RAID5 pointless.

...
>
>> > It looks like uncorrectable failures might occur because parity is
>> > correct, but the parity checksum is out of date, so the parity checksum
>> > doesn't match even though data blindly reconstructed from the parity
>> > *would* match the data.
>> >
>>
>> Yep, that is how I read it too. So if your data is checksummed, it
>> should at least avoid silent corruption.
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-24 Thread Hugo Mills
On Fri, Jun 24, 2016 at 07:02:34AM +0300, Andrei Borzenkov wrote:
> 24.06.2016 04:47, Zygo Blaxell пишет:
> > On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> >> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
> >> wrote:
> >>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> >>> checksum.
> >>
> >> Yeah I'm kinda confused on this point.
> >>
> >> https://btrfs.wiki.kernel.org/index.php/RAID56
> >>
> >> It says there is a write hole for Btrfs. But defines it in terms of
> >> parity possibly being stale after a crash. I think the term comes not
> >> from merely parity being wrong but parity being wrong *and* then being
> >> used to wrongly reconstruct data because it's blindly trusted.
> > 
> > I think the opposite is more likely, as the layers above raid56
> > seem to check the data against sums before raid56 ever sees it.
> > (If those layers seem inverted to you, I agree, but OTOH there are
> > probably good reason to do it that way).
> > 
> 
> Yes, that's how I read code as well. btrfs layer that does checksumming
> is unaware of parity blocks at all; for all practical purposes they do
> not exist. What happens is approximately
> 
> 1. logical extent is allocated and checksum computed
> 2. it is mapped to physical area(s) on disks, skipping over what would
> be parity blocks
> 3. when these areas are written out, RAID56 parity is computed and filled in
> 
> IOW btrfs checksums are for (meta)data and RAID56 parity is not data.

   Checksums are not parity, correct. However, every data block
(including, I think, the parity) is checksummed and put into the csum
tree. This allows the FS to determine where damage has occurred,
rather thansimply detecting that it has occurred (which would be the
case if the parity doesn't match the data, or if the two copies of a
RAID-1 array don't match).

   (Note that csums for metadata are stored in the metadata block
itself, not in the csum tree).

   Hugo.

> > It looks like uncorrectable failures might occur because parity is
> > correct, but the parity checksum is out of date, so the parity checksum
> > doesn't match even though data blindly reconstructed from the parity
> > *would* match the data.
> > 
> 
> Yep, that is how I read it too. So if your data is checksummed, it
> should at least avoid silent corruption.
> 
> >> I don't read code well enough, but I'd be surprised if Btrfs
> >> reconstructs from parity and doesn't then check the resulting
> >> reconstructed data to its EXTENT_CSUM.
> > 
> > I wouldn't be surprised if both things happen in different code paths,
> > given the number of different paths leading into the raid56 code and
> > the number of distinct failure modes it seems to have.
> > 
> 
> Well, the problem is that parity block cannot be redirected on write as
> data blocks; which makes it impossible to version control it. The only
> solution I see is to always use full stripe writes by either wasting
> time in fixed width stripe or using variable width, so that every stripe
> always gets new version of parity. This makes it possible to keep parity
> checksums like data checksums.
> 



-- 
Hugo Mills | Darkling's First Law of Filesystems:
hugo@... carfax.org.uk | The user hates their data
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Chris Murphy
On Thu, Jun 23, 2016 at 8:07 PM, Zygo Blaxell
 wrote:

>> With simple files changing one character with vi and gedit,
>> I get completely different logical and physical numbers with each
>> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
>> raid5).
>
> You are COWing the entire file because vi and gedit do truncate followed
> by full-file write.

I'm seeing the file inode changes with either a vi or gedit
modification, even when file size is exactly the same, just character
substitute. So as far as VFS and Btrfs are concerned, it's an entirely
different file, so it's like faux-CoW that would have happened on any
file system, not an overwrite.


> Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or
> edit the file with a sector-level hex editor.

The inode is now the same, one of the 4096 byte blocks is
dereferenced, a new 4096 byte block is referenced, and written, the
other 3 blocks remain untouched, the other files in the stripe remain
untouched. So it's pretty clearly cow'd in this case.

[root@f24s ~]# filefrag -v /mnt/5/*
Filesystem type is: 9123683e
File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931712..   2931715:  4: last,eof
/mnt/5/a.txt: 1 extent found
File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931716..   2931719:  4: last,eof
/mnt/5/b.txt: 1 extent found
File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931720..   2931723:  4: last,eof
/mnt/5/c.txt: 1 extent found
File size of /mnt/5/d.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931724..   2931727:  4: last,eof
/mnt/5/d.txt: 1 extent found
File size of /mnt/5/e.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931728..   2931731:  4: last,eof
/mnt/5/e.txt: 1 extent found

[root@f24s ~]# ls -li /mnt/5/*
285 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/a.txt
286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt
287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt
288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt
289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt

[root@f24s ~]# btrfs-map-logical -l $[4096*2931712] /dev/VG/a
mirror 1 logical 12008292352 physical 34603008 device /dev/mapper/VG-a
mirror 2 logical 12008292352 physical 1108344832 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931716] /dev/VG/a
mirror 1 logical 12008308736 physical 34619392 device /dev/mapper/VG-a
mirror 2 logical 12008308736 physical 1108361216 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931720] /dev/VG/a
mirror 1 logical 12008325120 physical 34635776 device /dev/mapper/VG-a
mirror 2 logical 12008325120 physical 1108377600 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931724] /dev/VG/a
mirror 1 logical 12008341504 physical 34652160 device /dev/mapper/VG-a
mirror 2 logical 12008341504 physical 1108393984 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*2931728] /dev/VG/a
mirror 1 logical 12008357888 physical 1048576 device /dev/mapper/VG-b
mirror 2 logical 12008357888 physical 1108344832 device /dev/mapper/VG-c


[root@f24s ~]# echo -n "g" | dd of=/mnt/5/a.txt conv=notrunc
0+1 records in
0+1 records out
1 byte copied, 0.000314582 s, 3.2 kB/s
[root@f24s ~]# ls -li /mnt/5/*
285 -rw-r--r--. 1 root root 16383 Jun 23 23:06 /mnt/5/a.txt
286 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/b.txt
287 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/c.txt
288 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/d.txt
289 -rw-r--r--. 1 root root 16383 Jun 23 22:57 /mnt/5/e.txt

[root@f24s ~]# filefrag -v /mnt/5/*
Filesystem type is: 9123683e
File size of /mnt/5/a.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0:2931732..   2931732:  1:
   1:1..   3:2931713..   2931715:  3:2931733: last,eof
/mnt/5/a.txt: 2 extents found
File size of /mnt/5/b.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931716..   2931719:  4: last,eof
/mnt/5/b.txt: 1 extent found
File size of /mnt/5/c.txt is 16383 (4 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   3:2931720..   2931723:  4:  

Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Andrei Borzenkov
24.06.2016 04:47, Zygo Blaxell пишет:
> On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
>> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
>> wrote:
>>> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
>>> checksum.
>>
>> Yeah I'm kinda confused on this point.
>>
>> https://btrfs.wiki.kernel.org/index.php/RAID56
>>
>> It says there is a write hole for Btrfs. But defines it in terms of
>> parity possibly being stale after a crash. I think the term comes not
>> from merely parity being wrong but parity being wrong *and* then being
>> used to wrongly reconstruct data because it's blindly trusted.
> 
> I think the opposite is more likely, as the layers above raid56
> seem to check the data against sums before raid56 ever sees it.
> (If those layers seem inverted to you, I agree, but OTOH there are
> probably good reason to do it that way).
> 

Yes, that's how I read code as well. btrfs layer that does checksumming
is unaware of parity blocks at all; for all practical purposes they do
not exist. What happens is approximately

1. logical extent is allocated and checksum computed
2. it is mapped to physical area(s) on disks, skipping over what would
be parity blocks
3. when these areas are written out, RAID56 parity is computed and filled in

IOW btrfs checksums are for (meta)data and RAID56 parity is not data.

> It looks like uncorrectable failures might occur because parity is
> correct, but the parity checksum is out of date, so the parity checksum
> doesn't match even though data blindly reconstructed from the parity
> *would* match the data.
> 

Yep, that is how I read it too. So if your data is checksummed, it
should at least avoid silent corruption.

>> I don't read code well enough, but I'd be surprised if Btrfs
>> reconstructs from parity and doesn't then check the resulting
>> reconstructed data to its EXTENT_CSUM.
> 
> I wouldn't be surprised if both things happen in different code paths,
> given the number of different paths leading into the raid56 code and
> the number of distinct failure modes it seems to have.
> 

Well, the problem is that parity block cannot be redirected on write as
data blocks; which makes it impossible to version control it. The only
solution I see is to always use full stripe writes by either wasting
time in fixed width stripe or using variable width, so that every stripe
always gets new version of parity. This makes it possible to keep parity
checksums like data checksums.



signature.asc
Description: OpenPGP digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote:
> > I expect that parity is in this data block group, and therefore is
> > checksummed the same as any other data in that block group.
> 
> This appears to be wrong. Comparing the same file, one file only, one
> two new Btrfs volumes, one volume single, one volume raid5, I get a
> single csum tree entry:
> 
> raid5
> item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 
> 128
> extent csum item
> 
> single
> 
> item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
> extent csum item
> 
> They're both the same size. They both contain the same data. So it
> looks like parity is not separately checksummed.

I'm inclined to agree because I didn't find any code that *writes* parity
csums...but if there are no parity csums, what does this code do?

scrub.c:
static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx 
*sctx,
[...]
ret = btrfs_lookup_csums_range(csum_root,
extent_logical,
extent_logical + extent_len - 1,
>csum_list, 1);
if (ret)
goto out;

ret = scrub_extent_for_parity(sparity, extent_logical,
  extent_len,
  extent_physical,
  extent_dev, flags,
  generation,
  extent_mirror_num);



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 05:37:09PM -0600, Chris Murphy wrote:
> > So in your example of degraded writes, no matter what the on disk
> > format makes it discoverable there is a problem:
> >
> > A. The "updating" is still always COW so there is no overwriting.
> 
> There is RMW code in btrfs/raid56.c but I don't know when that gets
> triggered. 

RMW seems to be for cases where part of a stripe is modified but the
entire stripe has not yet been read into memory.  It reads the remaining
blocks (reconstructing missing blocks if necessary) then calculates
new parity blocks.

> With simple files changing one character with vi and gedit,
> I get completely different logical and physical numbers with each
> change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
> raid5).

You are COWing the entire file because vi and gedit do truncate followed
by full-file write.

Try again with 'dd conv=notrunc bs=4k count=1 seek=N of=...' or
edit the file with a sector-level hex editor.

> [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
> Filesystem type is: 9123683e
> File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:2931744..   2931775: 32: last,eof
> /mnt/5/64k-a-then64k-b.txt: 1 extent found
> [root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a
> mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b
> mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a
> [root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt
> [root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
> Filesystem type is: 9123683e
> File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:2931776..   2931807: 32: last,eof
> /mnt/5/64k-a-then64k-b.txt: 1 extent found
> [root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a
> No extent found at range [120085610496,120085626880)
> [root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a
> mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c
> mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b
> [root@f24s ~]#
> 
> There is a neat bug/rfe I found for btrfs-map-logical, it doesn't
> report back the physical locations for all num_stripes on the volume.
> It only spits back two, and sometimes it's the two data strips,
> sometimes it's one data and one parity strip.
> 
> 
> [1]
> https://bugzilla.kernel.org/show_bug.cgi?id=120941
> 
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 06:26:22PM -0600, Chris Murphy wrote:
> On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  
> wrote:
> > The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the 
> > checksum.
> 
> Yeah I'm kinda confused on this point.
> 
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> It says there is a write hole for Btrfs. But defines it in terms of
> parity possibly being stale after a crash. I think the term comes not
> from merely parity being wrong but parity being wrong *and* then being
> used to wrongly reconstruct data because it's blindly trusted.

I think the opposite is more likely, as the layers above raid56
seem to check the data against sums before raid56 ever sees it.
(If those layers seem inverted to you, I agree, but OTOH there are
probably good reason to do it that way).

It looks like uncorrectable failures might occur because parity is
correct, but the parity checksum is out of date, so the parity checksum
doesn't match even though data blindly reconstructed from the parity
*would* match the data.

> I don't read code well enough, but I'd be surprised if Btrfs
> reconstructs from parity and doesn't then check the resulting
> reconstructed data to its EXTENT_CSUM.

I wouldn't be surprised if both things happen in different code paths,
given the number of different paths leading into the raid56 code and
the number of distinct failure modes it seems to have.



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Zygo Blaxell
On Thu, Jun 23, 2016 at 09:32:50PM +0200, Goffredo Baroncelli wrote:
> The raid write hole happens when a stripe is not completely written
> on the platters: the parity and the related data mismatch. In this
> case a "simple" raid5 may return wrong data if the parity is used to
> compute the data. But this happens because a "simple" raid5 is unable
> to detected if the returned data is right or none.
> 
> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the
> checksum.

Checksums do not help with the raid5 write hole.  The way btrfs does
checksums might even make it worse.

ZFS reduces the number of disks in a stripe when a disk failure
is detected so that writes are always in non-degraded mode, and they
presumably avoid sub-stripe-width data allocations or use journalling
to avoid the write hole.  btrfs seems to use neither tactic.  At best,
btrfs will avoid creating new block groups on disks that are missing at
mount time, and it doesn't deal with sub-stripe-width allocations at all.

I'm working from two assumptions as I haven't found all the relevant
code yet:

1.  btrfs writes parity stripes at fixed locations relative to
the data in the same stripe.  If this is true, then the parity
blocks are _not_ CoW while the data blocks and their checksums
_are_ CoW.  I don't know if the parity block checksums are also
CoW.

2.  btrfs sometimes puts data from two different transactions in
the same stripe at the same time--a fundamental violation of the
CoW concept.  I inferred this from the logical block addresses.

Unless I'm missing something in the code somewhere, parity blocks can
have out-of-date checksums for short periods of time between flushes and
commits.  This would lose data by falsely reporting valid parity blocks
as checksum failures.  If any *single* failure occurs at the same time
(such as a missing write or disk failure) a small amount of data will
be lost.

> BTRFS is able to discard the wrong data: i.e. in case of a 3 disks
> raid5, the right data may be extracted from the data1+data2 or if the
> checksum doesn't match from data1+parity or if the checksum doesn't
> match from data2+parity.

Suppose we have a sequence like this (3-disk RAID5 array, one stripe
containing 2 data and 1 parity block) starting with the stripe empty:

1.  write data block 1 to disk 1 of stripe (parity is now invalid, no 
checksum yet)

2.  write parity block to disk 3 in stripe (parity becomes valid again, 
no checksum yet)

3.  commit metadata pointing to block 1 (parity and checksums now valid)

4.  write data block 2 to disk 2 of stripe (parity and parity checksum 
now invalid)

5.  write parity block to disk 3 in stripe (parity valid now, parity 
checksum still invalid)

6.  commit metadata pointing to block 2 (parity and checksums now valid)

We can be interrupted at any point between step 1 and 4 with no data loss.
Before step 3 the data and parity blocks are not part of the extent
tree so their contents are irrelevant.  After step 3 (assuming each
step is completed in order) data block 1 is part of the extent tree and
can be reconstructed if any one disk fails.  This is the part of btrfs
raid5 that works.

If we are interrupted between steps 4 and 6 (e.g. power fails), a single
disk failure or corruption will cause data loss in block 1.  Note that
block 1 is *not* written between steps 4 and 6, so we are retroactively
damaging some previously written data that is not part of the current
transaction.

If we are interrupted between steps 4 and 5, we can no longer reconstruct
block 1 (block2 ^ parity) or block 2 (block1 ^ parity) because the parity
block doesn't match the data blocks in the same stripe
(i.e. block1 ^ block2 != parity).

If we are interrupted between step 5 and 6, the parity block checksum
committed at step 3 will fail.  Data block 2 will not be accessible
since the metadata was not written to point to it, but data block 1
will be intact, readable, and have a correct checksum as long as none
of the disks fail.  This can be repaired by a scrub (scrub will simply
throw the parity block away and reconstruct it from block1 and block2).
If disk 1 fails before the next scrub, data block 1 will be lost because
btrfs will believe the parity block is incorrect even though it is not.

This risk happens on *every* write to a stripe that is not a full stripe
write and contains existing committed data blocks.  It will occur more
often on full and heavily fragmented filesystems (filesystems which 
have these properties are more likely to write new data on stripes that 
already contain old data).

In cases where an entire stripe is written at once, or a stripe is
partially filled but no further writes ever modify the stripe, everything
works as intended in btrfs.

> NOTE2: this works if only one write is corrupted. If more write

Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Chris Murphy
On Thu, Jun 23, 2016 at 1:32 PM, Goffredo Baroncelli  wrote:

>
> The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum.

Yeah I'm kinda confused on this point.

https://btrfs.wiki.kernel.org/index.php/RAID56

It says there is a write hole for Btrfs. But defines it in terms of
parity possibly being stale after a crash. I think the term comes not
from merely parity being wrong but parity being wrong *and* then being
used to wrongly reconstruct data because it's blindly trusted.

I don't read code well enough, but I'd be surprised if Btrfs
reconstructs from parity and doesn't then check the resulting
reconstructed data to its EXTENT_CSUM.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Chris Murphy
On Wed, Jun 22, 2016 at 11:14 AM, Chris Murphy  wrote:

>
> However, from btrfs-debug-tree from a 3 device raid5 volume:
>
> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 1103101952) itemoff 15621 itemsize 144
> chunk length 2147483648 owner 2 stripe_len 65536
> type DATA|RAID5 num_stripes 3
> stripe 0 devid 2 offset 9437184
> dev uuid: 3c6f37eb-5cae-455a-82bc-a1b0877dea55
> stripe 1 devid 1 offset 1094713344
> dev uuid: 13104709-6f30-4982-979e-4f055c326fad
> stripe 2 devid 3 offset 1083179008
> dev uuid: d45fc482-a0c1-46b1-98c1-41cea5a11c80
>
> I expect that parity is in this data block group, and therefore is
> checksummed the same as any other data in that block group.

This appears to be wrong. Comparing the same file, one file only, one
two new Btrfs volumes, one volume single, one volume raid5, I get a
single csum tree entry:

raid5
item 0 key (EXTENT_CSUM EXTENT_CSUM 12009865216) itemoff 16155 itemsize 128
extent csum item

single

item 0 key (EXTENT_CSUM EXTENT_CSUM 2168717312) itemoff 16155 itemsize 128
extent csum item

They're both the same size. They both contain the same data. So it
looks like parity is not separately checksummed.

If there's a missing 64KiB data strip (bad sector, or dead drive), the
reconstruction of that strip from parity should match available csums
for those blocks. So in this way it's possible to infer if the parity
strip is bad. But, it also means assuming everything else about this
full stripe: the remaining data strips and their csums, are correct.




> So in your example of degraded writes, no matter what the on disk
> format makes it discoverable there is a problem:
>
> A. The "updating" is still always COW so there is no overwriting.

There is RMW code in btrfs/raid56.c but I don't know when that gets
triggered. With simple files changing one character with vi and gedit,
I get completely different logical and physical numbers with each
change, so it's clearly cowing the entire stripe (192KiB in my 3 dev
raid5).


[root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
Filesystem type is: 9123683e
File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:2931744..   2931775: 32: last,eof
/mnt/5/64k-a-then64k-b.txt: 1 extent found
[root@f24s ~]# btrfs-map-logical -l $[4096*2931744] /dev/VG/a
mirror 1 logical 12008423424 physical 1114112 device /dev/mapper/VG-b
mirror 2 logical 12008423424 physical 34668544 device /dev/mapper/VG-a
[root@f24s ~]# vi /mnt/5/64k-a-then64k-b.txt
[root@f24s ~]# filefrag -v /mnt/5/64k-a-then64k-b.txt
Filesystem type is: 9123683e
File size of /mnt/5/64k-a-then64k-b.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:2931776..   2931807: 32: last,eof
/mnt/5/64k-a-then64k-b.txt: 1 extent found
[root@f24s ~]# btrfs-map-logical -l $[4096*29317776] /dev/VG/a
No extent found at range [120085610496,120085626880)
[root@f24s ~]# btrfs-map-logical -l $[4096*2931776] /dev/VG/a
mirror 1 logical 12008554496 physical 1108475904 device /dev/mapper/VG-c
mirror 2 logical 12008554496 physical 1179648 device /dev/mapper/VG-b
[root@f24s ~]#

There is a neat bug/rfe I found for btrfs-map-logical, it doesn't
report back the physical locations for all num_stripes on the volume.
It only spits back two, and sometimes it's the two data strips,
sometimes it's one data and one parity strip.


[1]
https://bugzilla.kernel.org/show_bug.cgi?id=120941


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-23 Thread Goffredo Baroncelli
On 2016-06-22 22:35, Zygo Blaxell wrote:
>> I do not know the exact nature of the Btrfs raid56 write hole. Maybe a
>> > dev or someone who knows can explain it.
> If you have 3 raid5 devices, they might be laid out on disk like this
> (e.g. with a 16K stripe width):
> 
>   Address:  0..16K16..32K 32..64K
> Disk 1: [0..16K][32..64K]   [PARITY]
> Disk 2: [16..32K]   [PARITY][80..96K]
> Disk 3: [PARITY][64..80K]   [96..112K]
> 
> btrfs logical address ranges are inside [].  Disk physical address ranges
> are shown at the top of each column.  (I've simplified the mapping here;
> pretend all the addresses are relative to the start of a block group).
> 
> If we want to write a 32K extent at logical address 0, we'd write all
> three disks in one column (disk1 gets 0..16K, disk2 gets 16..32K, disk3
> gets parity for the other two disks).  The parity will be temporarily
> invalid for the time between the first disk write and the last disk write.
> In non-degraded mode the parity isn't necessary, but in degraded mode
> the entire column cannot be reconstructed because of invalid parity.
> 
> To see why this could be a problem, suppose btrfs writes a 4K extent at
> logical address 32K.  This requires updating (at least) disk 1 (where the
> logical address 32K resides) and disk 2 (the parity for this column).
> This means any data that existed at logical addresses 36K..80K (or at
> least 32..36K and 64..68K) has its parity temporarily invalidated between
> the write to the first and last disks.  If there were metadata pointing
> to other blocks in this column, the metadata temporarily points to
> damaged data during the write.  If there is no data in other blocks in
> this column then it doesn't matter that the parity doesn't match--the
> content of the reconstructed unallocated blocks would be undefined
> even in the success cases.
[...]

Sorry, but I can follow you.

RAID5 protect you in case of a failure (or a missing write) of a *single* disk.

The raid write hole happens when a stripe is not completely written on the 
platters: the parity and the related data mismatch. In this case a "simple" 
raid5 may return wrong data if the parity is used to compute the data. But this 
happens because a "simple" raid5 is unable to detected if the returned data is 
right or none.

The raid5 write hole is avoided in BTRFS (and in ZFS) thanks to the checksum. 

BTRFS is able to discard the wrong data: i.e. in case of a 3 disks raid5, the 
right data may be extracted from the data1+data2 or if the checksum doesn't 
match from data1+parity or if the checksum doesn't match from data2+parity. 
NOTE1: the real difference between the BTRFS (and ZFS) raid and the "simple" 
raid5 is that the latter doesn't try another pair of disks.
NOTE2: this works if only one write is corrupted. If more writes (== more 
disks) are involved, you got checksum mismatch. If more than one write are 
corrupted, raid5 is unable to protect you. 

In case of "degraded mode", you don't have any redundancy. So if a stripe of a 
degraded filesystem is not fully written to the disk, is like a block not fully 
written to the disk. And you have checksums mismatch. But this is not what is 
called raid write hole.


On 2016-06-22 22:35, Zygo Blaxell wrote:
> If in the future btrfs allocates physical block 2412725692 to
> a different file, up to 3 other blocks in this file (most likely
> 2412725689..2412725691) could be lost if a crash or disk I/O error also
> occurs during the same transaction.  btrfs does do this--in fact, the
> _very next block_ allocated by the filesystem is 2412725692:
> 
>   # head -c 4096 < /dev/urandom >> f; sync; filefrag -v f
>   Filesystem type is: 9123683e
>   File size of f is 45056 (11 blocks of 4096 bytes)
>ext: logical_offset:physical_offset: length:   expected: 
> flags:
>  0:0..   0: 2412725689..2412725689:  1:
>  1:1..   1: 2412725690..2412725690:  1:
>  2:2..   2: 2412725691..2412725691:  1:
>  3:3..   3: 2412725701..2412725701:  1: 2412725692:
>  4:4..   4: 2412725693..2412725693:  1: 2412725702:
>  5:5..   5: 2412725694..2412725694:  1:
>  6:6..   6: 2412725695..2412725695:  1:
>  7:7..   7: 2412725698..2412725698:  1: 2412725696:
>  8:8..   8: 2412725699..2412725699:  1:
>  9:9..   9: 2412725700..2412725700:  1:
> 10:   10..  10: 2412725692..2412725692:  1: 2412725701: 
> last,eof
>   f: 5 extents found

You are assuming that if you touch a block, all the blocks of the same stripe 
spread over the disks are involved. I disagree. The only parts which are 
involved, are the part 

Re: Adventures in btrfs raid5 disk recovery

2016-06-22 Thread Zygo Blaxell
On Wed, Jun 22, 2016 at 11:14:30AM -0600, Chris Murphy wrote:
> > Before deploying raid5, I tested these by intentionally corrupting
> > one disk in an otherwise healthy raid5 array and watching the result.
> 
> It's difficult to reproduce if no one understands how you
> intentionally corrupted that disk. Literal reading, you corrupted the
> entire disk, but that's impractical. The fs is expected to behave
> differently depending on what's been corrupted and how much.

The first round of testing I did (a year ago, when deciding whether
btrfs raid5 was mature enough to start using) was:

Create a 5-disk RAID5

Put some known data on it until it's full (i.e. random test
patterns).  At the time I didn't do any tests involving
compressible data, which I now realize was a serious gap in
my test coverage.

Pick 1000 random blocks (excluding superblocks) on one of the
disks and write random data to them

Read and verify the data through the filesystem, do scrub, etc.
Exercise all the btrfs features related to error reporting
and recovery.

I expected scrub and dev stat to report accurate corruption counts (except
for the 1 in 4 billion case where a bad csum matches by random chance),
and I expect all the data to be reconstructed since only one drive was
corrupted (assuming there are no unplanned disk failures during the
test, obviously) and the corruption occurred while the filesystem was
offline so there was no possibility of RAID write hole.

My results from that testing were that everything worked except for the
mostly-harmless quirk where scrub counts errors on random disks instead
of the disk where the errors occur.

> I don't often use the -Bd options, so I haven't tested it thoroughly,
> but what you're describing sounds like a bug in user space tools. I've
> found it reflects the same information as btrfs dev stats, and dev
> stats have been reliable in my testing.

Don't the user space tools just read what the kernel tells them?
I don't know how *not* to produce this behavior on btrfs raid5 or raid6.
It should show up on any btrfs raid56 system.

> > A different thing happens if there is a crash.  In that case, scrub cannot
> > repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> > behaves this way when disks turn bad.  I had assumed it was a software bug
> > in the comparatively new raid5 support that would get fixed eventually.
> 
> This is really annoyingly vague. You don't give a complete recipe for
> reproducing this sequence. Here's what I'm understanding and what I'm
> missing:
> 
> 1. The intentional corruption, extent of which is undefined, is still present.

No intentional corruption here (quote:  "A different thing happens if
there is a crash...").  Now we are talking about the baseline behavior
when there is a crash on a btrfs raid5 array, especially crashes
triggered by a disk-level failure (e.g. watchdog timeout because a disk
or controller has hung) but also ordinary power failures or other crashes
triggered by external causes.

> 2. A drive is bad, but that doesn't tell us if it's totally dead, or
> only intermittently spitting out spurious information.

The most common drive-initiated reboot case is that one drive temporarily
locks up and triggers the host to perform a watchdog reset.  The reset
is successful and the filesystem can be mounted again with all drives
present; however, a small amount of raid5 data appears to be corrupted
each time.  The raid1 metadata passes all the integrity checks I can
throw at it:  btrfs check, scrub, balance, walk the filesystem with find
-type f -exec cat ..., compare with the last backup, etc.

Usually when I detect this case, I delete any corrupted data, delete
the disk that triggers the lockups and have no further problems with
that array.

> 3. Is the volume remounted degraded or is the bad drive still being
> used by Btrfs? Because Btrfs has no concept (patches pending) of drive
> faulty state like md, let alone an automatic change to that faulty
> state. It just keeps on trying to read or write to bad drives, even if
> they're physically removed.

In the baseline case the filesystem has all drives present after remount.
It could be as simple as power-cycling the host while writes are active.

> 4. You've initiated a scrub, and the corruption in 1 is not fixed.

In this pattern, btrfs may find both correctable and uncorrectable
corrupted data, usually on one of the drives.  scrub fixes the correctable
corruption, but fails on the uncorrectable.

> OK so what am I missing?

Nothing yet.  The above is the "normal" btrfs raid5 crash experience with
a non-degraded raid5 array.  A few megabytes of corrupt extents can easily
be restored from backups or deleted and everything's fine after that.

In my *current* failure case, I'

Re: Adventures in btrfs raid5 disk recovery

2016-06-22 Thread Chris Murphy
On Mon, Jun 20, 2016 at 7:55 PM, Zygo Blaxell
<ce3g8...@umail.furryterror.org> wrote:
> On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
>> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
>> <ce3g8...@umail.furryterror.org> wrote:
>> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
>>
>> >> For me the critical question is what does "some corrupted sectors" mean?
>> >
>> > On other raid5 arrays, I would observe a small amount of corruption every
>> > time there was a system crash (some of which were triggered by disk
>> > failures, some not).
>>
>> What test are you using to determine there is corruption, and how much
>> data is corrupted? Is this on every disk? Non-deterministically fewer
>> than all disks? Have you identified this as a torn write or
>> misdirected write or is it just garbage at some sectors? And what's
>> the size? Partial sector? Partial md chunk (or fs block?)
>
> In earlier cases, scrub, read(), and btrfs dev stat all reported the
> incidents differently.  Scrub would attribute errors randomly to disks
> (error counts spread randomly across all the disks in the 'btrfs scrub
> status -d' output).  'dev stat' would correctly increment counts on only
> those disks which had individually had an event (e.g. media error or
> SATA bus reset).
>
> Before deploying raid5, I tested these by intentionally corrupting
> one disk in an otherwise healthy raid5 array and watching the result.

It's difficult to reproduce if no one understands how you
intentionally corrupted that disk. Literal reading, you corrupted the
entire disk, but that's impractical. The fs is expected to behave
differently depending on what's been corrupted and how much.

> When scrub identified an inode and offset in the kernel log, the csum
> failure log message matched the offsets producing EIO on read(), but
> the statistics reported by scrub about which disk had been corrupted
> were mostly wrong.  In such cases a scrub could repair the data.

I don't often use the -Bd options, so I haven't tested it thoroughly,
but what you're describing sounds like a bug in user space tools. I've
found it reflects the same information as btrfs dev stats, and dev
stats have been reliable in my testing.


> A different thing happens if there is a crash.  In that case, scrub cannot
> repair the errors.  Every btrfs raid5 filesystem I've deployed so far
> behaves this way when disks turn bad.  I had assumed it was a software bug
> in the comparatively new raid5 support that would get fixed eventually.

This is really annoyingly vague. You don't give a complete recipe for
reproducing this sequence. Here's what I'm understanding and what I'm
missing:

1. The intentional corruption, extent of which is undefined, is still present.
2. A drive is bad, but that doesn't tell us if it's totally dead, or
only intermittently spitting out spurious information.
3. Is the volume remounted degraded or is the bad drive still being
used by Btrfs? Because Btrfs has no concept (patches pending) of drive
faulty state like md, let alone an automatic change to that faulty
state. It just keeps on trying to read or write to bad drives, even if
they're physically removed.
4. You've initiated a scrub, and the corruption in 1 is not fixed.

OK so what am I missing?

Because it sounds to me like you have two copies of data that are
gone. For raid 5 that's data loss, scrub can't fix things. Corruption
is missing data. The bad drive is missing data.

What values do you get for

smartctl -l scterc /dev/sdX
cat /sys/block/sdX/device/timeout



>> This is on Btrfs? This isn't supposed to be possible. Even a literal
>> overwrite of a file is not an overwrite on Btrfs unless the file is
>> nodatacow. Data extents get written, then the metadata is updated to
>> point to those new blocks. There should be flush or fua requests to
>> make sure the order is such that the fs points to either the old or
>> new file, in either case uncorrupted. That's why I'm curious about the
>> nature of this corruption. It sounds like your hardware is not exactly
>> honoring flush requests.
>
> That's true when all the writes are ordered within a single device, but
> possibly not when writes must be synchronized across multiple devices.

I think that's a big problem, the fs cannot be consistent if the super
block points to any tree whose metadata or data isn't on stable media.

But if you think it's happening you might benefit from integrity
checking, maybe try just the metadata one for starters which is the
check_int mount option (it must be compiled in first for that mount
option to work).

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/check-integrity.c?id=refs/tags/v4.6.2



> 

Re: Adventures in btrfs raid5 disk recovery - update

2016-06-21 Thread Zygo Blaxell
TL;DR:

Kernel 4.6.2 causes a world of pain.  Use 4.5.7 instead.

'btrfs dev stat' doesn't seem to count "csum failed"
(i.e. corruption) errors in compressed extents.



On Sun, Jun 19, 2016 at 11:44:27PM -0400, Zygo Blaxell wrote:
> Not so long ago, I had a disk fail in a btrfs filesystem with raid1
> metadata and raid5 data.  I mounted the filesystem readonly, replaced
> the failing disk, and attempted to recover by adding the new disk and
> deleting the missing disk.

> I'm currently using kernel 4.6.2

That turned out to be a mistake.  4.6.2 has some severe problems.

Over the past few days I've been upgrading other machines from 4.5.7
to 4.6.2.  This morning I saw the aggregate data coming back from
those machines, and it's all bad:  stalls in snapshot delete, balance,
and sync; some machines just lock up with no console messages; a lot of
watchdog timeouts.  None of the machines could get to an uptime over 26
hours and still be in a usable state.

I switched to 4.5.7 and the crashes, balance/delete hangs, and some of
the data corruption modes stopped.

> I'm
> getting EIO randomly all over the filesystem, including in files that were
> written entirely _after_ the disk failure.

There were actually four distinct corruption modes happening:

1.  There are some number (16500 so far) "normal" corrupt blocks:  read
repeatably returns EIO, they show up in scrub with sane log messages,
and replacing the files that contain these blocks makes them go away.
These blocks appear to be contained in extents that coincide with the
date of the disk failure.  Interestingly, no matter how many times I
read these blocks, I get no increase in the 'btrfs dev stat' numbers
even though I get kernel csum failure messages.  That looks like a bug.

2.  When attempting to replace corrupted files with rsync, I had used
'rsync --inplace'.  This caused bad blocks to be overwritten within
extents, but does not necessarily replace the _entire_ extent containing a
bad block.  This creates corrupt blocks that show up in scrub, balance,
and device delete, but not when reading files.  It also updates the
timestamps so a file with old corruption looks "new" to an insufficiently
sophisticated analysis tool.

3.  Files were corrupted while they were written and accessed via NFS.
This created files with correct btrfs checksums, but garbage contents.
This would show up as failures during 'git gc' or rsync checksum
mismatches.  During one of the many VM crashes, any writes in progress at
the time of the crash were lost.  This effectively rewound the filesystem
several minutes each time as btrfs reverts to the previous committed
tree on the next mount.  4.6.2's hanging issues made this worse by
delaying btrfs commits indefinitely.  The NFS clients were completely
unaware of this, so when the VM rebooted, files ended up with holes,
or would just disappear while in use.

4.  After a VM crash and the filesystem reverted to the previous
committed tree, files with bad blocks that had been repaired through
the NFS server or with rsync would be "unrepaired" (i.e. the filesystem
would revert back to the original corrupted blocks after the mount).

Combinations of these could occur as well for extra confusion, and some
corrupted blocks are contained in many files thanks to dedup.

With kernel 4.5.7 there have been no lockups during commit and no VM
crashes, so I haven't seen any of corruption modes 3 and 4 since 4.5.7.

Balance is now running normally to move the remaining data off the
missing disk.  ETA is 558 hours.  See you in mid-July!  ;)



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 09:55:59PM -0400, Zygo Blaxell wrote:
> In this current case, I'm getting things like this:
> 
> [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 
> 26805825306624 csum 4105596028 wanted 787343232 mirror 0
[...]
> The other other weird thing here is that I can't find an example in the
> logs of an extent with an EIO that isn't compressed.  I've been looking
> up a random sample of the extent numbers, matching them up to filefrag
> output, and finding e.g. the one compressed extent in the middle of an
> otherwise uncompressed git pack file.  That's...odd.  Maybe there's a
> problem with compressed extents in particular?  I'll see if I can
> script something to check all the logs at once...

No need for a script:  this message wording appears only in
fs/btrfs/compression.c so it can only ever be emitted by reading a
compressed extent.

Maybe there's a problem specific to raid5, degraded mode, and compressed
extents?



signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
> <ce3g8...@umail.furryterror.org> wrote:
> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> 
> >> For me the critical question is what does "some corrupted sectors" mean?
> >
> > On other raid5 arrays, I would observe a small amount of corruption every
> > time there was a system crash (some of which were triggered by disk
> > failures, some not).
> 
> What test are you using to determine there is corruption, and how much
> data is corrupted? Is this on every disk? Non-deterministically fewer
> than all disks? Have you identified this as a torn write or
> misdirected write or is it just garbage at some sectors? And what's
> the size? Partial sector? Partial md chunk (or fs block?)

In earlier cases, scrub, read(), and btrfs dev stat all reported the
incidents differently.  Scrub would attribute errors randomly to disks
(error counts spread randomly across all the disks in the 'btrfs scrub
status -d' output).  'dev stat' would correctly increment counts on only
those disks which had individually had an event (e.g. media error or
SATA bus reset).

Before deploying raid5, I tested these by intentionally corrupting
one disk in an otherwise healthy raid5 array and watching the result.
When scrub identified an inode and offset in the kernel log, the csum
failure log message matched the offsets producing EIO on read(), but
the statistics reported by scrub about which disk had been corrupted
were mostly wrong.  In such cases a scrub could repair the data.

A different thing happens if there is a crash.  In that case, scrub cannot
repair the errors.  Every btrfs raid5 filesystem I've deployed so far
behaves this way when disks turn bad.  I had assumed it was a software bug
in the comparatively new raid5 support that would get fixed eventually.

In this current case, I'm getting things like this:

[12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4105596028 wanted 787343232 mirror 0
[12008.243876] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1689373462 wanted 787343232 mirror 0
[12008.243885] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3621611229 wanted 787343232 mirror 0
[12008.243893] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 113993114 wanted 787343232 mirror 0
[12008.243902] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1464956834 wanted 787343232 mirror 0
[12008.243911] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2545274038 wanted 787343232 mirror 0
[12008.243942] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4090153227 wanted 787343232 mirror 0
[12008.243952] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243961] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243976] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 172651968 wanted 787343232 mirror 0
[12008.246158] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 1
[12008.247557] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1374425809 wanted 787343232 mirror 1
[12008.403493] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1567917468 wanted 787343232 mirror 1
[12008.409809] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2881359629 wanted 787343232 mirror 0
[12008.411165] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3021442070 wanted 787343232 mirror 0
[12008.411180] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3984314874 wanted 787343232 mirror 0
[12008.411189] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 599192427 wanted 787343232 mirror 0
[12008.411199] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2887010053 wanted 787343232 mirror 0
[12008.411208] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1314141634 wanted 787343232 mirror 0
[12008.411217] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3156167613 wanted 787343232 mirror 0
[12008.411227] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 565550942 wanted 787343232 mirror 0
[12008.411236] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4068631390 wanted 787343232 mirror 0
[12008.411245] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 531263990 wanted 787343232 mirror 0
[120

Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:

>> For me the critical question is what does "some corrupted sectors" mean?
>
> On other raid5 arrays, I would observe a small amount of corruption every
> time there was a system crash (some of which were triggered by disk
> failures, some not).

What test are you using to determine there is corruption, and how much
data is corrupted? Is this on every disk? Non-deterministically fewer
than all disks? Have you identified this as a torn write or
misdirected write or is it just garbage at some sectors? And what's
the size? Partial sector? Partial md chunk (or fs block?)

>  It looked like any writes in progress at the time
> of the failure would be damaged.  In the past I would just mop up the
> corrupt files (they were always the last extents written, easy to find
> with find-new or scrub) and have no further problems.

This is on Btrfs? This isn't supposed to be possible. Even a literal
overwrite of a file is not an overwrite on Btrfs unless the file is
nodatacow. Data extents get written, then the metadata is updated to
point to those new blocks. There should be flush or fua requests to
make sure the order is such that the fs points to either the old or
new file, in either case uncorrupted. That's why I'm curious about the
nature of this corruption. It sounds like your hardware is not exactly
honoring flush requests.

With md raid and any other file system, it's pure luck that such
corrupted writes would only affect data extents and not the fs
metadata. Corrupted fs metadata is not well tolerated by any file
system, not least of which is most of them have no idea the metadata
is corrupt. At least Btrfs can determine this and if there's another
copy use that or just stop and face plant before more damage happens.
Maybe an exception now is XFS v5 metadata which employs checksumming.
But it still doesn't know if data extents are wrong (i.e. a torn or
misdirected write).

I've had perhaps a hundred power off during write with Btrfs and SSD
and I don't ever see corrupt files. It's definitely not normal to see
this with Btrfs.


> In the earlier
> cases there were no new instances of corruption after the initial failure
> event and manual cleanup.
>
> Now that I did a little deeper into this, I do see one fairly significant
> piece of data:
>
> root@host:~# btrfs dev stat /data | grep -v ' 0$'
> [/dev/vdc].corruption_errs 16774
> [/dev/vde].write_io_errs   121
> [/dev/vde].read_io_errs4
> [devid:8].read_io_errs16
>
> Prior to the failure of devid:8, vde had 121 write errors and 4 read
> errors (these counter values are months old and the errors were long
> since repaired by scrub).  The 16774 corruption errors on vdc are all
> new since the devid:8 failure, though.

On md RAID 5 and 6, if the array gets parity mismatch counts above 0
doing a scrub (check > md/sync_action) there's a hardware problem.
It's entirely possible you've found a bug, but it must be extremely
obscure to basically not have hit everyone trying Btrfs raid56. I
think you need to track down the source of this corruption and stop it
 however possible; whether that's changing hardware, or making sure
the system isn't crashing.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> >> On Sun, 19 Jun 2016 23:44:27 -0400
> Seems difficult at best due to this:
> >>The normal 'device delete' operation got about 25% of the way in,
> then got stuck on some corrupted sectors and aborting with EIO.
> 
> In effect it's like a 2 disk failure for a raid5 (or it's
> intermittently a 2 disk failure but always at least a 1 disk failure).
> That's not something md raid recovers from. Even manual recovery in
> such a case is far from certain.
> 
> Perhaps Roman's advice is also a question about the cause of this
> corruption? I'm wondering this myself. That's the real problem here as
> I see it. Losing a drive is ordinary. Additional corruptions happening
> afterward is not. And are those corrupt sectors hardware corruptions,
> or Btrfs corruptions at the time the data was written to disk, or
> Btrfs being confused as it's reading the data from disk?

> For me the critical question is what does "some corrupted sectors" mean?

On other raid5 arrays, I would observe a small amount of corruption every
time there was a system crash (some of which were triggered by disk
failures, some not).  It looked like any writes in progress at the time
of the failure would be damaged.  In the past I would just mop up the
corrupt files (they were always the last extents written, easy to find
with find-new or scrub) and have no further problems.  In the earlier
cases there were no new instances of corruption after the initial failure
event and manual cleanup.

Now that I did a little deeper into this, I do see one fairly significant
piece of data:

root@host:~# btrfs dev stat /data | grep -v ' 0$'
[/dev/vdc].corruption_errs 16774
[/dev/vde].write_io_errs   121
[/dev/vde].read_io_errs4
[devid:8].read_io_errs16

Prior to the failure of devid:8, vde had 121 write errors and 4 read
errors (these counter values are months old and the errors were long
since repaired by scrub).  The 16774 corruption errors on vdc are all
new since the devid:8 failure, though.

> 
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
<ce3g8...@umail.furryterror.org> wrote:
> On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
>> On Sun, 19 Jun 2016 23:44:27 -0400
>> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
>> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
>> better off shutting down the system, booting a rescue OS, copying the content
>> of the failing disk to the replacement one using 'ddrescue', then removing 
>> the
>> bad disk, and after boot up your main system wouldn't notice anything has 
>> ever
>> happened, aside from a few recoverable CRC errors in the "holes" on the areas
>> which ddrescue failed to copy.
>
> I'm aware of ddrescue and myrescue, but in this case the disk has failed,
> past tense.  At this point the remaining choices are to make btrfs native
> raid5 recovery work, or to restore from backups.

Seems difficult at best due to this:
>>The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.

In effect it's like a 2 disk failure for a raid5 (or it's
intermittently a 2 disk failure but always at least a 1 disk failure).
That's not something md raid recovers from. Even manual recovery in
such a case is far from certain.

Perhaps Roman's advice is also a question about the cause of this
corruption? I'm wondering this myself. That's the real problem here as
I see it. Losing a drive is ordinary. Additional corruptions happening
afterward is not. And are those corrupt sectors hardware corruptions,
or Btrfs corruptions at the time the data was written to disk, or
Btrfs being confused as it's reading the data from disk?

For me the critical question is what does "some corrupted sectors" mean?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> On Sun, 19 Jun 2016 23:44:27 -0400
> Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:
> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
> better off shutting down the system, booting a rescue OS, copying the content
> of the failing disk to the replacement one using 'ddrescue', then removing the
> bad disk, and after boot up your main system wouldn't notice anything has ever
> happened, aside from a few recoverable CRC errors in the "holes" on the areas
> which ddrescue failed to copy.

I'm aware of ddrescue and myrescue, but in this case the disk has failed,
past tense.  At this point the remaining choices are to make btrfs native
raid5 recovery work, or to restore from backups.

> But in general it's commendable that you're experimenting with doing things
> "the native way", as this is provides feedback to the developers and could 
> help
> make the RAID implementation better. I guess that's the whole point of the
> exercise and the report, and hope this ends up being useful for everyone.

The intent was both to provide a cautionary tale for anyone considering
deploying a btrfs raid5 system today, and to possibly engage some
developers to help solve the problems.

The underlying causes seem to be somewhat removed from where the symptoms
are appearing, and at the moment I don't understand this code well enough
to know where to look for them.  Any assistance would be greatly appreciated.


> -- 
> With respect,
> Roman




signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Roman Mamedov
On Sun, 19 Jun 2016 23:44:27 -0400
Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote:

> It's not going well so far.  Pay attention, there are at least four
> separate problems in here and we're not even half done yet.
> 
> I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
> 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
> also pulled in some patches from the 4.7-rc series.
> 
> This fixed a few problems I encountered early on, and I'm still making
> forward progress, but I've only replaced 50% of the failed disk so far,
> and this is week four of this particular project.

From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
better off shutting down the system, booting a rescue OS, copying the content
of the failing disk to the replacement one using 'ddrescue', then removing the
bad disk, and after boot up your main system wouldn't notice anything has ever
happened, aside from a few recoverable CRC errors in the "holes" on the areas
which ddrescue failed to copy.

But in general it's commendable that you're experimenting with doing things
"the native way", as this is provides feedback to the developers and could help
make the RAID implementation better. I guess that's the whole point of the
exercise and the report, and hope this ends up being useful for everyone.

-- 
With respect,
Roman


pgp8h7EycbEd6.pgp
Description: OpenPGP digital signature


Adventures in btrfs raid5 disk recovery

2016-06-19 Thread Zygo Blaxell
Not so long ago, I had a disk fail in a btrfs filesystem with raid1
metadata and raid5 data.  I mounted the filesystem readonly, replaced
the failing disk, and attempted to recover by adding the new disk and
deleting the missing disk.

It's not going well so far.  Pay attention, there are at least four
separate problems in here and we're not even half done yet.

I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
also pulled in some patches from the 4.7-rc series.

This fixed a few problems I encountered early on, and I'm still making
forward progress, but I've only replaced 50% of the failed disk so far,
and this is week four of this particular project.

What worked:

'mount -odegraded,...' successfully mounts the filesystem RW.  
'btrfs device add' adds the new disk.  Success!

The first thing I did was balance the metadata onto non-missing disks.
That went well.  Now there are only data chunks to recover from the
missing disk.  Success!

The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.  
That ends the success, but I've had similar problems with raid5 arrays
before and been able to solve them.

I've managed to remove about half of the data from the missing disk
so far.  'balance start -ddevid=,drange=0..1000'
(with increasing values for drange) is able to move data off the failed
disk while avoiding the damaged regions.  It looks like this process could
reduce the amount of data on "missing" devices to a manageable number,
then I could identify the offending corrupted extents with 'btrfs scrub',
remove the files containing them, and finish the device delete operation.
Hope!

What doesn't work:

The first problem is that the kernel keeps crashing.  I put the filesystem
and all its disks in a KVM so the crashes are less disruptive, and I can
debug them (or at least collect panic logs).  OK now crashes are merely a
performance problem.

Why did I mention 'btrfs scrub' above?  Because 'btrfs scrub' tells me
where corrupted blocks are.  'device delete' fills my kernel logs with
lines like this:

[26054.744158] BTRFS info (device vdc): relocating block group 
27753592127488 flags 129
[26809.746993] BTRFS warning (device vdc): csum failed ino 404 off 
6021976064 csum 778377694 expected csum 2827380172
[26809.747029] BTRFS warning (device vdc): csum failed ino 404 off 
6021980160 csum 3776938678 expected csum 514150079
[26809.747077] BTRFS warning (device vdc): csum failed ino 404 off 
6021984256 csum 470593400 expected csum 642831408
[26809.747093] BTRFS warning (device vdc): csum failed ino 404 off 
6021988352 csum 796755777 expected csum 690854341
[26809.747108] BTRFS warning (device vdc): csum failed ino 404 off 
6021992448 csum 4115095129 expected csum 249712906
[26809.747122] BTRFS warning (device vdc): csum failed ino 404 off 
6021996544 csum 2337431338 expected csum 1869250975
[26809.747138] BTRFS warning (device vdc): csum failed ino 404 off 
6022000640 csum 3543852608 expected csum 1929026437
[26809.747154] BTRFS warning (device vdc): csum failed ino 404 off 
6022004736 csum 3417780495 expected csum 3698318115
[26809.747169] BTRFS warning (device vdc): csum failed ino 404 off 
6022008832 csum 3423877520 expected csum 2981727596
[26809.747183] BTRFS warning (device vdc): csum failed ino 404 off 
6022012928 csum 550838742 expected csum 1005563554
[26896.379773] BTRFS info (device vdc): relocating block group 
27753592127488 flags 129
[27791.128098] __readpage_endio_check: 7 callbacks suppressed
[27791.236794] BTRFS warning (device vdc): csum failed ino 405 off 
6021980160 csum 3776938678 expected csum 514150079
[27791.236799] BTRFS warning (device vdc): csum failed ino 405 off 
6021971968 csum 3304844252 expected csum 4171523312
[27791.236821] BTRFS warning (device vdc): csum failed ino 405 off 
6021984256 csum 470593400 expected csum 642831408
[27791.236825] BTRFS warning (device vdc): csum failed ino 405 off 
6021988352 csum 796755777 expected csum 690854341
[27791.236842] BTRFS warning (device vdc): csum failed ino 405 off 
6021992448 csum 4115095129 expected csum 249712906
[27791.236847] BTRFS warning (device vdc): csum failed ino 405 off 
6021996544 csum 2337431338 expected csum 1869250975
[27791.236857] BTRFS warning (device vdc): csum failed ino 405 off 
6022004736 csum 3417780495 expected csum 3698318115
[27791.236864] BTRFS warning (device vdc): csum failed ino 405 off 
6022000640 csum 3543852608 expected csum 1929026437
[27791.236874] BTRFS warning (device vdc): csum failed ino 405 off 
6022008832 csum 3423877520 expected csum 2981727596
[27791.236978] BTRFS warning (device vdc): csum failed ino 405 off 
6021976064 csum 778377694 expected 

Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable

2016-02-14 Thread Henk Slager
>> > Do you think there is still a chance to recover those files?
>>
>> You can use  btrfs restore  to get files off a damaged fs.
>
> This however does work - thank you!
> Now since I'm a bit short on disc space, can I remove the disc that
> previously disappeared (and thus doesn't have all the
> data) from the RAID, format it and run btrfs rescue on the degraded array,
> saving the rescued data to the now free disc?

In theory   btrfs restore   should be able to read files from
(unmounted) /dev/sdb (devid 2) + /dev/sdc (devid 3).
The kernel code should still be able to mount devid 2 + devid 3 in
degraded mode, but   btrfs restore  needs unmounted fs and I am not
sure if userspace tools can also decode raid5 degraded well enough.
For a single device, so non-raid profiles, it might be different.

lf you unplug /dev/sda (devid 1) you can dry-run  btrfs restore -v -D

and see if it would work.

If not, maybe first save the files that have csum errors with restore
(all 3 discs connected) to other storage and then delete the files
from the normally mounted 3 discs raid5 array and then do a normal
copy from degraded,ro mounted 2 disc to the newly formatted /dev/sda.
Hopefully there's enough space in total.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable

2016-02-14 Thread Benjamin Valentin
Henk Slager  gmail.com> writes:

> You could use 1-time mount option clear_cache, then mount normally and
> cache will be rebuild automatically (but also corrected if you don't
> clear it)

This didn't help, gave me

[  316.111596] BTRFS info (device sda): force clearing of disk cache
[  316.111605] BTRFS info (device sda): disk space caching is enabled
[  316.111608] BTRFS: has skinny extents
[  316.227354] BTRFS info (device sda): bdev /dev/sda errs: wr 180547340, 
rd 592949011, flush 4967, corrupt 582096433, gen 
26993

and still

[  498.552298] BTRFS warning (device sda): csum failed ino 171545 off 
2269560832 csum 2566472073 expected csum 874509527
[  498.552325] BTRFS warning (device sda): csum failed ino 171545 off 
2269564928 csum 2566472073 expected csum 2434927850

> > Do you think there is still a chance to recover those files?
> 
> You can use  btrfs restore  to get files off a damaged fs.

This however does work - thank you!
Now since I'm a bit short on disc space, can I remove the disc that 
previously disappeared (and thus doesn't have all the 
data) from the RAID, format it and run btrfs rescue on the degraded array, 
saving the rescued data to the now free disc?



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable

2016-02-09 Thread Henk Slager
On Sun, Feb 7, 2016 at 6:28 PM, Benjamin Valentin
<benpi...@googlemail.com> wrote:
> Hi,
>
> I created a btrfs volume with 3x8TB drives (ST8000AS0002-1NA) in raid5
> configuration.
> I copied some TB of data onto it without errors (from eSATA drives, so
> rather fast - I mention that because of [1]), then set it up as a
> fileserver where it had data read and written to it over a gigabit
> ethernet connection for several days.
> This however didn't go so well because after one day, one of the drives
> dropped off the SATA bus.
>
> I don't know if that was related to [1] (I was running Linux 4.4-rc6 to
> avoid that) and by now all evidence has been eaten by logrotate :\
>
> But I was not concerned for I had set up raid5 to provide redundancy
> against one disc failure - unfortunately it did not.
>
> When trying to read a file I'd get an I/O error after some hundret MB
> (this is random across multiple files, but consistent for the same
> file) on both files written before and after the disc failue.
>
> (There was still data being written to the volume at this point.)
>
> After a reboot a couple days later the drive showed up again and SMART
> reported no errors, but the I/O errors remained.
>
> I then ran btrfs scrub (this took about 10 days) and afterwards I was
> again able to completely read all files written *before* the disc
> failure.
>
> However, many files written *after* the event (while only 2 drives were
> online) are still only readable up to a point:
>
> $ dd if=Dr.Strangelove.mkv of=/dev/null
> dd: error reading ‘Dr.Strangelove.mkv’:
> Input/output error
> 5331736+0 records in
> 5331736+0 records out
> 2729848832 bytes (2,7 GB) copied, 11,1318 s, 245 MB/s
>
> $ ls -sh
> 4,4G Dr.Strangelove.mkv
>
> [  197.321552] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269564928 csum 2566472073 expected csum 2434927850
> [  197.321574] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269569024 csum 566472073 expected csum 212160686
> [  197.321592] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269573120 csum 2566472073 expected sum 2202342500
>
> I tried btrfs check --repair but to no avail, got some
>
> [ 4549.762299] BTRFS warning (device sda): failed to load free space cache 
> for block group 1614937063424, rebuilding it now
> [ 4549.790389] BTRFS error (device sda): csum mismatch on free space cache
>
> and this result
>
> checking extents
> Fixed 0 roots.
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> enabling repair mode
> Checking filesystem on /dev/sda
> UUID: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8
> cache and super generation don't match, space cache will be invalidated
> found 11674258875712 bytes used err is 0
> total csum bytes: 11387937220
> total tree bytes: 13011156992
> total fs tree bytes: 338083840
> total extent tree bytes: 99123200
> btree space waste bytes: 1079766991
> file data blocks allocated: 14669115838464
>  referenced 14668840665088
>
> when I mount the volume with -o nospace_cache I instead get
>
> [ 6985.165421] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269560832 csum 2566472073 expected csum 874509527
> [ 6985.165469] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269564928 csum 566472073 expected csum 2434927850
> [ 6985.165490] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269569024 csum 2566472073 expected csum 212160686
>
> when trying to read the file.

You could use 1-time mount option clear_cache, then mount normally and
cache will be rebuild automatically (but also corrected if you don't
clear it)

> Do you think there is still a chance to recover those files?

You can use  btrfs restore  to get files off a damaged fs.

> Also am I mistaken to believe that btrfs-raid5 would continue to
> function when one disc fails?

The problem you encountered is quite typical unfortunately, the answer
is yes if you stop writing to the fs. But thats not acceptable of
course. A key problem of btrfs raid (also in recent kernels like 4.4)
is that when a (redundant) device goes offline (like pulling SATA
cable or HDD firmware crash) btrfs/kernel does not notice or does not
act correctly upon it under various circumstances. So same as in you
case, the writing to disappeared device seems to continue. For just
the data, this might then still be recoverable, but for the rest of
the structures, it might corrupt the fs heavily.

What should happen is that the btrfs+kernel+fs state switches to
degraded mode and warn about devicefailure so that user can take
action. Or completely automatically start using a spare disk that is
standby but connected. But this spare disk method is currently just
patched in 

One disc of 3-disc btrfs-raid5 failed - files only partially readable

2016-02-07 Thread Benjamin Valentin
Hi,

I created a btrfs volume with 3x8TB drives (ST8000AS0002-1NA) in raid5
configuration.
I copied some TB of data onto it without errors (from eSATA drives, so
rather fast - I mention that because of [1]), then set it up as a
fileserver where it had data read and written to it over a gigabit
ethernet connection for several days.
This however didn't go so well because after one day, one of the drives
dropped off the SATA bus.

I don't know if that was related to [1] (I was running Linux 4.4-rc6 to
avoid that) and by now all evidence has been eaten by logrotate :\

But I was not concerned for I had set up raid5 to provide redundancy
against one disc failure - unfortunately it did not.

When trying to read a file I'd get an I/O error after some hundret MB
(this is random across multiple files, but consistent for the same
file) on both files written before and after the disc failue.

(There was still data being written to the volume at this point.)

After a reboot a couple days later the drive showed up again and SMART
reported no errors, but the I/O errors remained.

I then ran btrfs scrub (this took about 10 days) and afterwards I was
again able to completely read all files written *before* the disc
failure.

However, many files written *after* the event (while only 2 drives were
online) are still only readable up to a point:

$ dd if=Dr.Strangelove.mkv of=/dev/null
dd: error reading ‘Dr.Strangelove.mkv’:
Input/output error
5331736+0 records in
5331736+0 records out
2729848832 bytes (2,7 GB) copied, 11,1318 s, 245 MB/s

$ ls -sh
4,4G Dr.Strangelove.mkv

[  197.321552] BTRFS warning (device sda): csum failed ino 171545 off 
2269564928 csum 2566472073 expected csum 2434927850 
[  197.321574] BTRFS warning (device sda): csum failed ino 171545 off 
2269569024 csum 566472073 expected csum 212160686
[  197.321592] BTRFS warning (device sda): csum failed ino 171545 off 
2269573120 csum 2566472073 expected sum 2202342500

I tried btrfs check --repair but to no avail, got some

[ 4549.762299] BTRFS warning (device sda): failed to load free space cache for 
block group 1614937063424, rebuilding it now
[ 4549.790389] BTRFS error (device sda): csum mismatch on free space cache

and this result

checking extents
Fixed 0 roots.
checking free space cache
checking fs roots
checking csums
checking root refs
enabling repair mode
Checking filesystem on /dev/sda
UUID: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8
cache and super generation don't match, space cache will be invalidated
found 11674258875712 bytes used err is 0
total csum bytes: 11387937220
total tree bytes: 13011156992
total fs tree bytes: 338083840
total extent tree bytes: 99123200
btree space waste bytes: 1079766991
file data blocks allocated: 14669115838464
 referenced 14668840665088

when I mount the volume with -o nospace_cache I instead get

[ 6985.165421] BTRFS warning (device sda): csum failed ino 171545 off 
2269560832 csum 2566472073 expected csum 874509527
[ 6985.165469] BTRFS warning (device sda): csum failed ino 171545 off 
2269564928 csum 566472073 expected csum 2434927850
[ 6985.165490] BTRFS warning (device sda): csum failed ino 171545 off 
2269569024 csum 2566472073 expected csum 212160686

when trying to read the file.

Do you think there is still a chance to recover those files?
Also am I mistaken to believe that btrfs-raid5 would continue to
function when one disc fails?

If you need any more info I'm happy to provide that - here is some
information about the system:

Linux nashorn 4.4.0-2-generic #16-Ubuntu SMP Thu Jan 28 15:44:21 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.4

Label: 'data'  uuid: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8
Total devices 3 FS bytes used 10.62TiB
devid1 size 7.28TiB used 5.33TiB path /dev/sda
devid2 size 7.28TiB used 5.33TiB path /dev/sdb
devid3 size 7.28TiB used 5.33TiB path /dev/sdc

Data, RAID5: total=10.64TiB, used=10.61TiB
System, RAID1: total=40.00MiB, used=928.00KiB
Metadata, RAID1: total=13.00GiB, used=12.12GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Thank you!

[1] https://bugzilla.kernel.org/show_bug.cgi?id=93581
[2] full dmesg: http://paste.ubuntu.com/14965237/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-06 Thread Patrik Lundquist
On 6 November 2015 at 10:03, Janos Toth F.  wrote:
>
> Although I updated the firmware of the drives. (I found an IMPORTANT
> update when I went there to download SeaTools, although there was no
> change log to tell me why this was important). This might changed the
> error handling behavior of the drive...?

I've had Seagate drives not reporting errors until I updated the
firmware. They tended to timeout instead. Got a shitload of SMART
errors after I updated, but they still didn't handle errors very well
(became unresponsive).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-06 Thread Janos Toth F.
I created a fresh RAID-5 mode Btrfs on the same 3 disks (including the
faulty one which is still producing numerous random read errors) and
Btrfs now seems to work exactly as I would anticipate.

I copied some data and verified the checksum. The data is readable and
correct regardless of the constant warning messages in the kernel log
about the read errors on the single faulty HDD (the bad behavior is
confirmed by the SMART logs and I tested it in a different PC as
well...).

I also ran several scrubs and now it always finishes with X corrected
and 0 uncorrected errors. (The errors are supposedly corrected but the
faulty HDD keeps randomly corrupting the data...)
The last time I saw uncorrected errors during the scrub and not every
data was readable. Rather strange...

I ran 24 hours of Gimps/Prime95 Blend stresstest without errors on the
problematic machine.
Although I updated the firmware of the drives. (I found an IMPORTANT
update when I went there to download SeaTools, although there was no
change log to tell me why this was important). This might changed the
error handling behavior of the drive...?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-05 Thread Austin S Hemmelgarn

On 2015-11-04 23:06, Duncan wrote:

(Tho I should mention, while not on zfs, I've actually had my own
problems with ECC RAM too.  In my case, the RAM was certified to run at
speeds faster than it was actually reliable at, such that actually stored
data, what the ECC protects, was fine, the data was actually getting
damaged in transit to/from the RAM.  On a lightly loaded system, such as
one running many memory tests or under normal desktop usage conditions,
the RAM was generally fine, no problems.  But on a heavily loaded system,
such as when doing parallel builds (I run gentoo, which builds from
sources in ordered to get the higher level of option flexibility that
comes only when you can toggle build-time options), I'd often have memory
faults and my builds would fail.

The most common failure, BTW, was on tarball decompression, bunzip2 or
the like, since the tarballs contained checksums that were verified on
data decompression, and often they'd fail to verify.

Once I updated the BIOS to one that would let me set the memory speed
instead of using the speed the modules themselves reported, and I
declocked the memory just one notch (this was DDR1, IIRC I declocked from
the PC3200 it was rated, to PC3000 speeds), not only was the memory then
100% reliable, but I could and did actually reduce the number of wait-
states for various operations, and it was STILL 100% reliable.  It simply
couldn't handle the raw speeds it was certified to run, is all, tho it
did handle it well enough, enough of the time, to make the problem far
more difficult to diagnose and confirm than it would have been had the
problem appeared at low load as well.

As it happens, I was running reiserfs at the time, and it handled both
that hardware issue, and a number of others I've had, far better than I'd
have expected of /any/ filesystem, when the memory feeding it is simply
not reliable.  Reiserfs metadata, in particular, seems incredibly
resilient in the face of hardware issues, and I lost far less data than I
might have expected, tho without checksums and with bad memory, I imagine
I had occasional undetected bitflip corruption in files here or there,
but generally nothing I detected.  I still use reiserfs on my spinning
rust today, but it's not well suited to SSD, which is where I run btrfs.

But the point for this discussion is that just because it's ECC RAM
doesn't mean you can't have memory related errors, just that if you do,
they're likely to be different errors, "transit errors", that will tend
to be undetected by many memory checkers, at least the ones that don't
tend to run full out memory bandwidth if they're simply checking that
what was stored in a cell can be read back, unchanged.)
I've actually seen similar issues with both ECC and non-ECC memory 
myself.  Any time I'm getting RAM for a system that I can afford to 
over-spec, I get the next higher speed and under-clock it (which in turn 
means I can lower the timing parameters and usually get a faster system 
than if I was running it at the rated speed).  FWIW, I also make a point 
of doing multiple memtest86+ runs (at a minimum, one running single 
core, and one with forced SMP) when I get new RAM, and even have a 
run-level configured on my Gentoo based home server system where it 
boots Xen and fires up twice as many VM's running memtest86+ as I have 
CPU cores, which is usually enough to fully saturate memory bandwidth 
and check for the type of issues you mentioned having above (although 
the BOINC client I run usually does a good job of triggering those kind 
of issues fast, distributed computing apps tend to be memory bound and 
use a lot of memory bandwidth).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-05 Thread Zoiled

Duncan wrote:

Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as
excerpted:


On 2015-11-04 13:01, Janos Toth F. wrote:

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?

If you can prove that there was a checksum mismatch and BTRFS returned
invalid data instead of a read error or going to the other disk, then
that is a very serious bug that needs to be fixed.  You need to keep in
mind also however that it's completely possible that the data was bad
before you wrote it to the filesystem, and if that's the case, there's
nothing any filesystem can do to fix it for you.

As Austin suggests, if btrfs is returning data, and you haven't turned
off checksumming with nodatasum or nocow, then it's almost certainly
returning the data it was given to write out in the first place.  Whether
that data it was given to write out was correct, however, is an
/entirely/ different matter.

If ISOs are failing their external checksums, then something is going
on.  Had you verified the external checksums when you first got the
files?  That is, are you sure the files were correct as downloaded and/or
ripped?

Where were the ISOs stored between original procurement/validation and
writing to btrfs?  Is it possible you still have some/all of them on that
media?  Do they still external-checksum-verify there?

Basically, assuming btrfs checksums are validating, there's three other
likely possibilities for where the corruption could have come from before
writing to btrfs.  Either the files were bad as downloaded or otherwise
procured -- which is why I asked whether you verified them upon receipt
-- or you have memory that's going bad, or your temporary storage is
going bad, before the files ever got written to btrfs.

The memory going bad is a particularly worrying possibility,
considering...


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new
motherboard with more native SATA connectors and an extra HDD). That
one seemed to actually do what it promises (abort any read operations
upon checksum errors [which always happens seamlessly on every read]
but look at the redundant data first and seamlessly "self-heal" if
possible). The only thing which made Btrfs to look as a better
alternative was the RAID-5 support. But I recently experienced two
cases of 1 drive failing of 3 and it always tuned out as a smaller or
bigger disaster (completely lost data or inconsistent data).

Have you considered looking into ZFS?  I hate to suggest it as an
alternative to BTRFS, but it's a much more mature and well tested
technology than ReFS, and has many of the same features as BTRFS (and
even has the option for triple parity instead of the double you get with
RAID6).  If you do consider ZFS, make a point to look at FreeBSD in
addition to the Linux version, the BSD one was a much better written
port of the original Solaris drivers, and has better performance in many
cases (and as much as I hate to admit it, BSD is way more reliable than
Linux in most use cases).

You should also seriously consider whether the convenience of having a
filesystem that fixes internal errors itself with no user intervention
is worth the risk of it corrupting your data.  Returning correct data
whenever possible is one thing, being 'self-healing' is completely
different.  When you start talking about things that automatically fix
internal errors without user intervention is when most seasoned system
administrators start to get really nervous.  Self correcting systems
have just as much chance to make things worse as they do to make things
better, and most of them depend on the underlying hardware working
correctly to actually provide any guarantee of reliability.

I too would point you at ZFS, but there's one VERY BIG caveat, and one
related smaller one!

The people who have a lot of ZFS experience say it's generally quite
reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*!
The self-healing works well, *PROVIDED* memory isn't producing errors.
Absolutely reliable memory is in fact *so* critical, that running ZFS on
non-ECC memory is severely discouraged as a very real risk to your data.

Which is why the above hints that your memory may be bad are so
worrying.  Don't even *THINK* about ZFS, particularly its self-healing
features, if you're not absolutely sure your memory is 100% reliable,
because apparently, based on the comment's I've seen, if it's not, you

Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Duncan
Austin S Hemmelgarn posted on Wed, 04 Nov 2015 13:45:37 -0500 as
excerpted:

> On 2015-11-04 13:01, Janos Toth F. wrote:
>> But the worst part is that there are some ISO files which were
>> seemingly copied without errors but their external checksums (the one
>> which I can calculate with md5sum and compare to the one supplied by
>> the publisher of the ISO file) don't match!
>> Well... this, I cannot understand.
>> How could these files become corrupt from a single disk failure? And
>> more importantly: how could these files be copied without errors? Why
>> didn't Btrfs gave a read error when the checksums didn't add up?
> If you can prove that there was a checksum mismatch and BTRFS returned
> invalid data instead of a read error or going to the other disk, then
> that is a very serious bug that needs to be fixed.  You need to keep in
> mind also however that it's completely possible that the data was bad
> before you wrote it to the filesystem, and if that's the case, there's
> nothing any filesystem can do to fix it for you.

As Austin suggests, if btrfs is returning data, and you haven't turned 
off checksumming with nodatasum or nocow, then it's almost certainly 
returning the data it was given to write out in the first place.  Whether 
that data it was given to write out was correct, however, is an 
/entirely/ different matter.

If ISOs are failing their external checksums, then something is going 
on.  Had you verified the external checksums when you first got the 
files?  That is, are you sure the files were correct as downloaded and/or 
ripped?

Where were the ISOs stored between original procurement/validation and 
writing to btrfs?  Is it possible you still have some/all of them on that 
media?  Do they still external-checksum-verify there?

Basically, assuming btrfs checksums are validating, there's three other 
likely possibilities for where the corruption could have come from before 
writing to btrfs.  Either the files were bad as downloaded or otherwise 
procured -- which is why I asked whether you verified them upon receipt 
-- or you have memory that's going bad, or your temporary storage is 
going bad, before the files ever got written to btrfs.

The memory going bad is a particularly worrying possibility, 
considering...

>> Now I am really considering to move from Linux to Windows and from
>> Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
>> that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a new
>> motherboard with more native SATA connectors and an extra HDD). That
>> one seemed to actually do what it promises (abort any read operations
>> upon checksum errors [which always happens seamlessly on every read]
>> but look at the redundant data first and seamlessly "self-heal" if
>> possible). The only thing which made Btrfs to look as a better
>> alternative was the RAID-5 support. But I recently experienced two
>> cases of 1 drive failing of 3 and it always tuned out as a smaller or
>> bigger disaster (completely lost data or inconsistent data).

> Have you considered looking into ZFS?  I hate to suggest it as an
> alternative to BTRFS, but it's a much more mature and well tested
> technology than ReFS, and has many of the same features as BTRFS (and
> even has the option for triple parity instead of the double you get with
> RAID6).  If you do consider ZFS, make a point to look at FreeBSD in
> addition to the Linux version, the BSD one was a much better written
> port of the original Solaris drivers, and has better performance in many
> cases (and as much as I hate to admit it, BSD is way more reliable than
> Linux in most use cases).
> 
> You should also seriously consider whether the convenience of having a
> filesystem that fixes internal errors itself with no user intervention
> is worth the risk of it corrupting your data.  Returning correct data
> whenever possible is one thing, being 'self-healing' is completely
> different.  When you start talking about things that automatically fix
> internal errors without user intervention is when most seasoned system
> administrators start to get really nervous.  Self correcting systems
> have just as much chance to make things worse as they do to make things
> better, and most of them depend on the underlying hardware working
> correctly to actually provide any guarantee of reliability.

I too would point you at ZFS, but there's one VERY BIG caveat, and one 
related smaller one!

The people who have a lot of ZFS experience say it's generally quite 
reliable, but gobs of **RELIABLE** memory are *absolutely* *critical*!  
The self-healing works well, *PROVIDED* memory isn't producing errors.  
Absolutely reliable memory is in fact *so* critical, that running ZFS on 
non-ECC memory is severely discouraged as a very real risk to your data.

Which is why the above hints that your memory may be bad are so 
worrying.  Don't even *THINK* about ZFS, particularly its self-healing 
features, if you're not 

Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Janos Toth F.
Well. Now I am really confused about Btrfs RAID-5!

So, I replaced all SATA cables (which are explicitly marked for beeing
aimed at SATA3 speeds) and all the 3x2Tb WD Red 2.0 drives with 3x4Tb
Seagate Contellation ES 3 drives and started from sratch. I
secure-erased every drives, created an empty filesystem and ran a
"long" SMART self-test on all drivers before I started using the
storage space (the tests finished without errors, all drivers looked
fine, 0 zero bad sectors, 0 read or SATA CEC errors... all looked
perfectly fine at the time...).

It didn't take long before I realized that one of the new drives
started failing.
I started a scrub and it reported both corrected and uncorrectable errors.
I looked at the SMART data. 2 drives look perfectly fine and 1 drive
seems to be really sick. The latter one has some "reallocated" and
several hundreds of "pending" sectors among other error indications in
the log. I guess it's not the drive surface but the HDD controller (or
may be a head) which is really dying.

I figured the uncorrectable errors are write errors which is not
surprising given the perceived "health" of the drive according to it's
SMART attributes and error logs. That's understandable.


Although, I tried to copy data from the filesystem and it failed at
various ways.
There was a file which couldn't be copied at all. Good question why. I
guess it's because the filesystem needs to be repaired to get the
checksums and parities sorted out first. That's also understandable
(though unexpected, I thought RAID-5 Btrfs is sort-of "self-healing"
in these situations, it should theoretically still be able to
reconstruct and present the correct data, based on checksums and
parities seamlessly and only place error in the kernel log...).

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?


Isn't Btrfs supposed to constantly check the integrity of the file
data during any normal read operations and give an error instead of
spitting out corrupt data as if it was perfectly legit?
I thought that's how it is supposed to work.
What's the point of full data checksuming if only an explicitly
requested scrub operation might look for errors? I thought's it's the
logical thing to do if checksum verification happens during every
single read operation and passing that check is mandatory in order to
get any data out of the filesystem (might be excluding the Direct-I/O
mode but I never use that on Btrfs - if that's even actually
supported, I don't know).


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
new motherboard with more native SATA connectors and an extra HDD).
That one seemed to actually do what it promises (abort any read
operations upon checksum errors [which always happens seamlessly on
every read] but look at the redundant data first and seamlessly
"self-heal" if possible). The only thing which made Btrfs to look as a
better alternative was the RAID-5 support. But I recently experienced
two cases of 1 drive failing of 3 and it always tuned out as a smaller
or bigger disaster (completely lost data or inconsistent data).


Does anybody have ideas what might went wrong in this second scenario?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-11-04 Thread Austin S Hemmelgarn

On 2015-11-04 13:01, Janos Toth F. wrote:

But the worst part is that there are some ISO files which were
seemingly copied without errors but their external checksums (the one
which I can calculate with md5sum and compare to the one supplied by
the publisher of the ISO file) don't match!
Well... this, I cannot understand.
How could these files become corrupt from a single disk failure? And
more importantly: how could these files be copied without errors? Why
didn't Btrfs gave a read error when the checksums didn't add up?
If you can prove that there was a checksum mismatch and BTRFS returned 
invalid data instead of a read error or going to the other disk, then 
that is a very serious bug that needs to be fixed.  You need to keep in 
mind also however that it's completely possible that the data was bad 
before you wrote it to the filesystem, and if that's the case, there's 
nothing any filesystem can do to fix it for you.


Isn't Btrfs supposed to constantly check the integrity of the file
data during any normal read operations and give an error instead of
spitting out corrupt data as if it was perfectly legit?
I thought that's how it is supposed to work.
Assuming that all of your hardware is working exactly like it's supposed 
to, yes it should work that way.  If however, you have something that 
corrupts the data in RAM before or while BTRFS is computing the checksum 
prior to writing the data, the it's fully possible for bad data to get 
written to disk and still have a perfectly correct checksum.  Bad RAM 
may also explain your issues mentioned above with not being able to copy 
stuff off of the filesystem.


Also, if you're using NOCOW files (or just the mount option), those very 
specifically do not store checksums for the blocks, because there is no 
way to do it without significant risk of data corruption.

What's the point of full data checksuming if only an explicitly
requested scrub operation might look for errors? I thought's it's the
logical thing to do if checksum verification happens during every
single read operation and passing that check is mandatory in order to
get any data out of the filesystem (might be excluding the Direct-I/O
mode but I never use that on Btrfs - if that's even actually
supported, I don't know).


Now I am really considering to move from Linux to Windows and from
Btrfs RAID-5 to Storage Spaces RAID-1 + ReFS (the only limitation is
that ReFS is only "self-healing" on RAID-1, not RAID-5, so I need a
new motherboard with more native SATA connectors and an extra HDD).
That one seemed to actually do what it promises (abort any read
operations upon checksum errors [which always happens seamlessly on
every read] but look at the redundant data first and seamlessly
"self-heal" if possible). The only thing which made Btrfs to look as a
better alternative was the RAID-5 support. But I recently experienced
two cases of 1 drive failing of 3 and it always tuned out as a smaller
or bigger disaster (completely lost data or inconsistent data).
Have you considered looking into ZFS?  I hate to suggest it as an 
alternative to BTRFS, but it's a much more mature and well tested 
technology than ReFS, and has many of the same features as BTRFS (and 
even has the option for triple parity instead of the double you get with 
RAID6).  If you do consider ZFS, make a point to look at FreeBSD in 
addition to the Linux version, the BSD one was a much better written 
port of the original Solaris drivers, and has better performance in many 
cases (and as much as I hate to admit it, BSD is way more reliable than 
Linux in most use cases).


You should also seriously consider whether the convenience of having a 
filesystem that fixes internal errors itself with no user intervention 
is worth the risk of it corrupting your data.  Returning correct data 
whenever possible is one thing, being 'self-healing' is completely 
different.  When you start talking about things that automatically fix 
internal errors without user intervention is when most seasoned system 
administrators start to get really nervous.  Self correcting systems 
have just as much chance to make things worse as they do to make things 
better, and most of them depend on the underlying hardware working 
correctly to actually provide any guarantee of reliability.  I cannot 
count the number of stories I've heard of 'self-healing' hardware RAID 
controllers destroying data.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Janos Toth F.
I went through all the recovery options I could find (starting from
read-only to "extraordinarily dangerous"). Nothing seemed to work.

A Windows based proprietary recovery software (ReclaiMe) could scratch
the surface but only that (it showed me the whole original folder
structure after a few minutes of scanning and the "preview" of some
some plaintext files was promising but most of the bigger files seemed
to be broken).

I used this as a bulk storage for backups and all the things I didn't
care to keep in more than one copies but that includes my
"scratchpad", so I cared enough to use RAID5 mode and to try restoring
some things.

Any last ideas before I "ata secure erase" and sell/repurpose the disks?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread ronnie sahlberg
If it is for mostly archival storage, I would suggest you take a look
at snapraid.


On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F.  wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread ronnie sahlberg
Maybe hold off erasing the drives a little in case someone wants to
collect some extra data for diagnosing how/why the filesystem got into
this unrecoverable state.

A single device having issues should not cause the whole filesystem to
become unrecoverable.

On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F.  wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Janos Toth F.
I am afraid the filesystem right now is really damaged regardless of
it's state upon the unexpected cable failure because I tried some
dangerous options after read-only restore/recovery methods all failed
(including zero-log, followed by init-csum-tree and even
chunk-recovery -> all of them just spit out several kind of errors
which suggested they probably didn't even write anything to the disks
before they decided that they already failed but they only caused more
harm than good if they did write something).

Actually, I almost got rid of this data myself intentionally when my
new set of drives arrived. I was considering if I should simply start
from scratch (may be reviewing and might be saving my "scratchpad"
portion of the data but nothing really irreplaceable and/or valuable)
but I thought it's a good idea to test the "device replace" function
in real life.

Even though the replace operation seemed to be successful I am
beginning to wonder if it wasn't really.


On Wed, Oct 21, 2015 at 7:42 PM, ronnie sahlberg
 wrote:
> Maybe hold off erasing the drives a little in case someone wants to
> collect some extra data for diagnosing how/why the filesystem got into
> this unrecoverable state.
>
> A single device having issues should not cause the whole filesystem to
> become unrecoverable.
>
> On Wed, Oct 21, 2015 at 9:09 AM, Janos Toth F.  wrote:
>> I went through all the recovery options I could find (starting from
>> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>>
>> A Windows based proprietary recovery software (ReclaiMe) could scratch
>> the surface but only that (it showed me the whole original folder
>> structure after a few minutes of scanning and the "preview" of some
>> some plaintext files was promising but most of the bigger files seemed
>> to be broken).
>>
>> I used this as a bulk storage for backups and all the things I didn't
>> care to keep in more than one copies but that includes my
>> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
>> some things.
>>
>> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Chris Murphy
https://btrfs.wiki.kernel.org/index.php/Restore

This should still be possible with even a degraded/unmounted raid5. It
is a bit tedious to figure out how to use it but if you've got some
things you want off the volume, it's not so difficult to prevent
trying it.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs/RAID5 became unmountable after SATA cable fault

2015-10-21 Thread Janos Toth F.
I tried several things, including the degraded mount option. One example:

# mount /dev/sdb /data -o ro,degraded,nodatasum,notreelog
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
   missing codepage or helper program, or other error

   In some cases useful info is found in syslog - try
   dmesg | tail or so.

# cat /proc/kmsg
<6>[  262.616929] BTRFS info (device sdd): allowing degraded mounts
<6>[  262.616943] BTRFS info (device sdd): setting nodatasum
<6>[  262.616949] BTRFS info (device sdd): disk space caching is enabled
<6>[  262.616953] BTRFS: has skinny extents
<6>[  262.652671] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280,
corrupt 0, gen 0
<3>[  262.697162] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  262.697633] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  262.697660] BTRFS: Failed to read block groups: -5
<3>[  262.709885] BTRFS: open_ctree failed
<6>[  267.197365] BTRFS info (device sdd): allowing degraded mounts
<6>[  267.197385] BTRFS info (device sdd): setting nodatasum
<6>[  267.197397] BTRFS info (device sdd): disabling tree log
<6>[  267.197406] BTRFS info (device sdd): disk space caching is enabled
<6>[  267.197412] BTRFS: has skinny extents
<6>[  267.232809] BTRFS: bdev (null) errs: wr 858, rd 8057, flush 280,
corrupt 0, gen 0
<3>[  267.246167] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  267.246706] BTRFS (device sdd): parent transid verify failed on
38719488 wanted 101765 found 101223
<3>[  267.246727] BTRFS: Failed to read block groups: -5
<3>[  267.261392] BTRFS: open_ctree failed

On Wed, Oct 21, 2015 at 6:09 PM, Janos Toth F.  wrote:
> I went through all the recovery options I could find (starting from
> read-only to "extraordinarily dangerous"). Nothing seemed to work.
>
> A Windows based proprietary recovery software (ReclaiMe) could scratch
> the surface but only that (it showed me the whole original folder
> structure after a few minutes of scanning and the "preview" of some
> some plaintext files was promising but most of the bigger files seemed
> to be broken).
>
> I used this as a bulk storage for backups and all the things I didn't
> care to keep in more than one copies but that includes my
> "scratchpad", so I cared enough to use RAID5 mode and to try restoring
> some things.
>
> Any last ideas before I "ata secure erase" and sell/repurpose the disks?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >