Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-26 Thread Vasco Almeida
A Dom, 26-06-2016 às 13:54 -0600, Chris Murphy escreveu:
> On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida  > wrote:
> > I have tried "btrfs check --repair /device" but that seems do not
> > do
> > any good.
> > http://paste.fedoraproject.org/384960/66945936/
> 
> It did fix things, in particular with the snapshot that was having
> problems being dropped. But it's not enough it seems to prevent it
> from going read only.
> 
> There's more than one bug here, you might see if the repair was good
> enough that it's possible to use brtfs-image now.

File system image available at (choose one link)
https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs
https://www.sendspace.com/file/i70cft

>  If not, use
> btrfs-debug-tree  > file.txt and post that file somewhere. This
> does expose file names. Maybe that'll shed some light on the problem.
> But also worth filing a bug at bugzilla.kernel.org with this debug
> tree referenced (probably too big to attach), maybe a dev will be
> able
> to look at it and improve things so they don't fail.

Should I file a bug report with that image dump linked above or btrfs-
debug-tree output or both?
I think I will use the subject of this thread as summary to file the
bug. Can you think of something more suitable or is that fine?

> > What else can I do or I must rebuild the file system?
> 
> Well, it's a long shot but you could try using --repair --init-csum
> which will create a new csum tree. But that applies to data, if the
> problem with it going read only is due to metadata corruption this
> won't help. And then last you could try --init-extent-tree. Thing I
> can't answer is which order to do it in.
> 
> In any case there will be files that you shouldn't trust after csum
> has been recreated, anything corrupt will now have a new csum, so you
> can get silent data corruption. It's better to just blow away this
> file system and make a new one and reinstall the OS. But if you're
> feeling brave, you can try one or both of those additional options
> and
> see if they can help.

I think I will reinstall the OS since, even if I manage to recover the
file system from this issue, that OS will be something I can not trust
fully.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces

2016-06-26 Thread Qu Wenruo

Hi Josef,

Would you please move this patch to the first of the patchset?

It's making bisect quite hard, as it will always stop at this patch, 
hard to check if it's a regression or existing bug.


Thanks,
Qu

At 03/26/2016 01:25 AM, Josef Bacik wrote:

These were hidden behind enospc_debug, which isn't helpful as they indicate
actual bugs, unlike the rest of the enospc_debug stuff which is really debug
information.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 157a0b6..90ac821 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9633,13 +9633,15 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
space_info = list_entry(info->space_info.next,
struct btrfs_space_info,
list);
-   if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) {
-   if (WARN_ON(space_info->bytes_pinned > 0 ||
+
+   /*
+* Do not hide this behind enospc_debug, this is actually
+* important and indicates a real bug if this happens.
+*/
+   if (WARN_ON(space_info->bytes_pinned > 0 ||
space_info->bytes_reserved > 0 ||
-   space_info->bytes_may_use > 0)) {
-   dump_space_info(space_info, 0, 0);
-   }
-   }
+   space_info->bytes_may_use > 0))
+   dump_space_info(space_info, 0, 0);
list_del(&space_info->list);
for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
struct kobject *kobj;




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Andrei Borzenkov
27.06.2016 06:50, Christoph Anton Mitterer пишет:
> 
> What should IMHO be done as well is giving a big fat warning in the
> manpages/etc. that when nodatacow is used RAID recovery cannot produce
> valid data (at least as long as there isn't checksumming implemented
> for nodatacow).

The problem is that current implementation of RAID56 puts exactly CoW
data at risk. I.e. writing new (copy of) data may suddenly make old
(copy of) data inaccessible, even though it had been safely committed to
disk and is now in read-only snapshot.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] fstests: btrfs/048: extend _filter_btrfs_prop_error to handle additional errors

2016-06-26 Thread Eryu Guan
On Fri, Jun 24, 2016 at 11:08:31AM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> btrfsprogs v4.5.3 changed the formatting of some error messages.  This
> patch extends the filter for btrfs prop to handle those.
> 
> Signed-off-by: Jeff Mahoney 
> ---
>  common/filter.btrfs | 10 +++---
>  tests/btrfs/048 |  6 --
>  tests/btrfs/048.out |  4 ++--
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/common/filter.btrfs b/common/filter.btrfs
> index 9970f4d..54361d4 100644
> --- a/common/filter.btrfs
> +++ b/common/filter.btrfs
> @@ -72,15 +72,19 @@ _filter_btrfs_compress_property()
>   sed -e "s/compression=\(lzo\|zlib\)/COMPRESSION=XXX/g"
>  }
>  
> -# filter name of the property from the output, optionally verify against $1
> +# filter error messages from btrfs prop, optionally verify against $1
>  # recognized message(s):
>  #  "object is not compatible with property: label"
> +#  "invalid value for property:{, value}"
> +#  "failed to {get,set} compression for $PATH[.:]: Invalid argument"
>  _filter_btrfs_prop_error()
>  {
>   if ! [ -z "$1" ]; then
> - sed -e "s/\(compatible with property\): $1/\1/"
> + sed -e "s#\(compatible with property\): $1#\1#" \
> + -e "s#^\(.*failed to [sg]et compression for $1\)[:.] 
> \(.*\)#\1: \2#"
>   else
> - sed -e "s/^\(.*compatible with property\).*/\1/"
> + sed -e "s#^\(.*compatible with property\).*#\1#" \
> + -e "s#^\(.*invalid value for property\):.*#\1#"

I think the last regex should be
-   -e "s#^\(.*invalid value for property\):.*#\1#"
+   -e "s#^\(.*invalid value for property\)[:.].*#\1#"

because...

>   fi
>  }
>  
> diff --git a/tests/btrfs/048 b/tests/btrfs/048
> index 4a36303..0b907b0 100755
> --- a/tests/btrfs/048
> +++ b/tests/btrfs/048
> @@ -79,7 +79,8 @@ echo -e "\nTesting subvolume ro property"
>  _run_btrfs_util_prog subvolume create $SCRATCH_MNT/sv1
>  $BTRFS_UTIL_PROG property get $SCRATCH_MNT/sv1 ro
>  echo "***"
> -$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo
> +$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo 2>&1 |
> + _filter_btrfs_prop_error
>  echo "***"
>  $BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro true
>  echo "***"
> @@ -99,7 +100,8 @@ $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 
> compression
>  $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/subdir1 compression
>  echo "***"
>  $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression \
> - foo 2>&1 | _filter_scratch
> + foo 2>&1 | _filter_scratch |
> + _filter_btrfs_prop_error SCRATCH_MNT/testdir/file1
>  echo "***"
>  $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression lzo
>  $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression
> diff --git a/tests/btrfs/048.out b/tests/btrfs/048.out
> index 0b20d0b..3e4e3d2 100644
> --- a/tests/btrfs/048.out
> +++ b/tests/btrfs/048.out
> @@ -15,7 +15,7 @@ ERROR: object is not compatible with property
>  Testing subvolume ro property
>  ro=false
>  ***
> -ERROR: invalid value for property.
> +ERROR: invalid value for property

this change breaks test with older btrfs-progs, the filter didn't remove
the ending "." correctly.

With above fix btrfs/048 works for me with both v3.19 and v4.6
btrfs-progs.

Thanks,
Eryu

>  ***
>  ***
>  ro=true
> @@ -27,7 +27,7 @@ ro=false
>  
>  Testing compression property
>  ***
> -ERROR: failed to set compression for SCRATCH_MNT/testdir/file1. Invalid 
> argument
> +ERROR: failed to set compression for SCRATCH_MNT/testdir/file1: Invalid 
> argument
>  ***
>  compression=lzo
>  compression=lzo
> -- 
> 1.8.5.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-26 Thread Nick Austin
On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin  wrote:
> sudo btrfs fi show /mnt/newdata
> Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
> Total devices 4 FS bytes used 8.07TiB
> devid1 size 5.46TiB used 2.70TiB path /dev/sdg
> devid2 size 5.46TiB used 2.70TiB path /dev/sdl
> devid3 size 5.46TiB used 2.70TiB path /dev/sdm
> devid4 size 5.46TiB used 2.70TiB path /dev/sdx

It looks like fi show has bad data:

When I start heavy IO on the filesystem (running rsync -c to verify the data),
I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the
 expected replacement.

I guess some metadata is messed up somewhere?

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  25.190.007.81   28.460.00   38.54

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdg 437.00 75168.00  1792.00  75168   1792
sdl 443.00 76064.00  1792.00  76064   1792
sdm 438.00 75232.00  1472.00  75232   1472
sdw 443.00 75680.00  1856.00  75680   1856
sdx   0.00 0.00 0.00  0  0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Strange behavior when replacing device on BTRFS RAID 5 array.

2016-06-26 Thread Nick Austin
I have a 4 device BTRFS RAID 5 filesystem.

One of the device members of this file system (sdr) had badblocks, so I
decided to replace it.

(I don't have a copy of fi show from before the replace. :-/ )

I ran this command:
sudo btrfs replace start 4 /dev/sdw /mnt/newdata

I had to shrink /dev/sdr by ~250 megs since the replacement drive was a tiny bit
smaller.

Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): resizing devid 4
Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): new size for /dev/sdr is
6001175121920
Jun 25 17:27:45 frank kernel: BTRFS info (device sdr): dev_replace from /dev/sdr
 (devid 4) to /dev/sdw started

The replace started, all seemed well.

3 hours into the replace, sdr dropped off the SATA bus and was redetected
as sdx. Bummer, but shouldn't be fatal.

This event really seemed to throw BTRFS for a loop.


Jun 25 20:32:35 frank kernel: sd 10:0:19:0: device_block, handle(0x0019)
Jun 25 20:33:05 frank kernel: sd 10:0:19:0: device_unblock and setting
to running, handle(0x0019)
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request
Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] FAILED Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jun 25 20:33:05 frank kernel: blk_update_request: I/O error, dev sdr,
sector 1785876480
Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing handle(0x0019),
sas_addr(0x500194000687e20e)
Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing : enclosure
logical id(0x500194000687e23f), slot(14)
Jun 25 20:33:16 frank kernel: scsi 10:0:21:0: Direct-Access ATA
  WL6000GSA12872E  1C01 PQ: 0 ANSI: 5

Here you can see btrfs seems to figure out sdr has become sdx (based on the
"dev /dev/sdx" entry showing up on the BTRFS warning lines).

Unfortunately, all remaining IO for the device formerly known as sdr results in
btrf errors like the ones listed below. iostat confirms no IO on sdx.

Jun 25 20:33:17 frank kernel: sd 10:0:21:0: [sdx] Attached SCSI disk
...
Jun 25 20:33:20 frank kernel: scrub_handle_errored_block: 31983
callbacks suppressed
Jun 25 20:33:20 frank kernel: BTRFS warning (device sdr): i/o error at
logical 2742536544256 on dev /dev/sdx, sector 1786897488, root 5,
inode 222965, offset 296329216, length 4096, links 1 (path:
/path/to/file)
Jun 25 20:33:20 frank kernel: btrfs_dev_stat_print_on_error: 32107
callbacks suppressed

These messages continue for many hours as the replace continues running.

sudo btrfs replace status /mnt/newdata
Started on 25.Jun 17:27:45, finished on 26.Jun 12:48:22, 0 write errs,
0 uncorr. read errs

...
Jun 26 12:48:22 frank kernel: BTRFS warning (device sdr): lost page
write due to IO error on /dev/sdx  (Many, many of these)
Jun 26 12:48:22 frank kernel: BTRFS info (device sdr): dev_replace
from /dev/sdx (devid 4) to /dev/sdw finished


Great! /dev/sdx replaced by /dev/sdw!

Now let's confirm:

sudo btrfs fi show /mnt/newdata
Label: '/var/data'  uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec
Total devices 4 FS bytes used 8.07TiB
devid1 size 5.46TiB used 2.70TiB path /dev/sdg
devid2 size 5.46TiB used 2.70TiB path /dev/sdl
devid3 size 5.46TiB used 2.70TiB path /dev/sdm
devid4 size 5.46TiB used 2.70TiB path /dev/sdx

Bummer, this doesn't look right.

sdx is still in the array (failing drive).

Additionally, /dev/sdw isn't listed at all! Worse still, it looks like the
array has lost redundancy (it has 8TiB of data, and that's the amount shown as
used divided by number of devices). It looks like it tried to add the new
device, but did a balance across all of them instead?

% sudo btrfs fi df /mnt/newdata
Data, RAID5: total=8.07TiB, used=8.06TiB
System, RAID10: total=80.00MiB, used=576.00KiB
Metadata, RAID10: total=12.00GiB, used=10.26GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Any advice would be appreciated.

%  uname -a
Linux frank 4.5.5-201.fc23.x86_64 #1 SMP Sat May 21 15:29:49 UTC 2016 x86_64
x86_64 x86_64 GNU/Linux

% lsb_release
Description:Fedora release 24 (Twenty Three)

% btrfs --version
btrfs-progs v4.4.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Christoph Anton Mitterer
On Sun, 2016-06-26 at 15:33 -0700, ronnie sahlberg wrote:
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.
I doubt most end users can be assumed to read the wiki...


> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".

Well if one does --force or --yes-i-know-what-i-do, and one actually
doesn't than such person is on his own.
People can always shoot themselves if they want to.

Actually I think that the compile-time way is inferior here.
Distros may just always enable raid56 there to allow people to continue
mounting their existing filesystems.



What should IMHO be done as well is giving a big fat warning in the
manpages/etc. that when nodatacow is used RAID recovery cannot produce
valid data (at least as long as there isn't checksumming implemented
for nodatacow).
Probably it should also be documented what btrfs does in such
situation. E.g. does it just randomly pick a readable block from one of
the copies? Simply error out and consider the file broken? Fill the
blocks in question with zero?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Steven Haigh

On 2016-06-27 08:38, Hugo Mills wrote:

On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote:

On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Could this explain why people have been reporting so many raid56 mode
> cases of btrfs replacing a first drive appearing to succeed just fine,
> but then they go to btrfs replace a second drive, and the array crashes
> as if the first replace didn't work correctly after all, resulting in two
> bad devices once the second replace gets under way, of course bringing
> down the array?
>
> If so, then it looks like we have our answer as to what has been going
> wrong that has been so hard to properly trace and thus to bugfix.
>
> Combine that with the raid4 dedicated parity device behavior you're
> seeing if the writes are all exactly 128 MB, with that possibly
> explaining the super-slow replaces, and this thread may have just given
> us answers to both of those until-now-untraceable issues.
>
> Regardless, what's /very/ clear by now is that raid56 mode as it
> currently exists is more or less fatally flawed, and a full scrap and
> rewrite to an entirely different raid56 mode on-disk format may be
> necessary to fix it.
>
> And what's even clearer is that people /really/ shouldn't be using raid56
> mode for anything but testing with throw-away data, at this point.
> Anything else is simply irresponsible.
>
> Does that mean we need to put a "raid56 mode may eat your babies" level
> warning in the manpage and require a --force to either mkfs.btrfs or
> balance to raid56 mode?  Because that's about where I am on it.

Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.


   I beefed up the warnings in several places in the wiki a couple of
days ago.


Not to sound rude - but I don't think these go anywhere near far enough. 
It needs to be completely obvious that its a good chance you'll lose 
everything. IMHO that's the only way that will stop BTRFS from getting 
the 'data eater' reputation. It can be revisited and reworded when the 
implementation is more tested and stable.


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Steven Haigh

On 2016-06-27 08:33, ronnie sahlberg wrote:

On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:

Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:

Wow. So it sees the data strip corruption, uses good parity on disk 
to

fix it, writes the fix to disk, recomputes parity for some reason but
does it wrongly, and then overwrites good parity with bad parity?
That's fucked. So in other words, if there are any errors fixed up
during a scrub, you should do a 2nd scrub. The first scrub should 
make

sure data is correct, and the 2nd scrub should make sure the bug is
papered over by computing correct parity and replacing the bad 
parity.


I wonder if the same problem happens with balance or if this is just 
a

bug in scrub code?


Could this explain why people have been reporting so many raid56 mode
cases of btrfs replacing a first drive appearing to succeed just fine,
but then they go to btrfs replace a second drive, and the array 
crashes
as if the first replace didn't work correctly after all, resulting in 
two

bad devices once the second replace gets under way, of course bringing
down the array?

If so, then it looks like we have our answer as to what has been going
wrong that has been so hard to properly trace and thus to bugfix.

Combine that with the raid4 dedicated parity device behavior you're
seeing if the writes are all exactly 128 MB, with that possibly
explaining the super-slow replaces, and this thread may have just 
given

us answers to both of those until-now-untraceable issues.

Regardless, what's /very/ clear by now is that raid56 mode as it
currently exists is more or less fatally flawed, and a full scrap and
rewrite to an entirely different raid56 mode on-disk format may be
necessary to fix it.

And what's even clearer is that people /really/ shouldn't be using 
raid56

mode for anything but testing with throw-away data, at this point.
Anything else is simply irresponsible.

Does that mean we need to put a "raid56 mode may eat your babies" 
level

warning in the manpage and require a --force to either mkfs.btrfs or
balance to raid56 mode?  Because that's about where I am on it.


Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


+1


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.


I voiced my concern on #btrfs about this - it really should show that 
this may eat your data and is properly experimental. At the moment, it 
looks as if the features are implemented and working as expected. In my 
case with nothing out of the ordinary - I've now got ~3.8Tb free disk 
space. Certainly not ready for *ANY* kind of public use.



2, Instead of a --force flag. (Users tend to ignore ---force and
warnings in documentation.)
Instead ifdef out the options to create raid56 in mkfs.btrfs.
Developers who want to test can just remove the ifdef and recompile
the tools anyway.
But if end-users have to recompile userspace, that really forces the
point that "you
really should not use this right now".


I think this is a somewhat good idea - however it should be a warning 
along the lines of:
"BTRFS RAID56 is VERY experimental and is known to corrupt data in 
certain cases. Use at your own risk!


Continue? (y/N):"


3, reach out to the documentation and fora for the major distros and
make sure they update their
documentation accordingly.
I think a lot of end-users, if they try to research something, are
more likely to go to  fora and wiki
than search out an upstream fora.


Another good idea.

I'd also recommend updates to the ArchLinux wiki - as for some reason I 
always seem to end up there when searching for a certain topic...


--
Steven Haigh

Email: net...@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-26 Thread Qu Wenruo



At 06/25/2016 01:45 PM, Chandan Rajendra wrote:

On Saturday, June 25, 2016 09:22:44 AM Qu Wenruo wrote:


On 06/24/2016 05:29 PM, Chandan Rajendra wrote:

On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote:

Hi Chandan, David,

When I'm trying to rebase dedupe patchset on top of Chadan's sub page
size patchset (using David's for-next-test-20160620), although the
rebase itself is quite simple, but I'm afraid that I found some bugs for
sub page size patchset, *without* dedupe patchset applied.

These bugs seems to be unrelated to each other
1) state leak at btrfs rmmod time


The leak was due to not freeing 'cached_state' in
read_extent_buffer_pages(). I have fixed this and the fix will be part of the
patchset when I post the next version to the mailing list.

I have always compiled the btrfs code as part of the vmlinux image and hence
have never rmmod the btrfs module during my local testing. The space leak
messages might have appeared when I shut down my guest. Hence I had never
noticed them before. Thanks once again for informing me about it.


2) bytes_may_use leak at qgroup EDQUOTA error time


I have a slightly older version of btrfs-progs which does not yet have btrfs
dedupe" command. I will get the new version and check if the space leak can be
reproduced on my machine.

However, I don't see the space leak warning messages when the reproducer
script is executed after commenting out "btrfs dedupe enable $mnt".


Strange.
That dedupe command is not useful at all, as I'm using the branch
without the dedupe patchset.
Even with btrfs-progs dedupe patchset, dedupe enable only output ENOTTY
error message.

I'll double check if it's related to the dedupe.

BTW, are you testing with 4K page size?


Yes, I executed the script with 4k page size. I had based my patchset on top
of 4.7-rc2 kernel. If you are interested, you can get the kernel sources at
'https://github.com/chandanr/linux subpagesize-blocksize'.

I will soon rebase my patchset on David's master branch. I will let you know
if I hit the space leak issue on the rebased kernel.



Thanks for your info.

Confirmed that leaked space is unrelated to your sub page size.

It's another patchset causing the bug.
Bisected and I'll inform the author.

Great thanks for your help.
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/31] btrfs: tests, move initialization into tests/

2016-06-26 Thread Jeff Mahoney
On 6/26/16 10:17 PM, Qu Wenruo wrote:
> 
> 
> At 06/25/2016 06:14 AM, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> We have all these stubs that only exist because they're called from
>> btrfs_run_sanity_tests, which is a static inside super.c.  Let's just
>> move it all into tests/btrfs-tests.c and only have one stub.
>>
>> Signed-off-by: Jeff Mahoney 
>> ---
>>  fs/btrfs/super.c | 43
>> 
>>  fs/btrfs/tests/btrfs-tests.c | 47
>> ++--
>>  fs/btrfs/tests/btrfs-tests.h | 35 +++--
>>  3 files changed, 48 insertions(+), 77 deletions(-)
>>
>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>> index a7b9a15..d8e48bb 100644
>> --- a/fs/btrfs/super.c
>> +++ b/fs/btrfs/super.c
>> @@ -2323,49 +2323,6 @@ static void btrfs_print_mod_info(void)
>>  btrfs_crc32c_impl());
>>  }
>>
>> -static int btrfs_run_sanity_tests(void)
>> -{
>> -int ret, i;
>> -u32 sectorsize, nodesize;
>> -u32 test_sectorsize[] = {
>> -PAGE_SIZE,
>> -};
>> -ret = btrfs_init_test_fs();
>> -if (ret)
>> -return ret;
>> -for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) {
>> -sectorsize = test_sectorsize[i];
>> -for (nodesize = sectorsize;
>> - nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE;
>> - nodesize <<= 1) {
>> -pr_info("BTRFS: selftest: sectorsize: %u  nodesize: %u\n",
>> -sectorsize, nodesize);
>> -ret = btrfs_test_free_space_cache(sectorsize, nodesize);
>> -if (ret)
>> -goto out;
>> -ret = btrfs_test_extent_buffer_operations(sectorsize,
>> -nodesize);
>> -if (ret)
>> -goto out;
>> -ret = btrfs_test_extent_io(sectorsize, nodesize);
>> -if (ret)
>> -goto out;
>> -ret = btrfs_test_inodes(sectorsize, nodesize);
>> -if (ret)
>> -goto out;
>> -ret = btrfs_test_qgroups(sectorsize, nodesize);
>> -if (ret)
>> -goto out;
>> -ret = btrfs_test_free_space_tree(sectorsize, nodesize);
>> -if (ret)
>> -goto out;
>> -}
>> -}
>> -out:
>> -btrfs_destroy_test_fs();
>> -return ret;
>> -}
>> -
>>  static int __init init_btrfs_fs(void)
>>  {
>>  int err;
>> diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
>> index 10eb249..d90c951 100644
>> --- a/fs/btrfs/tests/btrfs-tests.c
>> +++ b/fs/btrfs/tests/btrfs-tests.c
>> @@ -54,7 +54,7 @@ struct inode *btrfs_new_test_inode(void)
>>  return new_inode(test_mnt->mnt_sb);
>>  }
>>
>> -int btrfs_init_test_fs(void)
>> +static int btrfs_init_test_fs(void)
>>  {
>>  int ret;
>>
>> @@ -73,7 +73,7 @@ int btrfs_init_test_fs(void)
>>  return 0;
>>  }
>>
>> -void btrfs_destroy_test_fs(void)
>> +static void btrfs_destroy_test_fs(void)
>>  {
>>  kern_unmount(test_mnt);
>>  unregister_filesystem(&test_type);
>> @@ -220,3 +220,46 @@ void btrfs_init_dummy_trans(struct
>> btrfs_trans_handle *trans)
>>  INIT_LIST_HEAD(&trans->qgroup_ref_list);
>>  trans->type = __TRANS_DUMMY;
>>  }
>> +
>> +int btrfs_run_sanity_tests(void)
>> +{
>> +int ret, i;
>> +u32 sectorsize, nodesize;
>> +u32 test_sectorsize[] = {
>> +PAGE_SIZE,
>> +};
>> +ret = btrfs_init_test_fs();
>> +if (ret)
>> +return ret;
>> +for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) {
>> +sectorsize = test_sectorsize[i];
>> +for (nodesize = sectorsize;
>> + nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE;
>> + nodesize <<= 1) {
>> +pr_info("BTRFS: selftest: sectorsize: %u  nodesize: %u\n",
>> +sectorsize, nodesize);
>> +ret = btrfs_test_free_space_cache(sectorsize, nodesize);
>> +if (ret)
>> +goto out;
>> +ret = btrfs_test_extent_buffer_operations(sectorsize,
>> +nodesize);
>> +if (ret)
>> +goto out;
>> +ret = btrfs_test_extent_io(sectorsize, nodesize);
>> +if (ret)
>> +goto out;
>> +ret = btrfs_test_inodes(sectorsize, nodesize);
>> +if (ret)
>> +goto out;
>> +ret = btrfs_test_qgroups(sectorsize, nodesize);
>> +if (ret)
>> +goto out;
>> +ret = btrfs_test_free_space_tree(sectorsize, nodesize);
>> +if (ret)
>> +goto out;
>> +}
>> +}
>> +out:
>> +btrfs_destroy_test_fs();
>> +return ret;
>> +}
>> diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h
>> index 66fb6b70..e7d364f 100644
>> --- a/fs/btrfs/tests/btrfs-tests.h
>> +++ b/fs/btrfs/tests/btrfs-tests.h
>> @@ -20,20 +20,19 @@
>>  #define __BTRFS_TESTS
>>
>>  #ifdef CONFIG_BTRFS_F

Re: [PATCH 03/31] btrfs: btrfs_test_opt and friends should take a btrfs_fs_info

2016-06-26 Thread Qu Wenruo



At 06/27/2016 10:21 AM, Jeff Mahoney wrote:

On 6/26/16 10:14 PM, Qu Wenruo wrote:



At 06/25/2016 06:14 AM, je...@suse.com wrote:

From: Jeff Mahoney 

btrfs_test_opt and friends only use the root pointer to access
the fs_info.  Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h|  22 
 fs/btrfs/delayed-inode.c|   2 +-
 fs/btrfs/dev-replace.c  |   4 +-
 fs/btrfs/disk-io.c  |  22 
 fs/btrfs/extent-tree.c  |  32 +--
 fs/btrfs/file.c |   2 +-
 fs/btrfs/free-space-cache.c |   6 +-
 fs/btrfs/inode-map.c|  12 ++--
 fs/btrfs/inode.c|  12 ++--
 fs/btrfs/ioctl.c|   2 +-
 fs/btrfs/super.c| 132
+++-
 fs/btrfs/transaction.c  |   6 +-
 fs/btrfs/tree-log.c |   4 +-
 fs/btrfs/volumes.c  |  11 ++--
 14 files changed, 137 insertions(+), 132 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 101c3cf..100d2ea 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1297,21 +1297,21 @@ struct btrfs_root {
 #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt)
 #define btrfs_raw_test_opt(o, opt)((o) & BTRFS_MOUNT_##opt)
-#define btrfs_test_opt(root, opt)((root)->fs_info->mount_opt & \
+#define btrfs_test_opt(fs_info, opt)((fs_info)->mount_opt & \
  BTRFS_MOUNT_##opt)

-#define btrfs_set_and_info(root, opt, fmt, args...)\
+#define btrfs_set_and_info(fs_info, opt, fmt, args...)\
 {\
-if (!btrfs_test_opt(root, opt))\
-btrfs_info(root->fs_info, fmt, ##args);\
-btrfs_set_opt(root->fs_info->mount_opt, opt);\
+if (!btrfs_test_opt(fs_info, opt))\
+btrfs_info(fs_info, fmt, ##args);\
+btrfs_set_opt(fs_info->mount_opt, opt);\
 }

-#define btrfs_clear_and_info(root, opt, fmt, args...)\
+#define btrfs_clear_and_info(fs_info, opt, fmt, args...)\
 {\
-if (btrfs_test_opt(root, opt))\
-btrfs_info(root->fs_info, fmt, ##args);\
-btrfs_clear_opt(root->fs_info->mount_opt, opt);\
+if (btrfs_test_opt(fs_info, opt))\
+btrfs_info(fs_info, fmt, ##args);\
+btrfs_clear_opt(fs_info->mount_opt, opt);\
 }

 #ifdef CONFIG_BTRFS_DEBUG
@@ -1319,9 +1319,9 @@ static inline int
 btrfs_should_fragment_free_space(struct btrfs_root *root,
  struct btrfs_block_group_cache *block_group)
 {
-return (btrfs_test_opt(root, FRAGMENT_METADATA) &&
+return (btrfs_test_opt(root->fs_info, FRAGMENT_METADATA) &&
 block_group->flags & BTRFS_BLOCK_GROUP_METADATA) ||
-   (btrfs_test_opt(root, FRAGMENT_DATA) &&
+   (btrfs_test_opt(root->fs_info, FRAGMENT_DATA) &&
 block_group->flags &  BTRFS_BLOCK_GROUP_DATA);
 }
 #endif
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 61561c2..ed67717 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -653,7 +653,7 @@ static int btrfs_delayed_inode_reserve_metadata(
 if (!ret)
 goto out;

-if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
+if (btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) {
 btrfs_debug(root->fs_info,
 "block rsv migrate returned %d", ret);
 WARN_ON(1);
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 63ef9cd..e9bbff3 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -142,7 +142,7 @@ no_valid_dev_replace_entry_found:
  * missing
  */
 if (!dev_replace->srcdev &&
-!btrfs_test_opt(dev_root, DEGRADED)) {
+!btrfs_test_opt(dev_root->fs_info, DEGRADED)) {

Just fs_info, as following btrfs_warn() is using fs_info.


 ret = -EIO;
 btrfs_warn(fs_info,
"cannot mount because device replace operation is
ongoing and");
@@ -151,7 +151,7 @@ no_valid_dev_replace_entry_found:
src_devid);
 }
 if (!dev_replace->tgtdev &&
-!btrfs_test_opt(dev_root, DEGRADED)) {
+!btrfs_test_opt(dev_root->fs_info, DEGRADED)) {


Same here.


 ret = -EIO;
 btrfs_warn(fs_info,
"cannot mount because device replace operation is
ongoing and");
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 685c81a..8f27127 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3025,8 +3025,8 @@ retry_root_backup:
 if (IS_ERR(fs_info->transaction_kthread))
 goto fail_cleaner;

-if (!btrfs_test_opt(tree_root, SSD) &&
-!btrfs_test_opt(

Re: [PATCH 03/31] btrfs: btrfs_test_opt and friends should take a btrfs_fs_info

2016-06-26 Thread Jeff Mahoney
On 6/26/16 10:14 PM, Qu Wenruo wrote:
> 
> 
> At 06/25/2016 06:14 AM, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> btrfs_test_opt and friends only use the root pointer to access
>> the fs_info.  Let's pass the fs_info directly in preparation to
>> eliminate similar patterns all over btrfs.
>>
>> Signed-off-by: Jeff Mahoney 
>> ---
>>  fs/btrfs/ctree.h|  22 
>>  fs/btrfs/delayed-inode.c|   2 +-
>>  fs/btrfs/dev-replace.c  |   4 +-
>>  fs/btrfs/disk-io.c  |  22 
>>  fs/btrfs/extent-tree.c  |  32 +--
>>  fs/btrfs/file.c |   2 +-
>>  fs/btrfs/free-space-cache.c |   6 +-
>>  fs/btrfs/inode-map.c|  12 ++--
>>  fs/btrfs/inode.c|  12 ++--
>>  fs/btrfs/ioctl.c|   2 +-
>>  fs/btrfs/super.c| 132
>> +++-
>>  fs/btrfs/transaction.c  |   6 +-
>>  fs/btrfs/tree-log.c |   4 +-
>>  fs/btrfs/volumes.c  |  11 ++--
>>  14 files changed, 137 insertions(+), 132 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 101c3cf..100d2ea 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1297,21 +1297,21 @@ struct btrfs_root {
>>  #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt)
>>  #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt)
>>  #define btrfs_raw_test_opt(o, opt)((o) & BTRFS_MOUNT_##opt)
>> -#define btrfs_test_opt(root, opt)((root)->fs_info->mount_opt & \
>> +#define btrfs_test_opt(fs_info, opt)((fs_info)->mount_opt & \
>>   BTRFS_MOUNT_##opt)
>>
>> -#define btrfs_set_and_info(root, opt, fmt, args...)\
>> +#define btrfs_set_and_info(fs_info, opt, fmt, args...)\
>>  {\
>> -if (!btrfs_test_opt(root, opt))\
>> -btrfs_info(root->fs_info, fmt, ##args);\
>> -btrfs_set_opt(root->fs_info->mount_opt, opt);\
>> +if (!btrfs_test_opt(fs_info, opt))\
>> +btrfs_info(fs_info, fmt, ##args);\
>> +btrfs_set_opt(fs_info->mount_opt, opt);\
>>  }
>>
>> -#define btrfs_clear_and_info(root, opt, fmt, args...)\
>> +#define btrfs_clear_and_info(fs_info, opt, fmt, args...)\
>>  {\
>> -if (btrfs_test_opt(root, opt))\
>> -btrfs_info(root->fs_info, fmt, ##args);\
>> -btrfs_clear_opt(root->fs_info->mount_opt, opt);\
>> +if (btrfs_test_opt(fs_info, opt))\
>> +btrfs_info(fs_info, fmt, ##args);\
>> +btrfs_clear_opt(fs_info->mount_opt, opt);\
>>  }
>>
>>  #ifdef CONFIG_BTRFS_DEBUG
>> @@ -1319,9 +1319,9 @@ static inline int
>>  btrfs_should_fragment_free_space(struct btrfs_root *root,
>>   struct btrfs_block_group_cache *block_group)
>>  {
>> -return (btrfs_test_opt(root, FRAGMENT_METADATA) &&
>> +return (btrfs_test_opt(root->fs_info, FRAGMENT_METADATA) &&
>>  block_group->flags & BTRFS_BLOCK_GROUP_METADATA) ||
>> -   (btrfs_test_opt(root, FRAGMENT_DATA) &&
>> +   (btrfs_test_opt(root->fs_info, FRAGMENT_DATA) &&
>>  block_group->flags &  BTRFS_BLOCK_GROUP_DATA);
>>  }
>>  #endif
>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
>> index 61561c2..ed67717 100644
>> --- a/fs/btrfs/delayed-inode.c
>> +++ b/fs/btrfs/delayed-inode.c
>> @@ -653,7 +653,7 @@ static int btrfs_delayed_inode_reserve_metadata(
>>  if (!ret)
>>  goto out;
>>
>> -if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
>> +if (btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) {
>>  btrfs_debug(root->fs_info,
>>  "block rsv migrate returned %d", ret);
>>  WARN_ON(1);
>> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
>> index 63ef9cd..e9bbff3 100644
>> --- a/fs/btrfs/dev-replace.c
>> +++ b/fs/btrfs/dev-replace.c
>> @@ -142,7 +142,7 @@ no_valid_dev_replace_entry_found:
>>   * missing
>>   */
>>  if (!dev_replace->srcdev &&
>> -!btrfs_test_opt(dev_root, DEGRADED)) {
>> +!btrfs_test_opt(dev_root->fs_info, DEGRADED)) {
> Just fs_info, as following btrfs_warn() is using fs_info.
> 
>>  ret = -EIO;
>>  btrfs_warn(fs_info,
>> "cannot mount because device replace operation is
>> ongoing and");
>> @@ -151,7 +151,7 @@ no_valid_dev_replace_entry_found:
>> src_devid);
>>  }
>>  if (!dev_replace->tgtdev &&
>> -!btrfs_test_opt(dev_root, DEGRADED)) {
>> +!btrfs_test_opt(dev_root->fs_info, DEGRADED)) {
> 
> Same here.
> 
>>  ret = -EIO;
>>  btrfs_warn(fs_info,
>> "cannot mount because device replace operation is
>> ongoing and");
>> diff --git a/fs/btrfs/disk-io.c b/

Re: [PATCH 04/31] btrfs: tests, move initialization into tests/

2016-06-26 Thread Qu Wenruo



At 06/25/2016 06:14 AM, je...@suse.com wrote:

From: Jeff Mahoney 

We have all these stubs that only exist because they're called from
btrfs_run_sanity_tests, which is a static inside super.c.  Let's just
move it all into tests/btrfs-tests.c and only have one stub.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/super.c | 43 
 fs/btrfs/tests/btrfs-tests.c | 47 ++--
 fs/btrfs/tests/btrfs-tests.h | 35 +++--
 3 files changed, 48 insertions(+), 77 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a7b9a15..d8e48bb 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2323,49 +2323,6 @@ static void btrfs_print_mod_info(void)
btrfs_crc32c_impl());
 }

-static int btrfs_run_sanity_tests(void)
-{
-   int ret, i;
-   u32 sectorsize, nodesize;
-   u32 test_sectorsize[] = {
-   PAGE_SIZE,
-   };
-   ret = btrfs_init_test_fs();
-   if (ret)
-   return ret;
-   for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) {
-   sectorsize = test_sectorsize[i];
-   for (nodesize = sectorsize;
-nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE;
-nodesize <<= 1) {
-   pr_info("BTRFS: selftest: sectorsize: %u  nodesize: 
%u\n",
-   sectorsize, nodesize);
-   ret = btrfs_test_free_space_cache(sectorsize, nodesize);
-   if (ret)
-   goto out;
-   ret = btrfs_test_extent_buffer_operations(sectorsize,
-   nodesize);
-   if (ret)
-   goto out;
-   ret = btrfs_test_extent_io(sectorsize, nodesize);
-   if (ret)
-   goto out;
-   ret = btrfs_test_inodes(sectorsize, nodesize);
-   if (ret)
-   goto out;
-   ret = btrfs_test_qgroups(sectorsize, nodesize);
-   if (ret)
-   goto out;
-   ret = btrfs_test_free_space_tree(sectorsize, nodesize);
-   if (ret)
-   goto out;
-   }
-   }
-out:
-   btrfs_destroy_test_fs();
-   return ret;
-}
-
 static int __init init_btrfs_fs(void)
 {
int err;
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index 10eb249..d90c951 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -54,7 +54,7 @@ struct inode *btrfs_new_test_inode(void)
return new_inode(test_mnt->mnt_sb);
 }

-int btrfs_init_test_fs(void)
+static int btrfs_init_test_fs(void)
 {
int ret;

@@ -73,7 +73,7 @@ int btrfs_init_test_fs(void)
return 0;
 }

-void btrfs_destroy_test_fs(void)
+static void btrfs_destroy_test_fs(void)
 {
kern_unmount(test_mnt);
unregister_filesystem(&test_type);
@@ -220,3 +220,46 @@ void btrfs_init_dummy_trans(struct btrfs_trans_handle 
*trans)
INIT_LIST_HEAD(&trans->qgroup_ref_list);
trans->type = __TRANS_DUMMY;
 }
+
+int btrfs_run_sanity_tests(void)
+{
+   int ret, i;
+   u32 sectorsize, nodesize;
+   u32 test_sectorsize[] = {
+   PAGE_SIZE,
+   };
+   ret = btrfs_init_test_fs();
+   if (ret)
+   return ret;
+   for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) {
+   sectorsize = test_sectorsize[i];
+   for (nodesize = sectorsize;
+nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE;
+nodesize <<= 1) {
+   pr_info("BTRFS: selftest: sectorsize: %u  nodesize: 
%u\n",
+   sectorsize, nodesize);
+   ret = btrfs_test_free_space_cache(sectorsize, nodesize);
+   if (ret)
+   goto out;
+   ret = btrfs_test_extent_buffer_operations(sectorsize,
+   nodesize);
+   if (ret)
+   goto out;
+   ret = btrfs_test_extent_io(sectorsize, nodesize);
+   if (ret)
+   goto out;
+   ret = btrfs_test_inodes(sectorsize, nodesize);
+   if (ret)
+   goto out;
+   ret = btrfs_test_qgroups(sectorsize, nodesize);
+   if (ret)
+   goto out;
+   ret = btrfs_test_free_space_tree(sectorsize, nodesize);
+   if (ret)
+   goto out;
+   }
+   }
+out:
+   btrfs_destroy_test_fs();
+   return r

Re: [PATCH 03/31] btrfs: btrfs_test_opt and friends should take a btrfs_fs_info

2016-06-26 Thread Qu Wenruo



At 06/25/2016 06:14 AM, je...@suse.com wrote:

From: Jeff Mahoney 

btrfs_test_opt and friends only use the root pointer to access
the fs_info.  Let's pass the fs_info directly in preparation to
eliminate similar patterns all over btrfs.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/ctree.h|  22 
 fs/btrfs/delayed-inode.c|   2 +-
 fs/btrfs/dev-replace.c  |   4 +-
 fs/btrfs/disk-io.c  |  22 
 fs/btrfs/extent-tree.c  |  32 +--
 fs/btrfs/file.c |   2 +-
 fs/btrfs/free-space-cache.c |   6 +-
 fs/btrfs/inode-map.c|  12 ++--
 fs/btrfs/inode.c|  12 ++--
 fs/btrfs/ioctl.c|   2 +-
 fs/btrfs/super.c| 132 +++-
 fs/btrfs/transaction.c  |   6 +-
 fs/btrfs/tree-log.c |   4 +-
 fs/btrfs/volumes.c  |  11 ++--
 14 files changed, 137 insertions(+), 132 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 101c3cf..100d2ea 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1297,21 +1297,21 @@ struct btrfs_root {
 #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
 #define btrfs_raw_test_opt(o, opt) ((o) & BTRFS_MOUNT_##opt)
-#define btrfs_test_opt(root, opt)  ((root)->fs_info->mount_opt & \
+#define btrfs_test_opt(fs_info, opt)   ((fs_info)->mount_opt & \
 BTRFS_MOUNT_##opt)

-#define btrfs_set_and_info(root, opt, fmt, args...)\
+#define btrfs_set_and_info(fs_info, opt, fmt, args...) \
 {  \
-   if (!btrfs_test_opt(root, opt)) \
-   btrfs_info(root->fs_info, fmt, ##args);  \
-   btrfs_set_opt(root->fs_info->mount_opt, opt); \
+   if (!btrfs_test_opt(fs_info, opt))  \
+   btrfs_info(fs_info, fmt, ##args);   \
+   btrfs_set_opt(fs_info->mount_opt, opt);  \
 }

-#define btrfs_clear_and_info(root, opt, fmt, args...)  \
+#define btrfs_clear_and_info(fs_info, opt, fmt, args...)   \
 {  \
-   if (btrfs_test_opt(root, opt))  \
-   btrfs_info(root->fs_info, fmt, ##args);  \
-   btrfs_clear_opt(root->fs_info->mount_opt, opt);   \
+   if (btrfs_test_opt(fs_info, opt))   \
+   btrfs_info(fs_info, fmt, ##args);   \
+   btrfs_clear_opt(fs_info->mount_opt, opt);\
 }

 #ifdef CONFIG_BTRFS_DEBUG
@@ -1319,9 +1319,9 @@ static inline int
 btrfs_should_fragment_free_space(struct btrfs_root *root,
 struct btrfs_block_group_cache *block_group)
 {
-   return (btrfs_test_opt(root, FRAGMENT_METADATA) &&
+   return (btrfs_test_opt(root->fs_info, FRAGMENT_METADATA) &&
block_group->flags & BTRFS_BLOCK_GROUP_METADATA) ||
-  (btrfs_test_opt(root, FRAGMENT_DATA) &&
+  (btrfs_test_opt(root->fs_info, FRAGMENT_DATA) &&
block_group->flags &  BTRFS_BLOCK_GROUP_DATA);
 }
 #endif
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 61561c2..ed67717 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -653,7 +653,7 @@ static int btrfs_delayed_inode_reserve_metadata(
if (!ret)
goto out;

-   if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
+   if (btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) {
btrfs_debug(root->fs_info,
"block rsv migrate returned %d", ret);
WARN_ON(1);
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 63ef9cd..e9bbff3 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -142,7 +142,7 @@ no_valid_dev_replace_entry_found:
 * missing
 */
if (!dev_replace->srcdev &&
-   !btrfs_test_opt(dev_root, DEGRADED)) {
+   !btrfs_test_opt(dev_root->fs_info, DEGRADED)) {

Just fs_info, as following btrfs_warn() is using fs_info.


ret = -EIO;
btrfs_warn(fs_info,
   "cannot mount because device replace operation is ongoing 
and");
@@ -151,7 +151,7 @@ no_valid_dev_replace_entry_found:
   src_devid);
}
if (!dev_replace->tgtdev &&
-   !btrfs_test_opt(dev_root, DEGRADED)) {
+   !btrfs_test_opt(dev_root->fs_info, DEGRADED)) {


Same here.


 

Re: [PATCH 02/31] btrfs: prefix fsid to all trace events

2016-06-26 Thread Qu Wenruo



At 06/25/2016 06:14 AM, je...@suse.com wrote:

From: Jeff Mahoney 

When using trace events to debug a problem, it's impossible to determine
which file system generated a particular event.  This patch adds a
macro to prefix standard information to the head of a trace event.

The extent_state alloc/free events are all that's left without an
fs_info available.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/delayed-ref.c   |   9 +-
 fs/btrfs/extent-tree.c   |  10 +-
 fs/btrfs/qgroup.c|  19 +--
 fs/btrfs/qgroup.h|   9 +-
 fs/btrfs/super.c |   2 +-
 include/trace/events/btrfs.h | 282 ---
 6 files changed, 182 insertions(+), 149 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..e7b1ec0 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -606,7 +606,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
qrecord->num_bytes = num_bytes;
qrecord->old_roots = NULL;

-   qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs,
+   qexisting = btrfs_qgroup_insert_dirty_extent(fs_info,
+delayed_refs,
 qrecord);
if (qexisting)
kfree(qrecord);
@@ -615,7 +616,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
spin_lock_init(&head_ref->lock);
mutex_init(&head_ref->mutex);

-   trace_add_delayed_ref_head(ref, head_ref, action);
+   trace_add_delayed_ref_head(fs_info, ref, head_ref, action);

existing = htree_insert(&delayed_refs->href_root,
&head_ref->href_node);
@@ -682,7 +683,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref->type = BTRFS_TREE_BLOCK_REF_KEY;
full_ref->level = level;

-   trace_add_delayed_tree_ref(ref, full_ref, action);
+   trace_add_delayed_tree_ref(fs_info, ref, full_ref, action);

ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref);

@@ -739,7 +740,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
full_ref->objectid = owner;
full_ref->offset = offset;

-   trace_add_delayed_data_ref(ref, full_ref, action);
+   trace_add_delayed_data_ref(fs_info, ref, full_ref, action);

ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref);

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 29e5d00..39308a8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2194,7 +2194,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle 
*trans,
ins.type = BTRFS_EXTENT_ITEM_KEY;

ref = btrfs_delayed_node_to_data_ref(node);
-   trace_run_delayed_data_ref(node, ref, node->action);
+   trace_run_delayed_data_ref(root->fs_info, node, ref, node->action);

if (node->type == BTRFS_SHARED_DATA_REF_KEY)
parent = ref->parent;
@@ -2349,7 +2349,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
 SKINNY_METADATA);

ref = btrfs_delayed_node_to_tree_ref(node);
-   trace_run_delayed_tree_ref(node, ref, node->action);
+   trace_run_delayed_tree_ref(root->fs_info, node, ref, node->action);

if (node->type == BTRFS_SHARED_BLOCK_REF_KEY)
parent = ref->parent;
@@ -2413,7 +2413,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 */
BUG_ON(extent_op);
head = btrfs_delayed_node_to_head(node);
-   trace_run_delayed_ref_head(node, head, node->action);
+   trace_run_delayed_ref_head(root->fs_info, node, head,
+  node->action);

if (insert_reserved) {
btrfs_pin_extent(root, node->bytenr,
@@ -8317,7 +8318,8 @@ static int record_one_subtree_extent(struct 
btrfs_trans_handle *trans,

delayed_refs = &trans->transaction->delayed_refs;
spin_lock(&delayed_refs->lock);
-   if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
+   if (btrfs_qgroup_insert_dirty_extent(trans->root->fs_info,
+delayed_refs, qrecord))
kfree(qrecord);
spin_unlock(&delayed_refs->lock);

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9d4c05b..13e28d8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1453,9 +1453,10 @@ int btrfs_qgroup_prepare_account_extents(struct 
btrfs_trans_handle *trans,
return ret;
 }

-struct btrfs_qgroup_extent_record
-*btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
- struct btrfs_qgroup_extent_record *record)
+struct btrfs_qgroup_extent_record *
+btrfs_qgroup_insert_dirty_ext

Re: [PATCH 00/31] btrfs: simplify use of struct btrfs_root pointers

2016-06-26 Thread Qu Wenruo



At 06/25/2016 06:14 AM, je...@suse.com wrote:

From: Jeff Mahoney 

One of the common complaints I've heard from new and experienced
developers alike about the btrfs code is the ubiquity of
struct btrfs_root.  There is one for every tree on disk and it's not
always obvious which root is needed in a particular call path.  It can
be frustrating to spend time figuring out which root is required only
to discover that it's not actually used for anything other than
getting the fs-global struct btrfs_fs_info.


Can't agree any more.

btrfs_root should only be used for case involving searching one fs tree.

For non-fs trees, like extent/chunk/root and other trees, we should get 
it from fs_info, other than passing a btrfs_root.


So for this, I totally agree on the modification.



The patchset contains several sections.

1) The fsid trace event patchset I posted earlier; I can rebase without this
   but I'd prefer not to.


IMHO the fsid output in ftrace is too long, making the ftrace harder to 
read.


And even more, for some developer, like me, is doing testing with only 
one btrfs fs.

In that case, the ftrace fsid output is completely redundant.

So it would be quite nice to limit the fsid to first 8 hex bytes.

(And the first time I tested David's for next branch, the ftrace output 
makes it quite hard to read, had to stripe them out to read out needed data)




2) Converting btrfs_test_opt and friends to use an fs_info.

3) Converting tests to use an fs_info pointer whenever a root is used.

4) Moving sectorsize and nodesize to fs_info and cleaning up the
   macros used to access them.


Yeah! nodesize inside btrfs_root is never sane.
I still remember the days I need to get nodesize either from 
fs_info->sb_copy or fs_info->root_tree.


I'll check each patch more carefully then, but the direction is quite good.
Good job!

Thanks,
Qu



5) General cleanups and small fixes to make the later patches easier to
   review.

6) Adding an "fs_info" convenience variable to every functiont that
   references root->fs_info more than once.  As new references appear
   in functions, more of these are added later.

7) Call functions that always overwrite their root parameter with
   an fs_info instead.

8) Call functions that are always called using the same root with
   an fs_info instead, retreiving the root internally.

9) Convert every function that accepts a root argument but only uses it
   to retreive an fs_info to accept an fs_info instead.  This is a
   recursive process as the changes percolate up.

10) Separately convert btrfs_commit_transaction and btrfs_end_transaction
to use trans->root instead of a root parameter.  Both are always called
with the same root that was passed to btrfs_{start,join}_transaction.

This series of patches in email format is the "squashed" version of
the full development series.  That series is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/linux-btrfs.git

There are two branches of interest:
- btrfs-testing/root-fsinfo-cleanup-squashed contains this series
- btrfs-testing/root-fsinfo-cleanup contains the full series

The btrfs-testing/root-fsinfo-cleanup branch is easier to review if using
git as each change is discrete.  As a whole, the patchset is invasive but
should change execution minimally.  It passes and fails the same xfstests
that the unpatched kernel does across multiple runs.

Thanks,

-Jeff

---

Jeff Mahoney (31):
  btrfs: plumb fs_info into btrfs_work
  btrfs: prefix fsid to all trace events
  btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
  btrfs: tests, move initialization into tests/
  btrfs: tests, require fs_info for root
  btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
  btrfs: simpilify btrfs_subvol_inherit_props
  btrfs: copy_to_sk drop unused root parameter
  btrfs: cleanup, remove prototype for btrfs_find_root_ref
  btrfs: introduce BTRFS_MAX_ITEM_SIZE
  btrfs: convert nodesize macros to static inlines
  btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
  btrfs: add btrfs_trans_handle->fs_info pointer
  btrfs: btrfs_abort_transaction, drop root parameter
  btrfs: call functions that overwrite their root parameter with fs_info
  btrfs: call functions that always use the same root with fs_info
instead
  btrfs: btrfs_init_new_device should use fs_info->dev_root
  btrfs: alloc_reserved_file_extent trace point should use extent_root
  btrfs: struct btrfsic_state->root should be an fs_info
  btrfs: struct reada_control.root -> reada_control.fs_info
  btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
  btrfs: root->fs_info cleanup, io_ctl_init
  btrfs: pull node/sector/stripe sizes out of root and into fs_info
  btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
  btrfs: root->fs_info cleanup, lock/unlock_chunks
  btrfs: root->fs_info cleanup, update_block_group{,flags}
  btrfs: root->fs_info cleanup, add fs_info convenience variables
  btrfs: 

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Hugo Mills
On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote:
> On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> > Could this explain why people have been reporting so many raid56 mode
> > cases of btrfs replacing a first drive appearing to succeed just fine,
> > but then they go to btrfs replace a second drive, and the array crashes
> > as if the first replace didn't work correctly after all, resulting in two
> > bad devices once the second replace gets under way, of course bringing
> > down the array?
> >
> > If so, then it looks like we have our answer as to what has been going
> > wrong that has been so hard to properly trace and thus to bugfix.
> >
> > Combine that with the raid4 dedicated parity device behavior you're
> > seeing if the writes are all exactly 128 MB, with that possibly
> > explaining the super-slow replaces, and this thread may have just given
> > us answers to both of those until-now-untraceable issues.
> >
> > Regardless, what's /very/ clear by now is that raid56 mode as it
> > currently exists is more or less fatally flawed, and a full scrap and
> > rewrite to an entirely different raid56 mode on-disk format may be
> > necessary to fix it.
> >
> > And what's even clearer is that people /really/ shouldn't be using raid56
> > mode for anything but testing with throw-away data, at this point.
> > Anything else is simply irresponsible.
> >
> > Does that mean we need to put a "raid56 mode may eat your babies" level
> > warning in the manpage and require a --force to either mkfs.btrfs or
> > balance to raid56 mode?  Because that's about where I am on it.
> 
> Agree. At this point letting ordinary users create raid56 filesystems
> is counterproductive.
> 
> 
> I would suggest:
> 
> 1, a much more strongly worded warning in the wiki. Make sure there
> are no misunderstandings
> that they really should not use raid56 right now for new filesystems.

   I beefed up the warnings in several places in the wiki a couple of
days ago.

   Hugo.

> 2, Instead of a --force flag. (Users tend to ignore ---force and
> warnings in documentation.)
> Instead ifdef out the options to create raid56 in mkfs.btrfs.
> Developers who want to test can just remove the ifdef and recompile
> the tools anyway.
> But if end-users have to recompile userspace, that really forces the
> point that "you
> really should not use this right now".
> 
> 3, reach out to the documentation and fora for the major distros and
> make sure they update their
> documentation accordingly.
> I think a lot of end-users, if they try to research something, are
> more likely to go to  fora and wiki
> than search out an upstream fora.

-- 
Hugo Mills | "No! My collection of rare, incurable diseases!
hugo@... carfax.org.uk | Violated!"
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Stimpson J. Cat, The Ren & Stimpy Show


signature.asc
Description: Digital signature


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread ronnie sahlberg
On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted:
>
>> Wow. So it sees the data strip corruption, uses good parity on disk to
>> fix it, writes the fix to disk, recomputes parity for some reason but
>> does it wrongly, and then overwrites good parity with bad parity?
>> That's fucked. So in other words, if there are any errors fixed up
>> during a scrub, you should do a 2nd scrub. The first scrub should make
>> sure data is correct, and the 2nd scrub should make sure the bug is
>> papered over by computing correct parity and replacing the bad parity.
>>
>> I wonder if the same problem happens with balance or if this is just a
>> bug in scrub code?
>
> Could this explain why people have been reporting so many raid56 mode
> cases of btrfs replacing a first drive appearing to succeed just fine,
> but then they go to btrfs replace a second drive, and the array crashes
> as if the first replace didn't work correctly after all, resulting in two
> bad devices once the second replace gets under way, of course bringing
> down the array?
>
> If so, then it looks like we have our answer as to what has been going
> wrong that has been so hard to properly trace and thus to bugfix.
>
> Combine that with the raid4 dedicated parity device behavior you're
> seeing if the writes are all exactly 128 MB, with that possibly
> explaining the super-slow replaces, and this thread may have just given
> us answers to both of those until-now-untraceable issues.
>
> Regardless, what's /very/ clear by now is that raid56 mode as it
> currently exists is more or less fatally flawed, and a full scrap and
> rewrite to an entirely different raid56 mode on-disk format may be
> necessary to fix it.
>
> And what's even clearer is that people /really/ shouldn't be using raid56
> mode for anything but testing with throw-away data, at this point.
> Anything else is simply irresponsible.
>
> Does that mean we need to put a "raid56 mode may eat your babies" level
> warning in the manpage and require a --force to either mkfs.btrfs or
> balance to raid56 mode?  Because that's about where I am on it.

Agree. At this point letting ordinary users create raid56 filesystems
is counterproductive.


I would suggest:

1, a much more strongly worded warning in the wiki. Make sure there
are no misunderstandings
that they really should not use raid56 right now for new filesystems.

2, Instead of a --force flag. (Users tend to ignore ---force and
warnings in documentation.)
Instead ifdef out the options to create raid56 in mkfs.btrfs.
Developers who want to test can just remove the ifdef and recompile
the tools anyway.
But if end-users have to recompile userspace, that really forces the
point that "you
really should not use this right now".

3, reach out to the documentation and fora for the major distros and
make sure they update their
documentation accordingly.
I think a lot of end-users, if they try to research something, are
more likely to go to  fora and wiki
than search out an upstream fora.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-26 Thread Chris Murphy
On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida  wrote:
> A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu:
>> On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida > > wrote:
>> > Citando Chris Murphy :
>> > > 3. btrfs-image so that devs can see what's causing the problem
>> > > that
>> > > the current code isn't handling well enough.
>> >
>> >
>> > btrfs-image does not create dump image:
>> >
>> > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
>> > btrfs-lv_opensuse_root.image
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > Csum didn't match
>> > Error reading metadata block
>> > Error adding block -5
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
>> > Csum didn't match
>> > Error reading metadata block
>> > Error flushing pending -5
>> > create failed (Success)
>> > # echo $?
>> > 1
>>
>> Well it's pretty strange to have DUP metadata and for the checksum
>> verify to fail on both copies. I don't have much optimism that brfsck
>> repair can fix it either. But still it's worth a shot since there's
>> not much else to go on.
>
> I have tried "btrfs check --repair /device" but that seems do not do
> any good.
> http://paste.fedoraproject.org/384960/66945936/

It did fix things, in particular with the snapshot that was having
problems being dropped. But it's not enough it seems to prevent it
from going read only.

There's more than one bug here, you might see if the repair was good
enough that it's possible to use brtfs-image now. If not, use
btrfs-debug-tree  > file.txt and post that file somewhere. This
does expose file names. Maybe that'll shed some light on the problem.
But also worth filing a bug at bugzilla.kernel.org with this debug
tree referenced (probably too big to attach), maybe a dev will be able
to look at it and improve things so they don't fail.


> What else can I do or I must rebuild the file system?

Well, it's a long shot but you could try using --repair --init-csum
which will create a new csum tree. But that applies to data, if the
problem with it going read only is due to metadata corruption this
won't help. And then last you could try --init-extent-tree. Thing I
can't answer is which order to do it in.

In any case there will be files that you shouldn't trust after csum
has been recreated, anything corrupt will now have a new csum, so you
can get silent data corruption. It's better to just blow away this
file system and make a new one and reinstall the OS. But if you're
feeling brave, you can try one or both of those additional options and
see if they can help.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Zygo Blaxell
On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote:
> On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov  wrote:
> > 26.06.2016 00:52, Chris Murphy пишет:
> >> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> >> is raid4.
> >
> > That's not what code suggests and what I see in practice - parity seems
> > to be distributed across all disks; each new 128KiB file (extent) has
> > parity on new disk. At least as long as we can trust btrfs-map-logical
> > to always show parity as "mirror 2".
> 
> 
> tl;dr Andrei is correct there's no raid4 behavior here.
> 
> Looks like mirror 2 is always parity, more on that below.
> 
> 
> >
> > Do you see consecutive full stripes in your tests? Or how do you
> > determine which devid has parity for a given full stripe?
> 
> I do see consecutive full stripe writes, but it doesn't always happen.
> But not checking the consecutivity is where I became confused.
> 
> [root@f24s ~]# filefrag -v /mnt/5/ab*
> Filesystem type is: 9123683e
> File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456128..   3456159: 32: last,eof
> /mnt/5/ab128_2.txt: 1 extent found
> File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456224..   3456255: 32: last,eof
> /mnt/5/ab128_3.txt: 1 extent found
> File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456320..   3456351: 32: last,eof
> /mnt/5/ab128_4.txt: 1 extent found
> File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456352..   3456383: 32: last,eof
> /mnt/5/ab128_5.txt: 1 extent found
> File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456384..   3456415: 32: last,eof
> /mnt/5/ab128_6.txt: 1 extent found
> File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456416..   3456447: 32: last,eof
> /mnt/5/ab128_7.txt: 1 extent found
> File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456448..   3456479: 32: last,eof
> /mnt/5/ab128_8.txt: 1 extent found
> File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456480..   3456511: 32: last,eof
> /mnt/5/ab128_9.txt: 1 extent found
> File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..  31:3456096..   3456127: 32: last,eof
> /mnt/5/ab128.txt: 1 extent found
> 
> Starting with the bottom file then from the top so they're in 4096
> byte block order; and the 2nd column is the difference in value:
> 
> 3456096
> 3456128 32
> 3456224 96
> 3456320 96
> 3456352 32
> 3456384 32
> 3456416 32
> 3456448 32
> 3456480 32
> 
> So the first two files are consecutive full stripe writes. The next
> two aren't. The next five are. They were all copied at the same time.
> I don't know why they aren't always consecutive writes.

The logical addresses don't include parity stripes, so you won't find
them with FIEMAP.  Parity locations are calculated after the logical ->
(disk, chunk_offset) translation is done (it's the same chunk_offset on
every disk, but one of the disks is parity while the others are data).

> [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
> mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
> mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
> mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
> mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
> mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
> mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
> [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
> mirror 1 logical 14157086720 physical 1075445760 device /dev/map

Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Chris Murphy
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov  wrote:
> 26.06.2016 00:52, Chris Murphy пишет:
>> Interestingly enough, so far I'm finding with full stripe writes, i.e.
>> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
>> is raid4.
>
> That's not what code suggests and what I see in practice - parity seems
> to be distributed across all disks; each new 128KiB file (extent) has
> parity on new disk. At least as long as we can trust btrfs-map-logical
> to always show parity as "mirror 2".


tl;dr Andrei is correct there's no raid4 behavior here.

Looks like mirror 2 is always parity, more on that below.


>
> Do you see consecutive full stripes in your tests? Or how do you
> determine which devid has parity for a given full stripe?

I do see consecutive full stripe writes, but it doesn't always happen.
But not checking the consecutivity is where I became confused.

[root@f24s ~]# filefrag -v /mnt/5/ab*
Filesystem type is: 9123683e
File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456128..   3456159: 32: last,eof
/mnt/5/ab128_2.txt: 1 extent found
File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456224..   3456255: 32: last,eof
/mnt/5/ab128_3.txt: 1 extent found
File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456320..   3456351: 32: last,eof
/mnt/5/ab128_4.txt: 1 extent found
File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456352..   3456383: 32: last,eof
/mnt/5/ab128_5.txt: 1 extent found
File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456384..   3456415: 32: last,eof
/mnt/5/ab128_6.txt: 1 extent found
File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456416..   3456447: 32: last,eof
/mnt/5/ab128_7.txt: 1 extent found
File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456448..   3456479: 32: last,eof
/mnt/5/ab128_8.txt: 1 extent found
File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456480..   3456511: 32: last,eof
/mnt/5/ab128_9.txt: 1 extent found
File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31:3456096..   3456127: 32: last,eof
/mnt/5/ab128.txt: 1 extent found

Starting with the bottom file then from the top so they're in 4096
byte block order; and the 2nd column is the difference in value:

3456096
3456128 32
3456224 96
3456320 96
3456352 32
3456384 32
3456416 32
3456448 32
3456480 32

So the first two files are consecutive full stripe writes. The next
two aren't. The next five are. They were all copied at the same time.
I don't know why they aren't always consecutive writes.


[root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a
mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a
mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a
mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b
mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a
mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b
mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a
mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b
mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a
[root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a
mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c
mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b
[root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a
mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a
mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c
[root@f24s ~]# btrfs-map-logi

Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Chris Murphy
On Sun, Jun 26, 2016 at 3:20 AM, Goffredo Baroncelli  wrote:
> On 2016-06-26 00:33, Chris Murphy wrote:
>> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
>>  wrote:
>>> On 2016-06-25 19:58, Chris Murphy wrote:
>>> [...]
> Wow. So it sees the data strip corruption, uses good parity on disk to
> fix it, writes the fix to disk, recomputes parity for some reason but
> does it wrongly, and then overwrites good parity with bad parity?

 The wrong parity, is it valid for the data strips that includes the
 (intentionally) corrupt data?

 Can parity computation happen before the csum check? Where sometimes you 
 get:

 read data strips > computer parity > check csum fails > read good
 parity from disk > fix up the bad data chunk > write wrong parity
 (based on wrong data)?

 https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3

 2371-2383 suggest that there's a parity check, it's not always being
 rewritten to disk if it's already correct. But it doesn't know it's
 not correct, it thinks it's wrong so writes out the wrongly computed
 parity?
>>>
>>> The parity is not valid for both the corrected data and the corrupted data. 
>>> It seems that the scrub process copy the contents of the disk2 to disk3. It 
>>> could happens only if the contents of disk1 is zero.
>>
>> I'm not sure what it takes to hit this exactly. I just tested 3x
>> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
>> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
>> 64KiB of "b" did a scrub, error is detected, and corrected, and parity
>> is still correct.
>
> How many time tried this corruption test ? I was unable to raise the bug 
> systematically; in average every three tests I got 1 bug

Once.

I just did it a 2nd time and both file's parity are wrong now. So I
did it several more times. Sometimes both files' parity is bad.
Sometimes just one file's parity is bad. Sometimes neither file's
parity is bad.

It's a very bad bug, because it is a form of silent data corruption
and it's induced by Btrfs. And it's apparently non-deterministically
hit. Is this some form of race condition?

Somewhat orthogonal to this, is that while Btrfs is subject to the
write hole problem where parity can be wrong, this is detected and
warned. Bad data doesn't propagate up to user space.

This might explain how users are getting hit with corrupt files only
after they have a degraded volume. They did a scrub where some fixups
happen, but behind the scene possibly parity was corrupted even though
their data was fixed. Later they have a failed device, and the bad
parity is needed, and now there are a bunch of scary checksum errors
with piles of files listed as unrecoverable. And in fact they are
unrecoverable because the bad parity means bad reconstruction, so even
scraping it off with btrfs restore won't work.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Duncan
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted:

> P.S. usage of "stripe" to mean "stripe element" actually adds to
> confusion when reading code :)

... and posts (including patches, which I guess are code as well, just 
not applied yet).  I've been noticing that in the "stripe length" 
patches, when the comment associated with the patch suggests it's "strip 
length" they're actually talking about, using the "N strips, one per 
device, make a stripe" definition.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/31] btrfs: simplify use of struct btrfs_root pointers

2016-06-26 Thread Jeff Mahoney
On 6/24/16 6:14 PM, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> One of the common complaints I've heard from new and experienced
> developers alike about the btrfs code is the ubiquity of
> struct btrfs_root.  There is one for every tree on disk and it's not
> always obvious which root is needed in a particular call path.  It can
> be frustrating to spend time figuring out which root is required only
> to discover that it's not actually used for anything other than
> getting the fs-global struct btrfs_fs_info.
> 
> The patchset contains several sections.
> 
> 1) The fsid trace event patchset I posted earlier; I can rebase without this
>but I'd prefer not to.
> 
> 2) Converting btrfs_test_opt and friends to use an fs_info.
> 
> 3) Converting tests to use an fs_info pointer whenever a root is used.
> 
> 4) Moving sectorsize and nodesize to fs_info and cleaning up the
>macros used to access them.
> 
> 5) General cleanups and small fixes to make the later patches easier to
>review.
> 
> 6) Adding an "fs_info" convenience variable to every functiont that
>references root->fs_info more than once.  As new references appear
>in functions, more of these are added later.
> 
> 7) Call functions that always overwrite their root parameter with
>an fs_info instead.
> 
> 8) Call functions that are always called using the same root with
>an fs_info instead, retreiving the root internally.
> 
> 9) Convert every function that accepts a root argument but only uses it
>to retreive an fs_info to accept an fs_info instead.  This is a
>recursive process as the changes percolate up.
> 
> 10) Separately convert btrfs_commit_transaction and btrfs_end_transaction
> to use trans->root instead of a root parameter.  Both are always called
> with the same root that was passed to btrfs_{start,join}_transaction.
> 
> This series of patches in email format is the "squashed" version of
> the full development series.  That series is available at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/linux-btrfs.git
> 
> There are two branches of interest:
> - btrfs-testing/root-fsinfo-cleanup-squashed contains this series
> - btrfs-testing/root-fsinfo-cleanup contains the full series

Hrm.  It looks like vger didn't allow a few of the patches through,
presumably due to size.  Using the git repo is probably the only way to
go, then.

-Jeff

> The btrfs-testing/root-fsinfo-cleanup branch is easier to review if using
> git as each change is discrete.  As a whole, the patchset is invasive but
> should change execution minimally.  It passes and fails the same xfstests
> that the unpatched kernel does across multiple runs.
> 
> Thanks,
> 
> -Jeff
> 
> ---
> 
> Jeff Mahoney (31):
>   btrfs: plumb fs_info into btrfs_work
>   btrfs: prefix fsid to all trace events
>   btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
>   btrfs: tests, move initialization into tests/
>   btrfs: tests, require fs_info for root
>   btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
>   btrfs: simpilify btrfs_subvol_inherit_props
>   btrfs: copy_to_sk drop unused root parameter
>   btrfs: cleanup, remove prototype for btrfs_find_root_ref
>   btrfs: introduce BTRFS_MAX_ITEM_SIZE
>   btrfs: convert nodesize macros to static inlines
>   btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
>   btrfs: add btrfs_trans_handle->fs_info pointer
>   btrfs: btrfs_abort_transaction, drop root parameter
>   btrfs: call functions that overwrite their root parameter with fs_info
>   btrfs: call functions that always use the same root with fs_info
> instead
>   btrfs: btrfs_init_new_device should use fs_info->dev_root
>   btrfs: alloc_reserved_file_extent trace point should use extent_root
>   btrfs: struct btrfsic_state->root should be an fs_info
>   btrfs: struct reada_control.root -> reada_control.fs_info
>   btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
>   btrfs: root->fs_info cleanup, io_ctl_init
>   btrfs: pull node/sector/stripe sizes out of root and into fs_info
>   btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
>   btrfs: root->fs_info cleanup, lock/unlock_chunks
>   btrfs: root->fs_info cleanup, update_block_group{,flags}
>   btrfs: root->fs_info cleanup, add fs_info convenience variables
>   btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
>   btrfs: take an fs_info parameter directly when the root is not used
> otherwise
>   btrfs: root->fs_info cleanup, btrfs_commit_transaction already has
> root
>   btrfs: root->fs_info cleanup, btrfs_end_transaction{,_throttle} use
> trans->fs_info instead of parameter
> 
>  fs/btrfs/async-thread.c|   31 +-
>  fs/btrfs/async-thread.h|6 +-
>  fs/btrfs/backref.c |   12 +-
>  fs/btrfs/check-integrity.c |   41 +-
>  fs/btrfs/check-integrity.h |5 +-
>  fs/btrfs/compression.c

Re: Bad hard drive - checksum verify failure forces readonly mount

2016-06-26 Thread Vasco Almeida
A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu:
> On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida  > wrote:
> > Citando Chris Murphy :
> > > 3. btrfs-image so that devs can see what's causing the problem
> > > that
> > > the current code isn't handling well enough.
> > 
> > 
> > btrfs-image does not create dump image:
> > 
> > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root
> > btrfs-lv_opensuse_root.image
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > Csum didn't match
> > Error reading metadata block
> > Error adding block -5
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8
> > Csum didn't match
> > Error reading metadata block
> > Error flushing pending -5
> > create failed (Success)
> > # echo $?
> > 1
> 
> Well it's pretty strange to have DUP metadata and for the checksum
> verify to fail on both copies. I don't have much optimism that brfsck
> repair can fix it either. But still it's worth a shot since there's
> not much else to go on.

I have tried "btrfs check --repair /device" but that seems do not do
any good.
http://paste.fedoraproject.org/384960/66945936/

I then issued "mount /device /mnt" and, like before, it was mounted
readwrite and then forced readonly. Got some kernel oops and traces. 

I noticed that btrfs-balance was using ~100% CPU whilst btrfs device
was mounted readonly. I let it run for about 20 minutes.
Then had to reboot because the system was no responding well: was
unable to open or close applications, use internet. Did SysRq+reisu
(operations were enabled) and then pressed reset button on computer.

Unfortunately dmesg dumps were lost after resetting computer.

What else can I do or I must rebuild the file system?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5

2016-06-26 Thread Goffredo Baroncelli
On 2016-06-26 00:33, Chris Murphy wrote:
> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli
>  wrote:
>> On 2016-06-25 19:58, Chris Murphy wrote:
>> [...]
 Wow. So it sees the data strip corruption, uses good parity on disk to
 fix it, writes the fix to disk, recomputes parity for some reason but
 does it wrongly, and then overwrites good parity with bad parity?
>>>
>>> The wrong parity, is it valid for the data strips that includes the
>>> (intentionally) corrupt data?
>>>
>>> Can parity computation happen before the csum check? Where sometimes you 
>>> get:
>>>
>>> read data strips > computer parity > check csum fails > read good
>>> parity from disk > fix up the bad data chunk > write wrong parity
>>> (based on wrong data)?
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3
>>>
>>> 2371-2383 suggest that there's a parity check, it's not always being
>>> rewritten to disk if it's already correct. But it doesn't know it's
>>> not correct, it thinks it's wrong so writes out the wrongly computed
>>> parity?
>>
>> The parity is not valid for both the corrected data and the corrupted data. 
>> It seems that the scrub process copy the contents of the disk2 to disk3. It 
>> could happens only if the contents of disk1 is zero.
> 
> I'm not sure what it takes to hit this exactly. I just tested 3x
> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full
> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2
> 64KiB of "b" did a scrub, error is detected, and corrected, and parity
> is still correct.

How many time tried this corruption test ? I was unable to raise the bug 
systematically; in average every three tests I got 1 bug 


[]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-26 Thread Andrei Borzenkov
26.06.2016 00:52, Chris Murphy пишет:
> Interestingly enough, so far I'm finding with full stripe writes, i.e.
> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This
> is raid4.

That's not what code suggests and what I see in practice - parity seems
to be distributed across all disks; each new 128KiB file (extent) has
parity on new disk. At least as long as we can trust btrfs-map-logical
to always show parity as "mirror 2".

Do you see consecutive full stripes in your tests? Or how do you
determine which devid has parity for a given full stripe? This
information is not actually stored anywhere, it is computed based on
block group geometry and logical stripe offset.

P.S. usage of "stripe" to mean "stripe element" actually adds to
confusion when reading code :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html