Re: Bad hard drive - checksum verify failure forces readonly mount
A Dom, 26-06-2016 às 13:54 -0600, Chris Murphy escreveu: > On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida > wrote: > > I have tried "btrfs check --repair /device" but that seems do not > > do > > any good. > > http://paste.fedoraproject.org/384960/66945936/ > > It did fix things, in particular with the snapshot that was having > problems being dropped. But it's not enough it seems to prevent it > from going read only. > > There's more than one bug here, you might see if the repair was good > enough that it's possible to use brtfs-image now. File system image available at (choose one link) https://mega.nz/#!AkAEgKyB!RUa7G5xHIygWm0ALx5ZxQjjXNdFYa7lDRHJ_sW0bWLs https://www.sendspace.com/file/i70cft > If not, use > btrfs-debug-tree > file.txt and post that file somewhere. This > does expose file names. Maybe that'll shed some light on the problem. > But also worth filing a bug at bugzilla.kernel.org with this debug > tree referenced (probably too big to attach), maybe a dev will be > able > to look at it and improve things so they don't fail. Should I file a bug report with that image dump linked above or btrfs- debug-tree output or both? I think I will use the subject of this thread as summary to file the bug. Can you think of something more suitable or is that fine? > > What else can I do or I must rebuild the file system? > > Well, it's a long shot but you could try using --repair --init-csum > which will create a new csum tree. But that applies to data, if the > problem with it going read only is due to metadata corruption this > won't help. And then last you could try --init-extent-tree. Thing I > can't answer is which order to do it in. > > In any case there will be files that you shouldn't trust after csum > has been recreated, anything corrupt will now have a new csum, so you > can get silent data corruption. It's better to just blow away this > file system and make a new one and reinstall the OS. But if you're > feeling brave, you can try one or both of those additional options > and > see if they can help. I think I will reinstall the OS since, even if I manage to recover the file system from this issue, that OS will be something I can not trust fully. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/14] Btrfs: warn_on for unaccounted spaces
Hi Josef, Would you please move this patch to the first of the patchset? It's making bisect quite hard, as it will always stop at this patch, hard to check if it's a regression or existing bug. Thanks, Qu At 03/26/2016 01:25 AM, Josef Bacik wrote: These were hidden behind enospc_debug, which isn't helpful as they indicate actual bugs, unlike the rest of the enospc_debug stuff which is really debug information. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 157a0b6..90ac821 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9633,13 +9633,15 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) space_info = list_entry(info->space_info.next, struct btrfs_space_info, list); - if (btrfs_test_opt(info->tree_root, ENOSPC_DEBUG)) { - if (WARN_ON(space_info->bytes_pinned > 0 || + + /* +* Do not hide this behind enospc_debug, this is actually +* important and indicates a real bug if this happens. +*/ + if (WARN_ON(space_info->bytes_pinned > 0 || space_info->bytes_reserved > 0 || - space_info->bytes_may_use > 0)) { - dump_space_info(space_info, 0, 0); - } - } + space_info->bytes_may_use > 0)) + dump_space_info(space_info, 0, 0); list_del(&space_info->list); for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { struct kobject *kobj; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
27.06.2016 06:50, Christoph Anton Mitterer пишет: > > What should IMHO be done as well is giving a big fat warning in the > manpages/etc. that when nodatacow is used RAID recovery cannot produce > valid data (at least as long as there isn't checksumming implemented > for nodatacow). The problem is that current implementation of RAID56 puts exactly CoW data at risk. I.e. writing new (copy of) data may suddenly make old (copy of) data inaccessible, even though it had been safely committed to disk and is now in read-only snapshot. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] fstests: btrfs/048: extend _filter_btrfs_prop_error to handle additional errors
On Fri, Jun 24, 2016 at 11:08:31AM -0400, je...@suse.com wrote: > From: Jeff Mahoney > > btrfsprogs v4.5.3 changed the formatting of some error messages. This > patch extends the filter for btrfs prop to handle those. > > Signed-off-by: Jeff Mahoney > --- > common/filter.btrfs | 10 +++--- > tests/btrfs/048 | 6 -- > tests/btrfs/048.out | 4 ++-- > 3 files changed, 13 insertions(+), 7 deletions(-) > > diff --git a/common/filter.btrfs b/common/filter.btrfs > index 9970f4d..54361d4 100644 > --- a/common/filter.btrfs > +++ b/common/filter.btrfs > @@ -72,15 +72,19 @@ _filter_btrfs_compress_property() > sed -e "s/compression=\(lzo\|zlib\)/COMPRESSION=XXX/g" > } > > -# filter name of the property from the output, optionally verify against $1 > +# filter error messages from btrfs prop, optionally verify against $1 > # recognized message(s): > # "object is not compatible with property: label" > +# "invalid value for property:{, value}" > +# "failed to {get,set} compression for $PATH[.:]: Invalid argument" > _filter_btrfs_prop_error() > { > if ! [ -z "$1" ]; then > - sed -e "s/\(compatible with property\): $1/\1/" > + sed -e "s#\(compatible with property\): $1#\1#" \ > + -e "s#^\(.*failed to [sg]et compression for $1\)[:.] > \(.*\)#\1: \2#" > else > - sed -e "s/^\(.*compatible with property\).*/\1/" > + sed -e "s#^\(.*compatible with property\).*#\1#" \ > + -e "s#^\(.*invalid value for property\):.*#\1#" I think the last regex should be - -e "s#^\(.*invalid value for property\):.*#\1#" + -e "s#^\(.*invalid value for property\)[:.].*#\1#" because... > fi > } > > diff --git a/tests/btrfs/048 b/tests/btrfs/048 > index 4a36303..0b907b0 100755 > --- a/tests/btrfs/048 > +++ b/tests/btrfs/048 > @@ -79,7 +79,8 @@ echo -e "\nTesting subvolume ro property" > _run_btrfs_util_prog subvolume create $SCRATCH_MNT/sv1 > $BTRFS_UTIL_PROG property get $SCRATCH_MNT/sv1 ro > echo "***" > -$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo > +$BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro foo 2>&1 | > + _filter_btrfs_prop_error > echo "***" > $BTRFS_UTIL_PROG property set $SCRATCH_MNT/sv1 ro true > echo "***" > @@ -99,7 +100,8 @@ $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 > compression > $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/subdir1 compression > echo "***" > $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression \ > - foo 2>&1 | _filter_scratch > + foo 2>&1 | _filter_scratch | > + _filter_btrfs_prop_error SCRATCH_MNT/testdir/file1 > echo "***" > $BTRFS_UTIL_PROG property set $SCRATCH_MNT/testdir/file1 compression lzo > $BTRFS_UTIL_PROG property get $SCRATCH_MNT/testdir/file1 compression > diff --git a/tests/btrfs/048.out b/tests/btrfs/048.out > index 0b20d0b..3e4e3d2 100644 > --- a/tests/btrfs/048.out > +++ b/tests/btrfs/048.out > @@ -15,7 +15,7 @@ ERROR: object is not compatible with property > Testing subvolume ro property > ro=false > *** > -ERROR: invalid value for property. > +ERROR: invalid value for property this change breaks test with older btrfs-progs, the filter didn't remove the ending "." correctly. With above fix btrfs/048 works for me with both v3.19 and v4.6 btrfs-progs. Thanks, Eryu > *** > *** > ro=true > @@ -27,7 +27,7 @@ ro=false > > Testing compression property > *** > -ERROR: failed to set compression for SCRATCH_MNT/testdir/file1. Invalid > argument > +ERROR: failed to set compression for SCRATCH_MNT/testdir/file1: Invalid > argument > *** > compression=lzo > compression=lzo > -- > 1.8.5.6 > > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Strange behavior when replacing device on BTRFS RAID 5 array.
On Sun, Jun 26, 2016 at 8:57 PM, Nick Austin wrote: > sudo btrfs fi show /mnt/newdata > Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec > Total devices 4 FS bytes used 8.07TiB > devid1 size 5.46TiB used 2.70TiB path /dev/sdg > devid2 size 5.46TiB used 2.70TiB path /dev/sdl > devid3 size 5.46TiB used 2.70TiB path /dev/sdm > devid4 size 5.46TiB used 2.70TiB path /dev/sdx It looks like fi show has bad data: When I start heavy IO on the filesystem (running rsync -c to verify the data), I notice zero IO on the bad drive I told btrfs to replace, and lots of IO to the expected replacement. I guess some metadata is messed up somewhere? avg-cpu: %user %nice %system %iowait %steal %idle 25.190.007.81 28.460.00 38.54 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdg 437.00 75168.00 1792.00 75168 1792 sdl 443.00 76064.00 1792.00 76064 1792 sdm 438.00 75232.00 1472.00 75232 1472 sdw 443.00 75680.00 1856.00 75680 1856 sdx 0.00 0.00 0.00 0 0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Strange behavior when replacing device on BTRFS RAID 5 array.
I have a 4 device BTRFS RAID 5 filesystem. One of the device members of this file system (sdr) had badblocks, so I decided to replace it. (I don't have a copy of fi show from before the replace. :-/ ) I ran this command: sudo btrfs replace start 4 /dev/sdw /mnt/newdata I had to shrink /dev/sdr by ~250 megs since the replacement drive was a tiny bit smaller. Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): resizing devid 4 Jun 25 17:26:52 frank kernel: BTRFS info (device sdr): new size for /dev/sdr is 6001175121920 Jun 25 17:27:45 frank kernel: BTRFS info (device sdr): dev_replace from /dev/sdr (devid 4) to /dev/sdw started The replace started, all seemed well. 3 hours into the replace, sdr dropped off the SATA bus and was redetected as sdx. Bummer, but shouldn't be fatal. This event really seemed to throw BTRFS for a loop. Jun 25 20:32:35 frank kernel: sd 10:0:19:0: device_block, handle(0x0019) Jun 25 20:33:05 frank kernel: sd 10:0:19:0: device_unblock and setting to running, handle(0x0019) Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: rejecting I/O to offline device Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] killing request Jun 25 20:33:05 frank kernel: scsi 10:0:19:0: [sdr] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK Jun 25 20:33:05 frank kernel: blk_update_request: I/O error, dev sdr, sector 1785876480 Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing handle(0x0019), sas_addr(0x500194000687e20e) Jun 25 20:33:05 frank kernel: mpt2sas_cm0: removing : enclosure logical id(0x500194000687e23f), slot(14) Jun 25 20:33:16 frank kernel: scsi 10:0:21:0: Direct-Access ATA WL6000GSA12872E 1C01 PQ: 0 ANSI: 5 Here you can see btrfs seems to figure out sdr has become sdx (based on the "dev /dev/sdx" entry showing up on the BTRFS warning lines). Unfortunately, all remaining IO for the device formerly known as sdr results in btrf errors like the ones listed below. iostat confirms no IO on sdx. Jun 25 20:33:17 frank kernel: sd 10:0:21:0: [sdx] Attached SCSI disk ... Jun 25 20:33:20 frank kernel: scrub_handle_errored_block: 31983 callbacks suppressed Jun 25 20:33:20 frank kernel: BTRFS warning (device sdr): i/o error at logical 2742536544256 on dev /dev/sdx, sector 1786897488, root 5, inode 222965, offset 296329216, length 4096, links 1 (path: /path/to/file) Jun 25 20:33:20 frank kernel: btrfs_dev_stat_print_on_error: 32107 callbacks suppressed These messages continue for many hours as the replace continues running. sudo btrfs replace status /mnt/newdata Started on 25.Jun 17:27:45, finished on 26.Jun 12:48:22, 0 write errs, 0 uncorr. read errs ... Jun 26 12:48:22 frank kernel: BTRFS warning (device sdr): lost page write due to IO error on /dev/sdx (Many, many of these) Jun 26 12:48:22 frank kernel: BTRFS info (device sdr): dev_replace from /dev/sdx (devid 4) to /dev/sdw finished Great! /dev/sdx replaced by /dev/sdw! Now let's confirm: sudo btrfs fi show /mnt/newdata Label: '/var/data' uuid: e4a2eb77-956e-447a-875e-4f6595a5d3ec Total devices 4 FS bytes used 8.07TiB devid1 size 5.46TiB used 2.70TiB path /dev/sdg devid2 size 5.46TiB used 2.70TiB path /dev/sdl devid3 size 5.46TiB used 2.70TiB path /dev/sdm devid4 size 5.46TiB used 2.70TiB path /dev/sdx Bummer, this doesn't look right. sdx is still in the array (failing drive). Additionally, /dev/sdw isn't listed at all! Worse still, it looks like the array has lost redundancy (it has 8TiB of data, and that's the amount shown as used divided by number of devices). It looks like it tried to add the new device, but did a balance across all of them instead? % sudo btrfs fi df /mnt/newdata Data, RAID5: total=8.07TiB, used=8.06TiB System, RAID10: total=80.00MiB, used=576.00KiB Metadata, RAID10: total=12.00GiB, used=10.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B Any advice would be appreciated. % uname -a Linux frank 4.5.5-201.fc23.x86_64 #1 SMP Sat May 21 15:29:49 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux % lsb_release Description:Fedora release 24 (Twenty Three) % btrfs --version btrfs-progs v4.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sun, 2016-06-26 at 15:33 -0700, ronnie sahlberg wrote: > 1, a much more strongly worded warning in the wiki. Make sure there > are no misunderstandings > that they really should not use raid56 right now for new filesystems. I doubt most end users can be assumed to read the wiki... > 2, Instead of a --force flag. (Users tend to ignore ---force and > warnings in documentation.) > Instead ifdef out the options to create raid56 in mkfs.btrfs. > Developers who want to test can just remove the ifdef and recompile > the tools anyway. > But if end-users have to recompile userspace, that really forces the > point that "you > really should not use this right now". Well if one does --force or --yes-i-know-what-i-do, and one actually doesn't than such person is on his own. People can always shoot themselves if they want to. Actually I think that the compile-time way is inferior here. Distros may just always enable raid56 there to allow people to continue mounting their existing filesystems. What should IMHO be done as well is giving a big fat warning in the manpages/etc. that when nodatacow is used RAID recovery cannot produce valid data (at least as long as there isn't checksumming implemented for nodatacow). Probably it should also be documented what btrfs does in such situation. E.g. does it just randomly pick a readable block from one of the copies? Simply error out and consider the file broken? Fill the blocks in question with zero? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-27 08:38, Hugo Mills wrote: On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote: On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Could this explain why people have been reporting so many raid56 mode > cases of btrfs replacing a first drive appearing to succeed just fine, > but then they go to btrfs replace a second drive, and the array crashes > as if the first replace didn't work correctly after all, resulting in two > bad devices once the second replace gets under way, of course bringing > down the array? > > If so, then it looks like we have our answer as to what has been going > wrong that has been so hard to properly trace and thus to bugfix. > > Combine that with the raid4 dedicated parity device behavior you're > seeing if the writes are all exactly 128 MB, with that possibly > explaining the super-slow replaces, and this thread may have just given > us answers to both of those until-now-untraceable issues. > > Regardless, what's /very/ clear by now is that raid56 mode as it > currently exists is more or less fatally flawed, and a full scrap and > rewrite to an entirely different raid56 mode on-disk format may be > necessary to fix it. > > And what's even clearer is that people /really/ shouldn't be using raid56 > mode for anything but testing with throw-away data, at this point. > Anything else is simply irresponsible. > > Does that mean we need to put a "raid56 mode may eat your babies" level > warning in the manpage and require a --force to either mkfs.btrfs or > balance to raid56 mode? Because that's about where I am on it. Agree. At this point letting ordinary users create raid56 filesystems is counterproductive. I would suggest: 1, a much more strongly worded warning in the wiki. Make sure there are no misunderstandings that they really should not use raid56 right now for new filesystems. I beefed up the warnings in several places in the wiki a couple of days ago. Not to sound rude - but I don't think these go anywhere near far enough. It needs to be completely obvious that its a good chance you'll lose everything. IMHO that's the only way that will stop BTRFS from getting the 'data eater' reputation. It can be revisited and reworded when the implementation is more tested and stable. -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-27 08:33, ronnie sahlberg wrote: On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted: Wow. So it sees the data strip corruption, uses good parity on disk to fix it, writes the fix to disk, recomputes parity for some reason but does it wrongly, and then overwrites good parity with bad parity? That's fucked. So in other words, if there are any errors fixed up during a scrub, you should do a 2nd scrub. The first scrub should make sure data is correct, and the 2nd scrub should make sure the bug is papered over by computing correct parity and replacing the bad parity. I wonder if the same problem happens with balance or if this is just a bug in scrub code? Could this explain why people have been reporting so many raid56 mode cases of btrfs replacing a first drive appearing to succeed just fine, but then they go to btrfs replace a second drive, and the array crashes as if the first replace didn't work correctly after all, resulting in two bad devices once the second replace gets under way, of course bringing down the array? If so, then it looks like we have our answer as to what has been going wrong that has been so hard to properly trace and thus to bugfix. Combine that with the raid4 dedicated parity device behavior you're seeing if the writes are all exactly 128 MB, with that possibly explaining the super-slow replaces, and this thread may have just given us answers to both of those until-now-untraceable issues. Regardless, what's /very/ clear by now is that raid56 mode as it currently exists is more or less fatally flawed, and a full scrap and rewrite to an entirely different raid56 mode on-disk format may be necessary to fix it. And what's even clearer is that people /really/ shouldn't be using raid56 mode for anything but testing with throw-away data, at this point. Anything else is simply irresponsible. Does that mean we need to put a "raid56 mode may eat your babies" level warning in the manpage and require a --force to either mkfs.btrfs or balance to raid56 mode? Because that's about where I am on it. Agree. At this point letting ordinary users create raid56 filesystems is counterproductive. +1 I would suggest: 1, a much more strongly worded warning in the wiki. Make sure there are no misunderstandings that they really should not use raid56 right now for new filesystems. I voiced my concern on #btrfs about this - it really should show that this may eat your data and is properly experimental. At the moment, it looks as if the features are implemented and working as expected. In my case with nothing out of the ordinary - I've now got ~3.8Tb free disk space. Certainly not ready for *ANY* kind of public use. 2, Instead of a --force flag. (Users tend to ignore ---force and warnings in documentation.) Instead ifdef out the options to create raid56 in mkfs.btrfs. Developers who want to test can just remove the ifdef and recompile the tools anyway. But if end-users have to recompile userspace, that really forces the point that "you really should not use this right now". I think this is a somewhat good idea - however it should be a warning along the lines of: "BTRFS RAID56 is VERY experimental and is known to corrupt data in certain cases. Use at your own risk! Continue? (y/N):" 3, reach out to the documentation and fora for the major distros and make sure they update their documentation accordingly. I think a lot of end-users, if they try to research something, are more likely to go to fora and wiki than search out an upstream fora. Another good idea. I'd also recommend updates to the ArchLinux wiki - as for some reason I always seem to end up there when searching for a certain topic... -- Steven Haigh Email: net...@crc.id.au Web: https://www.crc.id.au Phone: (03) 9001 6090 - 0412 935 897 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v11 00/13] Btrfs dedupe framework
At 06/25/2016 01:45 PM, Chandan Rajendra wrote: On Saturday, June 25, 2016 09:22:44 AM Qu Wenruo wrote: On 06/24/2016 05:29 PM, Chandan Rajendra wrote: On Friday, June 24, 2016 10:50:41 AM Qu Wenruo wrote: Hi Chandan, David, When I'm trying to rebase dedupe patchset on top of Chadan's sub page size patchset (using David's for-next-test-20160620), although the rebase itself is quite simple, but I'm afraid that I found some bugs for sub page size patchset, *without* dedupe patchset applied. These bugs seems to be unrelated to each other 1) state leak at btrfs rmmod time The leak was due to not freeing 'cached_state' in read_extent_buffer_pages(). I have fixed this and the fix will be part of the patchset when I post the next version to the mailing list. I have always compiled the btrfs code as part of the vmlinux image and hence have never rmmod the btrfs module during my local testing. The space leak messages might have appeared when I shut down my guest. Hence I had never noticed them before. Thanks once again for informing me about it. 2) bytes_may_use leak at qgroup EDQUOTA error time I have a slightly older version of btrfs-progs which does not yet have btrfs dedupe" command. I will get the new version and check if the space leak can be reproduced on my machine. However, I don't see the space leak warning messages when the reproducer script is executed after commenting out "btrfs dedupe enable $mnt". Strange. That dedupe command is not useful at all, as I'm using the branch without the dedupe patchset. Even with btrfs-progs dedupe patchset, dedupe enable only output ENOTTY error message. I'll double check if it's related to the dedupe. BTW, are you testing with 4K page size? Yes, I executed the script with 4k page size. I had based my patchset on top of 4.7-rc2 kernel. If you are interested, you can get the kernel sources at 'https://github.com/chandanr/linux subpagesize-blocksize'. I will soon rebase my patchset on David's master branch. I will let you know if I hit the space leak issue on the rebased kernel. Thanks for your info. Confirmed that leaked space is unrelated to your sub page size. It's another patchset causing the bug. Bisected and I'll inform the author. Great thanks for your help. Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/31] btrfs: tests, move initialization into tests/
On 6/26/16 10:17 PM, Qu Wenruo wrote: > > > At 06/25/2016 06:14 AM, je...@suse.com wrote: >> From: Jeff Mahoney >> >> We have all these stubs that only exist because they're called from >> btrfs_run_sanity_tests, which is a static inside super.c. Let's just >> move it all into tests/btrfs-tests.c and only have one stub. >> >> Signed-off-by: Jeff Mahoney >> --- >> fs/btrfs/super.c | 43 >> >> fs/btrfs/tests/btrfs-tests.c | 47 >> ++-- >> fs/btrfs/tests/btrfs-tests.h | 35 +++-- >> 3 files changed, 48 insertions(+), 77 deletions(-) >> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index a7b9a15..d8e48bb 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -2323,49 +2323,6 @@ static void btrfs_print_mod_info(void) >> btrfs_crc32c_impl()); >> } >> >> -static int btrfs_run_sanity_tests(void) >> -{ >> -int ret, i; >> -u32 sectorsize, nodesize; >> -u32 test_sectorsize[] = { >> -PAGE_SIZE, >> -}; >> -ret = btrfs_init_test_fs(); >> -if (ret) >> -return ret; >> -for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) { >> -sectorsize = test_sectorsize[i]; >> -for (nodesize = sectorsize; >> - nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE; >> - nodesize <<= 1) { >> -pr_info("BTRFS: selftest: sectorsize: %u nodesize: %u\n", >> -sectorsize, nodesize); >> -ret = btrfs_test_free_space_cache(sectorsize, nodesize); >> -if (ret) >> -goto out; >> -ret = btrfs_test_extent_buffer_operations(sectorsize, >> -nodesize); >> -if (ret) >> -goto out; >> -ret = btrfs_test_extent_io(sectorsize, nodesize); >> -if (ret) >> -goto out; >> -ret = btrfs_test_inodes(sectorsize, nodesize); >> -if (ret) >> -goto out; >> -ret = btrfs_test_qgroups(sectorsize, nodesize); >> -if (ret) >> -goto out; >> -ret = btrfs_test_free_space_tree(sectorsize, nodesize); >> -if (ret) >> -goto out; >> -} >> -} >> -out: >> -btrfs_destroy_test_fs(); >> -return ret; >> -} >> - >> static int __init init_btrfs_fs(void) >> { >> int err; >> diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c >> index 10eb249..d90c951 100644 >> --- a/fs/btrfs/tests/btrfs-tests.c >> +++ b/fs/btrfs/tests/btrfs-tests.c >> @@ -54,7 +54,7 @@ struct inode *btrfs_new_test_inode(void) >> return new_inode(test_mnt->mnt_sb); >> } >> >> -int btrfs_init_test_fs(void) >> +static int btrfs_init_test_fs(void) >> { >> int ret; >> >> @@ -73,7 +73,7 @@ int btrfs_init_test_fs(void) >> return 0; >> } >> >> -void btrfs_destroy_test_fs(void) >> +static void btrfs_destroy_test_fs(void) >> { >> kern_unmount(test_mnt); >> unregister_filesystem(&test_type); >> @@ -220,3 +220,46 @@ void btrfs_init_dummy_trans(struct >> btrfs_trans_handle *trans) >> INIT_LIST_HEAD(&trans->qgroup_ref_list); >> trans->type = __TRANS_DUMMY; >> } >> + >> +int btrfs_run_sanity_tests(void) >> +{ >> +int ret, i; >> +u32 sectorsize, nodesize; >> +u32 test_sectorsize[] = { >> +PAGE_SIZE, >> +}; >> +ret = btrfs_init_test_fs(); >> +if (ret) >> +return ret; >> +for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) { >> +sectorsize = test_sectorsize[i]; >> +for (nodesize = sectorsize; >> + nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE; >> + nodesize <<= 1) { >> +pr_info("BTRFS: selftest: sectorsize: %u nodesize: %u\n", >> +sectorsize, nodesize); >> +ret = btrfs_test_free_space_cache(sectorsize, nodesize); >> +if (ret) >> +goto out; >> +ret = btrfs_test_extent_buffer_operations(sectorsize, >> +nodesize); >> +if (ret) >> +goto out; >> +ret = btrfs_test_extent_io(sectorsize, nodesize); >> +if (ret) >> +goto out; >> +ret = btrfs_test_inodes(sectorsize, nodesize); >> +if (ret) >> +goto out; >> +ret = btrfs_test_qgroups(sectorsize, nodesize); >> +if (ret) >> +goto out; >> +ret = btrfs_test_free_space_tree(sectorsize, nodesize); >> +if (ret) >> +goto out; >> +} >> +} >> +out: >> +btrfs_destroy_test_fs(); >> +return ret; >> +} >> diff --git a/fs/btrfs/tests/btrfs-tests.h b/fs/btrfs/tests/btrfs-tests.h >> index 66fb6b70..e7d364f 100644 >> --- a/fs/btrfs/tests/btrfs-tests.h >> +++ b/fs/btrfs/tests/btrfs-tests.h >> @@ -20,20 +20,19 @@ >> #define __BTRFS_TESTS >> >> #ifdef CONFIG_BTRFS_F
Re: [PATCH 03/31] btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
At 06/27/2016 10:21 AM, Jeff Mahoney wrote: On 6/26/16 10:14 PM, Qu Wenruo wrote: At 06/25/2016 06:14 AM, je...@suse.com wrote: From: Jeff Mahoney btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney --- fs/btrfs/ctree.h| 22 fs/btrfs/delayed-inode.c| 2 +- fs/btrfs/dev-replace.c | 4 +- fs/btrfs/disk-io.c | 22 fs/btrfs/extent-tree.c | 32 +-- fs/btrfs/file.c | 2 +- fs/btrfs/free-space-cache.c | 6 +- fs/btrfs/inode-map.c| 12 ++-- fs/btrfs/inode.c| 12 ++-- fs/btrfs/ioctl.c| 2 +- fs/btrfs/super.c| 132 +++- fs/btrfs/transaction.c | 6 +- fs/btrfs/tree-log.c | 4 +- fs/btrfs/volumes.c | 11 ++-- 14 files changed, 137 insertions(+), 132 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 101c3cf..100d2ea 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1297,21 +1297,21 @@ struct btrfs_root { #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt) #define btrfs_raw_test_opt(o, opt)((o) & BTRFS_MOUNT_##opt) -#define btrfs_test_opt(root, opt)((root)->fs_info->mount_opt & \ +#define btrfs_test_opt(fs_info, opt)((fs_info)->mount_opt & \ BTRFS_MOUNT_##opt) -#define btrfs_set_and_info(root, opt, fmt, args...)\ +#define btrfs_set_and_info(fs_info, opt, fmt, args...)\ {\ -if (!btrfs_test_opt(root, opt))\ -btrfs_info(root->fs_info, fmt, ##args);\ -btrfs_set_opt(root->fs_info->mount_opt, opt);\ +if (!btrfs_test_opt(fs_info, opt))\ +btrfs_info(fs_info, fmt, ##args);\ +btrfs_set_opt(fs_info->mount_opt, opt);\ } -#define btrfs_clear_and_info(root, opt, fmt, args...)\ +#define btrfs_clear_and_info(fs_info, opt, fmt, args...)\ {\ -if (btrfs_test_opt(root, opt))\ -btrfs_info(root->fs_info, fmt, ##args);\ -btrfs_clear_opt(root->fs_info->mount_opt, opt);\ +if (btrfs_test_opt(fs_info, opt))\ +btrfs_info(fs_info, fmt, ##args);\ +btrfs_clear_opt(fs_info->mount_opt, opt);\ } #ifdef CONFIG_BTRFS_DEBUG @@ -1319,9 +1319,9 @@ static inline int btrfs_should_fragment_free_space(struct btrfs_root *root, struct btrfs_block_group_cache *block_group) { -return (btrfs_test_opt(root, FRAGMENT_METADATA) && +return (btrfs_test_opt(root->fs_info, FRAGMENT_METADATA) && block_group->flags & BTRFS_BLOCK_GROUP_METADATA) || - (btrfs_test_opt(root, FRAGMENT_DATA) && + (btrfs_test_opt(root->fs_info, FRAGMENT_DATA) && block_group->flags & BTRFS_BLOCK_GROUP_DATA); } #endif diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 61561c2..ed67717 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -653,7 +653,7 @@ static int btrfs_delayed_inode_reserve_metadata( if (!ret) goto out; -if (btrfs_test_opt(root, ENOSPC_DEBUG)) { +if (btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) { btrfs_debug(root->fs_info, "block rsv migrate returned %d", ret); WARN_ON(1); diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 63ef9cd..e9bbff3 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -142,7 +142,7 @@ no_valid_dev_replace_entry_found: * missing */ if (!dev_replace->srcdev && -!btrfs_test_opt(dev_root, DEGRADED)) { +!btrfs_test_opt(dev_root->fs_info, DEGRADED)) { Just fs_info, as following btrfs_warn() is using fs_info. ret = -EIO; btrfs_warn(fs_info, "cannot mount because device replace operation is ongoing and"); @@ -151,7 +151,7 @@ no_valid_dev_replace_entry_found: src_devid); } if (!dev_replace->tgtdev && -!btrfs_test_opt(dev_root, DEGRADED)) { +!btrfs_test_opt(dev_root->fs_info, DEGRADED)) { Same here. ret = -EIO; btrfs_warn(fs_info, "cannot mount because device replace operation is ongoing and"); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 685c81a..8f27127 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3025,8 +3025,8 @@ retry_root_backup: if (IS_ERR(fs_info->transaction_kthread)) goto fail_cleaner; -if (!btrfs_test_opt(tree_root, SSD) && -!btrfs_test_opt(
Re: [PATCH 03/31] btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
On 6/26/16 10:14 PM, Qu Wenruo wrote: > > > At 06/25/2016 06:14 AM, je...@suse.com wrote: >> From: Jeff Mahoney >> >> btrfs_test_opt and friends only use the root pointer to access >> the fs_info. Let's pass the fs_info directly in preparation to >> eliminate similar patterns all over btrfs. >> >> Signed-off-by: Jeff Mahoney >> --- >> fs/btrfs/ctree.h| 22 >> fs/btrfs/delayed-inode.c| 2 +- >> fs/btrfs/dev-replace.c | 4 +- >> fs/btrfs/disk-io.c | 22 >> fs/btrfs/extent-tree.c | 32 +-- >> fs/btrfs/file.c | 2 +- >> fs/btrfs/free-space-cache.c | 6 +- >> fs/btrfs/inode-map.c| 12 ++-- >> fs/btrfs/inode.c| 12 ++-- >> fs/btrfs/ioctl.c| 2 +- >> fs/btrfs/super.c| 132 >> +++- >> fs/btrfs/transaction.c | 6 +- >> fs/btrfs/tree-log.c | 4 +- >> fs/btrfs/volumes.c | 11 ++-- >> 14 files changed, 137 insertions(+), 132 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 101c3cf..100d2ea 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1297,21 +1297,21 @@ struct btrfs_root { >> #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt) >> #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt) >> #define btrfs_raw_test_opt(o, opt)((o) & BTRFS_MOUNT_##opt) >> -#define btrfs_test_opt(root, opt)((root)->fs_info->mount_opt & \ >> +#define btrfs_test_opt(fs_info, opt)((fs_info)->mount_opt & \ >> BTRFS_MOUNT_##opt) >> >> -#define btrfs_set_and_info(root, opt, fmt, args...)\ >> +#define btrfs_set_and_info(fs_info, opt, fmt, args...)\ >> {\ >> -if (!btrfs_test_opt(root, opt))\ >> -btrfs_info(root->fs_info, fmt, ##args);\ >> -btrfs_set_opt(root->fs_info->mount_opt, opt);\ >> +if (!btrfs_test_opt(fs_info, opt))\ >> +btrfs_info(fs_info, fmt, ##args);\ >> +btrfs_set_opt(fs_info->mount_opt, opt);\ >> } >> >> -#define btrfs_clear_and_info(root, opt, fmt, args...)\ >> +#define btrfs_clear_and_info(fs_info, opt, fmt, args...)\ >> {\ >> -if (btrfs_test_opt(root, opt))\ >> -btrfs_info(root->fs_info, fmt, ##args);\ >> -btrfs_clear_opt(root->fs_info->mount_opt, opt);\ >> +if (btrfs_test_opt(fs_info, opt))\ >> +btrfs_info(fs_info, fmt, ##args);\ >> +btrfs_clear_opt(fs_info->mount_opt, opt);\ >> } >> >> #ifdef CONFIG_BTRFS_DEBUG >> @@ -1319,9 +1319,9 @@ static inline int >> btrfs_should_fragment_free_space(struct btrfs_root *root, >> struct btrfs_block_group_cache *block_group) >> { >> -return (btrfs_test_opt(root, FRAGMENT_METADATA) && >> +return (btrfs_test_opt(root->fs_info, FRAGMENT_METADATA) && >> block_group->flags & BTRFS_BLOCK_GROUP_METADATA) || >> - (btrfs_test_opt(root, FRAGMENT_DATA) && >> + (btrfs_test_opt(root->fs_info, FRAGMENT_DATA) && >> block_group->flags & BTRFS_BLOCK_GROUP_DATA); >> } >> #endif >> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c >> index 61561c2..ed67717 100644 >> --- a/fs/btrfs/delayed-inode.c >> +++ b/fs/btrfs/delayed-inode.c >> @@ -653,7 +653,7 @@ static int btrfs_delayed_inode_reserve_metadata( >> if (!ret) >> goto out; >> >> -if (btrfs_test_opt(root, ENOSPC_DEBUG)) { >> +if (btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) { >> btrfs_debug(root->fs_info, >> "block rsv migrate returned %d", ret); >> WARN_ON(1); >> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c >> index 63ef9cd..e9bbff3 100644 >> --- a/fs/btrfs/dev-replace.c >> +++ b/fs/btrfs/dev-replace.c >> @@ -142,7 +142,7 @@ no_valid_dev_replace_entry_found: >> * missing >> */ >> if (!dev_replace->srcdev && >> -!btrfs_test_opt(dev_root, DEGRADED)) { >> +!btrfs_test_opt(dev_root->fs_info, DEGRADED)) { > Just fs_info, as following btrfs_warn() is using fs_info. > >> ret = -EIO; >> btrfs_warn(fs_info, >> "cannot mount because device replace operation is >> ongoing and"); >> @@ -151,7 +151,7 @@ no_valid_dev_replace_entry_found: >> src_devid); >> } >> if (!dev_replace->tgtdev && >> -!btrfs_test_opt(dev_root, DEGRADED)) { >> +!btrfs_test_opt(dev_root->fs_info, DEGRADED)) { > > Same here. > >> ret = -EIO; >> btrfs_warn(fs_info, >> "cannot mount because device replace operation is >> ongoing and"); >> diff --git a/fs/btrfs/disk-io.c b/
Re: [PATCH 04/31] btrfs: tests, move initialization into tests/
At 06/25/2016 06:14 AM, je...@suse.com wrote: From: Jeff Mahoney We have all these stubs that only exist because they're called from btrfs_run_sanity_tests, which is a static inside super.c. Let's just move it all into tests/btrfs-tests.c and only have one stub. Signed-off-by: Jeff Mahoney --- fs/btrfs/super.c | 43 fs/btrfs/tests/btrfs-tests.c | 47 ++-- fs/btrfs/tests/btrfs-tests.h | 35 +++-- 3 files changed, 48 insertions(+), 77 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index a7b9a15..d8e48bb 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2323,49 +2323,6 @@ static void btrfs_print_mod_info(void) btrfs_crc32c_impl()); } -static int btrfs_run_sanity_tests(void) -{ - int ret, i; - u32 sectorsize, nodesize; - u32 test_sectorsize[] = { - PAGE_SIZE, - }; - ret = btrfs_init_test_fs(); - if (ret) - return ret; - for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) { - sectorsize = test_sectorsize[i]; - for (nodesize = sectorsize; -nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE; -nodesize <<= 1) { - pr_info("BTRFS: selftest: sectorsize: %u nodesize: %u\n", - sectorsize, nodesize); - ret = btrfs_test_free_space_cache(sectorsize, nodesize); - if (ret) - goto out; - ret = btrfs_test_extent_buffer_operations(sectorsize, - nodesize); - if (ret) - goto out; - ret = btrfs_test_extent_io(sectorsize, nodesize); - if (ret) - goto out; - ret = btrfs_test_inodes(sectorsize, nodesize); - if (ret) - goto out; - ret = btrfs_test_qgroups(sectorsize, nodesize); - if (ret) - goto out; - ret = btrfs_test_free_space_tree(sectorsize, nodesize); - if (ret) - goto out; - } - } -out: - btrfs_destroy_test_fs(); - return ret; -} - static int __init init_btrfs_fs(void) { int err; diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c index 10eb249..d90c951 100644 --- a/fs/btrfs/tests/btrfs-tests.c +++ b/fs/btrfs/tests/btrfs-tests.c @@ -54,7 +54,7 @@ struct inode *btrfs_new_test_inode(void) return new_inode(test_mnt->mnt_sb); } -int btrfs_init_test_fs(void) +static int btrfs_init_test_fs(void) { int ret; @@ -73,7 +73,7 @@ int btrfs_init_test_fs(void) return 0; } -void btrfs_destroy_test_fs(void) +static void btrfs_destroy_test_fs(void) { kern_unmount(test_mnt); unregister_filesystem(&test_type); @@ -220,3 +220,46 @@ void btrfs_init_dummy_trans(struct btrfs_trans_handle *trans) INIT_LIST_HEAD(&trans->qgroup_ref_list); trans->type = __TRANS_DUMMY; } + +int btrfs_run_sanity_tests(void) +{ + int ret, i; + u32 sectorsize, nodesize; + u32 test_sectorsize[] = { + PAGE_SIZE, + }; + ret = btrfs_init_test_fs(); + if (ret) + return ret; + for (i = 0; i < ARRAY_SIZE(test_sectorsize); i++) { + sectorsize = test_sectorsize[i]; + for (nodesize = sectorsize; +nodesize <= BTRFS_MAX_METADATA_BLOCKSIZE; +nodesize <<= 1) { + pr_info("BTRFS: selftest: sectorsize: %u nodesize: %u\n", + sectorsize, nodesize); + ret = btrfs_test_free_space_cache(sectorsize, nodesize); + if (ret) + goto out; + ret = btrfs_test_extent_buffer_operations(sectorsize, + nodesize); + if (ret) + goto out; + ret = btrfs_test_extent_io(sectorsize, nodesize); + if (ret) + goto out; + ret = btrfs_test_inodes(sectorsize, nodesize); + if (ret) + goto out; + ret = btrfs_test_qgroups(sectorsize, nodesize); + if (ret) + goto out; + ret = btrfs_test_free_space_tree(sectorsize, nodesize); + if (ret) + goto out; + } + } +out: + btrfs_destroy_test_fs(); + return r
Re: [PATCH 03/31] btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
At 06/25/2016 06:14 AM, je...@suse.com wrote: From: Jeff Mahoney btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney --- fs/btrfs/ctree.h| 22 fs/btrfs/delayed-inode.c| 2 +- fs/btrfs/dev-replace.c | 4 +- fs/btrfs/disk-io.c | 22 fs/btrfs/extent-tree.c | 32 +-- fs/btrfs/file.c | 2 +- fs/btrfs/free-space-cache.c | 6 +- fs/btrfs/inode-map.c| 12 ++-- fs/btrfs/inode.c| 12 ++-- fs/btrfs/ioctl.c| 2 +- fs/btrfs/super.c| 132 +++- fs/btrfs/transaction.c | 6 +- fs/btrfs/tree-log.c | 4 +- fs/btrfs/volumes.c | 11 ++-- 14 files changed, 137 insertions(+), 132 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 101c3cf..100d2ea 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1297,21 +1297,21 @@ struct btrfs_root { #define btrfs_clear_opt(o, opt)((o) &= ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) #define btrfs_raw_test_opt(o, opt) ((o) & BTRFS_MOUNT_##opt) -#define btrfs_test_opt(root, opt) ((root)->fs_info->mount_opt & \ +#define btrfs_test_opt(fs_info, opt) ((fs_info)->mount_opt & \ BTRFS_MOUNT_##opt) -#define btrfs_set_and_info(root, opt, fmt, args...)\ +#define btrfs_set_and_info(fs_info, opt, fmt, args...) \ { \ - if (!btrfs_test_opt(root, opt)) \ - btrfs_info(root->fs_info, fmt, ##args); \ - btrfs_set_opt(root->fs_info->mount_opt, opt); \ + if (!btrfs_test_opt(fs_info, opt)) \ + btrfs_info(fs_info, fmt, ##args); \ + btrfs_set_opt(fs_info->mount_opt, opt); \ } -#define btrfs_clear_and_info(root, opt, fmt, args...) \ +#define btrfs_clear_and_info(fs_info, opt, fmt, args...) \ { \ - if (btrfs_test_opt(root, opt)) \ - btrfs_info(root->fs_info, fmt, ##args); \ - btrfs_clear_opt(root->fs_info->mount_opt, opt); \ + if (btrfs_test_opt(fs_info, opt)) \ + btrfs_info(fs_info, fmt, ##args); \ + btrfs_clear_opt(fs_info->mount_opt, opt);\ } #ifdef CONFIG_BTRFS_DEBUG @@ -1319,9 +1319,9 @@ static inline int btrfs_should_fragment_free_space(struct btrfs_root *root, struct btrfs_block_group_cache *block_group) { - return (btrfs_test_opt(root, FRAGMENT_METADATA) && + return (btrfs_test_opt(root->fs_info, FRAGMENT_METADATA) && block_group->flags & BTRFS_BLOCK_GROUP_METADATA) || - (btrfs_test_opt(root, FRAGMENT_DATA) && + (btrfs_test_opt(root->fs_info, FRAGMENT_DATA) && block_group->flags & BTRFS_BLOCK_GROUP_DATA); } #endif diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 61561c2..ed67717 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -653,7 +653,7 @@ static int btrfs_delayed_inode_reserve_metadata( if (!ret) goto out; - if (btrfs_test_opt(root, ENOSPC_DEBUG)) { + if (btrfs_test_opt(root->fs_info, ENOSPC_DEBUG)) { btrfs_debug(root->fs_info, "block rsv migrate returned %d", ret); WARN_ON(1); diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 63ef9cd..e9bbff3 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -142,7 +142,7 @@ no_valid_dev_replace_entry_found: * missing */ if (!dev_replace->srcdev && - !btrfs_test_opt(dev_root, DEGRADED)) { + !btrfs_test_opt(dev_root->fs_info, DEGRADED)) { Just fs_info, as following btrfs_warn() is using fs_info. ret = -EIO; btrfs_warn(fs_info, "cannot mount because device replace operation is ongoing and"); @@ -151,7 +151,7 @@ no_valid_dev_replace_entry_found: src_devid); } if (!dev_replace->tgtdev && - !btrfs_test_opt(dev_root, DEGRADED)) { + !btrfs_test_opt(dev_root->fs_info, DEGRADED)) { Same here.
Re: [PATCH 02/31] btrfs: prefix fsid to all trace events
At 06/25/2016 06:14 AM, je...@suse.com wrote: From: Jeff Mahoney When using trace events to debug a problem, it's impossible to determine which file system generated a particular event. This patch adds a macro to prefix standard information to the head of a trace event. The extent_state alloc/free events are all that's left without an fs_info available. Signed-off-by: Jeff Mahoney --- fs/btrfs/delayed-ref.c | 9 +- fs/btrfs/extent-tree.c | 10 +- fs/btrfs/qgroup.c| 19 +-- fs/btrfs/qgroup.h| 9 +- fs/btrfs/super.c | 2 +- include/trace/events/btrfs.h | 282 --- 6 files changed, 182 insertions(+), 149 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 430b368..e7b1ec0 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -606,7 +606,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, qrecord->num_bytes = num_bytes; qrecord->old_roots = NULL; - qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs, + qexisting = btrfs_qgroup_insert_dirty_extent(fs_info, +delayed_refs, qrecord); if (qexisting) kfree(qrecord); @@ -615,7 +616,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info, spin_lock_init(&head_ref->lock); mutex_init(&head_ref->mutex); - trace_add_delayed_ref_head(ref, head_ref, action); + trace_add_delayed_ref_head(fs_info, ref, head_ref, action); existing = htree_insert(&delayed_refs->href_root, &head_ref->href_node); @@ -682,7 +683,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref->type = BTRFS_TREE_BLOCK_REF_KEY; full_ref->level = level; - trace_add_delayed_tree_ref(ref, full_ref, action); + trace_add_delayed_tree_ref(fs_info, ref, full_ref, action); ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref); @@ -739,7 +740,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info, full_ref->objectid = owner; full_ref->offset = offset; - trace_add_delayed_data_ref(ref, full_ref, action); + trace_add_delayed_data_ref(fs_info, ref, full_ref, action); ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 29e5d00..39308a8 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2194,7 +2194,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle *trans, ins.type = BTRFS_EXTENT_ITEM_KEY; ref = btrfs_delayed_node_to_data_ref(node); - trace_run_delayed_data_ref(node, ref, node->action); + trace_run_delayed_data_ref(root->fs_info, node, ref, node->action); if (node->type == BTRFS_SHARED_DATA_REF_KEY) parent = ref->parent; @@ -2349,7 +2349,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle *trans, SKINNY_METADATA); ref = btrfs_delayed_node_to_tree_ref(node); - trace_run_delayed_tree_ref(node, ref, node->action); + trace_run_delayed_tree_ref(root->fs_info, node, ref, node->action); if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) parent = ref->parent; @@ -2413,7 +2413,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, */ BUG_ON(extent_op); head = btrfs_delayed_node_to_head(node); - trace_run_delayed_ref_head(node, head, node->action); + trace_run_delayed_ref_head(root->fs_info, node, head, + node->action); if (insert_reserved) { btrfs_pin_extent(root, node->bytenr, @@ -8317,7 +8318,8 @@ static int record_one_subtree_extent(struct btrfs_trans_handle *trans, delayed_refs = &trans->transaction->delayed_refs; spin_lock(&delayed_refs->lock); - if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord)) + if (btrfs_qgroup_insert_dirty_extent(trans->root->fs_info, +delayed_refs, qrecord)) kfree(qrecord); spin_unlock(&delayed_refs->lock); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 9d4c05b..13e28d8 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1453,9 +1453,10 @@ int btrfs_qgroup_prepare_account_extents(struct btrfs_trans_handle *trans, return ret; } -struct btrfs_qgroup_extent_record -*btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs, - struct btrfs_qgroup_extent_record *record) +struct btrfs_qgroup_extent_record * +btrfs_qgroup_insert_dirty_ext
Re: [PATCH 00/31] btrfs: simplify use of struct btrfs_root pointers
At 06/25/2016 06:14 AM, je...@suse.com wrote: From: Jeff Mahoney One of the common complaints I've heard from new and experienced developers alike about the btrfs code is the ubiquity of struct btrfs_root. There is one for every tree on disk and it's not always obvious which root is needed in a particular call path. It can be frustrating to spend time figuring out which root is required only to discover that it's not actually used for anything other than getting the fs-global struct btrfs_fs_info. Can't agree any more. btrfs_root should only be used for case involving searching one fs tree. For non-fs trees, like extent/chunk/root and other trees, we should get it from fs_info, other than passing a btrfs_root. So for this, I totally agree on the modification. The patchset contains several sections. 1) The fsid trace event patchset I posted earlier; I can rebase without this but I'd prefer not to. IMHO the fsid output in ftrace is too long, making the ftrace harder to read. And even more, for some developer, like me, is doing testing with only one btrfs fs. In that case, the ftrace fsid output is completely redundant. So it would be quite nice to limit the fsid to first 8 hex bytes. (And the first time I tested David's for next branch, the ftrace output makes it quite hard to read, had to stripe them out to read out needed data) 2) Converting btrfs_test_opt and friends to use an fs_info. 3) Converting tests to use an fs_info pointer whenever a root is used. 4) Moving sectorsize and nodesize to fs_info and cleaning up the macros used to access them. Yeah! nodesize inside btrfs_root is never sane. I still remember the days I need to get nodesize either from fs_info->sb_copy or fs_info->root_tree. I'll check each patch more carefully then, but the direction is quite good. Good job! Thanks, Qu 5) General cleanups and small fixes to make the later patches easier to review. 6) Adding an "fs_info" convenience variable to every functiont that references root->fs_info more than once. As new references appear in functions, more of these are added later. 7) Call functions that always overwrite their root parameter with an fs_info instead. 8) Call functions that are always called using the same root with an fs_info instead, retreiving the root internally. 9) Convert every function that accepts a root argument but only uses it to retreive an fs_info to accept an fs_info instead. This is a recursive process as the changes percolate up. 10) Separately convert btrfs_commit_transaction and btrfs_end_transaction to use trans->root instead of a root parameter. Both are always called with the same root that was passed to btrfs_{start,join}_transaction. This series of patches in email format is the "squashed" version of the full development series. That series is available at: git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/linux-btrfs.git There are two branches of interest: - btrfs-testing/root-fsinfo-cleanup-squashed contains this series - btrfs-testing/root-fsinfo-cleanup contains the full series The btrfs-testing/root-fsinfo-cleanup branch is easier to review if using git as each change is discrete. As a whole, the patchset is invasive but should change execution minimally. It passes and fails the same xfstests that the unpatched kernel does across multiple runs. Thanks, -Jeff --- Jeff Mahoney (31): btrfs: plumb fs_info into btrfs_work btrfs: prefix fsid to all trace events btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs: tests, move initialization into tests/ btrfs: tests, require fs_info for root btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root btrfs: simpilify btrfs_subvol_inherit_props btrfs: copy_to_sk drop unused root parameter btrfs: cleanup, remove prototype for btrfs_find_root_ref btrfs: introduce BTRFS_MAX_ITEM_SIZE btrfs: convert nodesize macros to static inlines btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction btrfs: add btrfs_trans_handle->fs_info pointer btrfs: btrfs_abort_transaction, drop root parameter btrfs: call functions that overwrite their root parameter with fs_info btrfs: call functions that always use the same root with fs_info instead btrfs: btrfs_init_new_device should use fs_info->dev_root btrfs: alloc_reserved_file_extent trace point should use extent_root btrfs: struct btrfsic_state->root should be an fs_info btrfs: struct reada_control.root -> reada_control.fs_info btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere btrfs: root->fs_info cleanup, io_ctl_init btrfs: pull node/sector/stripe sizes out of root and into fs_info btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size btrfs: root->fs_info cleanup, lock/unlock_chunks btrfs: root->fs_info cleanup, update_block_group{,flags} btrfs: root->fs_info cleanup, add fs_info convenience variables btrfs:
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sun, Jun 26, 2016 at 03:33:08PM -0700, ronnie sahlberg wrote: > On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: > > Could this explain why people have been reporting so many raid56 mode > > cases of btrfs replacing a first drive appearing to succeed just fine, > > but then they go to btrfs replace a second drive, and the array crashes > > as if the first replace didn't work correctly after all, resulting in two > > bad devices once the second replace gets under way, of course bringing > > down the array? > > > > If so, then it looks like we have our answer as to what has been going > > wrong that has been so hard to properly trace and thus to bugfix. > > > > Combine that with the raid4 dedicated parity device behavior you're > > seeing if the writes are all exactly 128 MB, with that possibly > > explaining the super-slow replaces, and this thread may have just given > > us answers to both of those until-now-untraceable issues. > > > > Regardless, what's /very/ clear by now is that raid56 mode as it > > currently exists is more or less fatally flawed, and a full scrap and > > rewrite to an entirely different raid56 mode on-disk format may be > > necessary to fix it. > > > > And what's even clearer is that people /really/ shouldn't be using raid56 > > mode for anything but testing with throw-away data, at this point. > > Anything else is simply irresponsible. > > > > Does that mean we need to put a "raid56 mode may eat your babies" level > > warning in the manpage and require a --force to either mkfs.btrfs or > > balance to raid56 mode? Because that's about where I am on it. > > Agree. At this point letting ordinary users create raid56 filesystems > is counterproductive. > > > I would suggest: > > 1, a much more strongly worded warning in the wiki. Make sure there > are no misunderstandings > that they really should not use raid56 right now for new filesystems. I beefed up the warnings in several places in the wiki a couple of days ago. Hugo. > 2, Instead of a --force flag. (Users tend to ignore ---force and > warnings in documentation.) > Instead ifdef out the options to create raid56 in mkfs.btrfs. > Developers who want to test can just remove the ifdef and recompile > the tools anyway. > But if end-users have to recompile userspace, that really forces the > point that "you > really should not use this right now". > > 3, reach out to the documentation and fora for the major distros and > make sure they update their > documentation accordingly. > I think a lot of end-users, if they try to research something, are > more likely to go to fora and wiki > than search out an upstream fora. -- Hugo Mills | "No! My collection of rare, incurable diseases! hugo@... carfax.org.uk | Violated!" http://carfax.org.uk/ | PGP: E2AB1DE4 |Stimpson J. Cat, The Ren & Stimpy Show signature.asc Description: Digital signature
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sat, Jun 25, 2016 at 7:53 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Chris Murphy posted on Sat, 25 Jun 2016 11:25:05 -0600 as excerpted: > >> Wow. So it sees the data strip corruption, uses good parity on disk to >> fix it, writes the fix to disk, recomputes parity for some reason but >> does it wrongly, and then overwrites good parity with bad parity? >> That's fucked. So in other words, if there are any errors fixed up >> during a scrub, you should do a 2nd scrub. The first scrub should make >> sure data is correct, and the 2nd scrub should make sure the bug is >> papered over by computing correct parity and replacing the bad parity. >> >> I wonder if the same problem happens with balance or if this is just a >> bug in scrub code? > > Could this explain why people have been reporting so many raid56 mode > cases of btrfs replacing a first drive appearing to succeed just fine, > but then they go to btrfs replace a second drive, and the array crashes > as if the first replace didn't work correctly after all, resulting in two > bad devices once the second replace gets under way, of course bringing > down the array? > > If so, then it looks like we have our answer as to what has been going > wrong that has been so hard to properly trace and thus to bugfix. > > Combine that with the raid4 dedicated parity device behavior you're > seeing if the writes are all exactly 128 MB, with that possibly > explaining the super-slow replaces, and this thread may have just given > us answers to both of those until-now-untraceable issues. > > Regardless, what's /very/ clear by now is that raid56 mode as it > currently exists is more or less fatally flawed, and a full scrap and > rewrite to an entirely different raid56 mode on-disk format may be > necessary to fix it. > > And what's even clearer is that people /really/ shouldn't be using raid56 > mode for anything but testing with throw-away data, at this point. > Anything else is simply irresponsible. > > Does that mean we need to put a "raid56 mode may eat your babies" level > warning in the manpage and require a --force to either mkfs.btrfs or > balance to raid56 mode? Because that's about where I am on it. Agree. At this point letting ordinary users create raid56 filesystems is counterproductive. I would suggest: 1, a much more strongly worded warning in the wiki. Make sure there are no misunderstandings that they really should not use raid56 right now for new filesystems. 2, Instead of a --force flag. (Users tend to ignore ---force and warnings in documentation.) Instead ifdef out the options to create raid56 in mkfs.btrfs. Developers who want to test can just remove the ifdef and recompile the tools anyway. But if end-users have to recompile userspace, that really forces the point that "you really should not use this right now". 3, reach out to the documentation and fora for the major distros and make sure they update their documentation accordingly. I think a lot of end-users, if they try to research something, are more likely to go to fora and wiki than search out an upstream fora. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bad hard drive - checksum verify failure forces readonly mount
On Sun, Jun 26, 2016 at 7:05 AM, Vasco Almeida wrote: > A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu: >> On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida > > wrote: >> > Citando Chris Murphy : >> > > 3. btrfs-image so that devs can see what's causing the problem >> > > that >> > > the current code isn't handling well enough. >> > >> > >> > btrfs-image does not create dump image: >> > >> > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root >> > btrfs-lv_opensuse_root.image >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > Csum didn't match >> > Error reading metadata block >> > Error adding block -5 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 >> > Csum didn't match >> > Error reading metadata block >> > Error flushing pending -5 >> > create failed (Success) >> > # echo $? >> > 1 >> >> Well it's pretty strange to have DUP metadata and for the checksum >> verify to fail on both copies. I don't have much optimism that brfsck >> repair can fix it either. But still it's worth a shot since there's >> not much else to go on. > > I have tried "btrfs check --repair /device" but that seems do not do > any good. > http://paste.fedoraproject.org/384960/66945936/ It did fix things, in particular with the snapshot that was having problems being dropped. But it's not enough it seems to prevent it from going read only. There's more than one bug here, you might see if the repair was good enough that it's possible to use brtfs-image now. If not, use btrfs-debug-tree > file.txt and post that file somewhere. This does expose file names. Maybe that'll shed some light on the problem. But also worth filing a bug at bugzilla.kernel.org with this debug tree referenced (probably too big to attach), maybe a dev will be able to look at it and improve things so they don't fail. > What else can I do or I must rebuild the file system? Well, it's a long shot but you could try using --repair --init-csum which will create a new csum tree. But that applies to data, if the problem with it going read only is due to metadata corruption this won't help. And then last you could try --init-extent-tree. Thing I can't answer is which order to do it in. In any case there will be files that you shouldn't trust after csum has been recreated, anything corrupt will now have a new csum, so you can get silent data corruption. It's better to just blow away this file system and make a new one and reinstall the OS. But if you're feeling brave, you can try one or both of those additional options and see if they can help. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Sun, Jun 26, 2016 at 01:30:03PM -0600, Chris Murphy wrote: > On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov wrote: > > 26.06.2016 00:52, Chris Murphy пишет: > >> Interestingly enough, so far I'm finding with full stripe writes, i.e. > >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This > >> is raid4. > > > > That's not what code suggests and what I see in practice - parity seems > > to be distributed across all disks; each new 128KiB file (extent) has > > parity on new disk. At least as long as we can trust btrfs-map-logical > > to always show parity as "mirror 2". > > > tl;dr Andrei is correct there's no raid4 behavior here. > > Looks like mirror 2 is always parity, more on that below. > > > > > > Do you see consecutive full stripes in your tests? Or how do you > > determine which devid has parity for a given full stripe? > > I do see consecutive full stripe writes, but it doesn't always happen. > But not checking the consecutivity is where I became confused. > > [root@f24s ~]# filefrag -v /mnt/5/ab* > Filesystem type is: 9123683e > File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456128.. 3456159: 32: last,eof > /mnt/5/ab128_2.txt: 1 extent found > File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456224.. 3456255: 32: last,eof > /mnt/5/ab128_3.txt: 1 extent found > File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456320.. 3456351: 32: last,eof > /mnt/5/ab128_4.txt: 1 extent found > File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456352.. 3456383: 32: last,eof > /mnt/5/ab128_5.txt: 1 extent found > File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456384.. 3456415: 32: last,eof > /mnt/5/ab128_6.txt: 1 extent found > File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456416.. 3456447: 32: last,eof > /mnt/5/ab128_7.txt: 1 extent found > File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456448.. 3456479: 32: last,eof > /mnt/5/ab128_8.txt: 1 extent found > File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456480.. 3456511: 32: last,eof > /mnt/5/ab128_9.txt: 1 extent found > File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 31:3456096.. 3456127: 32: last,eof > /mnt/5/ab128.txt: 1 extent found > > Starting with the bottom file then from the top so they're in 4096 > byte block order; and the 2nd column is the difference in value: > > 3456096 > 3456128 32 > 3456224 96 > 3456320 96 > 3456352 32 > 3456384 32 > 3456416 32 > 3456448 32 > 3456480 32 > > So the first two files are consecutive full stripe writes. The next > two aren't. The next five are. They were all copied at the same time. > I don't know why they aren't always consecutive writes. The logical addresses don't include parity stripes, so you won't find them with FIEMAP. Parity locations are calculated after the logical -> (disk, chunk_offset) translation is done (it's the same chunk_offset on every disk, but one of the disks is parity while the others are data). > [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a > mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a > mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c > [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a > mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b > mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a > [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a > mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b > mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a > [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a > mirror 1 logical 14157086720 physical 1075445760 device /dev/map
Re: Adventures in btrfs raid5 disk recovery
On Sun, Jun 26, 2016 at 1:54 AM, Andrei Borzenkov wrote: > 26.06.2016 00:52, Chris Murphy пишет: >> Interestingly enough, so far I'm finding with full stripe writes, i.e. >> 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This >> is raid4. > > That's not what code suggests and what I see in practice - parity seems > to be distributed across all disks; each new 128KiB file (extent) has > parity on new disk. At least as long as we can trust btrfs-map-logical > to always show parity as "mirror 2". tl;dr Andrei is correct there's no raid4 behavior here. Looks like mirror 2 is always parity, more on that below. > > Do you see consecutive full stripes in your tests? Or how do you > determine which devid has parity for a given full stripe? I do see consecutive full stripe writes, but it doesn't always happen. But not checking the consecutivity is where I became confused. [root@f24s ~]# filefrag -v /mnt/5/ab* Filesystem type is: 9123683e File size of /mnt/5/ab128_2.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456128.. 3456159: 32: last,eof /mnt/5/ab128_2.txt: 1 extent found File size of /mnt/5/ab128_3.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456224.. 3456255: 32: last,eof /mnt/5/ab128_3.txt: 1 extent found File size of /mnt/5/ab128_4.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456320.. 3456351: 32: last,eof /mnt/5/ab128_4.txt: 1 extent found File size of /mnt/5/ab128_5.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456352.. 3456383: 32: last,eof /mnt/5/ab128_5.txt: 1 extent found File size of /mnt/5/ab128_6.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456384.. 3456415: 32: last,eof /mnt/5/ab128_6.txt: 1 extent found File size of /mnt/5/ab128_7.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456416.. 3456447: 32: last,eof /mnt/5/ab128_7.txt: 1 extent found File size of /mnt/5/ab128_8.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456448.. 3456479: 32: last,eof /mnt/5/ab128_8.txt: 1 extent found File size of /mnt/5/ab128_9.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456480.. 3456511: 32: last,eof /mnt/5/ab128_9.txt: 1 extent found File size of /mnt/5/ab128.txt is 131072 (32 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31:3456096.. 3456127: 32: last,eof /mnt/5/ab128.txt: 1 extent found Starting with the bottom file then from the top so they're in 4096 byte block order; and the 2nd column is the difference in value: 3456096 3456128 32 3456224 96 3456320 96 3456352 32 3456384 32 3456416 32 3456448 32 3456480 32 So the first two files are consecutive full stripe writes. The next two aren't. The next five are. They were all copied at the same time. I don't know why they aren't always consecutive writes. [root@f24s ~]# btrfs-map-logical -l $[4096*3456096] /dev/VG/a mirror 1 logical 14156169216 physical 1108541440 device /dev/mapper/VG-a mirror 2 logical 14156169216 physical 2182283264 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logical -l $[4096*3456128] /dev/VG/a mirror 1 logical 14156300288 physical 1075052544 device /dev/mapper/VG-b mirror 2 logical 14156300288 physical 1108606976 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456224] /dev/VG/a mirror 1 logical 14156693504 physical 1075249152 device /dev/mapper/VG-b mirror 2 logical 14156693504 physical 1108803584 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456320] /dev/VG/a mirror 1 logical 14157086720 physical 1075445760 device /dev/mapper/VG-b mirror 2 logical 14157086720 physical 1109000192 device /dev/mapper/VG-a [root@f24s ~]# btrfs-map-logical -l $[4096*3456352] /dev/VG/a mirror 1 logical 14157217792 physical 2182807552 device /dev/mapper/VG-c mirror 2 logical 14157217792 physical 1075511296 device /dev/mapper/VG-b [root@f24s ~]# btrfs-map-logical -l $[4096*3456384] /dev/VG/a mirror 1 logical 14157348864 physical 1109131264 device /dev/mapper/VG-a mirror 2 logical 14157348864 physical 2182873088 device /dev/mapper/VG-c [root@f24s ~]# btrfs-map-logi
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On Sun, Jun 26, 2016 at 3:20 AM, Goffredo Baroncelli wrote: > On 2016-06-26 00:33, Chris Murphy wrote: >> On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli >> wrote: >>> On 2016-06-25 19:58, Chris Murphy wrote: >>> [...] > Wow. So it sees the data strip corruption, uses good parity on disk to > fix it, writes the fix to disk, recomputes parity for some reason but > does it wrongly, and then overwrites good parity with bad parity? The wrong parity, is it valid for the data strips that includes the (intentionally) corrupt data? Can parity computation happen before the csum check? Where sometimes you get: read data strips > computer parity > check csum fails > read good parity from disk > fix up the bad data chunk > write wrong parity (based on wrong data)? https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 2371-2383 suggest that there's a parity check, it's not always being rewritten to disk if it's already correct. But it doesn't know it's not correct, it thinks it's wrong so writes out the wrongly computed parity? >>> >>> The parity is not valid for both the corrected data and the corrupted data. >>> It seems that the scrub process copy the contents of the disk2 to disk3. It >>> could happens only if the contents of disk1 is zero. >> >> I'm not sure what it takes to hit this exactly. I just tested 3x >> raid5, where two files 128KiB "a" and 128KiB "b", so that's a full >> stripe write for each. I corrupted devid 1 64KiB of "a" and devid2 >> 64KiB of "b" did a scrub, error is detected, and corrected, and parity >> is still correct. > > How many time tried this corruption test ? I was unable to raise the bug > systematically; in average every three tests I got 1 bug Once. I just did it a 2nd time and both file's parity are wrong now. So I did it several more times. Sometimes both files' parity is bad. Sometimes just one file's parity is bad. Sometimes neither file's parity is bad. It's a very bad bug, because it is a form of silent data corruption and it's induced by Btrfs. And it's apparently non-deterministically hit. Is this some form of race condition? Somewhat orthogonal to this, is that while Btrfs is subject to the write hole problem where parity can be wrong, this is detected and warned. Bad data doesn't propagate up to user space. This might explain how users are getting hit with corrupt files only after they have a degraded volume. They did a scrub where some fixups happen, but behind the scene possibly parity was corrupted even though their data was fixed. Later they have a failed device, and the bad parity is needed, and now there are a bunch of scary checksum errors with piles of files listed as unrecoverable. And in fact they are unrecoverable because the bad parity means bad reconstruction, so even scraping it off with btrfs restore won't work. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
Andrei Borzenkov posted on Sun, 26 Jun 2016 10:54:16 +0300 as excerpted: > P.S. usage of "stripe" to mean "stripe element" actually adds to > confusion when reading code :) ... and posts (including patches, which I guess are code as well, just not applied yet). I've been noticing that in the "stripe length" patches, when the comment associated with the patch suggests it's "strip length" they're actually talking about, using the "N strips, one per device, make a stripe" definition. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/31] btrfs: simplify use of struct btrfs_root pointers
On 6/24/16 6:14 PM, je...@suse.com wrote: > From: Jeff Mahoney > > One of the common complaints I've heard from new and experienced > developers alike about the btrfs code is the ubiquity of > struct btrfs_root. There is one for every tree on disk and it's not > always obvious which root is needed in a particular call path. It can > be frustrating to spend time figuring out which root is required only > to discover that it's not actually used for anything other than > getting the fs-global struct btrfs_fs_info. > > The patchset contains several sections. > > 1) The fsid trace event patchset I posted earlier; I can rebase without this >but I'd prefer not to. > > 2) Converting btrfs_test_opt and friends to use an fs_info. > > 3) Converting tests to use an fs_info pointer whenever a root is used. > > 4) Moving sectorsize and nodesize to fs_info and cleaning up the >macros used to access them. > > 5) General cleanups and small fixes to make the later patches easier to >review. > > 6) Adding an "fs_info" convenience variable to every functiont that >references root->fs_info more than once. As new references appear >in functions, more of these are added later. > > 7) Call functions that always overwrite their root parameter with >an fs_info instead. > > 8) Call functions that are always called using the same root with >an fs_info instead, retreiving the root internally. > > 9) Convert every function that accepts a root argument but only uses it >to retreive an fs_info to accept an fs_info instead. This is a >recursive process as the changes percolate up. > > 10) Separately convert btrfs_commit_transaction and btrfs_end_transaction > to use trans->root instead of a root parameter. Both are always called > with the same root that was passed to btrfs_{start,join}_transaction. > > This series of patches in email format is the "squashed" version of > the full development series. That series is available at: > > git://git.kernel.org/pub/scm/linux/kernel/git/jeffm/linux-btrfs.git > > There are two branches of interest: > - btrfs-testing/root-fsinfo-cleanup-squashed contains this series > - btrfs-testing/root-fsinfo-cleanup contains the full series Hrm. It looks like vger didn't allow a few of the patches through, presumably due to size. Using the git repo is probably the only way to go, then. -Jeff > The btrfs-testing/root-fsinfo-cleanup branch is easier to review if using > git as each change is discrete. As a whole, the patchset is invasive but > should change execution minimally. It passes and fails the same xfstests > that the unpatched kernel does across multiple runs. > > Thanks, > > -Jeff > > --- > > Jeff Mahoney (31): > btrfs: plumb fs_info into btrfs_work > btrfs: prefix fsid to all trace events > btrfs: btrfs_test_opt and friends should take a btrfs_fs_info > btrfs: tests, move initialization into tests/ > btrfs: tests, require fs_info for root > btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root > btrfs: simpilify btrfs_subvol_inherit_props > btrfs: copy_to_sk drop unused root parameter > btrfs: cleanup, remove prototype for btrfs_find_root_ref > btrfs: introduce BTRFS_MAX_ITEM_SIZE > btrfs: convert nodesize macros to static inlines > btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction > btrfs: add btrfs_trans_handle->fs_info pointer > btrfs: btrfs_abort_transaction, drop root parameter > btrfs: call functions that overwrite their root parameter with fs_info > btrfs: call functions that always use the same root with fs_info > instead > btrfs: btrfs_init_new_device should use fs_info->dev_root > btrfs: alloc_reserved_file_extent trace point should use extent_root > btrfs: struct btrfsic_state->root should be an fs_info > btrfs: struct reada_control.root -> reada_control.fs_info > btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere > btrfs: root->fs_info cleanup, io_ctl_init > btrfs: pull node/sector/stripe sizes out of root and into fs_info > btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size > btrfs: root->fs_info cleanup, lock/unlock_chunks > btrfs: root->fs_info cleanup, update_block_group{,flags} > btrfs: root->fs_info cleanup, add fs_info convenience variables > btrfs: root->fs_info cleanup, access fs_info->delayed_root directly > btrfs: take an fs_info parameter directly when the root is not used > otherwise > btrfs: root->fs_info cleanup, btrfs_commit_transaction already has > root > btrfs: root->fs_info cleanup, btrfs_end_transaction{,_throttle} use > trans->fs_info instead of parameter > > fs/btrfs/async-thread.c| 31 +- > fs/btrfs/async-thread.h|6 +- > fs/btrfs/backref.c | 12 +- > fs/btrfs/check-integrity.c | 41 +- > fs/btrfs/check-integrity.h |5 +- > fs/btrfs/compression.c
Re: Bad hard drive - checksum verify failure forces readonly mount
A Sáb, 25-06-2016 às 14:54 -0600, Chris Murphy escreveu: > On Sat, Jun 25, 2016 at 2:10 PM, Vasco Almeida > wrote: > > Citando Chris Murphy : > > > 3. btrfs-image so that devs can see what's causing the problem > > > that > > > the current code isn't handling well enough. > > > > > > btrfs-image does not create dump image: > > > > # btrfs-image /dev/mapper/vg_pupu-lv_opensuse_root > > btrfs-lv_opensuse_root.image > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > Csum didn't match > > Error reading metadata block > > Error adding block -5 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > checksum verify failed on 437944320 found 8BF8C752 wanted 39F456C8 > > Csum didn't match > > Error reading metadata block > > Error flushing pending -5 > > create failed (Success) > > # echo $? > > 1 > > Well it's pretty strange to have DUP metadata and for the checksum > verify to fail on both copies. I don't have much optimism that brfsck > repair can fix it either. But still it's worth a shot since there's > not much else to go on. I have tried "btrfs check --repair /device" but that seems do not do any good. http://paste.fedoraproject.org/384960/66945936/ I then issued "mount /device /mnt" and, like before, it was mounted readwrite and then forced readonly. Got some kernel oops and traces. I noticed that btrfs-balance was using ~100% CPU whilst btrfs device was mounted readonly. I let it run for about 20 minutes. Then had to reboot because the system was no responding well: was unable to open or close applications, use internet. Did SysRq+reisu (operations were enabled) and then pressed reset button on computer. Unfortunately dmesg dumps were lost after resetting computer. What else can I do or I must rebuild the file system? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Btrfs scrub sometime recalculate wrong parity in raid5
On 2016-06-26 00:33, Chris Murphy wrote: > On Sat, Jun 25, 2016 at 12:42 PM, Goffredo Baroncelli > wrote: >> On 2016-06-25 19:58, Chris Murphy wrote: >> [...] Wow. So it sees the data strip corruption, uses good parity on disk to fix it, writes the fix to disk, recomputes parity for some reason but does it wrongly, and then overwrites good parity with bad parity? >>> >>> The wrong parity, is it valid for the data strips that includes the >>> (intentionally) corrupt data? >>> >>> Can parity computation happen before the csum check? Where sometimes you >>> get: >>> >>> read data strips > computer parity > check csum fails > read good >>> parity from disk > fix up the bad data chunk > write wrong parity >>> (based on wrong data)? >>> >>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/raid56.c?id=refs/tags/v4.6.3 >>> >>> 2371-2383 suggest that there's a parity check, it's not always being >>> rewritten to disk if it's already correct. But it doesn't know it's >>> not correct, it thinks it's wrong so writes out the wrongly computed >>> parity? >> >> The parity is not valid for both the corrected data and the corrupted data. >> It seems that the scrub process copy the contents of the disk2 to disk3. It >> could happens only if the contents of disk1 is zero. > > I'm not sure what it takes to hit this exactly. I just tested 3x > raid5, where two files 128KiB "a" and 128KiB "b", so that's a full > stripe write for each. I corrupted devid 1 64KiB of "a" and devid2 > 64KiB of "b" did a scrub, error is detected, and corrected, and parity > is still correct. How many time tried this corruption test ? I was unable to raise the bug systematically; in average every three tests I got 1 bug [] -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
26.06.2016 00:52, Chris Murphy пишет: > Interestingly enough, so far I'm finding with full stripe writes, i.e. > 3x raid5, exactly 128KiB data writes, devid 3 is always parity. This > is raid4. That's not what code suggests and what I see in practice - parity seems to be distributed across all disks; each new 128KiB file (extent) has parity on new disk. At least as long as we can trust btrfs-map-logical to always show parity as "mirror 2". Do you see consecutive full stripes in your tests? Or how do you determine which devid has parity for a given full stripe? This information is not actually stored anywhere, it is computed based on block group geometry and logical stripe offset. P.S. usage of "stripe" to mean "stripe element" actually adds to confusion when reading code :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html