Re: [PATCH 01/18] btrfs-progs: btrfs-debug-tree: add option -f for "block only"
Original Message Subject: [PATCH 01/18] btrfs-progs: btrfs-debug-tree: add option -f for "block only" From: To: Date: 2014年12月11日 04:51 From: Martin Wilck btrfs-debug-tree prints only the given block. It is sometimes useful to be able to print the subtree under this block. This patch enables this behavior with the option "-f". Signed-off-by: Martin Wilck --- btrfs-debug-tree.c | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index e46500d..e61c71c 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -41,6 +41,8 @@ static int print_usage(void) fprintf(stderr, "\t-u : print info of uuid tree only\n"); fprintf(stderr, "\t-b block_num : print info of the specified block" " only\n"); + fprintf(stderr, "\t-f : (with -b) follow subtree of the specified" + " block\n"); fprintf(stderr, "\t-t tree_id : print only the tree with the given id\n"); fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION); @@ -137,6 +139,7 @@ int main(int ac, char **av) int roots_only = 0; int root_backups = 0; u64 block_only = 0; + int block_follow = 0; struct btrfs_root *tree_root_scan; u64 tree_id = 0; @@ -144,7 +147,7 @@ int main(int ac, char **av) while(1) { int c; - c = getopt(ac, av, "deb:rRut:"); + c = getopt(ac, av, "defb:rRut:"); if (c < 0) break; switch(c) { @@ -167,6 +170,9 @@ int main(int ac, char **av) case 'b': block_only = arg_strtou64(optarg); break; + case 'f': + block_follow = 1; + break; case 't': tree_id = arg_strtou64(optarg); break; @@ -211,7 +217,7 @@ int main(int ac, char **av) (unsigned long long)block_only); goto close_root; } - btrfs_print_tree(root, leaf, 0); + btrfs_print_tree(root, leaf, block_follow); Although not a bug of your patch, but would you please fix the extent buffer leak by adding a free_extent_buffer(buf)? Thanks, Qu goto close_root; } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted: > So I started looking at the mkfs.btrfs manual page with an eye towards > documenting some of the tidbits like metadata automatically switching > from dup to raid1 when more than one device is used. > > In experimenting I ended up with some questions... > > (1) why is the dup profile for data restricted to only one device and > only if it's mixed mode? > (2) why is metadata dup profile restricted to only one device on > creation when it will run that way just fine after a device add? 1 and 2 together since they both deal with dup mode... Dup mode was apparently originally considered purely an extra safeguard for metadata in the single-device case, where it was made the default (except for SSDs, which default to single mode metadata on a single- device filesystem, because the FTL voids any guarantees on location anyway, and because firmware such as sandforce compresses and dedups anyway, in which case the hardware/firmware is subverting btrfs' efforts to do dup anyway). In the single-device case, two copies of data was considered simply not worth the cost, due both to doubling the size (especially on SSD where size is money!) and to the speed penalties on spinning rust due to seeks between one 1-GiB data-chunk and its dup. With multi-device, raid1 metadata, forcing one copy to each of two different devices, was considered enough superior to make that the default, since that provided device-loss resiliency for the all-important metadata, thus enabling recovery of at least /some/ files even with a device missing (single-mode data where the file's extents all happened to be on available devices, plus of course raid1, etc, data). Further, dup- mode metadata was considered a mistake it was better not to even have available as an option, since loss of a single device would likely kill the filesystem, which made dup mode little better than single mode, without the doubled-size-cost. Further, on spinning rust there'd again be the seek penalty, to little benefit since dup mode provides no guarantees in case of device loss. So multi-device defaults to raid1 metadata for safety, but single mode metadata remains an option (along with raid0) if you really /don't/ care about losing everything due to loss of a single device. Single-device simply makes dup-mode available (and the default) for metadata, as a poor- man's substitute for the safety of raid1, but single-device-metadata is the only case where that poor-man's-raid1-substitute is worth the (considered extreme) cost, with usage of that option not even available on multi-device as it'd be a near-certain mistake, certainly at the mkfs level. And dup mode isn't ordinarily available for data even on single- device, because it's considered not worth the cost. As for dup-mode working after device-add, that's simply a necessary bit in ordered for device add to work from a default-dup-mode single-device at all. And it's only the existing metadata chunks on the original device that will be dup-mode. Once a second device is added, additional metadata chunks will be written in raid1 mode, forcing the two chunk copies to different devices since there's multiple devices available to allow that. The clear intent and recommendation is to do a rebalance ASAP after a device add, to spread usage to the new device as appropriate. And of course that rebalance will use the new raid1 metadata defaults, unless told otherwise of course, and I don't believe dup mode is available to tell it otherwise there, either. What all that original reasoning fails to account for, however, is the btrfs data/metadata checksumming and integrity features and the very high (which the original btrfs mode designers obviously considered extreme) value some users (including me) place on them. While a multi-device dup- mode-metadata choice at mkfs is arguably still a mistake, the cost of raid1 metadata without the benefit, near the risk of single metadata but at double the size, dup-mode data combined with btrfs checksumming and data integrity features on a single device has strong data integrity benefits that some would definitely consider worth it, even at the additional cost in speed on spinning rust due to seeking, and in size on expensive SSDs. Meanwhile, mixed-bg-mode was an after-thought, added much later (after my own btrfs journey began) in ordered to make working with small filesystems reasonable. Before mixed-bg-mode, people attempting to use btrfs on sub-GiB devices often found they couldn't use all available space (often 25-50% wasted!) as the separate data/metadata chunk allocation was simply too large grained to properly deal with the small sizes involved. And small filesystems really _was_ mixed-mode's _entire_ purpose. That it could additionally be used to allow dup-data, using the ability to specify mixed-bg-mode even on > 1 GiB filesystems where it
Re: Fixing Btrfs Filesystem Full Problems typo?
On 10 December 2014 at 23:28, Robert White wrote: > On 12/10/2014 10:56 AM, Patrik Lundquist wrote: >> >> On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote: >>> >>> Assuming no snapshots still contain the file, of course, and that the >>> ext* saved subvolume has already been deleted. >> >> Got no snapshots or subvolumes. Keeping it simple for now. > > Does that mean that you have already manually removed the subvolume that was > automatically created by btrfs-convert? Yes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
Patrik Lundquist posted on Wed, 10 Dec 2014 21:11:52 +0100 as excerpted: >> Patrik, assuming no btrfs snapshots yet, can you do a du --all --block- >> size=1M | sort -n (or similar), then take a look at all results over >> 1024 (1 GiB since the du specified 1 MiB blocks), and see if it's >> reasonable to move all those files out of the filesystem and back? > > Good idea, but it's quite a lot of files. I'd rather start over. > > But I've identified 46 files from Btrfs errors in syslog and will try to > move them to another disk. They're ranging from 41KiB to 6.6GiB in size. There's one as yet incomplete piece of the puzzle. I guess the devs could probably answer this, but being a simple sysadmin, I don't claim to read code well and don't know... That log snippet you quoted earlier gave block-group addresses. That's the chunks, in this case normally 1 GiB data chunks, but here we're dealing with a conversion from ext4 and apparently the extents are larger, nearly 2 GiB in this case according to that snippet. That had me thinking the problem files were all > 1 GiB and had these super-extents that btrfs can't work with. But you say you tracked down the file as I suggested using btrfs-inspect- internal, and the file is much smaller than that. Now I don't even know for sure what that log snippet was from, a normal dmesg during an attempted balance, or dmesg with btrfs debug turned on in the kernel, or the userspace debug you ask about, or... And not being a dev and not having done anything like this level myself, I'm sort of feeling my way along here too, trying to figure things out as you report them. So the missing piece I'm talking about is this. OK, we have the address of a nearly 2 GiB block group reported, and I recalled seeing in an earlier post that trick with btrfs-inspect-internal, so I though to try it here. But with the file being so much smaller than the 2 GiB block group reported, something's not matching. Either the file is somehow using an extent much much larger than it is (possible with fallocate, then writing a shorter file, I believe), or the referred to block group actually contains more than one file -- certainly btrfs data chunks can do so, but given that we're dealing with a conversion here, I don't know if the same rules apply, or... Anyway, it's possible that smaller file is simply the first one in the block group, thus being the one that was mapped when you plugged that address into inspect-internal, and that the problem file is actually a much larger file located after it in the same block group. So if moving the small files doesn't do the trick, try feeding inspect- internal with an address after that. Given that btrfs blocks are 4 KiB in size, round the size of the small file up to the nearest 4 KiB and add that to the address originally obtained from the log, and see if inspect- internal points at a different, presumably much larger (> 1 GiB or at least big enough so it'd extent beyond a GiB beyond the original address), file, with the new offset address. If so, try moving /that/ file, and see if you have any better luck. I was /hoping/ it would be the simple case and all the problem block- group addresses would point to > 1 GiB files and moving them would be it. But with a significant number of those addresses pointing at far smaller files, either I was wrong about the use of inspect-internal here and they're entirely unrelated, or the situation is otherwise rather more complex than I was hoping to be the case. OTOH, if for whatever reason all those smaller files were fallocated to some huge size and then written smaller, or something similar happened such that they're using huge > 1 GiB extents even while being smaller than 1 GiB in size, that COULD go some distance to explaining why defrag missed them. If defrag is looking at filesize and the files happen to be small but in huge extents, and it's those extents causing the problem, then we just found our bug, and all that's left is figuring out how to fix it, which is where I step out and the devs step in. With a bit of luck, that's it, and we're now well on the way to fixing a bug that could have otherwise triggered unexplained problems for some people doing conversions, but not others, for quite some time to come. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
Robert White posted on Wed, 10 Dec 2014 14:28:10 -0800 as excerpted: > On 12/10/2014 10:56 AM, Patrik Lundquist wrote: >> On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote: >>> Assuming no snapshots still contain the file, of course, and that the >>> ext* saved subvolume has already been deleted. >> >> Got no snapshots or subvolumes. Keeping it simple for now. > > Does that mean that you have already manually removed the subvolume that > was automatically created by btrfs-convert? Yes, he had. Patrik correct me if I have this wrong, but filling in the history as I believe I have it... If I'm keeping my cases straight, he had actually posted a thread some weeks ago with the initial problem, saying he had followed the conversion instructions to the letter -- conversion, delete-saved, defrag, balance, and ran into this problem with balance. The conclusion at that time was that he'd try successively larger balance -dusage=N figures, hoping to work thru it that way. That original thread could well have been shortly before you appeared on the list, however, and you may not have seen it. Either that, or you saw it but didn't connect that case with this one. Anyway, yes, assuming I haven't gotten my casefiles mixed up, and evidence so far is that I haven't, he did everything he was supposed to and still ended up with this issue. Obviously there's still a bug somewhere. And now he's back. The incrementally increasing usage= balances reaching 99%, but that last 1% is the sticking point and he, and the rest of us, are trying to figure out what happened and how to get him past it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
Patrik Lundquist posted on Wed, 10 Dec 2014 21:11:52 +0100 as excerpted: > Is btrfs-debug-tree -e useful in finding problematic files? Since you were replying directly to me, my answer... ENOTENOUGHINFO I don't know enough about it to honestly say, as I've never used it myself and haven't seen anyone posting practical usage that I could make note of in case I or someone else needed it later. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
Dongsheng Yang posted on Wed, 10 Dec 2014 23:02:15 +0800 as excerpted: >> And in the example, the mkfs was supplied with two devices, so there's >> no dup metadata remaining from a formerly single-device filesystem, >> either. (Tho there will be the small single-mode stubs, empty, >> remaining from the mkfs process, as no balance has been run to delete >> them yet, but those are much smaller and empty.) > > Yes. One question not related here: how about delete them in the end of > mkfs? GB covered the old, manual balance method. Do a btrfs balance -dusage=0 -musage=0 (or whatever, someone posted his recipe doing the same thing except with the single profiles instead of zero usage), and those stubs should disappear, as they're empty so there's nothing to rewrite when the balance does its thing and it simply removes them. FWIW I actually have a mkfs helper script here that takes care of a bunch of site-default options such as dual-device raid1 both data/metadata, skinny-metadata, etc, and it actually prompts for a mountpoint (assuming it's already setup in fstab) and will do an immediate mount and balance usage=0 to eliminate the stubs if that mountpoint is filled in, again assuming it appears in fstab as well. Since I keep fully separate filesystems to avoid putting all my data eggs in the same not-yet-fully- stable btrfs basket, and my backup system includes periodically blowing away the backup and (after booting to the new backup) the working copy with a fresh mkfs for a clean start, the mkfs helper script is useful, and since I was already doing that, it was reasonably simple to extend it to handle the mount and stub-killing balance immediately after the mkfs. But at least in theory, that old manual method shouldn't be necessary with a current (IIRC 3.18 required) kernel, since btrfs should now automatically detect empty chunks and automatically rebalance to remove them as necessary. However, I've been busy and haven't actually tried 3.18 yet, and thus obviously haven't done a mkfs and mount of a fresh filesystem to see how long it actually takes to trigger and remove those stubs, so for all I know it takes awhile to kick in, and if people are bothered by the display of the stubs before it does, they can of course still do it the old way. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: out of space warning?
Original Message Subject: Re: out of space warning? From: Robert White To: sys.syphus , Date: 2014年12月11日 09:29 On 12/10/2014 02:54 PM, sys.syphus wrote: I would like to avoid running out of space. is there a way to know that I am getting close? i'd like to make a script that runs as part of my bash prompt and lets me know when i am getting close. i know there are several ways you can run out of space and I'd like to avoid all of them. Don't do that either. 8-) (1) you'll grow to hate it. (2) You know when you are doing things that take a lot of storage. You instinct for system fullness is already part of your brain-meat. (3) The system isn't going to explode if it runs out of disk space. (old UNIX systems used to halt with system errors because running out of space prevented pipelines from being created, but that's ancient history). (4) The only _real_ way to run out of space is to be a data hoarder, and no script in the world is going to help you if that's the case. Ha Ha (5) Possible known/unknown kernel bug may cause strange ENOSPC during balance/replace/scrub... :-) You don't check your car's gas tank every time you put your foot on the brake, you don't want to check your free space every time your system finishes every tiny command you type. Scripts like this are possible in bash, but consider that every "ls" or even just enter you type would be followed by a "df" and a "grep" or whatever in whatever window you are using at the time. etc. IF you think you are going to run out of space, and you are using _any_ kind of window system, then start a system manager display for a while until you get the feel for how not out of space you really are. Nothing gets ignored faster than a text element that essentially never changes, and once you get in the habbit of ignoring the text you won't notice when it actually has something to say. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
On Thu, Dec 11, 2014 at 10:05:20AM +0800, Qu Wenruo wrote: > > Original Message > Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL? > From: Zygo Blaxell > To: Qu Wenruo > Date: 2014年12月11日 05:57 > >On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote: > >>The main memory usage in btrfsck is extent record, which > >>we can't free them until we read them all in and checked, so even we > >>mmap/unmap, it can only help with > >>the extent_buffer(which is already freed if not used according to refs). > >I'm thinking aloud here, but is it *really* necessary to read everything > >into memory? > Totally agreed to only read what we need. > But some backref and counts on refs can only be determined after a > full scan, especially for leaf/node corruption > case. It might be faster (and smaller) to pipe them out to sort (with gzip/lzma compression on temporary files) than to try to insert them in a tree. I have used that technique in some of my deduplicating programs. It can cut the working set size by several orders of magnitude (trading it for an O(n log n) sort, which will mostly read and write sequentially). e.g. duplicate refs will all sort together, so when you are sequentially reading the sorted data and the current key value changes, you know you've seen everything that could be a duplicate, and can discard everything in RAM. signature.asc Description: Digital signature
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
Original Message Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL? From: Zygo Blaxell To: Qu Wenruo Date: 2014年12月11日 05:57 On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote: The main memory usage in btrfsck is extent record, which we can't free them until we read them all in and checked, so even we mmap/unmap, it can only help with the extent_buffer(which is already freed if not used according to refs). I'm thinking aloud here, but is it *really* necessary to read everything into memory? Totally agreed to only read what we need. But some backref and counts on refs can only be determined after a full scan, especially for leaf/node corruption case. Maybe a multiple-pass algorithm might be possible, e.g. one to find free space by eliminating any areas that are occupied by extents, then other passes to rebuild the metadata in the free space. Or, one pass to verify the connectivity of references and collect dangling refs, then a second pass which fixes only the dangling refs. I have similar idea, but not multi-pass method, instead, using per sector scan + tree search for other data. E.g in extent tree check, each time only record all extents in a block group, and check them. After check, remove the good extents/block groups and then move to next block group. For fs tree, any key with same objectid(ino) as a group, and only read the group in one time and remove the already known healthy record. (info not fully gathered or bad record will still stay in memory) But I don't consider this method can really save much memory though... Usually sequential reads are significantly faster than swapping--even if swapping on solid-state media. It could be that reading 260GB of metadata sequentially two or three times is still faster than thrashing through random lookups in 20GB of swap on a 4GB machine. Definitely, but if we want to reduce memory usage, it is almost unavoidable to do more disk IO, especially random disk IO, so it will become a tradeoff, which may cause the already slow fsck more slow Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: out of space warning?
On 12/10/2014 02:54 PM, sys.syphus wrote: I would like to avoid running out of space. is there a way to know that I am getting close? i'd like to make a script that runs as part of my bash prompt and lets me know when i am getting close. i know there are several ways you can run out of space and I'd like to avoid all of them. Don't do that either. 8-) (1) you'll grow to hate it. (2) You know when you are doing things that take a lot of storage. You instinct for system fullness is already part of your brain-meat. (3) The system isn't going to explode if it runs out of disk space. (old UNIX systems used to halt with system errors because running out of space prevented pipelines from being created, but that's ancient history). (4) The only _real_ way to run out of space is to be a data hoarder, and no script in the world is going to help you if that's the case. Ha Ha You don't check your car's gas tank every time you put your foot on the brake, you don't want to check your free space every time your system finishes every tiny command you type. Scripts like this are possible in bash, but consider that every "ls" or even just enter you type would be followed by a "df" and a "grep" or whatever in whatever window you are using at the time. etc. IF you think you are going to run out of space, and you are using _any_ kind of window system, then start a system manager display for a while until you get the feel for how not out of space you really are. Nothing gets ignored faster than a text element that essentially never changes, and once you get in the habbit of ignoring the text you won't notice when it actually has something to say. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Balance & scrub & defrag
On 12/10/2014 02:15 PM, sys.syphus wrote: I am working on a script that i can run daily that will do maintenance on my btrfs mountpoints. is there any reason not to concurrently do all of the above? possibly including discards as well. also, is there anything existing currently that will do maintenance on btrfs so i don't have to reinvent the wheel? #!/bin/bash btrfs filesystem defragment -r -v /media/btrfs/ & btrfs scrub start /media/btrfs/ & btrfs balance start /media/btrfs/ & watch -d -n 30 "btrfs balance status /media/btrfs/; btrfs scrub status /media/btrfs/" I'd recommend doing "none of the above" on a daily basis. One of the goals of the filesystem design is to remove the need for any of these operations on any regular basis. You are just going to bog down your system and increase you heat and wear profiles for no good reason. Those tools should be used if you notice something fishy like recent decreases in efficiency or errors in your log files. A _monthly_ scrub is maybe worth scheduling if you have a lot of churn in your disk contents. Defragging should be done after significant content additions/changes (like replacing a lot of files via package management) and limited to the directories most likely changed. Balancing is almost never necessary and can be anti-helpful if a experiences random updates in batches (because the nicely packed file may end up far, far away from the active data extent where its COW events are taking place. Resist the urge to tinker with production systems. The exposure (rewriting stable data is just the chance to destabilize your data, balancing your drive can take two files that always change together and put them far away from one another, etc) is not worth the nearly non-existent chance of benefit. Once the system is "good" just leave it that way until you notice something "not good" coming on the horizon. If you feel you _must_ do these tasks then doing them all at once, where possible, will just make both tasks take longer. If you are transcribing a file over on one side of the disk to defrag it, and you are transcribing an extent on the other side of the disk to balance it, you are just bouncing your disk heads back-and-forth and wasing wall-clock time. So yea, it's not windows, it doesn't need the defrag hammer. Trying to over-manage the system will prevent it from seeking its dynamic (and so predictable) equilibrium. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibility to have a "transient" snapshot?
On 12/10/2014 11:52 AM, James West wrote: I was just looking into using overlayfs, and although it has some promise, I think it's biggest drawback is the upperdir will have to be some sort of storage backed filesystem. From my limited understanding of tmpfs, it's not supposed to be the greatest with many large files (and my system in particular would be downloading many large movies/videos, and doing any kind of os update to test it would involve many changes all over the volume, which could be problematic to commit to a golden state.) I could partition the main drive in 2 parts, and dynamically zero-out then create the volume in the second partition on each boot, but I'm still saving no drive writes, and not really extending the life of the hardware (which is one of my premises.) You are over-thinking the "transient" part way too much. If the underlying device is not an SSD then most of your wear is immaterial. And if it is an SSD, you wear is still pretty damn immaterial. The "full weight" snapshots are plenty "transient" if you delete them between uses and they don't do any recursive copying so they are almost wear-free. [If you want to wear out a hard disk, park it's heads a lot. (The My Book series of WD external enclosures had a _horrible_ default of parking the heads after every eight seconds of idle time. Ouch.)] A normal hard disk's runtime (provided its not a lemon) is shorter than its mean write wear time anyway. So the best thing to do in your case is to customize your initramfs to do what you need and then "hide" your stuff from normal use. Consider this (untested but) hiper-simple init script... (assumes busybox in the initramfs providing mount and a few of the other basic tools and the btrfs command). --- snip --- #!/bin/bash mkdir /dev mount -t devtmpfs none /dev mkdir /scratch mount -t btrfs /dev/sda1 /scratch if [ -d /scratch/active ]; then btrfs subvol del /scratch/active fi btrfs subvol snap /scratch/__Master /scratch/active mkdir /root mount -o subvol=/active /dev/sda1 /root umount /scratch rmdir /scratch umount /dev busybox switch_root /root /sbin/init "$@" --- snip --- Every time you boot it makes a fresh snapshot of the /__Master subvolume of /dev/sda1 into /active and mounts that as root then treats that as the root of the file system. Estimated human-scale time to run this script is in the one-second-or-less range. None of the files in /__Master are then visible to the running system, so they won't be subject to search via find or locate etc. Problem solved. When you want to do maintenance you can log into you box and do mount /dev/sda1 /mnt at which point /__Master is visible as /mnt/__Master. You can do your backup snapshots and your maintenance via the /mnt view without purturbing your running system. chroot /mnt/__Master /bin/bash That gives you the "native view" of your master system in that shell. From that shell all your package tools will work just fine etc. You can prep new or covariant system is snapshots parallel to /__Master and use rename to select the __Master for the next reboot. Even better, since snapshots of snapshots are not degenerate in any way at all, you can create multiple system roots as /Whatever and /OtherThing (and so on) and always do your maintenance there. Then before any reboot you can snap /mnt/Whatever into /mnt/__Master (using the same technique as for /active) and then reboot. On that reboot the new /__Master will be the master for the new /active. All of the snapshot activities are almost instant (except for the cleanup of the previous /active if it's full of a lot of changes, but that will happen in the background so you don't have to care much about that time). ASIDE: And I keep pointing people at it, but I do a lot of experimental boot behaviors while testing hardware and such for my job, and my baseline initramfs builder at http://underdog.sourceforge.net is easy to customize and plenty stable. It already sucks in the btrfs and command and friends, and you can take control of the boot-up to do experimental tweaks by adding "bash" to the kernel boot args for an individual boot. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and
Original Message Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and From: David Sterba To: Qu Wenruo Date: 2014年12月10日 20:37 On Tue, Dec 09, 2014 at 04:27:19PM +0800, Qu Wenruo wrote: The patchset introduce two new repair function and some helpers to archive a huge goal: Repair btrfs whose fs tree's non-root leaf/node is corrupted when no duplication is valid. The two new repair functions are: repair_inode_nlinks(): Repair any inode nlink related problem. From fixing the nlink number and related inode_ref/dir_index/dir_item to recovering file name and file type and salvage them into the lost+found dir. This does not only fix a case that some users reported but also cooperate with repair_inode_no_item() function to salvaged heavily damaged inode to lost+found dir. repair_inode_no_item(): Repair case for inode_item missing case, which is quite common when fs tree leaf/node is missing. This only does the inode item rebuild. Later recovery like move it to lost+found dir is done by repair_inode_nlinks(). The main helper is the repair_btree() function, which will drops the corrupted non-root leaf/node and rebalance the tree to keep the correctness of the btree. Sounds a bit intrusive, but under the circumstances I don't see anything better to do. Better non-destructive but less generic method may be introduced later. My dream is to inspect each key and its item to rebuild each member, but it would takes a long long time to implement. With this patchset, even a non-root leaf/node is corrupted and no duplication survived, btrfsck can still repair it to a mountable status. (And normal rw should also be OK,) The remaining unfixable problems will be inode nbytes error with file extent discounts error, which may be fixed in next patchset. Cc David: Sorry for the huge change in the patchset and merge the old inode nlink repair with new inode item rebuild patchset. No problem, the incremental changelogs helped a lot. Since when developing inode item rebuild patchset, I found the old nlink cooperated very bad with item rebuild and there is some duplicated codes between the two patchset, no to mention the math lib introduced by nlink repair patch. So I decided to somewhat rebase the nlink repair patchset to provide better generality. Great, the patchset looks good for merge, I'm adding it to 3.18. From now on please send only incremental changes and not the whole patchset. Thanks. Thanks, this should be the last large update patchset. Later work will focus on file extent recovery and should not interfere with this patch. Thanks. Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
out of space warning?
I would like to avoid running out of space. is there a way to know that I am getting close? i'd like to make a script that runs as part of my bash prompt and lets me know when i am getting close. i know there are several ways you can run out of space and I'd like to avoid all of them. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 12/10/2014 10:56 AM, Patrik Lundquist wrote: On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote: Assuming no snapshots still contain the file, of course, and that the ext* saved subvolume has already been deleted. Got no snapshots or subvolumes. Keeping it simple for now. Does that mean that you have already manually removed the subvolume that was automatically created by btrfs-convert? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]
So I started looking at the mkfs.btrfs manual page with an eye towards documenting some of the tidbits like metadata automatically switching from dup to raid1 when more than one device is used. In experimenting I ended up with some questions... (1) why is the dup profile for data restricted to only one device and only if it's mixed mode? Gust t # mkfs.btrfs -f /dev/loop{0..1} -d dup Error: unable to create FS with data profile 16 (have 2 devices) Gust t # mkfs.btrfs -f /dev/loop0 -d dup Error: dup for data is allowed only in mixed mode (2) why is metadata dup profile restricted to only one device on creation when it will run that way just fine after a device add? Gust t # mkfs.btrfs -f /dev/loop{0..1} -m dup Error: unable to create FS with metadata profile 32 (have 2 devices) (3) why can I make a raid5 out of two devices? (I understand that we are currently just making mirrors, but the standard requires three devices in the geometry etc. So I would expect a two device RAID5 to be considered degraded with all that entails. It just looks like its asking for trouble to allow this once the support is finalized as suddenly a working RAID5 thats really a mirror would become something that can only be mounted with the degraded flag.) Gust t # mkfs.btrfs -f /dev/loop{0..1} -d raid5 -m raid5 Btrfs v3.17.1 See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM (2.00GiB) ... Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Turning ON incompat feature 'raid56': raid56 extended format Performing full device TRIM (2.00GiB) ... adding device /dev/loop1 id 2 fs created label (null) on /dev/loop0 nodesize 16384 leafsize 16384 sectorsize 4096 size 4.00GiB (4) Same question for raid6 but with three drives instead of the mandated four. (5) If I can make a RAID5 or RAID6 device with one missing element, why can't I make a RAID1 out of one drive, e.g. with one missing element? (6) If I make a RAID1 out of three devices are there three copies of every extent or are there always two copies that are semi-randomly spread across three devices? (ibid for more than three). --- It seems to me (very dangerous words in computer science, I know) that we need a "failed" device designator so that a device can be in the geometry (e.g. have a device ID) but not actually exist. Reads/writes to the failed device would always be treated as error returns. The failed device would be subject to replacement with "btrfs dev replace", and could be the source of said replacement to drop a problematic device out of an array. EXAMPLE: Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1 Btrfs v3.17.1 See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM (2.00GiB) ... Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Processing explicitly missing device adding device (failed) id 2 (phantom device) mount /dev/loop0 /mountpoint btrfs replace start 2 /dev/loop1 /mountpoint (and so on) Being able to "replace" a faulty device with a phantom "failed" device would nicely disambiguate the whole device add/remove versus replace mistake. It would make the degraded status less mysterious. A filesystem with an explicitly failed element would also make the future roll-out of full RAID5/6 less confusing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Balance & scrub & defrag
I am working on a script that i can run daily that will do maintenance on my btrfs mountpoints. is there any reason not to concurrently do all of the above? possibly including discards as well. also, is there anything existing currently that will do maintenance on btrfs so i don't have to reinvent the wheel? #!/bin/bash btrfs filesystem defragment -r -v /media/btrfs/ & btrfs scrub start /media/btrfs/ & btrfs balance start /media/btrfs/ & watch -d -n 30 "btrfs balance status /media/btrfs/; btrfs scrub status /media/btrfs/" -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote: > The main memory usage in btrfsck is extent record, which > we can't free them until we read them all in and checked, so even we > mmap/unmap, it can only help with > the extent_buffer(which is already freed if not used according to refs). I'm thinking aloud here, but is it *really* necessary to read everything into memory? Maybe a multiple-pass algorithm might be possible, e.g. one to find free space by eliminating any areas that are occupied by extents, then other passes to rebuild the metadata in the free space. Or, one pass to verify the connectivity of references and collect dangling refs, then a second pass which fixes only the dangling refs. Usually sequential reads are significantly faster than swapping--even if swapping on solid-state media. It could be that reading 260GB of metadata sequentially two or three times is still faster than thrashing through random lookups in 20GB of swap on a 4GB machine. signature.asc Description: Digital signature
[PATCH 10/18] btrfs restore: hide "offset is X" messages
From: Martin Wilck Almost everyone who cares about her data will run btrfs restore with -v. The "offset is" messages displayed will irritate users because they reveal only btrfs internals. Users will think that "offset" refers to a file offset and suspect severe corruption. Limit these messages to verbose > 1. Signed-off-by: Martin Wilck --- cmds-restore.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 10bb8be..f1c63ed 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -315,7 +315,7 @@ static int copy_one_extent(struct btrfs_root *root, int fd, if (compress == BTRFS_COMPRESS_NONE) bytenr += offset; - if (verbose && offset) + if (verbose > 1 && offset) printf("offset is %Lu\n", offset); /* we found a hole */ if (disk_size == 0) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/18] btrfs-progs: btrfs-debug-tree: fix usage message
From: Martin Wilck Adapt usage message to the additional options introduced. Signed-off-by: Martin Wilck --- btrfs-debug-tree.c | 13 +++-- 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index 7cdc368..4c1e835 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -31,22 +31,23 @@ static int print_usage(void) { - fprintf(stderr, "usage: btrfs-debug-tree [-e] [-d] [-r] [-R] [-u]\n"); - fprintf(stderr, "[-b block_num ] device\n"); + fprintf(stderr, "usage: btrfs-debug-tree [-e] [-d] [-r] [-R] [-u] [-B]\n"); + fprintf(stderr, "[-t tree_id] device\n"); + fprintf(stderr, " btrfs-debug-tree [-b block_num [-f]] device\n"); fprintf(stderr, "\t-e : print detailed extents info\n"); fprintf(stderr, "\t-d : print info of btrfs device and root tree dirs" " only\n"); fprintf(stderr, "\t-r : print info of roots only\n"); fprintf(stderr, "\t-R : print info of roots and root backups\n"); fprintf(stderr, "\t-u : print info of uuid tree only\n"); - fprintf(stderr, "\t-b block_num : print info of the specified block" -" only\n"); - fprintf(stderr, "\t-f : (with -b) follow subtree of the specified" - " block\n"); fprintf(stderr, "\t-t tree_id : print only the tree with the given id\n"); fprintf(stderr, "\t-B nr: use root backup instead of real root\n"); + fprintf(stderr, "\t-b block_num : print info of the specified block" +" only\n"); + fprintf(stderr, "\t-f : (with -b) follow subtree of the specified" + " block\n"); fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION); exit(1); } -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/18] btrfs restore: improve user-asking logic for files with many extents
From: Martin Wilck The logic to ask after 1024 extents is broken. It unnecessarily confuses users if big files are being restored, making them think somthing is going wrong. Change it to two cases: 1) no or little progress restoring, 2) writing beyond the file size. Signed-off-by: Martin Wilck --- cmds-restore.c | 18 +++--- 1 files changed, 15 insertions(+), 3 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 80081b8..8ae3337 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -661,7 +661,7 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, #define MAYBE_NL (verbose && (next_pos >> display_shift) ? "\n" : "") const u64 display_shift = 16; struct stat st; - + int dont_ask = 0; path = btrfs_alloc_path(); if (!path) { fprintf(stderr, "Ran out of memory\n"); @@ -697,9 +697,21 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, } while (1) { - if (loops >= 0 && loops++ >= 1024) { + int problem = 0; + if (st.st_size == _INVALID_SIZE && next_pos > st.st_size) { + fprintf(stderr, "%swriting at offset %llu beyond size " + "of file (%llu)\n", + MAYBE_NL, next_pos, st.st_size); + problem = 1; + } + if ((++loops % 1024) == 0 && (next_pos / loops < 4096)) { + fprintf(stderr, "%smany loops (%d) and little progress " + "(%llu bytes)\n", + MAYBE_NL, loops, next_pos); + problem = 1; + } + if (problem && !dont_ask && loops++) { enum loop_response resp; - resp = ask_to_continue(file); if (resp == LOOP_STOP) break; -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/18] btrfs restore: print progress marks for big files
From: Martin Wilck print a '+' for every 64k restored. This gives people more confidence in long-running restore processes. Signed-off-by: Martin Wilck --- cmds-restore.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index f1c63ed..004c82e 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -658,6 +658,8 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, int loops = 0; u64 bytes_written, next_pos = 0ULL; u64 total_written = 0ULL; +#define MAYBE_NL (verbose && (next_pos >> display_shift) ? "\n" : "") + const u64 display_shift = 16; struct stat st; path = btrfs_alloc_path(); @@ -751,6 +753,10 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, printf("Weird extent type %d\n", extent_type); } total_written += bytes_written; + if (verbose && + ((next_pos + bytes_written) >> display_shift) > + (next_pos >> display_shift)) + fprintf(stderr, "+"); next_pos = found_key.offset + bytes_written; if (ret) { fprintf(stderr, "ERROR after writing %llu bytes\n", @@ -764,6 +770,8 @@ next: set_size: btrfs_free_path(path); + + printf(MAYBE_NL); if (get_xattrs) { ret = set_file_xattrs(root, key->objectid, fd, file); if (ret) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/18] btrfs restore: check progress of file restoration
From: Martin Wilck extents should be ordered by file offset. Expect no overlaps, and report holes. Signed-off-by: Martin Wilck --- cmds-restore.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 004c82e..80081b8 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -739,6 +739,14 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, ret = -1; goto set_size; } + if (found_key.offset < next_pos) { + fprintf(stderr, "extent overlap, %llu < %llu\n", + found_key.offset, next_pos); + ret = -1; + goto set_size; + } else if (found_key.offset > next_pos) + fprintf(stderr, "hole at %llu (%llu bytes)\n", + next_pos, found_key.offset - next_pos); bytes_written = 0ULL; if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/18] btrfs restore: more graceful error handling in copy_file
From: Martin Wilck Setting size and attributes of a file makes sense even if some errors have occured during revovery. Also, do something useful with the number of bytes written. Signed-off-by: Martin Wilck --- cmds-restore.c | 27 ++- 1 files changed, 14 insertions(+), 13 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 8ecd896..10bb8be 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -715,7 +715,7 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, return ret; } else if (ret) { /* No more leaves to search */ - btrfs_free_path(path); + ret = 0; goto set_size; } leaf = path->nodes[0]; @@ -734,35 +734,36 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, if (compression >= BTRFS_COMPRESS_LAST) { fprintf(stderr, "Don't support compression yet %d\n", compression); - btrfs_free_path(path); - return -1; + ret = -1; + goto set_size; } + bytes_written = 0ULL; if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) goto next; if (extent_type == BTRFS_FILE_EXTENT_INLINE) { ret = copy_one_inline(fd, path, found_key.offset, &bytes_written); - if (ret) { - btrfs_free_path(path); - return -1; - } } else if (extent_type == BTRFS_FILE_EXTENT_REG) { ret = copy_one_extent(root, fd, leaf, fi, found_key.offset, &bytes_written); - if (ret) { - btrfs_free_path(path); - return ret; - } } else { printf("Weird extent type %d\n", extent_type); } + total_written += bytes_written; + next_pos = found_key.offset + bytes_written; + if (ret) { + fprintf(stderr, "ERROR after writing %llu bytes\n", + total_written); + ret = -1; + goto set_size; + } next: path->slots[0]++; } - btrfs_free_path(path); set_size: + btrfs_free_path(path); if (get_xattrs) { ret = set_file_xattrs(root, key->objectid, fd, file); if (ret) @@ -771,7 +772,7 @@ set_size: } set_fd_attrs(fd, &st, file); - return 0; + return ret; } static int search_dir(struct btrfs_root *root, struct btrfs_key *key, -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/18] btrfs restore: report mismatch in file size
From: Martin Wilck A mismatch between the file size stored in the inode and the number of bytes restored may indicate a problem. restore reads data in 4k chunks, so it's normal that files are truncated. Only emit the warning in unusual cases. Signed-off-by: Martin Wilck --- cmds-restore.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 8ae3337..3c4dc7a 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -799,6 +799,12 @@ set_size: file); } + if (st.st_size != _INVALID_SIZE && (st.st_size > next_pos || + (st.st_size < next_pos && +(st.st_size >> 12) != +(next_pos >> 12) - 1))) + fprintf(stderr, "size mismatch: extpected %llu, got %llu (written %llu)\n", + st.st_size, next_pos, total_written); set_fd_attrs(fd, &st, file); return ret; } -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/18] btrfs-progs: ctree.c: make bin_search non-static
From: Martin Wilck I need it in btrfs-search-metadata Signed-off-by: Martin Wilck --- ctree.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ctree.c b/ctree.c index 23399e2..1137312 100644 --- a/ctree.c +++ b/ctree.c @@ -602,8 +602,8 @@ static int generic_bin_search(struct extent_buffer *eb, unsigned long p, * simple bin_search frontend that does the right thing for * leaves vs nodes */ -static int bin_search(struct extent_buffer *eb, struct btrfs_key *key, - int level, int *slot) +int bin_search(struct extent_buffer *eb, struct btrfs_key *key, + int level, int *slot) { if (level == 0) return generic_bin_search(eb, -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/18] btrfs-progs: documentation for btrfs-raw and btrfs-search-metadata
From: Martin Wilck Update documentation for btrfs-debug-tree, and add pages for btrfs-search-metadata and btrfs-raw. Signed-off-by: Martin Wilck --- Documentation/Makefile |2 + Documentation/btrfs-debug-tree.txt | 10 + Documentation/btrfs-raw.txt | 54 + Documentation/btrfs-search-metadata.txt | 57 +++ 4 files changed, 123 insertions(+), 0 deletions(-) create mode 100644 Documentation/btrfs-raw.txt create mode 100644 Documentation/btrfs-search-metadata.txt diff --git a/Documentation/Makefile b/Documentation/Makefile index ef4f1bd..354c412 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -5,6 +5,8 @@ MAN8_TXT = MAN8_TXT += btrfs.txt MAN8_TXT += btrfs-convert.txt MAN8_TXT += btrfs-debug-tree.txt +MAN8_TXT += btrfs-raw.txt +MAN8_TXT += btrfs-search-metadata.txt MAN8_TXT += btrfs-find-root.txt MAN8_TXT += btrfs-image.txt MAN8_TXT += btrfs-map-logical.txt diff --git a/Documentation/btrfs-debug-tree.txt b/Documentation/btrfs-debug-tree.txt index 23fc115..69a547d 100644 --- a/Documentation/btrfs-debug-tree.txt +++ b/Documentation/btrfs-debug-tree.txt @@ -25,8 +25,18 @@ Print detailed extents info. Print info of btrfs device and root tree dirs only. -r:: Print info of roots only. +-u:: +Print info of UUID tree only. +-R:: +Print info of roots and root backups. +-t :: +Only print the subvolume tree with given object ID. +-B :: +Start at backup root from superblock rather than current root. -b :: Print info of the specified block only. +-f:: +Follow (descend) the (sub)tree rooted at the block given with -b. EXIT STATUS --- diff --git a/Documentation/btrfs-raw.txt b/Documentation/btrfs-raw.txt new file mode 100644 index 000..ae7bd2d --- /dev/null +++ b/Documentation/btrfs-raw.txt @@ -0,0 +1,54 @@ +btrfs-raw(8) + + +NAME + +btrfs-raw - low-level manipulation of btrfs meta data blocks. + +SYNOPSIS + +*btrfs-raw* [[-r|-w] ] + +DESCRIPTION +--- +*btrfs-raw* is used to dump the raw contents of the given logical block +of a btrfs device to stdout, or write raw data read from stdin to the +given logical block. + +*THIS TOOL IS DANGEROUS; IT MAY CORRUPT YOUR FILESYSTEM BEYOND REPAIR!!* + +Please exert extreme caution when writing modified blocks to disk. You +should make a backup copy of the entire file system before doing so, even +if the file system is already corrupted. Make sure you have a thorough +understanding of the btrfs disk data structures before making any changes. + +*YOU USE THIS TOOL AT YOUR OWN RISK!* + +OPTIONS +--- +-r :: +Read the logical block starting at and write the raw contents +to stdout (caution, binary data). +-w :: +Write the logical block starting at to disk, reading data from +stdin. The tool will adjust the header checksum before writing to disk. + +EXIT STATUS +--- +*btrfs-raw* will return 0 if no error happened. +If any problems happened, 1 will be returned. + +EXAMPLE +--- + +`btrfs-raw -r 874991616 /dev/sda >/tmp/blob` + +`btrfs-raw -w 874991616 /dev/sda + +DESCRIPTION +--- +*btrfs-search-metadata* is used to dump the meta data of a device, or to +selectively dump nodes or leaves matching certain conditions. + +Unlike `btrfs-dump-tree`, this tool will also find tree "branches" that +are disconnected from the root tree, and previous meta data copies. If a +corruption occurs, this may be useful for finding old, still healthy copies. + +This is maybe useful for analyzing filesystem state or inconsistence and has +a positive educational effect on understanding the internal structure. + is the device file where the filesystem is stored. + +OPTIONS +--- +-k //:: +Search for leaves and nodes containing the given key. +-g :: +Search for leaves and nodes with the given generation (transid). +-l :: +Search for nodes with the given level, or leaves (level 0). +-t :: +Search for leaves and nodes with the given owner object ID. +-L:: +dump full content of found nodes or leaves, like btrfs-debug-tree. + +EXIT STATUS +--- +*btrfs-search-metadata* will return 0 if no error happened. +If any problems happened, 1 will be returned. + +EXAMPLE +--- + +`btrfs-search-metadata -t 260 -l 0 -k 256/1/0 /dev/sda` + +Search the btrfs file system on `/dev/sda` for leaves belonging to +subvolume 260 and containing the first inode item (type 1: inode item, +object ID 256: first available object ID). + +See `ctree.h` in the btrfs source code and the btrfs Wiki for +assigned object IDs and key types. + +SEE ALSO + +`mkfs.btrfs`(8), `btrfs-debug-tree`(8) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/18] btrfs restore: track number of bytes restored
From: Martin Wilck Track the number of bytes read from extents and restored. This is useful for detecting errors and corruptions. Signed-off-by: Martin Wilck --- cmds-restore.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index f9dab7e..8ecd896 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -222,7 +222,8 @@ again: return 0; } -static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos) +static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos, + u64 *bytes_written) { struct extent_buffer *leaf = path->nodes[0]; struct btrfs_file_extent_item *fi; @@ -246,6 +247,7 @@ static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos) compress = btrfs_file_extent_compression(leaf, fi); if (compress == BTRFS_COMPRESS_NONE) { done = pwrite(fd, buf, len, pos); + *bytes_written += done; if (done < len) { fprintf(stderr, "Short inline write, wanted %d, did " "%zd: %d\n", len, done, errno); @@ -269,6 +271,7 @@ static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos) done = pwrite(fd, outbuf, ram_size, pos); free(outbuf); + *bytes_written += done; if (done < ram_size) { fprintf(stderr, "Short compressed inline write, wanted %Lu, " "did %zd: %d\n", ram_size, done, errno); @@ -280,7 +283,8 @@ static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos) static int copy_one_extent(struct btrfs_root *root, int fd, struct extent_buffer *leaf, - struct btrfs_file_extent_item *fi, u64 pos) + struct btrfs_file_extent_item *fi, u64 pos, + u64 *bytes_written) { struct btrfs_multi_bio *multi = NULL; struct btrfs_device *device; @@ -410,6 +414,7 @@ again: total += done; } out: + *bytes_written += total; free(inbuf); free(outbuf); return ret; @@ -651,6 +656,8 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, int extent_type; int compression; int loops = 0; + u64 bytes_written, next_pos = 0ULL; + u64 total_written = 0ULL; struct stat st; path = btrfs_alloc_path(); @@ -734,14 +741,15 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, if (extent_type == BTRFS_FILE_EXTENT_PREALLOC) goto next; if (extent_type == BTRFS_FILE_EXTENT_INLINE) { - ret = copy_one_inline(fd, path, found_key.offset); + ret = copy_one_inline(fd, path, found_key.offset, + &bytes_written); if (ret) { btrfs_free_path(path); return -1; } } else if (extent_type == BTRFS_FILE_EXTENT_REG) { ret = copy_one_extent(root, fd, leaf, fi, - found_key.offset); + found_key.offset, &bytes_written); if (ret) { btrfs_free_path(path); return ret; -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/18] btrfs-progs: NEW: brtfs-search-metadata
From: Martin Wilck A new tool for dumping all meta data (also unlinked nodes and leaves) and searching nodes or leaves with certain properties. Signed-off-by: Martin Wilck --- Makefile|2 +- btrfs-search-metadata.c | 224 +++ 2 files changed, 225 insertions(+), 1 deletions(-) create mode 100644 btrfs-search-metadata.c diff --git a/Makefile b/Makefile index fe65867..c670f67 100644 --- a/Makefile +++ b/Makefile @@ -48,7 +48,7 @@ MAKEOPTS = --no-print-directory Q=$(Q) progs = mkfs.btrfs btrfs-debug-tree btrfs-raw btrfsck \ btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \ - btrfs-find-root btrfstune btrfs-show-super + btrfs-find-root btrfstune btrfs-show-super btrfs-search-metadata progs_extra = btrfs-corrupt-block btrfs-fragments btrfs-calc-size \ btrfs-select-super diff --git a/btrfs-search-metadata.c b/btrfs-search-metadata.c new file mode 100644 index 000..80dc326 --- /dev/null +++ b/btrfs-search-metadata.c @@ -0,0 +1,224 @@ +/* + * Copyright (C) 2007 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include +#include +#include +#include +#include "kerncompat.h" +#include "radix-tree.h" +#include "ctree.h" +#include "disk-io.h" +#include "print-tree.h" +#include "version.h" +#include "utils.h" +#include "volumes.h" + +static int print_usage(void) +{ + fprintf(stderr, "usage: btrfs-search-metadata [options] device\n"); + fprintf(stderr, "\t-k //: search for given key\n"); + fprintf(stderr, "\t-g : search for given generation (transid)\n"); + fprintf(stderr, "\t-t : search for given tree\n"); + fprintf(stderr, "\t-l : search for node level (0=leaf)\n"); + fprintf(stderr, "\t-L: print full listing of matching leaf/node contents\n"); + fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION); + exit(1); +} + +int bin_search(struct extent_buffer *eb, struct btrfs_key *key, + int level, int *slot); + +static int do_one_block(struct btrfs_root *root, u64 block_nr, u64 tree_id, + u64 gen_id, int level, struct btrfs_key *key, int brief) +{ + struct extent_buffer *leaf; + int ret; + int slot = -1; + struct btrfs_disk_key disk_key; + + leaf = read_tree_block(root, + block_nr, + root->leafsize, 0); + + if (leaf && btrfs_header_level(leaf) != 0) { + free_extent_buffer(leaf); + leaf = NULL; + } + + if (!leaf) { + leaf = read_tree_block(root, + block_nr, + root->nodesize, 0); + } + if (!leaf) { + fprintf(stderr, "failed to read %llu\n", + (unsigned long long)block_nr); + return -1; + } + + ret = btrfs_is_leaf(leaf); + if (tree_id != 0 && tree_id != btrfs_header_owner(leaf)) + goto out; + if (gen_id != 0 && gen_id != btrfs_header_generation(leaf)) + goto out; + if (level != -1 && level != (int)btrfs_header_level(leaf)) + goto out; + + if (key && key->type != 0ULL) { + if (bin_search(leaf, key, btrfs_header_level(leaf), &slot)) + goto out; + } + + if (brief) + printf("%s %llu level %u items %d free %lu generation %llu owner %llu\n", + (ret ? "leaf" : "node"), + (unsigned long long)btrfs_header_bytenr(leaf), + btrfs_header_level(leaf), + btrfs_header_nritems(leaf), + (ret ? btrfs_leaf_free_space(root, leaf) : + (unsigned long)BTRFS_NODEPTRS_PER_BLOCK(root) - + btrfs_header_nritems(leaf)), + (u64)btrfs_header_generation(leaf), + (u64)btrfs_header_owner(leaf)); + else + btrfs_print_tree(root, leaf, 0); + + if (key->objectid != 0ULL) { + btrfs_cpu_key_to_disk(&disk_key, key); + printf("\t"); + btrfs_print_key(&disk_key); + printf(" found @ slot %d in %s %llu\n", slot, + (ret
[PATCH 15/18] btrfs-progs: NEW: btrfs-raw
From: Martin Wilck This program can be used to dump a meta data block, fix it e.g. using a hex editor, and write it back to disk, adapting the check sum. Signed-off-by: Martin Wilck --- Makefile|2 +- btrfs-raw.c | 143 +++ 2 files changed, 144 insertions(+), 1 deletions(-) create mode 100644 btrfs-raw.c diff --git a/Makefile b/Makefile index 4cae30c..fe65867 100644 --- a/Makefile +++ b/Makefile @@ -46,7 +46,7 @@ endif MAKEOPTS = --no-print-directory Q=$(Q) -progs = mkfs.btrfs btrfs-debug-tree btrfsck \ +progs = mkfs.btrfs btrfs-debug-tree btrfs-raw btrfsck \ btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \ btrfs-find-root btrfstune btrfs-show-super diff --git a/btrfs-raw.c b/btrfs-raw.c new file mode 100644 index 000..1dfeed9 --- /dev/null +++ b/btrfs-raw.c @@ -0,0 +1,143 @@ +/* + * Copyright (C) 2007 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include +#include +#include +#include "kerncompat.h" +#include "radix-tree.h" +#include "ctree.h" +#include "utils.h" +#include "disk-io.h" + +static int print_usage(void) +{ + fprintf(stderr, "usage: btrfs-raw [ -r block|-w block] device\n"); + exit(1); +} + +static int read_block(struct btrfs_root *root, u64 block_nr, + struct extent_buffer **eb) +{ + struct extent_buffer *leaf; + leaf = read_tree_block(root, + block_nr, + root->leafsize, 0); + + if (leaf && btrfs_header_level(leaf) != 0) { + free_extent_buffer(leaf); + leaf = NULL; + } + + if (!leaf) { + leaf = read_tree_block(root, + block_nr, + root->nodesize, 0); + } + if (!leaf) { + fprintf(stderr, "failed to read %llu\n", + (unsigned long long)block_nr); + return -1; + } + + *eb = leaf; + return btrfs_is_leaf(leaf) ? root->leafsize : root->nodesize; +} + +int main(int ac, char **av) +{ + struct btrfs_root *root; + struct btrfs_fs_info *info; + struct extent_buffer *eb = NULL; + struct btrfs_trans_handle *trans = NULL; + u64 block = ~0ULL; + int len; + enum btrfs_open_ctree_flags flags = OPEN_CTREE_PARTIAL; + radix_tree_init(); + + while(1) { + int c; + c = getopt(ac, av, "r:w:"); + if (c < 0) + break; + switch(c) { + case 'r': + block = arg_strtou64(optarg); + break; + case 'w': + flags |= OPEN_CTREE_WRITES; + block = arg_strtou64(optarg); + break; + default: + print_usage(); + } + } + set_argv0(av); + ac = ac - optind; + if (check_argc_exact(ac, 1) || block == ~0ULL) + print_usage(); + + info = open_ctree_fs_info(av[optind], 0, 0, flags); + if (!info) { + fprintf(stderr, "unable to open %s\n", av[optind]); + exit(1); + } + + root = info->fs_root; + if (!root) { + fprintf(stderr, "unable to open %s\n", av[optind]); + exit(1); + } + + len = read_block(root, block, &eb); + if (eb->len != len) { + fprintf(stderr, "length mismatch: %u %d\n", eb->len, len); + return 1; + } + + if (flags & OPEN_CTREE_WRITES) { + char buf[4]; + int ret; + fprintf(stderr, "*** THIS MAY CORRUPT YOUR FILE SYSTEM ***\n"); + fprintf(stderr, "*** Do you want to write logical block %llu " + "on device %s ?\n", block, av[optind]); + fprintf(stderr, "*** Type upper case \"yes\" to continue: "); + memset(buf, 0, 4); + ret = read(fileno(stderr), buf, 3); + if (strcmp(buf, "YES")) { + fprintf(stderr, "*** Aborted.\n"); +
[PATCH 06/18] btrfs restore: set uid/gid/mode/times
From: Martin Wilck current btrfs restore will discard file attributes. This patch sets them regular files and directories, as found in the meta data. Signed-off-by: Martin Wilck --- cmds-restore.c | 116 --- 1 files changed, 101 insertions(+), 15 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 2f9b72d..5aa2167 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #include @@ -549,6 +550,95 @@ out: return ret; } +#define _INVALID_SIZE ((off_t)~0ULL) +static int stat_from_inode(struct stat *st, struct btrfs_root *root, + struct btrfs_key *key) +{ + static struct btrfs_path *path; + struct btrfs_inode_item *inode_item; + struct btrfs_timespec *ts; + struct extent_buffer *eb; + + if (!path) + path = btrfs_alloc_path(); + + memset(st, 0, sizeof(*st)); + st->st_size = _INVALID_SIZE; + + if (!path) { + fprintf(stderr, "Ran out of memory\n"); + return -ENOMEM; + } + + if (btrfs_lookup_inode(NULL, root, path, key, 0)) { + btrfs_release_path(path); + return -ENOENT; + } + + inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_inode_item); + eb = path->nodes[0]; + + st->st_size = btrfs_inode_size(eb, inode_item); + st->st_uid = btrfs_inode_uid(eb, inode_item); + st->st_gid = btrfs_inode_gid(eb, inode_item); + st->st_mode = btrfs_inode_mode(eb, inode_item); + + ts = btrfs_inode_atime(eb, inode_item); + st->st_atim.tv_sec = ts->sec; + st->st_atim.tv_nsec = ts->nsec; + + ts = btrfs_inode_mtime(eb, inode_item); + st->st_mtim.tv_sec = ts->sec; + st->st_mtim.tv_nsec = ts->nsec; + + ts = btrfs_inode_ctime(eb, inode_item); + st->st_ctim.tv_sec = ts->sec; + st->st_ctim.tv_nsec = ts->nsec; + + btrfs_release_path(path); + return 0; +} + +static void set_fd_attrs(int fd, const struct stat *st, const char *file) +{ + struct timeval tv[2]; + if (st->st_size == _INVALID_SIZE) + return; + + tv[0].tv_sec = st->st_atim.tv_sec; + tv[0].tv_usec = st->st_atim.tv_nsec/1000; + tv[1].tv_sec = st->st_mtim.tv_sec; + tv[1].tv_usec = st->st_mtim.tv_nsec/1000; + if (S_ISREG(st->st_mode) && ftruncate(fd, st->st_size) == -1) + fprintf(stderr, "failed to set file size on %s\n", + file); + if (fchown(fd, st->st_uid, st->st_gid) == -1) + fprintf(stderr, "failed to set uid/gid on %s\n", + file); + if (fchmod(fd, st->st_mode) == -1) + fprintf(stderr, "failed to set permissions on %s\n", + file); + if (futimes(fd, tv) == -1) + fprintf(stderr, "failed to set file times on %s\n", + file); +} + +static int set_file_attrs(const char *output_rootdir, const char *file, + const struct stat *st) +{ + int fd; + static char path[4096]; + snprintf(path, sizeof(path), "%s%s", output_rootdir, file); + fd = open(path, O_RDONLY|O_NOATIME); + if (fd == -1) { + fprintf(stderr, "failed to open %s\n", path_name); + return -1; + } + set_fd_attrs(fd, st, path); + close(fd); + return 0; +} static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, const char *file) @@ -556,13 +646,12 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, struct extent_buffer *leaf; struct btrfs_path *path; struct btrfs_file_extent_item *fi; - struct btrfs_inode_item *inode_item; struct btrfs_key found_key; int ret; int extent_type; int compression; int loops = 0; - u64 found_size = 0; + struct stat st; path = btrfs_alloc_path(); if (!path) { @@ -570,13 +659,7 @@ static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key, return -ENOMEM; } - ret = btrfs_lookup_inode(NULL, root, path, key, 0); - if (ret == 0) { - inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0], - struct btrfs_inode_item); - found_size = btrfs_inode_size(path->nodes[0], inode_item); - } - btrfs_release_path(path); + stat_from_inode(&st, root, key); key->offset = 0; key->type = BTRFS_EXTENT_DATA_KEY; @@ -672,16 +755,14 @@ next: btrfs_free_path(path); set_size: - if (found_size) { - ret = ftruncate(fd, (loff_t)found_size); - if (ret) - return ret; -
[PATCH 00/18] Patch series related to my btrfs recovery
From: Martin Wilck This patch series contains all changes I made to the btrfs tools in the course of analyzing and repairing the corruption I described in my other mail to linux-btrfs titled "A story of btrfs corruption and recovery". The bottom line of this patch set is: 1) have the tools continue with error messages instead of aborting in certain error cases; and 2) look for meta data outside the current trees. Both is useful if the tree is internally corrupted in the way I described. I have also added support for extracting inode meta data (times, permissions) in "btrfs restore"; this was also useful for my recovery case. Please review and apply what you find useful. Martin Wilck (18): btrfs-progs: btrfs-debug-tree: add option -f for "block only" btrfs-progs: btrfs-debug-tree: add option -B (backup root) btrfs-progs: btrfs-debug-tree: fix usage message btrfs-progs: btrfs-debug-tree: handle corruption more gracefully btrfs-progs: ctree.h: fix btrfs_inode_[amc]time btrfs restore: set uid/gid/mode/times btrfs restore: better output readability btrfs restore: track number of bytes restored btrfs restore: more graceful error handling in copy_file btrfs restore: hide "offset is X" messages btrfs restore: print progress marks for big files btrfs restore: check progress of file restoration btrfs restore: improve user-asking logic for files with many extents btrfs restore: report mismatch in file size btrfs-progs: NEW: btrfs-raw btrfs-progs: NEW: brtfs-search-metadata btrfs-progs: ctree.c: make bin_search non-static btrfs-progs: documentation for btrfs-raw and btrfs-search-metadata Documentation/Makefile |2 + Documentation/btrfs-debug-tree.txt | 10 ++ Documentation/btrfs-raw.txt | 54 Documentation/btrfs-search-metadata.txt | 57 Makefile|4 +- btrfs-debug-tree.c | 78 +-- btrfs-raw.c | 143 btrfs-search-metadata.c | 224 +++ cmds-restore.c | 205 +++- ctree.c |4 +- ctree.h | 15 ++- print-tree.c| 22 +++- 12 files changed, 752 insertions(+), 66 deletions(-) create mode 100644 Documentation/btrfs-raw.txt create mode 100644 Documentation/btrfs-search-metadata.txt create mode 100644 btrfs-raw.c create mode 100644 btrfs-search-metadata.c -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/18] btrfs-progs: btrfs-debug-tree: handle corruption more gracefully
From: Martin Wilck This patch fixes the same thing in two different places. First, the first of the two BUG() tests is just a special case of the second one and can therefore be omitted. Second, instead of bailing out with BUG(), just print a reasonable error message and check the next child. Signed-off-by: Martin Wilck --- btrfs-debug-tree.c | 22 -- print-tree.c | 22 -- 2 files changed, 32 insertions(+), 12 deletions(-) diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index 4c1e835..d7c1155 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -73,13 +73,23 @@ static void print_extents(struct btrfs_root *root, struct extent_buffer *eb) btrfs_node_blockptr(eb, i), size, btrfs_node_ptr_generation(eb, i)); - if (btrfs_is_leaf(next) && - btrfs_header_level(eb) != 1) - BUG(); if (btrfs_header_level(next) != - btrfs_header_level(eb) - 1) - BUG(); - print_extents(root, next); + btrfs_header_level(eb) - 1) { + fprintf(stderr, "EXTENT TREE CORRUPTION detected at %llu, " + "slot %d pointing at %llu.\n" + "\tExpected child level: %d, found %d\n" + "\tExpected tree/transid: %llu/%llu," + " found %llu/%llu\n", + eb->start, i, next->start, + btrfs_header_level(eb) - 1, + btrfs_header_level(next), + (unsigned long long)btrfs_header_owner(eb), + (unsigned long long)btrfs_header_generation(eb), + (unsigned long long)btrfs_header_owner(next), + (unsigned long long) + btrfs_header_generation(next)); + } else + print_extents(root, next); free_extent_buffer(next); } } diff --git a/print-tree.c b/print-tree.c index 70a7acc..6769e20 100644 --- a/print-tree.c +++ b/print-tree.c @@ -1066,13 +1066,23 @@ void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *eb, int fol (unsigned long long)btrfs_header_owner(eb)); continue; } - if (btrfs_is_leaf(next) && - btrfs_header_level(eb) != 1) - BUG(); if (btrfs_header_level(next) != - btrfs_header_level(eb) - 1) - BUG(); - btrfs_print_tree(root, next, 1); + btrfs_header_level(eb) - 1) { + fprintf(stderr, "TREE CORRUPTION detected at %llu, " + "slot %d pointing at %llu.\n" + "\tExpected child level: %d, found %d\n" + "\tExpected tree/transid: %llu/%llu," + " found %llu/%llu\n", + eb->start, i, next->start, + btrfs_header_level(eb) - 1, + btrfs_header_level(next), + (unsigned long long)btrfs_header_owner(eb), + (unsigned long long)btrfs_header_generation(eb), + (unsigned long long)btrfs_header_owner(next), + (unsigned long long) + btrfs_header_generation(next)); + } else + btrfs_print_tree(root, next, 1); free_extent_buffer(next); } } -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/18] btrfs-progs: ctree.h: fix btrfs_inode_[amc]time
From: Martin Wilck make btrfs_inode_[amc]time work like the other btrfs_inode_xxx functions. The current definition appears broken to me; it never returns valid pointer unless an extent buffer address is added. Signed-off-by: Martin Wilck --- ctree.h | 15 +-- 1 files changed, 9 insertions(+), 6 deletions(-) diff --git a/ctree.h b/ctree.h index 89036de..1d5a5fc 100644 --- a/ctree.h +++ b/ctree.h @@ -1414,27 +1414,30 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 64); static inline struct btrfs_timespec * -btrfs_inode_atime(struct btrfs_inode_item *inode_item) +btrfs_inode_atime(struct extent_buffer *eb, + struct btrfs_inode_item *inode_item) { unsigned long ptr = (unsigned long)inode_item; ptr += offsetof(struct btrfs_inode_item, atime); - return (struct btrfs_timespec *)ptr; + return (struct btrfs_timespec *)(ptr + eb->data); } static inline struct btrfs_timespec * -btrfs_inode_mtime(struct btrfs_inode_item *inode_item) +btrfs_inode_mtime(struct extent_buffer *eb, + struct btrfs_inode_item *inode_item) { unsigned long ptr = (unsigned long)inode_item; ptr += offsetof(struct btrfs_inode_item, mtime); - return (struct btrfs_timespec *)ptr; + return (struct btrfs_timespec *)(ptr + eb->data); } static inline struct btrfs_timespec * -btrfs_inode_ctime(struct btrfs_inode_item *inode_item) +btrfs_inode_ctime(struct extent_buffer *eb, + struct btrfs_inode_item *inode_item) { unsigned long ptr = (unsigned long)inode_item; ptr += offsetof(struct btrfs_inode_item, ctime); - return (struct btrfs_timespec *)ptr; + return (struct btrfs_timespec *)(ptr + eb->data); } static inline struct btrfs_timespec * -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/18] btrfs restore: better output readability
From: Martin Wilck Don't print whole path for files, which will mangle output for long path names. Rather distinguish between directories and files. Signed-off-by: Martin Wilck --- cmds-restore.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/cmds-restore.c b/cmds-restore.c index 5aa2167..f9dab7e 100644 --- a/cmds-restore.c +++ b/cmds-restore.c @@ -908,7 +908,7 @@ static int search_dir(struct btrfs_root *root, struct btrfs_key *key, ret = 0; } if (verbose) - printf("Restoring %s\n", path_name); + printf("Restoring %s\n", filename); if (dry_run) goto next; fd = open(path_name, O_CREAT|O_WRONLY, 0644); @@ -982,7 +982,7 @@ static int search_dir(struct btrfs_root *root, struct btrfs_key *key, } if (verbose) - printf("Restoring %s\n", path_name); + printf("Searching directory %s\n", path_name); errno = 0; if (dry_run) -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/18] btrfs-progs: btrfs-debug-tree: add option -B (backup root)
From: Martin Wilck Option -B causes btrfs-debug-tree to dump the tree rooted at the backup root number given instead of the real root. Signed-off-by: Martin Wilck --- btrfs-debug-tree.c | 39 ++- 1 files changed, 38 insertions(+), 1 deletions(-) diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index e61c71c..7cdc368 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -45,6 +45,8 @@ static int print_usage(void) " block\n"); fprintf(stderr, "\t-t tree_id : print only the tree with the given id\n"); + fprintf(stderr, + "\t-B nr: use root backup instead of real root\n"); fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION); exit(1); } @@ -140,6 +142,7 @@ int main(int ac, char **av) int root_backups = 0; u64 block_only = 0; int block_follow = 0; + int use_backup = -1; struct btrfs_root *tree_root_scan; u64 tree_id = 0; @@ -147,7 +150,7 @@ int main(int ac, char **av) while(1) { int c; - c = getopt(ac, av, "defb:rRut:"); + c = getopt(ac, av, "defb:rRut:B:"); if (c < 0) break; switch(c) { @@ -176,6 +179,9 @@ int main(int ac, char **av) case 't': tree_id = arg_strtou64(optarg); break; + case 'B': + use_backup = arg_strtou64(optarg); + break; default: print_usage(); } @@ -221,6 +227,37 @@ int main(int ac, char **av) goto close_root; } + if (use_backup >= BTRFS_NUM_BACKUP_ROOTS) { + fprintf(stderr, "Invalid backup root number %d\n", + use_backup); + exit(1); + } else if (use_backup >= 0) { + u64 bytenr, generation; + u32 blocksize; + struct btrfs_super_block *sb = info->super_copy; + struct btrfs_root_backup *backup = sb->super_roots + use_backup; + struct extent_buffer *eb; + bytenr = btrfs_backup_tree_root(backup); + generation = btrfs_backup_tree_root_gen(backup); + blocksize = btrfs_level_size(info->tree_root, +btrfs_super_root_level(sb)); + eb = info->tree_root->node; + info->tree_root->node = read_tree_block(root, bytenr, + blocksize, generation); + free_extent_buffer(eb); + bytenr = btrfs_backup_chunk_root(backup); + generation = btrfs_backup_chunk_root_gen(backup); + eb = info->chunk_root->node; + info->chunk_root->node = read_tree_block(root, bytenr, + blocksize, generation); + free_extent_buffer(eb); + if (!extent_buffer_uptodate(info->tree_root->node) || + !extent_buffer_uptodate(info->tree_root->node)) { + fprintf(stderr, "Couldn't backup root\n"); + return 1; + } + } + if (!(extent_only || uuid_tree_only || tree_id)) { if (roots_only) { printf("root tree: %llu level %d\n", -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/18] btrfs-progs: btrfs-debug-tree: add option -f for "block only"
From: Martin Wilck btrfs-debug-tree prints only the given block. It is sometimes useful to be able to print the subtree under this block. This patch enables this behavior with the option "-f". Signed-off-by: Martin Wilck --- btrfs-debug-tree.c | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index e46500d..e61c71c 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -41,6 +41,8 @@ static int print_usage(void) fprintf(stderr, "\t-u : print info of uuid tree only\n"); fprintf(stderr, "\t-b block_num : print info of the specified block" " only\n"); + fprintf(stderr, "\t-f : (with -b) follow subtree of the specified" + " block\n"); fprintf(stderr, "\t-t tree_id : print only the tree with the given id\n"); fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION); @@ -137,6 +139,7 @@ int main(int ac, char **av) int roots_only = 0; int root_backups = 0; u64 block_only = 0; + int block_follow = 0; struct btrfs_root *tree_root_scan; u64 tree_id = 0; @@ -144,7 +147,7 @@ int main(int ac, char **av) while(1) { int c; - c = getopt(ac, av, "deb:rRut:"); + c = getopt(ac, av, "defb:rRut:"); if (c < 0) break; switch(c) { @@ -167,6 +170,9 @@ int main(int ac, char **av) case 'b': block_only = arg_strtou64(optarg); break; + case 'f': + block_follow = 1; + break; case 't': tree_id = arg_strtou64(optarg); break; @@ -211,7 +217,7 @@ int main(int ac, char **av) (unsigned long long)block_only); goto close_root; } - btrfs_print_tree(root, leaf, 0); + btrfs_print_tree(root, leaf, block_follow); goto close_root; } -- 1.7.3.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
On 12/10/2014 09:36 PM, Robert White wrote: [...] > I tested it and sure enough, it's RAID1... > > I also noticed that the default for data goes from single to RAID0 in > a two slice build. > > I generally don't expect defaults to change in undocumented ways. > Particularly since that makes make-plus-add orthogonal to > make-as-multi. > > Without other guidance I'd been assuming that > > mkfs.btrfs d1 d2 d3 ... > --vs-- > mkfs.btrfs d1 > btrfs dev add d2 > btrfs dev add d3 ... > > would net the same resultant system. I have only ever done the latter > until today. > > Does/Will the defaults change when three, four, or more slices are > used to build the system? > > I'll take a stab at updating the manual page. Why not printing from mkfs.btrfs the raid profiles used ? > > -- Rob. > > > -- To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
A story of btrfs corruption and recovery
In April 2014, I reported a btrfs corruption on the linux-btrfs mailing list (http://www.spinics.net/lists/linux-btrfs/msg33318.html). 8 months later, I am happy to be able to say I've been able to recover the data with a combination of persistence and luck. I want to share some of my insight with this list in the hope it that may be useful in future cases. I also did some work on the btrfs tools to be able to better understand what was wrong; I will submit the additions and changes I made for review later. 1. The history I had created this file system in late 2012 when I installed OpenSUSE 12.2 on a friend's laptop. "btrfs was still unstable at that time", I imagine you say. That's easy to say in hindsight. OpenSUSE's installer offered btrfs as a tier-1 choice, as far as I remember. Articles written at the time (e.g. http://rainbowtux.blogspot.de/2012/09/to-btrfs-or-not-to-btrfs.html) suggest that I wasn't the only person considering it worth a serious try. Today I wish I hadn't incautiously put my friend's /home on that FS, too - I've certainly paid for that carelessness. So, /home was subvolume 263 in this file system. Complicating matters further, I had created encrypted home file systems using ecryptfs on top of btrfs. 2. The disaster It all went well until April 14, 2014. On that day, the laptop suddenly crashed. OpenSUSE Kernel 3.4.11-2.16 was running at the time of the crash. Subsequent reboot attempts failed. I described the phenomena in my posting to linux-btrfs, desparately hoping someone would give me an easy recipe for recovery. It didn't happen. I got the recommendation to use a newer version of the kernel and btrfs tools, but they didn't get me any further. Whatever tool I tried, /home appeared to be completely empty. I had to dig deeper. 3. The quest After quite some time, I found the hint, looking at the root of the /home subvolume, which was a level 2 node: # ./btrfs-debug-tree -b 980717568 /dev/XX node 980717568 level 2 items 78 free 43 generation 39637 owner 263 key (256 INODE_ITEM 0) block 1012207616 (247121) gen 35754 Looking at the supposed level-1 subnode at 1012207616, I found that it contained data of the wrong level (0), owner (2 - the extent tree), and generation: leaf 1012207616 items 26 free space 1967 generation 39622 owner 2 item 0 key (8266870784 EXTENT_ITEM 12288) itemoff 3942 itemsize 53 So, the tree was massively corrupted at this crucial point; the top inode of the subvolume couldn't be found, explaining why /home had appeared empty on every recovery attempt. I looked at the other children of the children of the tree root, and was pleasantly surprised that these didn't look bad; I saw inodes and directory entries of ecryptfs-encrypted home directories, as I had expected. The obvious next thing to try was to look for previous generations of the root of the /home subvolume, hoping they weren't corrupted. I started with the super block root backups, with no luck. Later I went back all the way from generation 39637 to 38081 (the oldest copy of this root node I could find), but it was just as corrupted as the last one - they all pointed to the same wrong level 1 block 1012207616. I began to wonder whether the all-important level 1 and leaf meta data of this part of the file system had survived somewhere at all. I hacked together a tool to search for a specific btrfs key in all of the meta data, and used it to search for the the key 256-1-0 of the subvolume 263 (the first inode of the /home file system). Luckily, I found exactly one copy of a leaf containing this key, and a handful of level 1 nodes referring to it. At this point I didn't yet dare to even think of repairing the file system. Rather, I made additional debugging steps. One strange thing I found was that beyond the 603 top (level 2) copies of /home's root node, there were several instances with the same generation number: node 1037123584 level 2 items 78 free 43 generation 39636 owner 263 node 1041215488 level 2 items 78 free 43 generation 39636 owner 263 node 980566016 level 2 items 78 free 43 generation 39636 owner 263 node 980717568 level 2 items 78 free 43 generation 39637 owner 263 Looking at the details of these blocks, I found that the various level-2-gen-39636-owner-263 were actually different. I have no idea if this can happen under any circumstances, but it gave me another hint towards the final solution. Out of the generation 39636 roots listed above, only the last one showed the original corruption I described - the others actually had reasonable data in slot 1. My first hope that these root copies might actually be healthy was quickly destroyed - a tree dump showed other errors. But, and that was key, these other corruptions were at different points of the tree. Taking the three gen-39636 roots together, I was able to find sane data for every part of the tree. I was lucky insofar as the total number of corruptions I needed to fix turned out to be so low that it was doable
Re: [PATCH] Btrfs: get more accurate output in fd command.
On 12/10/2014 05:21 AM, Duncan wrote: Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted: On 12/09/2014 05:08 PM, Dongsheng Yang wrote: On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote: Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote: When function btrfs_statfs() calculate the tatol size of fs, it is calculating the total size of disks and then dividing it by a factor. But in some usecase, the result is not good to user. Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 3.0G 1018M 1.3G 45% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes used 1001.53MiB devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 a. df -h should report Size as 2GiB rather than as 3GiB. Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. I agree NOPE. The model you propose is too simple. While the data portion of the file system is set to RAID1 the metadata portion of the filesystem is still set to the default of DUP. Well my bad... /D'oh... Though I'd say the documentation needs to be updated. The only mention of changes from the default is this bit. From man mkfs.btrfs as distributed in the source tree: [QUOTE] -m|--metadata Specify how metadata must be spanned across the devices specified. Valid values are raid0, raid1, raid5, raid6, raid10, single or dup. Single device will have dup set by default except in the case of SSDs which will default to single. This is because SSDs can remap blocks internally so duplicate blocks could end up in the same erase block which negates the benefits of doing metadata duplication. [/QUOTE] No mention is made of RAID1 for a multi-device FS, the two defaults are listed as DUP and Single. ASIDE: The wiki page mentions RAID1 but doesn't mention the SSD fallback to single; and it's annotated as potentially out of date. But I never looked there because I had the manual page locally. I tested it and sure enough, it's RAID1... I also noticed that the default for data goes from single to RAID0 in a two slice build. I generally don't expect defaults to change in undocumented ways. Particularly since that makes make-plus-add orthogonal to make-as-multi. Without other guidance I'd been assuming that mkfs.btrfs d1 d2 d3 ... --vs-- mkfs.btrfs d1 btrfs dev add d2 btrfs dev add d3 ... would net the same resultant system. I have only ever done the latter until today. Does/Will the defaults change when three, four, or more slices are used to build the system? I'll take a stab at updating the manual page. -- Rob. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 10 December 2014 at 13:47, Duncan <1i5t5.dun...@cox.net> wrote: > > The recursive btrfs defrag after deleting the saved ext* subvolume > _should_ have split up any such > 1 GiB extents so balance could deal > with them, but either it failed for some reason on at least one such > file, or there's some other weird corner-case going on, very likely > something else having to do with the conversion. I've run defrag several times again and it doesn't do anything additional. > Patrik, assuming no btrfs snapshots yet, can you do a du --all --block- > size=1M | sort -n (or similar), then take a look at all results over 1024 > (1 GiB since the du specified 1 MiB blocks), and see if it's reasonable > to move all those files out of the filesystem and back? Good idea, but it's quite a lot of files. I'd rather start over. But I've identified 46 files from Btrfs errors in syslog and will try to move them to another disk. They're ranging from 41KiB to 6.6GiB in size. Is btrfs-debug-tree -e useful in finding problematic files? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possibility to have a "transient" snapshot?
I was just looking into using overlayfs, and although it has some promise, I think it's biggest drawback is the upperdir will have to be some sort of storage backed filesystem. From my limited understanding of tmpfs, it's not supposed to be the greatest with many large files (and my system in particular would be downloading many large movies/videos, and doing any kind of os update to test it would involve many changes all over the volume, which could be problematic to commit to a golden state.) I could partition the main drive in 2 parts, and dynamically zero-out then create the volume in the second partition on each boot, but I'm still saving no drive writes, and not really extending the life of the hardware (which is one of my premises.) On 05/12/2014 11:12 PM, Chris Murphy wrote: On Fri, Dec 5, 2014 at 11:27 AM, James West wrote: General idea would be to have a transient snapshot (optional quota support possibility here) on top of a base snapshot (possibly readonly). On system start/restart (whether clean or dirty), the transient snapshot would be flushed, and the system would restart the snapshot, basically restarting from the base snapshot. Sounds similar to this idea: http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html About 1/3 of the way down it gets to a proposal to Btrfs as a way to get to a stateless system, which is basically what you want to be able to rollback to. A variation on this that might serve the use case better is seed device. You can either drop the added device that stores changes to the seed device, or the volume (seed+added device) can become another seed if you want to make the current state persistent at next boot. And still another possibility is overlayfs, which isn't Btrfs specific. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub status misreports as "interrupted"
On 10/12/2014 9:28 μμ, Marc Joliet wrote: Am Wed, 10 Dec 2014 10:51:15 +0800 schrieb Anand Jain : Is there any relevant log in the dmegs ? Not in my case; at least, nothing that made it into the syslog. Same with me, no messages at all -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub status misreports as "interrupted"
Am Wed, 10 Dec 2014 10:51:15 +0800 schrieb Anand Jain : > Is there any relevant log in the dmegs ? Not in my case; at least, nothing that made it into the syslog. -- Marc Joliet -- "People who think they know everything really annoy those of us who know we don't" - Bjarne Stroustrup signature.asc Description: PGP signature
Re: [PATCH] Btrfs: get more accurate output in fd command.
On 12/10/2014 04:02 PM, Dongsheng Yang wrote: > On Wed, Dec 10, 2014 at 9:21 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted: [...] >> And in the example, the mkfs was supplied with two devices, so there's no >> dup metadata remaining from a formerly single-device filesystem, either. >> (Tho there will be the small single-mode stubs, empty, remaining from the >> mkfs process, as no balance has been run to delete them yet, but those >> are much smaller and empty.) > > Yes. One question not related here: how about delete them in the end of mkfs? > > Thanx A btrfs balance should remove them. If you don't want to balance a full filesystem, you can filter the chunk by usage (set a low usage). Recently it was discussed in a tread... BR Goffredo -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote: > > From there... I've never used it but I /think/ btrfs inspect-internal > logical-resolve should let you map the 182109... address to a filename. > From there, moving that file out of the filesystem and back in should > eliminate that issue. btrfs inspect-internal logical-resolve 1821099687936 /mnt gives me the filename and it's only a 54175 bytes file. > Assuming no snapshots still contain the file, of course, and that the > ext* saved subvolume has already been deleted. Got no snapshots or subvolumes. Keeping it simple for now. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device
On 12/10/2014 08:52 AM, Anand Jain wrote: > > >> This patch allows btrfs to skip LVM snapshot during the device scan >> phase. > > Its better to generalize the problem and fix it. The fix here is very > specific to LVM use case. This does not work in cases where device is > cloned using dd (device is unmounted). See patch #5; this aborts btrfs[progs] when two devices have the same dev.uuid and fsid. Unfortunately this patch doesn't work with "btrfs dev scan"; this because each device is discovered/registered alone. See my patches about mount.btrfs for an alternative approach. > As mentioned we need to depend on the device wwn as provided by the > device target driver. > > Thanks, Anand > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
On 12/10/2014 11:53 AM, Robert White wrote: > On 12/09/2014 05:08 PM, Dongsheng Yang wrote: >> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote: >>> Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote: When function btrfs_statfs() calculate the tatol size of fs, it is calculating the total size of disks and then dividing it by a factor. But in some usecase, the result is not good to user. Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 3.0G 1018M 1.3G 45% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes used 1001.53MiB devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 a. df -h should report Size as 2GiB rather than as 3GiB. Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >>> I agree > > NOPE. > > The model you propose is too simple. > > While the data portion of the file system is set to RAID1 the > metadata portion of the filesystem is still set to the default of > DUP. As such it is impossible to guess how much space is "free" since > it is unknown how the space will be used before hand. Hi Robert, sorry but you are talking about a different problem. Yang is trying to solve a problem where it is impossible to fill all the disk space because some portion is not raid1 protected. So it is incorrect to report all space/2 as free space. Instead you are stating that *if* the metadata are stored as DUP (and is not this case, because the metadata are raid1, see below), it is possible to fill all the disk space. This is a complex problem. The fact that BTRFS allows different raid levels causes to be very difficult to evaluate the free space ( as space available directly to the user). There is no a simple answer. I am still convinced that the best free space *estimation* is considering the ratio disk-space-consumed/file-allocated constant, and evaluate the free space as the disk-space-unused*file-allocate/disk-space-consumed. Of course there are pathological cases that make this prediction fails completely. But I consider the best estimation possible for the average users. But again this is a different problem that the one raised by Yang. [...] > IF you wanted everything to be RAID-1 you should have instead done > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 -m raid1 > > The mistake is yours, rest of you analysis is, therefore, completely > inapplicable. Please read all the documentation before making that > sort of filesystem. Your data will thank you later. > > DSCLAIMER: I have _not_ looked at the numbers you would get if you > used the corrected command. Sorry, but you are wrong. Doing mkfs.btrfs -d raid1 /dev/loop[01] leads to have both data and metadata in raid1. IIRC if you have more than one disks, the metadata switched to raid1 automatically. $ sudo mkfs.btrfs -d raid1 /dev/loop[01] Btrfs v3.17 See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM (10.00GiB) ... Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Performing full device TRIM (30.00GiB) ... adding device /dev/loop1 id 2 fs created label (null) on /dev/loop0 nodesize 16384 leafsize 16384 sectorsize 4096 size 40.00GiB ghigo@venice:/tmp$ sudo mount /dev/loop0 t/ ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/fill bs=4M count=10 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 0.018853 s, 2.2 GB/s ghigo@venice:/tmp$ sync ghigo@venice:/tmp$ sudo btrfs fi df t/ Data, RAID1: total=1.00GiB, used=40.50MiB Data, single: total=8.00MiB, used=0.00B System, RAID1: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, RAID1: total=1.00GiB, used=160.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B [...] -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/9/2014 10:10 PM, Anand Jain wrote: > In the test case provided earlier who is triggering the scan ? > grub-probe ? The scan is initiated by udev. grub-probe only comes into it because it is looking to /proc/mounts to find out what device is mounted, and /proc/mounts is lieing. > But we had to revert, Since btrfs bug become a feature for the > system boot process and fixing that breaks mount at boot with > subvol. How is this? Also are we talking about updating the cached list of devices that *can* be mounted, or what device already *is* mounted? I can see doing the former, but the latter should never happen. > if the device is already mounted, just the device path is updated > but still the original device will be still in use (bug). Yep, that is the bug that started all of this. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUiG1MAAoJENRVrw2cjl5Rm0gIAJ6sq72zKSEfCuCjigknx25T a97wjtMeb+yeaECc5FfwN7Fm454GSSuj6RFCRVjo3sCgJP3sUEH49syJnvW1QiEP A5ktXfTpz6/zaeP9DbGPDCiVix0RdsJ6bCjP/8InsASueXOENCpxxmblxrbE4Wxj Mdz8lu9L8G+fc6btbLLb0N4i0clSiImQds90zTQ1cXihJ/4wUIO3qgq+rruSYMqI A182FS7NTUQrRcJ/rbcha3dCyD/urbCaRTUztMvTnSs3a7hK5p+SBNbfxEORC6ni HrRMxpOlgHOTMnL3EHw843OuGv0Us3VqVbuPG3K6L4+G4W1sFxgKEAnLvEbjzAI= =Vpre -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs-progs: subvol delete: add verbosity option
Add an the option -v and use it for the transaction commit mode message. Signed-off-by: David Sterba --- cmds-subvolume.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/cmds-subvolume.c b/cmds-subvolume.c index 4e452f4f4eb7..b14f86e06cb4 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -66,7 +66,7 @@ static int cmd_subvol_create(int argc, char **argv) optind = 1; while (1) { - int c = getopt(argc, argv, "c:i:"); + int c = getopt(argc, argv, "c:i:v"); if (c < 0) break; @@ -217,6 +217,7 @@ static int cmd_subvol_delete(int argc, char **argv) char*dupvname = NULL; char*path; DIR *dirstream = NULL; + int verbose = 0; int sync_mode = 0; struct option long_options[] = { {"commit-after", no_argument, NULL, 'c'}, /* sync mode 1 */ @@ -239,6 +240,9 @@ static int cmd_subvol_delete(int argc, char **argv) case 'C': sync_mode = 2; break; + case 'v': + verbose++; + break; default: usage(cmd_subvol_delete_usage); } @@ -247,9 +251,11 @@ static int cmd_subvol_delete(int argc, char **argv) if (check_argc_min(argc - optind, 1)) usage(cmd_subvol_delete_usage); - printf("Transaction commit: %s\n", - !sync_mode ? "none (default)" : - sync_mode == 1 ? "at the end" : "after each"); + if (verbose > 0) { + printf("Transaction commit: %s\n", + !sync_mode ? "none (default)" : + sync_mode == 1 ? "at the end" : "after each"); + } cnt = optind; -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs-progs: let subvol delete print commit mode inline
There are options to specify if the subvolume deletion should wait for commit after each subvol or at the end. This is reported at the beginning and considered as a noise. We'd like to report the mode for each subvolume instead. http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg34617.html Reported-by: Marc MERLIN --- cmds-subvolume.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/cmds-subvolume.c b/cmds-subvolume.c index 53eec467251d..4e452f4f4eb7 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -303,7 +303,9 @@ again: goto out; } - printf("Delete subvolume '%s/%s'\n", dname, vname); + printf("Delete subvolume (%s): '%s/%s'\n", + sync_mode == 2 || (sync_mode == 1 && cnt + 1 == argc) + ? "commit" : "no-commit", dname, vname); strncpy_null(args.name, vname); res = ioctl(fd, BTRFS_IOC_SNAP_DESTROY, &args); e = errno; -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Btrfs-progs: subvolume deletion commit mode update
Minor change in the output of 'subvolume delete' command, the commit mode is printed inline with the subvolume and the global message is moved under the newly added 'verbose' option. David Sterba (3): btrfs-progs: let subvol delete print commit mode inline btrfs-progs: subvol delete: add verbosity option btrfs-progs: subvol delete: rename variable to match the option name cmds-subvolume.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs-progs: subvol delete: rename variable to match the option name
Signed-off-by: David Sterba --- cmds-subvolume.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/cmds-subvolume.c b/cmds-subvolume.c index b14f86e06cb4..15d4b975a916 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -218,10 +218,10 @@ static int cmd_subvol_delete(int argc, char **argv) char*path; DIR *dirstream = NULL; int verbose = 0; - int sync_mode = 0; + int commit_mode = 0; struct option long_options[] = { - {"commit-after", no_argument, NULL, 'c'}, /* sync mode 1 */ - {"commit-each", no_argument, NULL, 'C'}, /* sync mode 2 */ + {"commit-after", no_argument, NULL, 'c'}, /* commit mode 1 */ + {"commit-each", no_argument, NULL, 'C'}, /* commit mode 2 */ {NULL, 0, NULL, 0} }; @@ -235,10 +235,10 @@ static int cmd_subvol_delete(int argc, char **argv) switch(c) { case 'c': - sync_mode = 1; + commit_mode = 1; break; case 'C': - sync_mode = 2; + commit_mode = 2; break; case 'v': verbose++; @@ -253,8 +253,8 @@ static int cmd_subvol_delete(int argc, char **argv) if (verbose > 0) { printf("Transaction commit: %s\n", - !sync_mode ? "none (default)" : - sync_mode == 1 ? "at the end" : "after each"); + !commit_mode ? "none (default)" : + commit_mode == 1 ? "at the end" : "after each"); } cnt = optind; @@ -310,7 +310,7 @@ again: } printf("Delete subvolume (%s): '%s/%s'\n", - sync_mode == 2 || (sync_mode == 1 && cnt + 1 == argc) + commit_mode == 2 || (commit_mode == 1 && cnt + 1 == argc) ? "commit" : "no-commit", dname, vname); strncpy_null(args.name, vname); res = ioctl(fd, BTRFS_IOC_SNAP_DESTROY, &args); @@ -323,7 +323,7 @@ again: goto out; } - if (sync_mode == 1) { + if (commit_mode == 1) { res = wait_for_commit(fd); if (res < 0) { fprintf(stderr, @@ -347,7 +347,7 @@ out: goto again; } - if (sync_mode == 2 && fd != -1) { + if (commit_mode == 2 && fd != -1) { res = wait_for_commit(fd); if (res < 0) { fprintf(stderr, -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
On Wed, Dec 10, 2014 at 9:21 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted: > >> On 12/09/2014 05:08 PM, Dongsheng Yang wrote: >>> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote: Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote: > When function btrfs_statfs() calculate the tatol size of fs, it is > calculating the total size of disks and then dividing it by a factor. > But in some usecase, the result is not good to user. > > Example: > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 > # mount /dev/vdf1 /mnt > # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 > # df -h /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/vdf1 3.0G 1018M 1.3G 45% /mnt > > # btrfs fi show /dev/vdf1 > Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 > Total devices 2 FS bytes used 1001.53MiB > devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 > devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 > > a. df -h should report Size as 2GiB rather than as 3GiB. > Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. I agree >> >> NOPE. >> >> The model you propose is too simple. >> >> While the data portion of the file system is set to RAID1 the metadata >> portion of the filesystem is still set to the default of DUP. > > Metadata defaults to DUP only on a single-device filesystem. On a multi- > device filesystem, metadata defaults to raid1. (FWIW, for both, data > defaults to single.) Exactly. Thanx for your clarification. :) > > And in the example, the mkfs was supplied with two devices, so there's no > dup metadata remaining from a formerly single-device filesystem, either. > (Tho there will be the small single-mode stubs, empty, remaining from the > mkfs process, as no balance has been run to delete them yet, but those > are much smaller and empty.) Yes. One question not related here: how about delete them in the end of mkfs? Thanx > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
On Wed, Dec 10, 2014 at 9:59 PM, Shriramana Sharma wrote: > On Tue, Dec 9, 2014 at 4:50 PM, Dongsheng Yang > wrote: >> # df -h /mnt >> Filesystem Size Used Avail Use% Mounted on >> /dev/vdf1 3.0G 1018M 1.3G 45% /mnt > > LOL -- not being a user of RAID I can't comment on the patch, but I > was somewhat wondering what the "fd" command in the subject line is... > :-) Yea, it should be "df". :) > > -- > Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
On Wed, Dec 10, 2014 at 6:53 PM, Robert White wrote: > On 12/09/2014 05:08 PM, Dongsheng Yang wrote: >> >> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote: >>> >>> Hi Dongsheng >>> On 12/09/2014 12:20 PM, Dongsheng Yang wrote: When function btrfs_statfs() calculate the tatol size of fs, it is calculating the total size of disks and then dividing it by a factor. But in some usecase, the result is not good to user. Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 3.0G 1018M 1.3G 45% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes used 1001.53MiB devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 a. df -h should report Size as 2GiB rather than as 3GiB. Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >>> >>> I agree > > > NOPE. > > The model you propose is too simple. > > While the data portion of the file system is set to RAID1 the metadata > portion of the filesystem is still set to the default of DUP. As such it is > impossible to guess how much space is "free" since it is unknown how the > space will be used before hand. > > IF, say, this were used as a typical mail spool, web cache, or any number of > similar smal-file applications virtually all of the data may end up in the > metadata chunks. The "blocks free" in this usage are indistinguisable from > any other file system. > > For all that DUP data the correct size is 3GiB because there will be two > copies of all metadata but they could _all_ end up on /dev/vdf2. > > So you have a RAID-1 region that is constrained to 2Gib. You have 2GiB more > storage for all your metadata, but the constraint is DUP (so everything is > written twice "somewhere") > > So the space breakdown is, if optimally packed, actually The issue you pointed here really exists. If the all data is stored inline, the raid level will probably be different with the raid level we set by "-d". If we want to give an exactly guess of the future use, I would say it's impossible. But, 2G of the @size is more proper than 3G in this case I think. Let's compare them as below: 2G: a). It's readable to user, we build a btrfs with two devices of 2G and 4G. Then we got an fs of 2G. That's what raid1 should be understood. b). Even if all data is stored in inline extent, the @size will also grows at the same time. That said, if as you said, we got 3G data in it. The @size will also be reported as 3G in df command. 3G: a). It is strange to user, why we got a fs of 3G in raid1 with 2G and 4G device? And why I can not use the all the 3G capacity df reported (we can not assume a user understand what's inline extent.)? So, I prefer 2G to 3G here. Furthermore, I have cooked a new patch to treat space in metadata chunk and system chunk more properly. shown as below. # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 2.0G 1.3G 713M 66% /mnt # df /mnt Filesystem 1K-blocksUsed Available Use% Mounted on /dev/vdf12097152 1359424729536 66% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4 Total devices 2 FS bytes used 1001.55MiB devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 Does this makes more sense to you, Robert? Thanx Yang > > 2GiB mirrored, for _data_, takes up 4GiB total spread evenly across > /dev/vdf2 (2Gib) and /dev/vdf1 (2Gib). > > _AND_ 1GiB of metadata, written twice to /dev/vdf2 (2Gib) > > So free space is 3Gib on the presumption that data and metadata will be > equally used. > > The program, not being psychic, can only make a fair-usage guess about > future use. > > Now we have accounted for all 6GiB of raw storage _and_ the report of 3GiB > free. > > IF you wanted everything to be RAID-1 you should have instead done > > # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 -m raid1 > > The mistake is yours, rest of you analysis is, therefore, completely > inapplicable. Please read all the documentation before making that sort of > filesystem. Your data will thank you later. > > DSCLAIMER: I have _not_ looked at the numbers you would get if you used the > corrected command. > > > >>> b. df -h should report Avail as 0.15GiB or less, rather than as 1.3GiB. 2 - 1.85 = 0.15 >>> >>> I cannot agree; the avail should be: >>> 1.85 (the capacity of the allocated chunk) >>> -1.018 (the file stored) >>> +(2-1.85=0.15) (the residual capacity of the disks >>> considering a raid1 fs) >>> --- >>> =
Re: systemd.setenv and a mount.unit
On Thu, Nov 20, 2014 at 11:39:19AM -0700, Chris Murphy wrote: > On Thu, Nov 20, 2014 at 4:14 AM, Goffredo Baroncelli > wrote: > > > Supposing to have the following four subvolumes > > > > /root/ > > /root/etc > > /root/usr > > /root/var > > > > When you need to snapshot, you should: > > > > # btrfs subvolume snapshot /root /backup-root-20141120 > > # btrfs subvolume snapshot /root/etc /backup-root-20141120/etc > > # btrfs subvolume snapshot /root/usr /backup-root-20141120/usr > > # btrfs subvolume snapshot /root/var /backup-root-20141120/var > > > > So in order to remount an "old" filesystem, you need to make only > > 1 mount. > > I like this layout better than either the openSUSE or Fedora layouts. > It's easier to mount and old filesystem, where on Fedora each > subvolume must be explicitly mounted. And it ensures old binaries > aren't in the current mount path – kinda like running in a chroot – > where on openSUSE the snapshots containing old binaries are in the > current mount path. While the old binaries are in the current mount path, they're not generally accessible due to 0750 on the .snapshots directory. The 'single mountpoint for whole root' is not perfect in case there are files that are independent on the system files, like logs or some application data in wellknown paths. The other option is to have separate subvolumes for the selected paths and either mount them in fstab or do more work when the old filesystem has to be rolled back and transformed to the expected layout. Both have their pros and cons so this is a matter of user choice. Eg. if the logs are forwarded and not kept locally, no applications store data on root partition. Going to an older snapshot is trivial and without unexpected consequences. And of course the layouts are both ways convertible. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
On Tue, Dec 9, 2014 at 4:50 PM, Dongsheng Yang wrote: > # df -h /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/vdf1 3.0G 1018M 1.3G 45% /mnt LOL -- not being a user of RAID I can't comment on the patch, but I was somewhat wondering what the "fd" command in the subject line is... :-) -- Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: mkfs: make skinny-metadata default
According to public poll, this is desired and deemed to be safe. Feature introduced in kernel 3.10 (Jun 2013). Signed-off-by: David Sterba --- mkfs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mkfs.c b/mkfs.c index e10e62d2f2e3..f930a5353f75 100644 --- a/mkfs.c +++ b/mkfs.c @@ -46,7 +46,8 @@ static u64 index_cnt = 2; -#define DEFAULT_MKFS_FEATURES (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) +#define DEFAULT_MKFS_FEATURES (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF \ + | BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) #define DEFAULT_MKFS_LEAF_SIZE 16384 -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: basic support for TREE_SEARCH_V2 ioctl
Add the interface and helper that checks if the v2 ioctl is supported. Signed-off-by: David Sterba --- ioctl.h | 14 ++ utils.c | 40 utils.h | 2 ++ 3 files changed, 56 insertions(+) diff --git a/ioctl.h b/ioctl.h index 67c8de9808a7..2c2c7c1bc57e 100644 --- a/ioctl.h +++ b/ioctl.h @@ -279,6 +279,18 @@ struct btrfs_ioctl_search_args { char buf[BTRFS_SEARCH_ARGS_BUFSIZE]; }; +/* + * Extended version of TREE_SEARCH ioctl that can return more than 4k of bytes. + * The allocated size of the buffer is set in buf_size. + */ +struct btrfs_ioctl_search_args_v2 { +struct btrfs_ioctl_search_key key; /* in/out - search parameters */ +__u64 buf_size; /* in - size of buffer +* out - on EOVERFLOW: needed size +* to store item */ +__u64 buf[0]; /* out - found items */ +}; + #define BTRFS_INO_LOOKUP_PATH_MAX 4080 struct btrfs_ioctl_ino_lookup_args { __u64 treeid; @@ -542,6 +554,8 @@ struct btrfs_ioctl_clone_range_args { struct btrfs_ioctl_defrag_range_args) #define BTRFS_IOC_TREE_SEARCH _IOWR(BTRFS_IOCTL_MAGIC, 17, \ struct btrfs_ioctl_search_args) +#define BTRFS_IOC_TREE_SEARCH_V2 _IOWR(BTRFS_IOCTL_MAGIC, 17, \ + struct btrfs_ioctl_search_args_v2) #define BTRFS_IOC_INO_LOOKUP _IOWR(BTRFS_IOCTL_MAGIC, 18, \ struct btrfs_ioctl_ino_lookup_args) #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, __u64) diff --git a/utils.c b/utils.c index 2a9241619128..d3ec0d4ab467 100644 --- a/utils.c +++ b/utils.c @@ -2450,3 +2450,43 @@ int find_next_key(struct btrfs_path *path, struct btrfs_key *key) } return 1; } + +int btrfs_tree_search2_ioctl_supported(int fd) +{ + struct btrfs_ioctl_search_args_v2 *args2; + struct btrfs_ioctl_search_key *sk; + int args2_size = 1024; + char args2_buf[args2_size]; + int ret; + static int v2_supported = -1; + + if (v2_supported != -1) + return v2_supported; + + args2 = (struct btrfs_ioctl_search_args_v2 *)args2_buf; + sk = &(args2->key); + + /* +* Search for the extent tree item in the root tree. +*/ + sk->tree_id = BTRFS_ROOT_TREE_OBJECTID; + sk->min_objectid = BTRFS_EXTENT_TREE_OBJECTID; + sk->max_objectid = BTRFS_EXTENT_TREE_OBJECTID; + sk->min_type = BTRFS_ROOT_ITEM_KEY; + sk->max_type = BTRFS_ROOT_ITEM_KEY; + sk->min_offset = 0; + sk->max_offset = (u64)-1; + sk->min_transid = 0; + sk->max_transid = (u64)-1; + sk->nr_items = 1; + args2->buf_size = args2_size - sizeof(struct btrfs_ioctl_search_args_v2); + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, args2); + if (ret == -EOPNOTSUPP) + v2_supported = 0; + else if (ret == 0) + v2_supported = 1; + else + return ret; + + return v2_supported; +} diff --git a/utils.h b/utils.h index 289e86b4b11e..eb917d695f18 100644 --- a/utils.h +++ b/utils.h @@ -161,4 +161,6 @@ static inline u64 btrfs_min_dev_size(u32 leafsize) int find_next_key(struct btrfs_path *path, struct btrfs_key *key); +int btrfs_tree_search2_ioctl_supported(int fd); + #endif -- 2.1.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 10 December 2014 at 13:17, Robert White wrote: > On 12/09/2014 11:19 PM, Patrik Lundquist wrote: >> > BUT FIRST UNDERSTAND: you do _not_ need to balance a newly converted > filesystem. That is, the recommended balance (and recursive defrag) is _not_ > a useability issue, its an efficiency issue. But if I can't start with an efficient filesystem I'd rather start over now/soon. I intend to add four more old disks for a RAID1 and it will be problematic to start over later on (I'd have to buy new, large disks). I deleted the subvolume after being satisfied with the conversion, defragged recursively, and balanced. In that order. > Because you made a backup and everything yes? Shh! > So anyway. Your system isn't "bugged" or "broken" it's "full" but its a > fragmented fullness that has lots of free sectors but insufficent contiguous > free sectors, so it cannot satisfy the request. It's a half full 3TB disk. There _is_ space, somewhere. I can't speak for contiguous space though. >> I don't know how to interpret the space_info error. Why is only >> 4773171200 (4,4GiB) free? >> Can I inspect block group 1821099687936 to try to find out what makes >> it problematic? >> >> BTRFS info (device sdc1): relocating block group 1821099687936 flags 1 >> BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920 >> BTRFS: space_info 1 has 4773171200 free, is not full >> BTRFS: space_info total=1494648619008, used=1489775505408, pinned=0, >> reserved=99700736, may_use=2102390784, readonly=241664 > > > So it was looking for a single chunk 2013265920 bytes long and it couldn't > find one because all the spaces were smaller and there was no room to make a > new suitable space. > > The problem is that it wanted 2013265920 bytes and while the system as a > whole had no way to satisfy that desire. It asked for something just shy of > two gigs as a single extent. That's a tough order on a full platter. > > Since your entire free size is 2102390784 that is an attempt to allocate > about 80% of your free space as one contiguous block. That's never going to > happen. 8-) What about "space_info 1 has 4773171200 free"? Besides the other 1,5TB free space. > I don't even know if 2GiB is normally a legal size for an extent. My > understanding is that data is allocated in 1G chunks, so I'd expect all > extents to be smaller than 1G. The 'summary' after the failed balances is always something like "98 enospc errors" which now makes me suspect that I have 98 files with extents larger than 1GiB that the defrag didn't take care of. So if I can find out which files have >1GiB extents I can then copy them back and forth to solve the problem. Maybe running defrag more times can also solve it? Can I get a list of fragmented files? Suppose an old file with 2GiB extent isn't fragmented, will btrfs defrag still try to defrag it? > After a quick glance at the btrfs-convert, it looks like it might make some > pretty atypical extents if the underlying donor filesystem needed needed > them. It wouldn't have had a choice. So it's easily within the realm of > reason that you'd have some really fascinating data as a result of > converting a nearly full EXT4 file system of the Terabyte+ size. It was about half full at conversion. > This would > be quadruply true if you'd tweaked the block group ratios when you made the > original file system. Ext4 created with defaults, but I think it has been completely full at one time. > So since you have nice backups... you should probably drop the ext2_saved > subvolume and then get on with your life for good or ill. Done before defrag and balance attempts. > Think of the time and worry you'd have saved if you'd copied the thing in > the first place. 8-) But then I wouldn't learn as much. :-) >>> P.S. you should re-balance your System and Metadata as "DUP" for now. Two >>> copies of that stuff is better than one as right now you have no real >>> recovery path for that stuff. If you didn't make that change on purpose >>> it >>> probably got down-revved from DUP automagically when you tired to RAID >>> it. >> >> >> Good point. Maybe btrfs-convert should do that by default? I don't >> think it has ever been DUP. > > Eyup. And the metadata is now DUP. That's ~1.5GB extra metadata that was allocated just fine after the failed balance. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: get more accurate output in fd command.
Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted: > On 12/09/2014 05:08 PM, Dongsheng Yang wrote: >> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote: >>> Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote: When function btrfs_statfs() calculate the tatol size of fs, it is calculating the total size of disks and then dividing it by a factor. But in some usecase, the result is not good to user. Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 3.0G 1018M 1.3G 45% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes used 1001.53MiB devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 a. df -h should report Size as 2GiB rather than as 3GiB. Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. >>> I agree > > NOPE. > > The model you propose is too simple. > > While the data portion of the file system is set to RAID1 the metadata > portion of the filesystem is still set to the default of DUP. Metadata defaults to DUP only on a single-device filesystem. On a multi- device filesystem, metadata defaults to raid1. (FWIW, for both, data defaults to single.) And in the example, the mkfs was supplied with two devices, so there's no dup metadata remaining from a formerly single-device filesystem, either. (Tho there will be the small single-mode stubs, empty, remaining from the mkfs process, as no balance has been run to delete them yet, but those are much smaller and empty.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V10 03/19] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size.
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE units. Fix this. Signed-off-by: Chandan Rajendra --- fs/btrfs/file.c | 32 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d3afac2..444819d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1366,18 +1366,21 @@ fail: static noinline int lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages, size_t num_pages, loff_t pos, + size_t write_bytes, u64 *lockstart, u64 *lockend, struct extent_state **cached_state) { + struct btrfs_root *root = BTRFS_I(inode)->root; u64 start_pos; u64 last_pos; int i; int ret = 0; - start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1); - last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1; + start_pos = pos & ~((u64)root->sectorsize - 1); + last_pos = start_pos + + ALIGN(pos + write_bytes - start_pos, root->sectorsize) - 1; - if (start_pos < inode->i_size) { + if (start_pos < inode->i_size) { struct btrfs_ordered_extent *ordered; lock_extent_bits(&BTRFS_I(inode)->io_tree, start_pos, last_pos, 0, cached_state); @@ -1494,6 +1497,7 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, while (iov_iter_count(i) > 0) { size_t offset = pos & (PAGE_CACHE_SIZE - 1); + size_t sector_offset; size_t write_bytes = min(iov_iter_count(i), nrptrs * (size_t)PAGE_CACHE_SIZE - offset); @@ -1514,7 +1518,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, break; } - reserve_bytes = num_pages << PAGE_CACHE_SHIFT; + sector_offset = pos & (root->sectorsize - 1); + reserve_bytes = ALIGN(write_bytes + sector_offset, root->sectorsize); + ret = btrfs_check_data_free_space(inode, reserve_bytes); if (ret == -ENOSPC && (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW | @@ -1529,7 +1535,9 @@ static noinline ssize_t __btrfs_buffered_write(struct file *file, num_pages = (write_bytes + offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; - reserve_bytes = num_pages << PAGE_CACHE_SHIFT; + + reserve_bytes = ALIGN(write_bytes + sector_offset, + root->sectorsize); ret = 0; } else { ret = -ENOSPC; @@ -1564,8 +1572,8 @@ again: break; ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages, - pos, &lockstart, &lockend, - &cached_state); + pos, write_bytes, &lockstart, &lockend, + &cached_state); if (ret < 0) { if (ret == -EAGAIN) goto again; @@ -1602,9 +1610,9 @@ again: * we still have an outstanding extent for the chunk we actually * managed to copy. */ - if (num_pages > dirty_pages) { - release_bytes = (num_pages - dirty_pages) << - PAGE_CACHE_SHIFT; + if (write_bytes > copied) { + release_bytes = (write_bytes - copied) + & ~((u64)root->sectorsize - 1); if (copied > 0) { spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents++; @@ -1618,7 +1626,7 @@ again: release_bytes); } - release_bytes = dirty_pages << PAGE_CACHE_SHIFT; + release_bytes = ALIGN(copied + sector_offset, root->sectorsize); if (copied > 0) ret = btrfs_dirty_pages(root, inode, pages, @@ -1640,7 +1648,7 @@ again: if (only_release_metadata && copied > 0) { u64 lockstart = round_down(pos, root->sectorsize); u64 lockend = lockstart + - (dirty_pages << PAGE_CACHE_SHIFT) - 1; + ALIGN(copied, root->sectorsize) - 1;
[RFC PATCH V10 16/19] Btrfs: subpagesize-blocksize: Track blocks of ordered extent submitted for write I/O.
In the subpagesize-blocksize scenario, the following command (with 4k as the PAGE_SIZE and 2k as the block size) can cause false accounting of blocks of an ordered extent that is written to disk: $ xfs_io -f -c "pwrite 0 10240" \ -c "sync_range 0 4096" \ -c "sync_range 8192 2048" \ -c "pwrite 10240 2048" \ -c "sync_range 10240 2048" \ /mnt/btrfs/file.bin To fix this, we would have to explicitly track the blocks of an ordered extent that have already been submitted for write I/O. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c| 24 ++-- fs/btrfs/ordered-data.c | 4 +++- fs/btrfs/ordered-data.h | 4 3 files changed, 29 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f9172aa..bc4dd46 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3227,6 +3227,8 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, u64 extent_offset; u64 extent_end; u64 iosize; + u64 blk, nr_blks; + u64 blk_submitted; sector_t sector; struct extent_state *cached_state = NULL; struct block_device *bdev; @@ -3293,11 +3295,26 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, iosize = min(extent_end - cur, end - cur + 1); iosize = ALIGN(iosize, blocksize); + blk = extent_offset >> inode->i_sb->s_blocksize_bits; + nr_blks = iosize >> inode->i_sb->s_blocksize_bits; + + blk_submitted = find_next_bit(ordered->blocks_submitted, + ordered->len >> inode->i_sb->s_blocksize_bits, + blk); + if (blk_submitted < blk + nr_blks) { + if (blk_submitted == blk) { + cur += blocksize; + btrfs_put_ordered_extent(ordered); + continue; + } + iosize = (blk_submitted - blk) + << inode->i_sb->s_blocksize_bits; + nr_blks = iosize >> inode->i_sb->s_blocksize_bits; + } + sector = (ordered->start + extent_offset) >> 9; bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev; compressed = test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags); - btrfs_put_ordered_extent(ordered); - ordered = NULL; /* * compressed and inline extents are written through other @@ -3310,6 +3327,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, */ nr++; cur += iosize; + btrfs_put_ordered_extent(ordered); continue; } @@ -3324,6 +3342,8 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, } else { unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 1; + bitmap_set(ordered->blocks_submitted, blk, nr_blks); + btrfs_put_ordered_extent(ordered); set_range_writeback(tree, cur, cur + iosize - 1); if (!PageWriteback(page)) { btrfs_err(BTRFS_I(inode)->root->fs_info, diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 4d9832f..59b2544 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -199,13 +199,15 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, nr_longs = BITS_TO_LONGS(len >> inode->i_sb->s_blocksize_bits); if (nr_longs == 1) { entry->blocks_done = &entry->blocks_bitmap; + entry->blocks_submitted = &entry->blocks_submitted_bitmap; } else { - entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long), + entry->blocks_done = kzalloc(2 * nr_longs * sizeof(unsigned long), GFP_NOFS); if (!entry->blocks_done) { kmem_cache_free(btrfs_ordered_extent_cache, entry); return -ENOMEM; } + entry->blocks_submitted = entry->blocks_done + nr_longs; } entry->file_offset = file_offset; diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 7de3b1e..851914c 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -139,6 +139,10 @@ struct btrfs_ordered_extent { /* bitmap to track the blocks that have been written to disk */ unsigned long *blocks_done; unsigned long blocks_bitmap; + + /* bitmap to track the blocks that have been submitted for write i/o */ + unsigned long *blocks_submitted; + unsigned
[RFC PATCH V10 01/19] Btrfs: subpagesize-blocksize: Get rid of whole page reads.
Based on original patch from Aneesh Kumar K.V For the subpagesize-blocksize scenario, a page can contain multiple blocks. This patch handles this case. This patch adds the new EXTENT_READ_IO extent state bit to reliably unlock pages in readpage's end bio function. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c | 182 --- fs/btrfs/extent_io.h | 5 +- 2 files changed, 89 insertions(+), 98 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index a389820..c98dfd8 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1951,14 +1951,23 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, * helper function to set a given page up to date if all the * extents in the tree for that page are up to date */ -static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) +static void check_page_uptodate(struct extent_io_tree *tree, struct page *page, + struct extent_state *cached) { u64 start = page_offset(page); u64 end = start + PAGE_CACHE_SIZE - 1; - if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL)) + if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, cached)) SetPageUptodate(page); } +static int page_read_complete(struct extent_io_tree *tree, struct page *page) +{ + u64 start = page_offset(page); + u64 end = start + PAGE_CACHE_SIZE - 1; + + return !test_range_bit(tree, start, end, EXTENT_READ_IO, 0, NULL); +} + /* * When IO fails, either with EIO or csum verification fails, we * try other mirrors that might have a good copy of the data. This @@ -2275,7 +2284,9 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, * a) deliver good data to the caller * b) correct the bad sectors on disk */ - if (failed_bio->bi_vcnt > 1) { + if ((failed_bio->bi_vcnt > 1) + || (failed_bio->bi_io_vec->bv_len + > BTRFS_I(inode)->root->sectorsize)) { /* * to fulfill b), we need to know the exact failing sectors, as * we don't want to rewrite any more than the failed ones. thus, @@ -2422,18 +2433,6 @@ static void end_bio_extent_writepage(struct bio *bio, int err) bio_put(bio); } -static void -endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len, - int uptodate) -{ - struct extent_state *cached = NULL; - u64 end = start + len - 1; - - if (uptodate && tree->track_uptodate) - set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC); - unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); -} - /* * after a readpage IO is done, we need to: * clear the uptodate bits on error @@ -2450,14 +2449,15 @@ static void end_bio_extent_readpage(struct bio *bio, int err) struct bio_vec *bvec; int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + struct extent_state *cached = NULL; struct extent_io_tree *tree; + unsigned long flags; u64 offset = 0; u64 start; u64 end; - u64 len; - u64 extent_start = 0; - u64 extent_len = 0; + int nr_sectors; int mirror; + int unlock; int ret; int i; @@ -2467,54 +2467,31 @@ static void end_bio_extent_readpage(struct bio *bio, int err) bio_for_each_segment_all(bvec, bio, i) { struct page *page = bvec->bv_page; struct inode *inode = page->mapping->host; + struct btrfs_root *root = BTRFS_I(inode)->root; pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, " "mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err, io_bio->mirror_num); tree = &BTRFS_I(inode)->io_tree; - /* We always issue full-page reads, but if some block -* in a page fails to read, blk_update_request() will -* advance bv_offset and adjust bv_len to compensate. -* Print a warning for nonzero offsets, and an error -* if they don't add up to a full page. */ - if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) { - if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE) - btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info, - "partial page read in btrfs with offset %u and length %u", - bvec->bv_offset, bvec->bv_len); - else - btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info, - "incomplete page read in btrfs with offset %u an
[RFC PATCH V10 15/19] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43.
In subpagesize-blocksize, we have multiple blocks in a page. Checking for existence of a page in the page cache isn't a sufficient check, since we could be truncating a subset of the blocks mapped by the page. Signed-off-by: Chandan Rajendra --- fs/btrfs/btrfs_inode.h | 2 -- fs/btrfs/file.c| 4 ++- fs/btrfs/inode.c | 77 +++--- 3 files changed, 7 insertions(+), 76 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 43527fd..50497bf 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -278,6 +278,4 @@ static inline void btrfs_inode_resume_unlocked_dio(struct inode *inode) &BTRFS_I(inode)->runtime_flags); } -bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end); - #endif diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index b1e0d27..3707515 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2314,7 +2314,9 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if ((!ordered || (ordered->file_offset + ordered->len <= lockstart || ordered->file_offset > lockend)) && -!btrfs_page_exists_in_range(inode, lockstart, lockend)) { +!test_range_bit(&BTRFS_I(inode)->io_tree, lockstart, +lockend, EXTENT_UPTODATE, 0, +cached_state)) { if (ordered) btrfs_put_ordered_extent(ordered); break; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e0dd338..b236417 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6832,76 +6832,6 @@ out: return ret; } -bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end) -{ - struct radix_tree_root *root = &inode->i_mapping->page_tree; - int found = false; - void **pagep = NULL; - struct page *page = NULL; - int start_idx; - int end_idx; - - start_idx = start >> PAGE_CACHE_SHIFT; - - /* -* end is the last byte in the last page. end == start is legal -*/ - end_idx = end >> PAGE_CACHE_SHIFT; - - rcu_read_lock(); - - /* Most of the code in this while loop is lifted from -* find_get_page. It's been modified to begin searching from a -* page and return just the first page found in that range. If the -* found idx is less than or equal to the end idx then we know that -* a page exists. If no pages are found or if those pages are -* outside of the range then we're fine (yay!) */ - while (page == NULL && - radix_tree_gang_lookup_slot(root, &pagep, NULL, start_idx, 1)) { - page = radix_tree_deref_slot(pagep); - if (unlikely(!page)) - break; - - if (radix_tree_exception(page)) { - if (radix_tree_deref_retry(page)) { - page = NULL; - continue; - } - /* -* Otherwise, shmem/tmpfs must be storing a swap entry -* here as an exceptional entry: so return it without -* attempting to raise page count. -*/ - page = NULL; - break; /* TODO: Is this relevant for this use case? */ - } - - if (!page_cache_get_speculative(page)) { - page = NULL; - continue; - } - - /* -* Has the page moved? -* This is part of the lockless pagecache protocol. See -* include/linux/pagemap.h for details. -*/ - if (unlikely(page != *pagep)) { - page_cache_release(page); - page = NULL; - } - } - - if (page) { - if (page->index <= end_idx) - found = true; - page_cache_release(page); - } - - rcu_read_unlock(); - return found; -} - static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend, struct extent_state **cached_state, int writing) { @@ -6926,9 +6856,10 @@ static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend, * invalidate needs to happen so that reads after a write do not * get stale data. */ - if (!ordered && - (!writing || -!btrfs_page_exists_in_range(inode, lockstart, lockend))) + if (!ordered && (!writing || + !test_range_bit(&BTRFS_I(inode)->io_tree, +
[RFC PATCH V10 10/19] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units.
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized blocks instead of pages. Hence the function has been renamed to btrfs_truncate_block(). Signed-off-by: Chandan Rajendra --- fs/btrfs/ctree.h | 2 +- fs/btrfs/file.c | 41 ++--- fs/btrfs/inode.c | 48 +--- 3 files changed, 48 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5b7b7ca..59779dc 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3815,7 +3815,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *dir, u64 objectid, const char *name, int name_len); -int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len, +int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, int front); int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 444819d..b1e0d27 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2200,21 +2200,24 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) u64 tail_len; u64 orig_start = offset; u64 cur_offset; + unsigned char blocksize_bits; u64 min_size = btrfs_calc_trunc_metadata_size(root, 1); u64 drop_end; int ret = 0; int err = 0; int rsv_count; - bool same_page; + bool same_block; bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES); u64 ino_size; + blocksize_bits = inode->i_sb->s_blocksize_bits; + ret = btrfs_wait_ordered_range(inode, offset, len); if (ret) return ret; mutex_lock(&inode->i_mutex); - ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE); + ino_size = round_up(inode->i_size, root->sectorsize); ret = find_first_non_hole(inode, &offset, &len); if (ret < 0) goto out_only_mutex; @@ -2224,29 +2227,28 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) goto out_only_mutex; } - lockstart = round_up(offset , BTRFS_I(inode)->root->sectorsize); + lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize); lockend = round_down(offset + len, BTRFS_I(inode)->root->sectorsize) - 1; - same_page = ((offset >> PAGE_CACHE_SHIFT) == - ((offset + len - 1) >> PAGE_CACHE_SHIFT)); - + same_block = ((offset >> blocksize_bits) + == ((offset + len - 1) >> blocksize_bits)); /* -* We needn't truncate any page which is beyond the end of the file +* We needn't truncate any block which is beyond the end of the file * because we are sure there is no data there. */ /* -* Only do this if we are in the same page and we aren't doing the -* entire page. +* Only do this if we are in the same block and we aren't doing the +* entire block. */ - if (same_page && len < PAGE_CACHE_SIZE) { + if (same_block && len < root->sectorsize) { if (offset < ino_size) - ret = btrfs_truncate_page(inode, offset, len, 0); + ret = btrfs_truncate_block(inode, offset, len, 0); goto out_only_mutex; } - /* zero back part of the first page */ + /* zero back part of the first block */ if (offset < ino_size) { - ret = btrfs_truncate_page(inode, offset, 0, 0); + ret = btrfs_truncate_block(inode, offset, 0, 0); if (ret) { mutex_unlock(&inode->i_mutex); return ret; @@ -2281,11 +2283,12 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) if (!ret) { /* zero the front end of the last page */ if (tail_start + tail_len < ino_size) { - ret = btrfs_truncate_page(inode, - tail_start + tail_len, 0, 1); + ret = btrfs_truncate_block(inode, + tail_start + tail_len, + 0, 1); if (ret) goto out_only_mutex; - } + } } } @@ -2506,10 +2509,10 @@ static long btrfs_fallocate(struct file *file, int mode, } else { /* * If we are fallocating from the end of the file onward we -* need to zero out the en
[RFC PATCH V10 00/19] Btrfs: Subpagesize-blocksize: Get rid of whole page I/O.
This patchset continues with the work posted earlier at https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg38862.html. Changes from V9: 1. Earlier, In read_extent_buffer_pages(), we used to check for extent buffer pages' PG_uptodate flag immediately after the page was unlocked by the endio function. However, the PG_uptodate flag is set on the pages only after the read operation on all pages complete successfully and the verification of the extent buffer's contents is done. Fix this by checking only for EXTENT_BUFFER_UPTODATE flag in read_extent_buffer_pages(). 2. Add the new EXTENT_READ_IO extent state bit to reliably unlock pages in readpage's end bio function. 3. Use (eb->start, seq) as search key for tree modification log. 4. btrfs_submit_direct_hook: Prevent zero length bios from being submitted when map_length < bio vector length. 5. Enabled POSIX ACL support. Changes from V8: 1. In subpagesize-blocksize scenario, prevent writes to an extent buffer when the corresponding page's PG_writeback flag is set. This race condition was triggered when running xfstests' generic/083 test. With the new patch applied, I have run the complete xfstests suite as well as run generic/083 multiple times on both 4k and 2k block size setups. There were 2 non-related test failures that occured rarely, but they were reproducible even when the patch was not applied. Changes from V7: 1. Fix a softlockup issue that occured because the page corresponding to the delalloc region did not exist. This bug was introduced by the code added in btrfs_invalidatepage and related functions in V7 version. Changes from V6: 1. Fix softlockup issue that occured during unmounting a 4k blocksized filesystem instance. 2. Track blocks of an ordered extent submitted for write I/O to avoid I/O resubmission in certain scenarios. Changes from V5: 1. Rebased patchset on top of current btrfs-next tree (i.e. commit 8d875f95da43c6a8f18f77869f2ef26e9594fecc). This involved using "immutable biovecs". 2. Deal with partially allocated ordered extents across a page. 3. Explicitly track I/O status of blocks of an ordered extent. Changes from V4: 1. V2's "Btrfs: subpagesize-blocksize: Get rid of whole page reads" patch was incorrectly replaced with an older version when working on V3 patches. Fix this. 2. Fix btrfs_endio_direct_read() to compute checksums for all possible blocks in a page. Changes from V3: 1. Get "Hole punching" and "Extent preallocation" to work correctly in subpagesize-blocksize scenario. 2. Get btrfs_page_mkwrite() to reserve space in sectorsized units. Changes from V2: 1. Get __extent_writepage() to write only the dirty blocks of a page. 2. Fix "page private not zero on page" warning message which is printed when running xfstests. Changes from V1: 1. Remove usage of bio_vec->bv_{len,offset} in end_bio_extent_readpage() and end_bio_extent_writepage(). Xfstests' generic tests were run on an x86_64 machine with the patches applied for blocksizes 2k and 4k. For 2k blocksize, the following xfstests' generic tests failed: 1. generic/068 The following xfstests' generic tests failed for both 2k and 4k blocksize: 1. generic/008 - FALLOC_FL_ZERO_RANGE is not supported by Btrfs. 2. generic/091 - FALLOC_FL_ZERO_RANGE is not supported by Btrfs. 3. generic/224 This looks mostly an issue caused by non-btrfs code as the test failed for the exact same reason when run on an ext4 filesystem instance. 4. generic/263 - FALLOC_FL_ZERO_RANGE is not supported by Btrfs. 5. generic/274 This test very rarely results in a hung task. The following is a list of known TODO items which will be implemented in future revisions of this patchset: 1. The xfstests suite was based off the corresponding git tree as available on March 2014. Pull the latest from the xfstests git tree and execute the tests. 2. Rebase the patches on top of the current linux-btrfs/next branch. 3. Get Xfstests' generic tests to successfully run on both 4k and 2k blocksizes. 4. Remove PAGE_CACHE_SIZE delalloc reservation in btrfs_writepage_fixup_worker(). 5. Create separate slab caches for 'extent buffer head' and 'extent buffer'. 6. Add 'leak list' tracking for 'extent buffer' instances. 7. Rename EXTENT_BUFFER_TREE_REF and EXTENT_BUFFER_IN_TREE to EXTENT_BUFFER_HEAD_TREE_REF and EXTENT_BUFFER_HEAD_IN_TREE respectively. Chandan Rajendra (17): Btrfs: subpagesize-blocksize: Get rid of whole page reads. Btrfs: subpagesize-blocksize: Get rid of whole page writes. Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size. Btrfs: subpagesize-blocksize: Read tree blocks whose size is start, seq) as search key for tree modification log. Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length Chandra Seetharaman (2): Btrfs: subpagesize-blocksize: Define extent_buffer_head. B
[RFC PATCH V10 04/19] Btrfs: subpagesize-blocksize: Define extent_buffer_head.
From: Chandra Seetharaman In order to handle multiple extent buffers per page, first we need to create a way to handle all the extent buffers that are attached to a page. This patch creates a new data structure 'struct extent_buffer_head', and moves fields that are common to all extent buffers in a page from 'struct extent buffer' to 'struct extent_buffer_head' Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags to extent_buffer_head->bflags. Signed-off-by: Chandra Seetharaman Signed-off-by: Chandan Rajendra --- fs/btrfs/backref.c | 2 +- fs/btrfs/ctree.c | 2 +- fs/btrfs/ctree.h | 6 +- fs/btrfs/disk-io.c | 46 -- fs/btrfs/extent-tree.c | 6 +- fs/btrfs/extent_io.c | 373 +-- fs/btrfs/extent_io.h | 47 -- fs/btrfs/volumes.c | 2 +- include/trace/events/btrfs.h | 2 +- 9 files changed, 328 insertions(+), 158 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 54a201d..1d3d5d6 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1305,7 +1305,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, struct btrfs_path *path, eb = path->nodes[0]; /* make sure we can use eb after releasing the path */ if (eb != eb_in) { - atomic_inc(&eb->refs); + atomic_inc(&eb_head(eb)->refs); btrfs_tree_read_lock(eb); btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK); } diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 44ee5d2..693b541 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -169,7 +169,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root *root) * the inc_not_zero dance and if it doesn't work then * synchronize_rcu and try again. */ - if (atomic_inc_not_zero(&eb->refs)) { + if (atomic_inc_not_zero(&eb_head(eb)->refs)) { rcu_read_unlock(); break; } diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8e29b61..5b7b7ca 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2215,14 +2215,16 @@ static inline void btrfs_set_token_##name(struct extent_buffer *eb, \ #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)\ static inline u##bits btrfs_##name(struct extent_buffer *eb) \ { \ - type *p = page_address(eb->pages[0]); \ + type *p = page_address(eb_head(eb)->pages[0]) + \ + (eb->start & (PAGE_CACHE_SIZE -1)); \ u##bits res = le##bits##_to_cpu(p->member); \ return res; \ } \ static inline void btrfs_set_##name(struct extent_buffer *eb, \ u##bits val)\ { \ - type *p = page_address(eb->pages[0]); \ + type *p = page_address(eb_head(eb)->pages[0]) + \ + (eb->start & (PAGE_CACHE_SIZE -1)); \ p->member = cpu_to_le##bits(val); \ } diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d0ed9e6..3a79833 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1030,13 +1030,21 @@ static int btree_set_page_dirty(struct page *page) { #ifdef DEBUG struct extent_buffer *eb; + int i, dirty = 0; BUG_ON(!PagePrivate(page)); eb = (struct extent_buffer *)page->private; BUG_ON(!eb); - BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)); - BUG_ON(!atomic_read(&eb->refs)); - btrfs_assert_tree_locked(eb); + + do { + dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags); + if (dirty) + break; + } while ((eb = eb->eb_next) != NULL); + + BUG_ON(!dirty); + BUG_ON(!atomic_read(&(eb_head(eb)->refs))); + btrfs_assert_tree_locked(&ebh->eb); #endif return __set_page_dirty_nobuffers(page); } @@ -1080,7 +1088,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 blocksize, if (!buf) return 0; - set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags); + set_bit(EXTENT_BUFFER_READAHEAD, &buf->ebflags); ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK, btree_get_extent, mirror_num); @@ -1089,7 +1097,7 @@ int reada_tree_bl
[RFC PATCH V10 14/19] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent.
In subpagesize-blocksize scenario a page can have more than one block. So in addition to PagePrivate2 flag, we would have to track the I/O status of each block of a page to reliably mark the ordered extent as complete. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c| 19 +-- fs/btrfs/extent_io.h| 5 +- fs/btrfs/inode.c| 338 +++- fs/btrfs/ordered-data.c | 17 +++ fs/btrfs/ordered-data.h | 4 + 5 files changed, 285 insertions(+), 98 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 51ab453..f9172aa 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4302,11 +4302,10 @@ int extent_invalidatepage(struct extent_io_tree *tree, * to drop the page. */ static int try_release_extent_state(struct extent_map_tree *map, - struct extent_io_tree *tree, - struct page *page, gfp_t mask) + struct extent_io_tree *tree, + struct page *page, u64 start, u64 end, + gfp_t mask) { - u64 start = page_offset(page); - u64 end = start + PAGE_CACHE_SIZE - 1; int ret = 1; if (test_range_bit(tree, start, end, @@ -4340,12 +4339,12 @@ static int try_release_extent_state(struct extent_map_tree *map, * map records are removed */ int try_release_extent_mapping(struct extent_map_tree *map, - struct extent_io_tree *tree, struct page *page, - gfp_t mask) + struct extent_io_tree *tree, struct page *page, + u64 start, u64 end, gfp_t mask) { struct extent_map *em; - u64 start = page_offset(page); - u64 end = start + PAGE_CACHE_SIZE - 1; + u64 orig_start = start; + u64 orig_end = end; if ((mask & __GFP_WAIT) && page->mapping->host->i_size > 16 * 1024 * 1024) { @@ -4379,7 +4378,9 @@ int try_release_extent_mapping(struct extent_map_tree *map, free_extent_map(em); } } - return try_release_extent_state(map, tree, page, mask); + return try_release_extent_state(map, tree, page, + orig_start, orig_end, + mask); } /* diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 264dfd4..15bb2a7 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -209,8 +209,9 @@ typedef struct extent_map *(get_extent_t)(struct inode *inode, void extent_io_tree_init(struct extent_io_tree *tree, struct address_space *mapping); int try_release_extent_mapping(struct extent_map_tree *map, - struct extent_io_tree *tree, struct page *page, - gfp_t mask); + struct extent_io_tree *tree, struct page *page, + u64 start, u64 end, + gfp_t mask); int try_release_extent_buffer(struct page *page); int lock_extent(struct extent_io_tree *tree, u64 start, u64 end); int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4ed78dd..e0dd338 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2827,51 +2827,115 @@ static void finish_ordered_fn(struct btrfs_work *work) btrfs_finish_ordered_io(ordered_extent); } -static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, +static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered, + u64 blk, u64 nr_blks, int uptodate) +{ + struct inode *inode = ordered->inode; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_workqueue *workers; + int done; + + while (nr_blks--) { + if (test_and_set_bit(blk, ordered->blocks_done)) { + blk++; + continue; + } + + done = btrfs_dec_test_ordered_pending(inode, &ordered, + ordered->file_offset + + (blk << inode->i_sb->s_blocksize_bits), + root->sectorsize, + uptodate); + if (done) { + btrfs_init_work(&ordered->work, finish_ordered_fn, + NULL, NULL); + + ordered->work.func = finish_ordered_fn; + ordered->work.flags = 0; + + if (btrfs_is_free_space_inode(inode)) + workers = root->fs_info->endio_freespace_worker; + else + workers = root->fs_info->endio_write_workers; + +
[RFC PATCH V10 07/19] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE
From: Chandra Seetharaman This patch allows mounting filesystems with blocksize smaller than the PAGE_SIZE. Signed-off-by: Chandra Seetharaman Signed-off-by: Chandan Rajendra --- fs/btrfs/disk-io.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6c6e8bb..2f3caaf 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2634,12 +2634,6 @@ int open_ctree(struct super_block *sb, goto fail_sb_buffer; } - if (sectorsize != PAGE_SIZE) { - printk(KERN_WARNING "BTRFS: Incompatible sector size(%lu) " - "found on %s\n", (unsigned long)sectorsize, sb->s_id); - goto fail_sb_buffer; - } - mutex_lock(&fs_info->chunk_mutex); ret = btrfs_read_sys_array(tree_root); mutex_unlock(&fs_info->chunk_mutex); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V10 17/19] Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set.
In non-subpagesize-blocksize scenario, BTRFS_HEADER_FLAG_WRITTEN flag prevents Btrfs code from writing into an extent buffer whose pages are under writeback. This facility isn't sufficient for achieving the same in subpagesize-blocksize scenario, since we have more than one extent buffer mapped to a page. Hence this patch adds a new flag (i.e. EXTENT_BUFFER_HEAD_WRITEBACK) and corresponding code to track the writeback status of the page and to prevent writes to any of the extent buffers mapped to the page while writeback is going on. Signed-off-by: Chandan Rajendra --- fs/btrfs/ctree.c | 20 ++- fs/btrfs/extent-tree.c | 12 fs/btrfs/extent_io.c | 153 +++-- fs/btrfs/extent_io.h | 2 + 4 files changed, 157 insertions(+), 30 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 693b541..75129da 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1543,6 +1543,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, struct extent_buffer *parent, int parent_slot, struct extent_buffer **cow_ret) { + struct extent_buffer_head *ebh = eb_head(buf); u64 search_start; int ret; @@ -1556,6 +1557,13 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, trans->transid, root->fs_info->generation); if (!should_cow_block(trans, root, buf)) { + if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags)) { + if (parent) + btrfs_set_lock_blocking(parent); + btrfs_set_lock_blocking(buf); + wait_on_bit(&ebh->bflags, EXTENT_BUFFER_HEAD_WRITEBACK, + eb_wait, TASK_UNINTERRUPTIBLE); + } *cow_ret = buf; return 0; } @@ -2687,6 +2695,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_key *key, struct btrfs_path *p, int ins_len, int cow) { + struct extent_buffer_head *ebh; struct extent_buffer *b; int slot; int ret; @@ -2789,8 +2798,17 @@ again: * then we don't want to set the path blocking, * so we test it here */ - if (!should_cow_block(trans, root, b)) + if (!should_cow_block(trans, root, b)) { + ebh = eb_head(b); + if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK, + &ebh->bflags)) { + btrfs_set_path_blocking(p); + wait_on_bit(&ebh->bflags, + EXTENT_BUFFER_HEAD_WRITEBACK, + eb_wait, TASK_UNINTERRUPTIBLE); + } goto cow_done; + } btrfs_set_path_blocking(p); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index fbcad82..fb5cc46 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7203,14 +7203,26 @@ static struct extent_buffer * btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u32 blocksize, int level) { + struct extent_buffer_head *ebh; struct extent_buffer *buf; buf = btrfs_find_create_tree_block(root, bytenr, blocksize); if (!buf) return ERR_PTR(-ENOMEM); + + ebh = eb_head(buf); btrfs_set_header_generation(buf, trans->transid); btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level); btrfs_tree_lock(buf); + + if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK, + &ebh->bflags)) { + btrfs_set_lock_blocking(buf); + wait_on_bit(&ebh->bflags, + EXTENT_BUFFER_HEAD_WRITEBACK, + eb_wait, TASK_UNINTERRUPTIBLE); + } + clean_tree_block(trans, root, buf); clear_bit(EXTENT_BUFFER_STALE, &buf->ebflags); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index bc4dd46..598923c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3448,7 +3448,7 @@ done_unlocked: return 0; } -static int eb_wait(void *word) +int eb_wait(void *word) { io_schedule(); return 0; @@ -3460,6 +3460,52 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb) TASK_UNINTERRUPTIBLE); } +static void lock_extent_buffers(struct extent_buffer_head *ebh, + struct extent_page_data *epd) +{ + struct extent_buffer *locked_eb = NULL; + struct extent_buffer *eb; +a
[RFC PATCH V10 11/19] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in sectorsized units.
In subpagesize-blocksize scenario, if i_size occurs in a block which is not the last block in the page, then the space to be reserved should be calculated appropriately. Signed-off-by: Chandan Rajendra --- fs/btrfs/inode.c | 33 ++--- 1 file changed, 22 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7ad7d0f..23ce9ff 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7812,26 +7812,23 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) loff_t size; int ret; int reserved = 0; + u64 delalloc_size; u64 page_start; u64 page_end; sb_start_pagefault(inode->i_sb); - ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); - if (!ret) { - ret = file_update_time(vma->vm_file); - reserved = 1; - } + + ret = file_update_time(vma->vm_file); if (ret) { if (ret == -ENOMEM) ret = VM_FAULT_OOM; else /* -ENOSPC, -EIO, etc */ ret = VM_FAULT_SIGBUS; - if (reserved) - goto out; - goto out_noreserve; + goto out; } ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */ + again: lock_page(page); size = i_size_read(inode); @@ -7862,6 +7859,19 @@ again: goto again; } + if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT)) + delalloc_size = round_up(size - page_start, root->sectorsize); + else + delalloc_size = PAGE_CACHE_SIZE; + + ret = btrfs_delalloc_reserve_space(inode, delalloc_size); + if (ret) { + /* -ENOSPC */ + ret = VM_FAULT_SIGBUS; + goto out_unlock; + } + reserved = 1; + /* * XXX - page_mkwrite gets called every time the page is dirtied, even * if it was already dirty, so for space accounting reasons we need to @@ -7874,7 +7884,8 @@ again: EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG, 0, 0, &cached_state, GFP_NOFS); - ret = btrfs_set_extent_delalloc(inode, page_start, page_end, + ret = btrfs_set_extent_delalloc(inode, page_start, + page_start + delalloc_size - 1, &cached_state); if (ret) { unlock_extent_cached(io_tree, page_start, page_end, @@ -7913,8 +7924,8 @@ out_unlock: } unlock_page(page); out: - btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE); -out_noreserve: + if (reserved) + btrfs_delalloc_release_space(inode, delalloc_size); sb_end_pagefault(inode->i_sb); return ret; } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V10 19/19] Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length
In subpagesize-blocksize scenario, map_length can be less than the length of a bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a zero length bio. Fix this. Signed-off-by: Chandan Rajendra --- fs/btrfs/inode.c | 23 --- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b236417..0f59c6c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7346,9 +7346,11 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, u64 file_offset = dip->logical_offset; u64 submit_len = 0; u64 map_length; - int nr_pages = 0; + u32 blocksize = root->sectorsize; int ret = 0; int async_submit = 0; + int nr_sectors; + int i; map_length = orig_bio->bi_iter.bi_size; ret = btrfs_map_block(root->fs_info, rw, start_sector << 9, @@ -7378,9 +7380,12 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, atomic_inc(&dip->pending_bios); while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) { - if (unlikely(map_length < submit_len + bvec->bv_len || - bio_add_page(bio, bvec->bv_page, bvec->bv_len, -bvec->bv_offset) < bvec->bv_len)) { + nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits; + i = 0; +next_block: + if (unlikely(map_length < submit_len + blocksize || + bio_add_page(bio, bvec->bv_page, blocksize, + bvec->bv_offset + (i * blocksize)) < blocksize)) { /* * inc the count before we submit the bio so * we know the end IO handler won't happen before @@ -7401,7 +7406,6 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, file_offset += submit_len; submit_len = 0; - nr_pages = 0; bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev, start_sector, GFP_NOFS); @@ -7418,9 +7422,14 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, bio_put(bio); goto out_err; } + + goto next_block; } else { - submit_len += bvec->bv_len; - nr_pages++; + submit_len += blocksize; + if (--nr_sectors) { + i++; + goto next_block; + } bvec++; } } -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH V10 18/19] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log.
In subpagesize-blocksize a page can map multiple extent buffers and hence using (page index, seq) as the search key is incorrect. For example, searching through tree modification log tree can return an entry associated with the first extent buffer mapped by the page (if such an entry exists), when we are actually searching for entries associated with extent buffers that are mapped at position 2 or more in the page. Signed-off-by: Chandan Rajendra --- fs/btrfs/ctree.c | 34 +- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 75129da..8344f49 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -314,7 +314,7 @@ struct tree_mod_root { struct tree_mod_elem { struct rb_node node; - u64 index; /* shifted logical */ + u64 logical; u64 seq; enum mod_log_op op; @@ -438,11 +438,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info, /* * key order of the log: - * index -> sequence + * node/leaf start address -> sequence * - * the index is the shifted logical of the *new* root node for root replace - * operations, or the shifted logical of the affected block for all other - * operations. + * The 'start address' is the logical address of the *new* root node + * for root replace operations, or the logical address of the affected + * block for all other operations. * * Note: must be called with write lock (tree_mod_log_write_lock). */ @@ -463,9 +463,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm) while (*new) { cur = container_of(*new, struct tree_mod_elem, node); parent = *new; - if (cur->index < tm->index) + if (cur->logical < tm->logical) new = &((*new)->rb_left); - else if (cur->index > tm->index) + else if (cur->logical > tm->logical) new = &((*new)->rb_right); else if (cur->seq < tm->seq) new = &((*new)->rb_left); @@ -526,7 +526,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot, if (!tm) return NULL; - tm->index = eb->start >> PAGE_CACHE_SHIFT; + tm->logical = eb->start; if (op != MOD_LOG_KEY_ADD) { btrfs_node_key(eb, &tm->key, slot); tm->blockptr = btrfs_node_blockptr(eb, slot); @@ -591,7 +591,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info, goto free_tms; } - tm->index = eb->start >> PAGE_CACHE_SHIFT; + tm->logical = eb->start; tm->slot = src_slot; tm->move.dst_slot = dst_slot; tm->move.nr_items = nr_items; @@ -702,7 +702,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, goto free_tms; } - tm->index = new_root->start >> PAGE_CACHE_SHIFT; + tm->logical = new_root->start; tm->old_root.logical = old_root->start; tm->old_root.level = btrfs_header_level(old_root); tm->generation = btrfs_header_generation(old_root); @@ -742,16 +742,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq, struct rb_node *node; struct tree_mod_elem *cur = NULL; struct tree_mod_elem *found = NULL; - u64 index = start >> PAGE_CACHE_SHIFT; tree_mod_log_read_lock(fs_info); tm_root = &fs_info->tree_mod_log; node = tm_root->rb_node; while (node) { cur = container_of(node, struct tree_mod_elem, node); - if (cur->index < index) { + if (cur->logical < start) { node = node->rb_left; - } else if (cur->index > index) { + } else if (cur->logical > start) { node = node->rb_right; } else if (cur->seq < min_seq) { node = node->rb_left; @@ -1232,9 +1231,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info, return NULL; /* -* the very last operation that's logged for a root is the replacement -* operation (if it is replaced at all). this has the index of the *new* -* root, making it the very first operation that's logged for this root. +* the very last operation that's logged for a root is the +* replacement operation (if it is replaced at all). this has +* the logical address of the *new* root, making it the very +* first operation that's logged for this root. */ while (1) { tm = tree_mod_log_search_oldest(fs_info, root_logical, @@ -1338,7 +1338,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, if (!next) break; tm = container_of(next, struct tree_mod_elem, node); -
[RFC PATCH V10 02/19] Btrfs: subpagesize-blocksize: Get rid of whole page writes.
This commit brings back functions that set/clear EXTENT_WRITEBACK bits. These are required to reliably clear PG_writeback page flag. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c | 47 +++ fs/btrfs/inode.c | 40 +++- 2 files changed, 58 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c98dfd8..57db008 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1300,6 +1300,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, u64 start, u64 end, cached_state, mask); } +static int set_extent_writeback(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask) +{ + return set_extent_bit(tree, start, end, EXTENT_WRITEBACK, NULL, + cached_state, mask); +} + +static int clear_extent_writeback(struct extent_io_tree *tree, u64 start, u64 end, + struct extent_state **cached_state, gfp_t mask) +{ + return clear_extent_bit(tree, start, end, EXTENT_WRITEBACK, 1, 0, + cached_state, mask); +} + /* * either insert or lock state struct between start and end use mask to tell * us if waiting is desired. @@ -1406,6 +1420,7 @@ static int set_range_writeback(struct extent_io_tree *tree, u64 start, u64 end) page_cache_release(page); index++; } + set_extent_writeback(tree, start, end, NULL, GFP_NOFS); return 0; } @@ -2403,31 +2418,23 @@ static void end_bio_extent_writepage(struct bio *bio, int err) bio_for_each_segment_all(bvec, bio, i) { struct page *page = bvec->bv_page; + struct inode *inode = page->mapping->host; + struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + u64 page_start, page_end; - /* We always issue full-page reads, but if some block -* in a page fails to read, blk_update_request() will -* advance bv_offset and adjust bv_len to compensate. -* Print a warning for nonzero offsets, and an error -* if they don't add up to a full page. */ - if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) { - if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE) - btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info, - "partial page write in btrfs with offset %u and length %u", - bvec->bv_offset, bvec->bv_len); - else - btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info, - "incomplete page write in btrfs with offset %u and " - "length %u", - bvec->bv_offset, bvec->bv_len); - } - - start = page_offset(page); - end = start + bvec->bv_offset + bvec->bv_len - 1; + start = page_offset(page) + bvec->bv_offset; + end = start + bvec->bv_len - 1; if (end_extent_writepage(page, err, start, end)) continue; - end_page_writeback(page); + clear_extent_writeback(tree, start, end, NULL, GFP_ATOMIC); + + page_start = page_offset(page); + page_end = page_offset(page) + PAGE_CACHE_SIZE - 1; + if (!test_range_bit(tree, page_start, page_end, + EXTENT_WRITEBACK, 0, NULL)) + end_page_writeback(page); } bio_put(bio); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7309832..2ffb4df 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2823,22 +2823,44 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_ordered_extent *ordered_extent = NULL; struct btrfs_workqueue *workers; + u64 ordered_start, ordered_end; + int done; trace_btrfs_writepage_end_io_hook(page, start, end, uptodate); ClearPagePrivate2(page); - if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start, - end - start + 1, uptodate)) - return 0; +loop: + ordered_extent = btrfs_lookup_ordered_range(inode, start, + end - start + 1); + if (!ordered_extent) + goto out; - btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL); + ordered_start = max_t(u64, start, ordered_extent->file_offset); + ordered_end = min_t(u64, end, + o
[RFC PATCH V10 06/19] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page
For the subpagesize-blocksize scenario, This patch adds the ability to write a single extent buffer to the disk. Signed-off-by: Chandan Rajendra --- fs/btrfs/disk-io.c | 20 ++-- fs/btrfs/extent_io.c | 300 --- 2 files changed, 250 insertions(+), 70 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 20168e6..6c6e8bb 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -484,17 +484,23 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, static int csum_dirty_buffer(struct btrfs_root *root, struct page *page) { - u64 start = page_offset(page); - u64 found_start; struct extent_buffer *eb; + u64 found_start; eb = (struct extent_buffer *)page->private; - if (page != eb->pages[0]) + if (page != eb_head(eb)->pages[0]) return 0; - found_start = btrfs_header_bytenr(eb); - if (WARN_ON(found_start != start || !PageUptodate(page))) - return 0; - csum_tree_block(root, eb, 0); + do { + if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) + continue; + if (WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags))) + continue; + found_start = btrfs_header_bytenr(eb); + if (WARN_ON(found_start != eb->start)) + return 0; + csum_tree_block(root, eb, 0); + } while ((eb = eb->eb_next) != NULL); + return 0; } diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 70bc10e..bbb5e980 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3453,33 +3453,54 @@ void wait_on_extent_buffer_writeback(struct extent_buffer *eb) TASK_UNINTERRUPTIBLE); } -static noinline_for_stack int -lock_extent_buffer_for_io(struct extent_buffer *eb, - struct btrfs_fs_info *fs_info, - struct extent_page_data *epd) +static void lock_extent_buffer_pages(struct extent_buffer_head *ebh, + struct extent_page_data *epd) { + struct extent_buffer *eb = &ebh->eb; unsigned long i, num_pages; - int flush = 0; + + num_pages = num_extent_pages(eb->start, eb->len); + for (i = 0; i < num_pages; i++) { + struct page *p = extent_buffer_page(eb, i); + + if (!trylock_page(p)) { + flush_write_bio(epd); + lock_page(p); + } + } + + return; +} + +static int noinline_for_stack +lock_extent_buffer_for_io(struct extent_buffer *eb, + struct btrfs_fs_info *fs_info, + struct extent_page_data *epd) +{ + int dirty; int ret = 0; if (!btrfs_try_tree_write_lock(eb)) { - flush = 1; flush_write_bio(epd); btrfs_tree_lock(eb); } - if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) { + if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) { + dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags); btrfs_tree_unlock(eb); - if (!epd->sync_io) - return 0; - if (!flush) { - flush_write_bio(epd); - flush = 1; + if (!epd->sync_io) { + if (!dirty) + return 1; + else + return 2; } + + flush_write_bio(epd); + while (1) { wait_on_extent_buffer_writeback(eb); btrfs_tree_lock(eb); - if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) + if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) break; btrfs_tree_unlock(eb); } @@ -3490,37 +3511,22 @@ lock_extent_buffer_for_io(struct extent_buffer *eb, * under IO since we can end up having no IO bits set for a short period * of time. */ - spin_lock(&eb->refs_lock); - if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) { - set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags); - spin_unlock(&eb->refs_lock); + spin_lock(&eb_head(eb)->refs_lock); + if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) { + set_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags); + spin_unlock(&eb_head(eb)->refs_lock); btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN); __percpu_counter_add(&fs_info->dirty_metadata_bytes, -eb->len, fs_info->dirty_metadata_batch); - ret = 1; + re
[RFC PATCH V10 08/19] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks.
Checksums are applicable to sectorsize units. The current code uses bio->bv_len units to compute and look up checksums. This works on machines where sectorsize == PAGE_CACHE_SIZE. This patch makes the checksum computation and look up code to work with sectorsize units. Signed-off-by: Chandan Rajendra --- fs/btrfs/file-item.c | 87 fs/btrfs/inode.c | 53 +--- 2 files changed, 89 insertions(+), 51 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 54c84da..000418a 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; + u64 page_bytes_left; u32 diff; int nblocks; int bio_index = 0; @@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, disk_bytenr = (u64)bio->bi_iter.bi_sector << 9; if (dio) offset = logical_offset; + + page_bytes_left = bvec->bv_len; while (bio_index < bio->bi_vcnt) { if (!dio) offset = page_offset(bvec->bv_page) + bvec->bv_offset; @@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, if (BTRFS_I(inode)->root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID) { set_extent_bits(io_tree, offset, - offset + bvec->bv_len - 1, + offset + root->sectorsize - 1, EXTENT_NODATASUM, GFP_NOFS); } else { btrfs_info(BTRFS_I(inode)->root->fs_info, @@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, found: csum += count * csum_size; nblocks -= count; - bio_index += count; + while (count--) { - disk_bytenr += bvec->bv_len; - offset += bvec->bv_len; - bvec++; + disk_bytenr += root->sectorsize; + offset += root->sectorsize; + page_bytes_left -= root->sectorsize; + if (!page_bytes_left) { + bio_index++; + bvec++; + page_bytes_left = bvec->bv_len; + } + } } btrfs_free_path(path); @@ -442,6 +451,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, struct bio_vec *bvec = bio->bi_io_vec; int bio_index = 0; int index; + int nr_sectors; + int i; unsigned long total_bytes = 0; unsigned long this_sum_bytes = 0; u64 offset; @@ -469,41 +480,51 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, if (!contig) offset = page_offset(bvec->bv_page) + bvec->bv_offset; - if (offset >= ordered->file_offset + ordered->len || - offset < ordered->file_offset) { - unsigned long bytes_left; - sums->len = this_sum_bytes; - this_sum_bytes = 0; - btrfs_add_ordered_sum(inode, ordered, sums); - btrfs_put_ordered_extent(ordered); + data = kmap_atomic(bvec->bv_page); - bytes_left = bio->bi_iter.bi_size - total_bytes; - sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left), - GFP_NOFS); - BUG_ON(!sums); /* -ENOMEM */ - sums->len = bytes_left; - ordered = btrfs_lookup_ordered_extent(inode, offset); - BUG_ON(!ordered); /* Logic error */ - sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) + - total_bytes; - index = 0; + nr_sectors = (bvec->bv_len + root->sectorsize - 1) + >> root->fs_info->sb->s_blocksize_bits; + + + for (i = 0; i < nr_sectors; i++) { + if (offset >= ordered->file_offset + ordered->len || + offset < ordered->file_offset) { + unsigned long bytes_left; + sums->len = this_sum_bytes; + this_sum_bytes = 0; + btrfs_add_ordered_sum(inode, ordered, sums); + btrfs_put_ordered_extent(ordere
[RFC PATCH V10 13/19] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations.
In subpagesize-blocksize scenario, extent allocations for only some of the dirty blocks of a page can succeed, while allocation for rest of the blocks can fail. This patch allows I/O against such partially allocated ordered extents to be submitted. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c | 24 +--- fs/btrfs/extent_io.h | 1 + fs/btrfs/inode.c | 39 +-- 3 files changed, 39 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3f6bec2..51ab453 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1774,15 +1774,22 @@ int extent_clear_unlock_delalloc(struct inode *inode, u64 start, u64 end, if (page_ops & PAGE_SET_PRIVATE2) SetPagePrivate2(pages[i]); + if (page_ops & PAGE_SET_ERROR) + SetPageError(pages[i]); + if (pages[i] == locked_page) { page_cache_release(pages[i]); continue; } - if (page_ops & PAGE_CLEAR_DIRTY) + + if ((page_ops & PAGE_CLEAR_DIRTY) + && !PagePrivate2(pages[i])) clear_page_dirty_for_io(pages[i]); - if (page_ops & PAGE_SET_WRITEBACK) + if ((page_ops & PAGE_SET_WRITEBACK) + && !PagePrivate2(pages[i])) set_page_writeback(pages[i]); - if (page_ops & PAGE_END_WRITEBACK) + if ((page_ops & PAGE_END_WRITEBACK) + && !PagePrivate2(pages[i])) end_page_writeback(pages[i]); if (page_ops & PAGE_UNLOCK) unlock_page(pages[i]); @@ -2398,7 +2405,7 @@ int end_extent_writepage(struct page *page, int err, u64 start, u64 end) uptodate = 0; } - if (!uptodate) { + if (!uptodate || PageError(page)) { ClearPageUptodate(page); SetPageError(page); ret = ret < 0 ? ret : -EIO; @@ -3149,7 +3156,6 @@ static noinline_for_stack int writepage_delalloc(struct inode *inode, nr_written); /* File system has been set read-only */ if (ret) { - SetPageError(page); /* fill_delalloc should be return < 0 for error * but just in case, we use > 0 here meaning the * IO is started, so we don't want to return > 0 @@ -3358,7 +3364,6 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, struct inode *inode = page->mapping->host; struct extent_page_data *epd = data; u64 start = page_offset(page); - u64 page_end = start + PAGE_CACHE_SIZE - 1; int ret; int nr = 0; size_t pg_offset; @@ -3401,7 +3406,7 @@ static int __extent_writepage(struct page *page, struct writeback_control *wbc, ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written); if (ret == 1) goto done_unlocked; - if (ret) + if (ret && !PagePrivate2(page)) goto done; ret = __extent_writepage_io(inode, page, wbc, epd, @@ -3415,10 +3420,7 @@ done: set_page_writeback(page); end_page_writeback(page); } - if (PageError(page)) { - ret = ret < 0 ? ret : -EIO; - end_extent_writepage(page, ret, start, page_end); - } + unlock_page(page); return ret; diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 39e14fc..264dfd4 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -52,6 +52,7 @@ #define PAGE_SET_WRITEBACK (1 << 2) #define PAGE_END_WRITEBACK (1 << 3) #define PAGE_SET_PRIVATE2 (1 << 4) +#define PAGE_SET_ERROR (1 << 5) /* * page->private values. Every page that is controlled by the extent diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 91c5580..4ed78dd 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -880,6 +880,8 @@ static noinline int cow_file_range(struct inode *inode, struct btrfs_key ins; struct extent_map *em; struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree; + struct btrfs_ordered_extent *ordered; + unsigned long page_ops, extent_ops; int ret = 0; if (btrfs_is_free_space_inode(inode)) { @@ -924,8 +926,6 @@ static noinline int cow_file_range(struct inode *inode, btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0); while (disk_num_bytes > 0) { - unsigned long op; -
[RFC PATCH V10 05/19] Btrfs: subpagesize-blocksize: Read tree blocks whose size is
In the case of subpagesize-blocksize, this patch makes it possible to read only a single metadata block from the disk instead of all the metadata blocks that map into a page. Signed-off-by: Chandan Rajendra --- fs/btrfs/disk-io.c | 45 -- fs/btrfs/disk-io.h | 3 ++ fs/btrfs/extent_io.c | 129 ++- 3 files changed, 140 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3a79833..20168e6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -431,7 +431,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, int mirror_num = 0; int failed_mirror = 0; - clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); + clear_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags); io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree; while (1) { ret = read_extent_buffer_pages(io_tree, eb, start, @@ -450,7 +450,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, * there is no reason to read the other copies, they won't be * any less wrong. */ - if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags)) + if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags)) break; num_copies = btrfs_num_copies(root->fs_info, @@ -582,12 +582,13 @@ static noinline int check_leaf(struct btrfs_root *root, return 0; } -static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, - u64 phy_offset, struct page *page, - u64 start, u64 end, int mirror) +int verify_extent_buffer_read(struct btrfs_io_bio *io_bio, + struct page *page, + u64 start, u64 end, int mirror) { u64 found_start; int found_level; + struct extent_buffer_head *ebh; struct extent_buffer *eb; struct btrfs_root *root = BTRFS_I(page->mapping->host)->root; int ret = 0; @@ -597,18 +598,26 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, goto out; eb = (struct extent_buffer *)page->private; + do { + if ((eb->start <= start) && (eb->start + eb->len - 1 > start)) + break; + } while ((eb = eb->eb_next) != NULL); + + BUG_ON(!eb); + + ebh = eb_head(eb); /* the pending IO might have been the only thing that kept this buffer * in memory. Make sure we have a ref for all this other checks */ extent_buffer_get(eb); - reads_done = atomic_dec_and_test(&eb->io_pages); + reads_done = atomic_dec_and_test(&ebh->io_bvecs); if (!reads_done) goto err; eb->read_mirror = mirror; - if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) { + if (test_bit(EXTENT_BUFFER_IOERR, &eb->ebflags)) { ret = -EIO; goto err; } @@ -650,7 +659,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, * return -EIO. */ if (found_level == 0 && check_leaf(root, eb)) { - set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags); + set_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags); ret = -EIO; } @@ -658,7 +667,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, set_extent_buffer_uptodate(eb); err: if (reads_done && - test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags)) + test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags)) btree_readahead_hook(root, eb, eb->start, ret); if (ret) { @@ -667,7 +676,7 @@ err: * again, we have to make sure it has something * to decrement */ - atomic_inc(&eb->io_pages); + atomic_inc(&eb_head(eb)->io_bvecs); clear_extent_buffer_uptodate(eb); } free_extent_buffer(eb); @@ -675,20 +684,6 @@ out: return ret; } -static int btree_io_failed_hook(struct page *page, int failed_mirror) -{ - struct extent_buffer *eb; - struct btrfs_root *root = BTRFS_I(page->mapping->host)->root; - - eb = (struct extent_buffer *)page->private; - set_bit(EXTENT_BUFFER_IOERR, &eb->bflags); - eb->read_mirror = failed_mirror; - atomic_dec(&eb->io_pages); - if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags)) - btree_readahead_hook(root, eb, eb->start, -EIO); - return -EIO;/* we fixed nothing */ -} - static void end_workqueue_bio(struct bio *bio, int err) { struct end_io_wq *end_io_wq = bio->bi_private; @@ -4156,8 +4151,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root *root) } static struct extent_io_ops btree_extent_io_ops = {
[RFC PATCH V10 09/19] Btrfs: subpagesize-blocksize: __extent_writepage: Write only dirty blocks of a page.
The code now loops across 'ordered extents' instead of 'extent maps' to figure out the dirty blocks of the page to be submitted for a write operation. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c | 74 1 file changed, 29 insertions(+), 45 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index bbb5e980..ceaf137 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3212,18 +3212,18 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, int write_flags, int *nr_ret) { struct extent_io_tree *tree = epd->tree; + struct btrfs_ordered_extent *ordered; u64 start = page_offset(page); u64 page_end = start + PAGE_CACHE_SIZE - 1; u64 end; u64 cur = start; u64 extent_offset; - u64 block_start; + u64 extent_end; u64 iosize; sector_t sector; struct extent_state *cached_state = NULL; - struct extent_map *em; struct block_device *bdev; - size_t pg_offset = 0; + size_t pg_offset; size_t blocksize; int ret = 0; int nr = 0; @@ -3263,59 +3263,46 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, blocksize = inode->i_sb->s_blocksize; while (cur <= end) { - u64 em_end; if (cur >= i_size) { if (tree->ops && tree->ops->writepage_end_io_hook) tree->ops->writepage_end_io_hook(page, cur, page_end, NULL, 1); break; } - em = epd->get_extent(inode, page, pg_offset, cur, -end - cur + 1, 1); - if (IS_ERR_OR_NULL(em)) { - SetPageError(page); - ret = PTR_ERR_OR_ZERO(em); - break; - } - extent_offset = cur - em->start; - em_end = extent_map_end(em); - BUG_ON(em_end <= cur); + ordered = btrfs_lookup_ordered_extent(inode, cur); + if (!ordered) { + cur += blocksize; + continue; + } + + pg_offset = cur & (PAGE_CACHE_SIZE - 1); + + extent_offset = cur - ordered->file_offset; + extent_end = ordered->file_offset + ordered->len; + extent_end = (extent_end < ordered->file_offset) ? -1 : extent_end; + BUG_ON(extent_end <= cur); BUG_ON(end < cur); - iosize = min(em_end - cur, end - cur + 1); + iosize = min(extent_end - cur, end - cur + 1); iosize = ALIGN(iosize, blocksize); - sector = (em->block_start + extent_offset) >> 9; - bdev = em->bdev; - block_start = em->block_start; - compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags); - free_extent_map(em); - em = NULL; + + sector = (ordered->start + extent_offset) >> 9; + bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev; + compressed = test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags); + btrfs_put_ordered_extent(ordered); + ordered = NULL; /* * compressed and inline extents are written through other * paths in the FS */ - if (compressed || block_start == EXTENT_MAP_HOLE || - block_start == EXTENT_MAP_INLINE) { - /* -* end_io notification does not happen here for -* compressed extents -*/ - if (!compressed && tree->ops && - tree->ops->writepage_end_io_hook) - tree->ops->writepage_end_io_hook(page, cur, -cur + iosize - 1, -NULL, 1); - else if (compressed) { - /* we don't want to end_page_writeback on -* a compressed extent. this happens -* elsewhere -*/ - nr++; - } - + if (compressed) { + /* we don't want to end_page_writeback on +* a compressed extent. this happens +* elsewhere +*/ + nr++; cur += iosize; - pg_offset += iosize; continue;
[RFC PATCH V10 12/19] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.
In subpagesize-blocksize scenario it is not sufficient to search using the first byte of the page to make sure that there are no ordered extents present across the page. Fix this. Signed-off-by: Chandan Rajendra --- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/inode.c | 6 +++--- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ceaf137..3f6bec2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3053,7 +3053,8 @@ static int __extent_read_full_page(struct extent_io_tree *tree, while (1) { lock_extent(tree, start, end); - ordered = btrfs_lookup_ordered_extent(inode, start); + ordered = btrfs_lookup_ordered_range(inode, start, + PAGE_CACHE_SIZE); if (!ordered) break; unlock_extent(tree, start, end); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 23ce9ff..91c5580 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1821,7 +1821,7 @@ again: if (PagePrivate2(page)) goto out; - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE); if (ordered) { unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start, page_end, &cached_state, GFP_NOFS); @@ -7724,7 +7724,7 @@ static void btrfs_invalidatepage(struct page *page, unsigned int offset, if (!inode_evicting) lock_extent_bits(tree, page_start, page_end, 0, &cached_state); - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, PAGE_CACHE_SIZE); if (ordered) { /* * IO on this page will never be started, so we need @@ -7849,7 +7849,7 @@ again: * we can't set the delalloc bits if there are pending ordered * extents. Drop our locks and wait for them to finish */ - ordered = btrfs_lookup_ordered_extent(inode, page_start); + ordered = btrfs_lookup_ordered_range(inode, page_start, page_end); if (ordered) { unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
Robert White posted on Wed, 10 Dec 2014 04:17:50 -0800 as excerpted: >> BTRFS info (device sdc1): relocating block group 1821099687936 flags 1 >> BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920 >> BTRFS: space_info 1 has 4773171200 free, is not full BTRFS: space_info >> total=1494648619008, used=1489775505408, pinned=0, reserved=99700736, >> may_use=2102390784, readonly=241664 > > So it was looking for a single chunk 2013265920 bytes long and it > couldn't find one because all the spaces were smaller and there was no > room to make a new suitable space. > > The problem is that it wanted 2013265920 bytes and while the system as a > whole had no way to satisfy that desire. It asked for something just shy > of two gigs as a single extent. That's a tough order on a full platter. > > Since your entire free size is 2102390784 that is an attempt to allocate > about 80% of your free space as one contiguous block. That's never going > to happen. 8-) > > I don't even know if 2GiB is normally a legal size for an extent. My > understanding is that data is allocated in 1G chunks, so I'd expect all > extents to be smaller than 1G. On native btrfs, an extent must fit within the 1 GiB data chunk size, with extents inherited from an ext* conversion being an obvious non- native exception. I hadn't looked at the actual output, but that confirms my earlier suspicion, that after the ext* saved subvolume delete, the defrag somehow missed at least one file > 1 GiB with a "super-extent" also > 1 GiB in size. >From there... I've never used it but I /think/ btrfs inspect-internal logical-resolve should let you map the 182109... address to a filename. >From there, moving that file out of the filesystem and back in should eliminate that issue. Assuming no snapshots still contain the file, of course, and that the ext* saved subvolume has already been deleted. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
Robert White posted on Tue, 09 Dec 2014 16:01:02 -0800 as excerpted: > On 12/09/2014 03:48 PM, Robert White wrote: >> On 12/09/2014 02:29 PM, Patrik Lundquist wrote: >>> (stuff depicting a nearly full file system). >> >> Having taken another look at it all, I'd bet (there is not sufficient >> information to be _sure_ from the output you've provided) that you >> don't have the necessary 1Gb free on your disk slice to allocate >> another data extent. [snip most of both quote levels] > Full filesystems always get into corner cases. But, from the content you snipped from his post, this from btrfs fi show: >>> Label: none uuid: 770fe01d-6a45-42b9-912e-e8f8b413f6a4 >>>Total devices 1 FS bytes used 1.35TiB >>>devid1 size 2.73TiB used 1.36TiB path /dev/sdc1 Device 2.73 TiB, used only 1.36 TiB. That's over a TiB of entirely unallocated space, so a mere 1 GiB chunk allocation shouldn't be a problem. I'm sticking with my original hypothesis (assuming this is a continuation from the thread I think it was), that there's something about the conversion from ext* that didn't work correctly; most likely a file larger than the btrfs 1 GiB data-chunk size, that has an extent larger than that size as well. Btrfs balance couldn't do anything with that, as it's larger than the native 1 GiB data-chunk size and balance alone doesn't know how to split it up. The recursive btrfs defrag after deleting the saved ext* subvolume _should_ have split up any such > 1 GiB extents so balance could deal with them, but either it failed for some reason on at least one such file, or there's some other weird corner-case going on, very likely something else having to do with the conversion. Patrik, assuming no btrfs snapshots yet, can you do a du --all --block- size=1M | sort -n (or similar), then take a look at all results over 1024 (1 GiB since the du specified 1 MiB blocks), and see if it's reasonable to move all those files out of the filesystem and back? Assuming there's not too many of them, the idea is to kill the copy in the filesystem by moving them elsewhere, then move them back so they get recreated using native btrfs semantics -- no extents larger than the native btrfs data chunk size of 1 GiB. If you have lots of memory to work with, one method would be to create a tmpfs, then /copy/ the files to tmpfs and /move/ them back to a temporary tree on the btrfs, deleting the originals on btrfs only after the move back from tmpfs and a sync (or btrfs fi sync) so there's always a permanent copy if the machine should crash and take down the tmpfs with it. After all the files have been processed and the originals deleted you can then move the contents of the temporary tree back into the original location. That should ensure no more > 1 GiB file extents and will I hope get rid of the problem, as this workaround has been demonstrated to fix problems other people had with converted-from-ext* btrfs, generally where they had failed to run the defrag right after the conversion, and now had a bunch more data on the filesystem and didn't want to have to defrag it too. Obviously it works best when there's only a handful of > 1 GiB files, however, and snapshots containing references to the affected files will prevent the file delete from actually deleting the problematic extents. With luck that'll allow a full 100% balance without error. If not, at least it should eliminate the > 1 GiB file extents possibility, and the focus can move to something else. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 08/13] btrfs-progs: Add count_digit() function to help calculate filename len.
On Tue, Dec 09, 2014 at 04:27:27PM +0800, Qu Wenruo wrote: > +static inline int count_digit(u64 num) FYI, I've renamed it to count_digits, and updated all callers. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and
On Tue, Dec 09, 2014 at 04:27:19PM +0800, Qu Wenruo wrote: > The patchset introduce two new repair function and some helpers to > archive a huge goal: > Repair btrfs whose fs tree's non-root leaf/node is corrupted when no > duplication is valid. > > The two new repair functions are: > repair_inode_nlinks(): > Repair any inode nlink related problem. > From fixing the nlink number and related > inode_ref/dir_index/dir_item to recovering file name and file type > and salvage them into the lost+found dir. > This does not only fix a case that some users reported but also > cooperate with repair_inode_no_item() function to salvaged heavily > damaged inode to lost+found dir. > > repair_inode_no_item(): > Repair case for inode_item missing case, which is quite common when > fs tree leaf/node is missing. > This only does the inode item rebuild. Later recovery like move it > to lost+found dir is done by repair_inode_nlinks(). > > The main helper is the repair_btree() function, which will drops the > corrupted non-root leaf/node and rebalance the tree to keep the > correctness of the btree. Sounds a bit intrusive, but under the circumstances I don't see anything better to do. > With this patchset, even a non-root leaf/node is corrupted and no > duplication survived, btrfsck can still repair it to a mountable status. > (And normal rw should also be OK,) > > The remaining unfixable problems will be inode nbytes error with file > extent discounts error, which may be fixed in next patchset. > > Cc David: > Sorry for the huge change in the patchset and merge the old inode nlink > repair with new inode item rebuild patchset. No problem, the incremental changelogs helped a lot. > Since when developing inode item rebuild patchset, I found the old nlink > cooperated very bad with item rebuild and there is some duplicated codes > between the two patchset, no to mention the math lib introduced by nlink > repair patch. > So I decided to somewhat rebase the nlink repair patchset to provide > better generality. Great, the patchset looks good for merge, I'm adding it to 3.18. From now on please send only incremental changes and not the whole patchset. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixing Btrfs Filesystem Full Problems typo?
On 12/09/2014 11:19 PM, Patrik Lundquist wrote: On 10 December 2014 at 00:13, Robert White wrote: On 12/09/2014 02:29 PM, Patrik Lundquist wrote: Label: none uuid: 770fe01d-6a45-42b9-912e-e8f8b413f6a4 Total devices 1 FS bytes used 1.35TiB devid1 size 2.73TiB used 1.36TiB path /dev/sdc1 Data, single: total=1.35TiB, used=1.35TiB System, single: total=32.00MiB, used=112.00KiB Metadata, single: total=3.00GiB, used=1.55GiB GlobalReserve, single: total=512.00MiB, used=0.00B Are you trying to convert a filesystem on a single device/partition to RAID 1? Not yet. I'm stuck at the full balance after the conversion from ext4. I haven't added the disks for RAID1 and might need them for starting over instead. You are not "stuck" here as this step is not mandatory. (see below) A balance with -musage=100 -dusage=99 works but a full fails. It would be nice to nail the bug since the fs passes btrfs check and it seems to be a clear ENOSPC bug. Conversion from ext2/3/4 is constrained because it needs to be reversible. If you are out of space this isn't a "bug", you are just out of space. So by telling the system to ignore the 100% full clusters it is free to juggle the fragments. But once you get into moving the fully full extents the COW features _MUST_ have access to _contiguous_ 1Gib blocks to make the new extents int which the Copy will be Written. If your file system was nearly full it's completely likely that there are no such contiguous blocks available to make the necessary extents. BUT FIRST UNDERSTAND: you do _not_ need to balance a newly converted filesystem. That is, the recommended balance (and recursive defrag) is _not_ a useability issue, its an efficiency issue. Check what you've got. Make sure it is good. Make sure you are cool with it all. When you know everything is usable then remove the undo information snapshot. That snapshot is pinning a _lot_ of data into exact positions on disk. It's memorializing your previous fragmentation and the anniversary positions of all the EXT4 data structures. Since your system is basically full that undo information has to go. At that point your balance will probably have the room it needs. _Then_ you can balance if you feel the desire. If you are _still_ out of space you'll need to add some, at least temporarily, to give the system enough room to work. Since we all _know_ you are a dilligent system administrator and architect with a good, recent, and well tested backup we know we can recommend that you just dump the undo partition with a nice btrfs subvol delete, right? Because you made a backup and everything yes? So anyway. Your system isn't "bugged" or "broken" it's "full" but its a fragmented fullness that has lots of free sectors but insufficent contiguous free sectors, so it cannot satisfy the request. That Said... I suspect you _have_ revealed a problem with the error reporting in the case of "scary and wrong error message". The allocator in extent-tree.c just tells you the raw free space on the disk and says "hua... there are lots of bytes out there". Which is _WAY_ different than "there are enough bytes all in one clump to satisfy my needs. E.g. there is _not_ a lot of brains behind the message. ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins, flags, delalloc); if (ret == -ENOSPC) { if (!final_tried && ins->offset) { num_bytes = min(num_bytes >> 1, ins->offset); num_bytes = round_down(num_bytes, root->sectorsize); num_bytes = max(num_bytes, min_alloc_size); if (num_bytes == min_alloc_size) final_tried = true; goto again; } else if (btrfs_test_opt(root, ENOSPC_DEBUG)) { struct btrfs_space_info *sinfo; sinfo = __find_space_info(root->fs_info, flags); btrfs_err(root->fs_info, "allocation failed flags %llu, wanted %llu", flags, num_bytes); if (sinfo) dump_space_info(sinfo, num_bytes, 1); } } I don't know how to interpret the space_info error. Why is only 4773171200 (4,4GiB) free? Can I inspect block group 1821099687936 to try to find out what makes it problematic? BTRFS info (device sdc1): relocating block group 1821099687936 flags 1 BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920 BTRFS: space_info 1 has 4773171200 free, is not full BTRFS: space_info total=1494648619008, used=1489775505408, pinned=0, reserved=99700736, may_use=2102390784, readonly=241664 So it was looking for a single chunk 2013265920 bytes long and it couldn't find one because all the spaces were smaller and there was no room to make a new suitable spac
Re: [PATCH] Btrfs: get more accurate output in fd command.
On 12/09/2014 05:08 PM, Dongsheng Yang wrote: On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote: Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote: When function btrfs_statfs() calculate the tatol size of fs, it is calculating the total size of disks and then dividing it by a factor. But in some usecase, the result is not good to user. Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 3.0G 1018M 1.3G 45% /mnt # btrfs fi show /dev/vdf1 Label: none uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes used 1001.53MiB devid1 size 2.00GiB used 1.85GiB path /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2 a. df -h should report Size as 2GiB rather than as 3GiB. Because this is 2 device raid1, the limiting factor is devid 1 @2GiB. I agree NOPE. The model you propose is too simple. While the data portion of the file system is set to RAID1 the metadata portion of the filesystem is still set to the default of DUP. As such it is impossible to guess how much space is "free" since it is unknown how the space will be used before hand. IF, say, this were used as a typical mail spool, web cache, or any number of similar smal-file applications virtually all of the data may end up in the metadata chunks. The "blocks free" in this usage are indistinguisable from any other file system. For all that DUP data the correct size is 3GiB because there will be two copies of all metadata but they could _all_ end up on /dev/vdf2. So you have a RAID-1 region that is constrained to 2Gib. You have 2GiB more storage for all your metadata, but the constraint is DUP (so everything is written twice "somewhere") So the space breakdown is, if optimally packed, actually 2GiB mirrored, for _data_, takes up 4GiB total spread evenly across /dev/vdf2 (2Gib) and /dev/vdf1 (2Gib). _AND_ 1GiB of metadata, written twice to /dev/vdf2 (2Gib) So free space is 3Gib on the presumption that data and metadata will be equally used. The program, not being psychic, can only make a fair-usage guess about future use. Now we have accounted for all 6GiB of raw storage _and_ the report of 3GiB free. IF you wanted everything to be RAID-1 you should have instead done # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 -m raid1 The mistake is yours, rest of you analysis is, therefore, completely inapplicable. Please read all the documentation before making that sort of filesystem. Your data will thank you later. DSCLAIMER: I have _not_ looked at the numbers you would get if you used the corrected command. b. df -h should report Avail as 0.15GiB or less, rather than as 1.3GiB. 2 - 1.85 = 0.15 I cannot agree; the avail should be: 1.85 (the capacity of the allocated chunk) -1.018 (the file stored) +(2-1.85=0.15) (the residual capacity of the disks considering a raid1 fs) --- = 0.97 My bad here. It should be 0.97. My mistake in this changelog. I will update it in next version. This patch drops the factor at all and calculate the size observable to user without considering which raid level the data is in and what's the size exactly in disk. After this patch applied: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 # df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/vdf1 2.0G 1018M 713M 59% /mnt I am confused: in this example you reported as Avail 713MB, when previous you stated that the right value should be 150MB... As you pointed above, the right value should be 970MB or less (Some space is used for metadata and system). And the 713MB is my result of it. What happens when the filesystem is RAID5/RAID6 or Linear ? The original df did not consider the RAID5/6. So it still does not work well with this patch applied. But I will update this patch to handle these scenarios in V2. Thanx Yang [...] -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Btrfs: qgroup: free reserved in exceeding quota.
On 12/09/2014 11:42 PM, Josef Bacik wrote: On 12/09/2014 06:27 AM, Dongsheng Yang wrote: When we exceed quota limit in writing, we will free some reserved extent when we need to drop but not free account in qgroup. It means, each time we exceed quota in writing, there will be some remain space in qg->reserved we can not use any more. If things go on like this, the all space will be ate up. Signed-off-by: Dongsheng Yang --- fs/btrfs/extent-tree.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a84e00d..014b7f2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5262,8 +5262,11 @@ out_fail: to_free = 0; } spin_unlock(&BTRFS_I(inode)->lock); -if (dropped) +if (dropped) { +if (root->fs_info->quota_enabled) +btrfs_qgroup_free(root, dropped * root->nodesize); This needs to be num_bytes + dropped * root->nodesize. Thanks, Let me try to explain why it did not free num_bytes here. In out_fail, we did not reserve num_bytes in qgroup successfully, then we do not need to free it in out_fail. The problem this patch attempts to solve is that, when we run into out_fail here, we will drop a outstanding extent. That said, in out_fail here, the extra reserved nodesize for some extents should be freed. Example: 1). BTRFS_I(inode)->reserved_extents: 2, BTRFS_I(inode)->outstanding_extents: 1. In this case, we go intobtrfs_delalloc_reserve_metadata(). outstanding_extents will be increased at first. then BTRFS_I(inode)->outstanding_extents is 2. If we want to reserve space and failed. it will goto out_fail. 2). In out_failed: reserved_extents is 2, outstanding_extents is 2. we will get a dropped of 1 from dropping_outstanding_extent(). And now, reserved_extents:1, outstanding_extents:1. In step 2, we just decrease the reserved_extents without freeing the related nodesize in qgroup at the same time. So it will cause the problem I described in changelog which will eat the space. Therefore, this patch here will free the nodesize related with the dropped extents in step 2. About the num_bytes, as we did not reserve it successfully, no need to free it. With my poor english, there must be something confusing in my description. Please correct me if anything is wrong or not-good-explained. Thanx Yang Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
On 12/09/2014 11:55 PM, Josef Bacik wrote: On 12/09/2014 06:27 AM, Dongsheng Yang wrote: Currently, for pre_alloc or delay_alloc, the bytes will be accounted in space_info by the three guys. space_info->bytes_may_use --- space_info->reserved --- space_info->used. But on the other hand, in qgroup, there are only two counters to account the bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts bytes in space_info->bytes_may_use and qg->excl accounts bytes in space_info->used. So the bytes in space_info->reserved is not accounted in qgroup. If so, there is a window we can exceed the quota limit when bytes is in space_info->reserved. Example: # btrfs quota enable /mnt # btrfs qgroup limit -e 10M /mnt # for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done # sync # btrfs qgroup show -pcre /mnt qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 20987904 20987904 010485760 --- --- qg->excl is 20987904 larger than max_excl 10485760. This patch introduce a new counter named may_use to qgroup, then there are three counters in qgroup to account bytes in space_info as below. space_info->bytes_may_use --- space_info->reserved --- space_info->used. qgroup->may_use --- qgroup->reserved --- qgroup->excl With this patch applied: # btrfs quota enable /mnt # btrfs qgroup limit -e 10M /mnt # for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done fallocate: /mnt/data9: fallocate failed: Disk quota exceeded fallocate: /mnt/data10: fallocate failed: Disk quota exceeded fallocate: /mnt/data11: fallocate failed: Disk quota exceeded fallocate: /mnt/data12: fallocate failed: Disk quota exceeded fallocate: /mnt/data13: fallocate failed: Disk quota exceeded fallocate: /mnt/data14: fallocate failed: Disk quota exceeded fallocate: /mnt/data15: fallocate failed: Disk quota exceeded fallocate: /mnt/data16: fallocate failed: Disk quota exceeded fallocate: /mnt/data17: fallocate failed: Disk quota exceeded fallocate: /mnt/data18: fallocate failed: Disk quota exceeded fallocate: /mnt/data19: fallocate failed: Disk quota exceeded # sync # btrfs qgroup show -pcre /mnt qgroupid rferexclmax_rfer max_excl parent child -- - 0/5 9453568 9453568 010485760 --- --- Reported-by: Cyril SCETBON Signed-off-by: Dongsheng Yang --- fs/btrfs/extent-tree.c | 25 ++- fs/btrfs/inode.c | 22 +++- fs/btrfs/qgroup.c | 68 +++--- fs/btrfs/qgroup.h | 4 +++ 4 files changed, 113 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 014b7f2..9eaf268 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5500,8 +5500,13 @@ static int pin_down_extent(struct btrfs_root *root, set_extent_dirty(root->fs_info->pinned_extents, bytenr, bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL); -if (reserved) +if (reserved) { +if (root->fs_info->quota_enabled) You already have this check in btrfs_qgroup_update_reserved_bytes, just call it unconditionally everywhere in this patch. Otherwise this looks good, thanks, Thanx, I will update it in V2. Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html