Re: [PATCH 01/18] btrfs-progs: btrfs-debug-tree: add option -f for "block only"

2014-12-10 Thread Qu Wenruo


 Original Message 
Subject: [PATCH 01/18] btrfs-progs: btrfs-debug-tree: add option -f for 
"block only"

From: 
To: 
Date: 2014年12月11日 04:51

From: Martin Wilck 

btrfs-debug-tree prints only the given block. It is sometimes
useful to be able to print the subtree under this block.
This patch enables this behavior with the option "-f".

Signed-off-by: Martin Wilck 
---
  btrfs-debug-tree.c |   10 --
  1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index e46500d..e61c71c 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -41,6 +41,8 @@ static int print_usage(void)
fprintf(stderr, "\t-u : print info of uuid tree only\n");
fprintf(stderr, "\t-b block_num : print info of the specified block"
  " only\n");
+   fprintf(stderr, "\t-f : (with -b) follow subtree of the specified"
+   " block\n");
fprintf(stderr,
"\t-t tree_id : print only the tree with the given id\n");
fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
@@ -137,6 +139,7 @@ int main(int ac, char **av)
int roots_only = 0;
int root_backups = 0;
u64 block_only = 0;
+   int block_follow = 0;
struct btrfs_root *tree_root_scan;
u64 tree_id = 0;
  
@@ -144,7 +147,7 @@ int main(int ac, char **av)
  
  	while(1) {

int c;
-   c = getopt(ac, av, "deb:rRut:");
+   c = getopt(ac, av, "defb:rRut:");
if (c < 0)
break;
switch(c) {
@@ -167,6 +170,9 @@ int main(int ac, char **av)
case 'b':
block_only = arg_strtou64(optarg);
break;
+   case 'f':
+   block_follow = 1;
+   break;
case 't':
tree_id = arg_strtou64(optarg);
break;
@@ -211,7 +217,7 @@ int main(int ac, char **av)
(unsigned long long)block_only);
goto close_root;
}
-   btrfs_print_tree(root, leaf, 0);
+   btrfs_print_tree(root, leaf, block_follow);
Although not a bug of your patch, but would you please fix the extent 
buffer leak by

adding a free_extent_buffer(buf)?

Thanks,
Qu

goto close_root;
}
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-10 Thread Duncan
Robert White posted on Wed, 10 Dec 2014 14:18:55 -0800 as excerpted:

> So I started looking at the mkfs.btrfs manual page with an eye towards
> documenting some of the tidbits like metadata automatically switching
> from dup to raid1 when more than one device is used.
> 
> In experimenting I ended up with some questions...
> 
> (1) why is the dup profile for data restricted to only one device and
> only if it's mixed mode?

> (2) why is metadata dup profile restricted to only one device on
> creation when it will run that way just fine after a device add?

1 and 2 together since they both deal with dup mode...

Dup mode was apparently originally considered purely an extra safeguard 
for metadata in the single-device case, where it was made the default 
(except for SSDs, which default to single mode metadata on a single-
device filesystem, because the FTL voids any guarantees on location 
anyway, and because firmware such as sandforce compresses and dedups 
anyway, in which case the hardware/firmware is subverting btrfs' efforts 
to do dup anyway).

In the single-device case, two copies of data was considered simply not 
worth the cost, due both to doubling the size (especially on SSD where 
size is money!) and to the speed penalties on spinning rust due to seeks 
between one 1-GiB data-chunk and its dup.

With multi-device, raid1 metadata, forcing one copy to each of two 
different devices, was considered enough superior to make that the 
default, since that provided device-loss resiliency for the all-important 
metadata, thus enabling recovery of at least /some/ files even with a 
device missing (single-mode data where the file's extents all happened to 
be on available devices, plus of course raid1, etc, data).  Further, dup-
mode metadata was considered a mistake it was better not to even have 
available as an option, since loss of a single device would likely kill 
the filesystem, which made dup mode little better than single mode, 
without the doubled-size-cost.  Further, on spinning rust there'd again 
be the seek penalty, to little benefit since dup mode provides no 
guarantees in case of device loss.

So multi-device defaults to raid1 metadata for safety, but single mode 
metadata remains an option (along with raid0) if you really /don't/ care 
about losing everything due to loss of a single device.  Single-device 
simply makes dup-mode available (and the default) for metadata, as a poor-
man's substitute for the safety of raid1, but single-device-metadata is 
the only case where that poor-man's-raid1-substitute is worth the 
(considered extreme) cost, with usage of that option not even available 
on multi-device as it'd be a near-certain mistake, certainly at the mkfs 
level.  And dup mode isn't ordinarily available for data even on single-
device, because it's considered not worth the cost.

As for dup-mode working after device-add, that's simply a necessary bit 
in ordered for device add to work from a default-dup-mode single-device 
at all.  And it's only the existing metadata chunks on the original 
device that will be dup-mode.  Once a second device is added, additional 
metadata chunks will be written in raid1 mode, forcing the two chunk 
copies to different devices since there's multiple devices available to 
allow that.  The clear intent and recommendation is to do a rebalance 
ASAP after a device add, to spread usage to the new device as 
appropriate.  And of course that rebalance will use the new raid1 
metadata defaults, unless told otherwise of course, and I don't believe 
dup mode is available to tell it otherwise there, either.


What all that original reasoning fails to account for, however, is the 
btrfs data/metadata checksumming and integrity features and the very high 
(which the original btrfs mode designers obviously considered extreme) 
value some users (including me) place on them.  While a multi-device dup-
mode-metadata choice at mkfs is arguably still a mistake, the cost of 
raid1 metadata without the benefit, near the risk of single metadata but 
at double the size, dup-mode data combined with btrfs checksumming and 
data integrity features on a single device has strong data integrity 
benefits that some would definitely consider worth it, even at the 
additional cost in speed on spinning rust due to seeking, and in size on 
expensive SSDs.

Meanwhile, mixed-bg-mode was an after-thought, added much later (after my 
own btrfs journey began) in ordered to make working with small 
filesystems reasonable.  Before mixed-bg-mode, people attempting to use 
btrfs on sub-GiB devices often found they couldn't use all available 
space (often 25-50% wasted!) as the separate data/metadata chunk 
allocation was simply too large grained to properly deal with the small 
sizes involved.

And small filesystems really _was_ mixed-mode's _entire_ purpose.  That 
it could additionally be used to allow dup-data, using the ability to 
specify mixed-bg-mode even on > 1 GiB filesystems where it

Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Patrik Lundquist
On 10 December 2014 at 23:28, Robert White  wrote:
> On 12/10/2014 10:56 AM, Patrik Lundquist wrote:
>>
>> On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote:
>>>
>>> Assuming no snapshots still contain the file, of course, and that the
>>> ext* saved subvolume has already been deleted.
>>
>> Got no snapshots or subvolumes. Keeping it simple for now.
>
> Does that mean that you have already manually removed the subvolume that was
> automatically created by btrfs-convert?

Yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Duncan
Patrik Lundquist posted on Wed, 10 Dec 2014 21:11:52 +0100 as excerpted:

>> Patrik, assuming no btrfs snapshots yet, can you do a du --all --block-
>> size=1M | sort -n (or similar), then take a look at all results over
>> 1024 (1 GiB since the du specified 1 MiB blocks), and see if it's
>> reasonable to move all those files out of the filesystem and back?
> 
> Good idea, but it's quite a lot of files. I'd rather start over.
> 
> But I've identified 46 files from Btrfs errors in syslog and will try to
> move them to another disk. They're ranging from 41KiB to 6.6GiB in size.

There's one as yet incomplete piece of the puzzle.  I guess the devs 
could probably answer this, but being a simple sysadmin, I don't claim to 
read code well and don't know...

That log snippet you quoted earlier gave block-group addresses.  That's 
the chunks, in this case normally 1 GiB data chunks, but here we're 
dealing with a conversion from ext4 and apparently the extents are 
larger, nearly 2 GiB in this case according to that snippet.

That had me thinking the problem files were all > 1 GiB and had these 
super-extents that btrfs can't work with.

But you say you tracked down the file as I suggested using btrfs-inspect-
internal, and the file is much smaller than that.

Now I don't even know for sure what that log snippet was from, a normal 
dmesg during an attempted balance, or dmesg with btrfs debug turned on in 
the kernel, or the userspace debug you ask about, or...

And not being a dev and not having done anything like this level myself, 
I'm sort of feeling my way along here too, trying to figure things out as 
you report them.

So the missing piece I'm talking about is this.  OK, we have the address 
of a nearly 2 GiB block group reported, and I recalled seeing in an 
earlier post that trick with btrfs-inspect-internal, so I though to try 
it here.

But with the file being so much smaller than the 2 GiB block group 
reported, something's not matching.  Either the file is somehow using an 
extent much much larger than it is (possible with fallocate, then writing 
a shorter file, I believe), or the referred to block group actually 
contains more than one file -- certainly btrfs data chunks can do so, but 
given that we're dealing with a conversion here, I don't know if the same 
rules apply, or...

Anyway, it's possible that smaller file is simply the first one in the 
block group, thus being the one that was mapped when you plugged that 
address into inspect-internal, and that the problem file is actually a 
much larger file located after it in the same block group.

So if moving the small files doesn't do the trick, try feeding inspect-
internal with an address after that.  Given that btrfs blocks are 4 KiB 
in size, round the size of the small file up to the nearest 4 KiB and add 
that to the address originally obtained from the log, and see if inspect-
internal points at a different, presumably much larger (> 1 GiB or at 
least big enough so it'd extent beyond a GiB beyond the original 
address), file, with the new offset address.  If so, try moving /that/ 
file, and see if you have any better luck.

I was /hoping/ it would be the simple case and all the problem block-
group addresses would point to > 1 GiB files and moving them would be 
it.  But with a significant number of those addresses pointing at far 
smaller files, either I was wrong about the use of inspect-internal here 
and they're entirely unrelated, or the situation is otherwise rather more 
complex than I was hoping to be the case.

OTOH, if for whatever reason all those smaller files were fallocated to 
some huge size and then written smaller, or something similar happened 
such that they're using huge > 1 GiB extents even while being smaller 
than 1 GiB in size, that COULD go some distance to explaining why defrag 
missed them.  If defrag is looking at filesize and the files happen to be 
small but in huge extents, and it's those extents causing the problem, 
then we just found our bug, and all that's left is figuring out how to 
fix it, which is where I step out and the devs step in.  With a bit of 
luck, that's it, and we're now well on the way to fixing a bug that could 
have otherwise triggered unexplained problems for some people doing 
conversions, but not others, for quite some time to come.  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Duncan
Robert White posted on Wed, 10 Dec 2014 14:28:10 -0800 as excerpted:

> On 12/10/2014 10:56 AM, Patrik Lundquist wrote:
>> On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote:
>>> Assuming no snapshots still contain the file, of course, and that the
>>> ext* saved subvolume has already been deleted.
>>
>> Got no snapshots or subvolumes. Keeping it simple for now.
> 
> Does that mean that you have already manually removed the subvolume that
> was automatically created by btrfs-convert?

Yes, he had.

Patrik correct me if I have this wrong, but filling in the history as I 
believe I have it...

If I'm keeping my cases straight, he had actually posted a thread some 
weeks ago with the initial problem, saying he had followed the conversion 
instructions to the letter -- conversion, delete-saved, defrag, balance, 
and ran into this problem with balance.  The conclusion at that time was 
that he'd try successively larger balance -dusage=N figures, hoping to 
work thru it that way.

That original thread could well have been shortly before you appeared on 
the list, however, and you may not have seen it.  Either that, or you saw 
it but didn't connect that case with this one.

Anyway, yes, assuming I haven't gotten my casefiles mixed up, and 
evidence so far is that I haven't, he did everything he was supposed to 
and still ended up with this issue.  Obviously there's still a bug 
somewhere.

And now he's back.  The incrementally increasing usage= balances reaching 
99%, but that last 1% is the sticking point and he, and the rest of us, 
are trying to figure out what happened and how to get him past it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Duncan
Patrik Lundquist posted on Wed, 10 Dec 2014 21:11:52 +0100 as excerpted:

> Is btrfs-debug-tree -e useful in finding problematic files?

Since you were replying directly to me, my answer...

ENOTENOUGHINFO

I don't know enough about it to honestly say, as I've never used it 
myself and haven't seen anyone posting practical usage that I could make 
note of in case I or someone else needed it later.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Duncan
Dongsheng Yang posted on Wed, 10 Dec 2014 23:02:15 +0800 as excerpted:

>> And in the example, the mkfs was supplied with two devices, so there's
>> no dup metadata remaining from a formerly single-device filesystem,
>> either. (Tho there will be the small single-mode stubs, empty,
>> remaining from the mkfs process, as no balance has been run to delete
>> them yet, but those are much smaller and empty.)
> 
> Yes. One question not related here: how about delete them in the end of
> mkfs?

GB covered the old, manual balance method.  Do a btrfs balance -dusage=0
-musage=0 (or whatever, someone posted his recipe doing the same thing 
except with the single profiles instead of zero usage), and those stubs 
should disappear, as they're empty so there's nothing to rewrite when the 
balance does its thing and it simply removes them.

FWIW I actually have a mkfs helper script here that takes care of a bunch 
of site-default options such as dual-device raid1 both data/metadata, 
skinny-metadata, etc, and it actually prompts for a mountpoint (assuming 
it's already setup in fstab) and will do an immediate mount and balance 
usage=0 to eliminate the stubs if that mountpoint is filled in, again 
assuming it appears in fstab as well.  Since I keep fully separate 
filesystems to avoid putting all my data eggs in the same not-yet-fully-
stable btrfs basket, and my backup system includes periodically blowing 
away the backup and (after booting to the new backup) the working copy 
with a fresh mkfs for a clean start, the mkfs helper script is useful, 
and since I was already doing that, it was reasonably simple to extend it 
to handle the mount and stub-killing balance immediately after the mkfs.


But at least in theory, that old manual method shouldn't be necessary 
with a current (IIRC 3.18 required) kernel, since btrfs should now 
automatically detect empty chunks and automatically rebalance to remove 
them as necessary.  However, I've been busy and haven't actually tried 
3.18 yet, and thus obviously haven't done a mkfs and mount of a fresh 
filesystem to see how long it actually takes to trigger and remove those 
stubs, so for all I know it takes awhile to kick in, and if people are 
bothered by the display of the stubs before it does, they can of course 
still do it the old way.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: out of space warning?

2014-12-10 Thread Qu Wenruo


 Original Message 
Subject: Re: out of space warning?
From: Robert White 
To: sys.syphus , 
Date: 2014年12月11日 09:29

On 12/10/2014 02:54 PM, sys.syphus wrote:

I would like to avoid running out of space. is there a way to know
that I am getting close? i'd like to make a script that runs as part
of my bash prompt and lets me know when i am getting close. i know
there are several ways you can run out of space and I'd like to avoid
all of them.


Don't do that either. 8-)

(1) you'll grow to hate it.

(2) You know when you are doing things that take a lot of storage. You 
instinct for system fullness is already part of your brain-meat.


(3) The system isn't going to explode if it runs out of disk space. 
(old UNIX systems used to halt with system errors because running out 
of space prevented pipelines from being created, but that's ancient 
history).


(4) The only _real_ way to run out of space is to be a data hoarder, 
and no script in the world is going to help you if that's the case. Ha 
Ha
(5) Possible known/unknown kernel bug may cause strange ENOSPC during 
balance/replace/scrub... :-)


You don't check your car's gas tank every time you put your foot on 
the brake, you don't want to check your free space every time your 
system finishes every tiny command you type.


Scripts like this are possible in bash, but consider that every "ls" 
or even just enter you type would be followed by a "df" and a "grep" 
or whatever in whatever window you are using at the time. etc.


IF you think you are going to run out of space, and you are using 
_any_ kind of window system, then start a system manager display for a 
while until you get the feel for how not out of space you really are.


Nothing gets ignored faster than a text element that essentially never 
changes, and once you get in the habbit of ignoring the text you won't 
notice when it actually has something to say.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-10 Thread Zygo Blaxell
On Thu, Dec 11, 2014 at 10:05:20AM +0800, Qu Wenruo wrote:
> 
>  Original Message 
> Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
> From: Zygo Blaxell 
> To: Qu Wenruo 
> Date: 2014年12月11日 05:57
> >On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
> >>The main memory usage in btrfsck is extent record, which
> >>we can't free them until we read them all in and checked, so even we
> >>mmap/unmap, it can only help with
> >>the extent_buffer(which is already freed if not used according to refs).
> >I'm thinking aloud here, but is it *really* necessary to read everything
> >into memory?
> Totally agreed to only read what we need.
> But some backref and counts on refs can only be determined after a
> full scan, especially for leaf/node corruption
> case.

It might be faster (and smaller) to pipe them out to sort (with gzip/lzma
compression on temporary files) than to try to insert them in a tree.
I have used that technique in some of my deduplicating programs.  It can
cut the working set size by several orders of magnitude (trading it for
an O(n log n) sort, which will mostly read and write sequentially).

e.g. duplicate refs will all sort together, so when you are sequentially
reading the sorted data and the current key value changes, you know you've
seen everything that could be a duplicate, and can discard everything
in RAM.



signature.asc
Description: Digital signature


Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-10 Thread Qu Wenruo


 Original Message 
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Zygo Blaxell 
To: Qu Wenruo 
Date: 2014年12月11日 05:57

On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:

The main memory usage in btrfsck is extent record, which
we can't free them until we read them all in and checked, so even we
mmap/unmap, it can only help with
the extent_buffer(which is already freed if not used according to refs).

I'm thinking aloud here, but is it *really* necessary to read everything
into memory?

Totally agreed to only read what we need.
But some backref and counts on refs can only be determined after a full 
scan, especially for leaf/node corruption

case.

   Maybe a multiple-pass algorithm might be possible, e.g. one
to find free space by eliminating any areas that are occupied by extents,
then other passes to rebuild the metadata in the free space.  Or, one
pass to verify the connectivity of references and collect dangling refs,
then a second pass which fixes only the dangling refs.
I have similar idea, but not multi-pass method, instead, using per 
sector scan + tree search for other data.
E.g in extent tree check, each time only record all extents in a block 
group, and check them.
After check, remove the good extents/block groups and then move to next 
block group.
For fs tree, any key with same objectid(ino) as a group, and only read  
the group in one time and remove
the already known healthy record. (info not fully gathered or bad record 
will still stay in memory)


But I don't consider this method can really save much memory though...


Usually sequential reads are significantly faster than swapping--even
if swapping on solid-state media.  It could be that reading 260GB of
metadata sequentially two or three times is still faster than thrashing
through random lookups in 20GB of swap on a 4GB machine.

Definitely, but if we want to reduce memory usage, it is almost 
unavoidable to do more disk IO, especially random
disk IO, so it will become a tradeoff, which may cause the already slow 
fsck more slow


Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: out of space warning?

2014-12-10 Thread Robert White

On 12/10/2014 02:54 PM, sys.syphus wrote:

I would like to avoid running out of space. is there a way to know
that I am getting close? i'd like to make a script that runs as part
of my bash prompt and lets me know when i am getting close. i know
there are several ways you can run out of space and I'd like to avoid
all of them.


Don't do that either. 8-)

(1) you'll grow to hate it.

(2) You know when you are doing things that take a lot of storage. You 
instinct for system fullness is already part of your brain-meat.


(3) The system isn't going to explode if it runs out of disk space. (old 
UNIX systems used to halt with system errors because running out of 
space prevented pipelines from being created, but that's ancient history).


(4) The only _real_ way to run out of space is to be a data hoarder, and 
no script in the world is going to help you if that's the case. Ha Ha


You don't check your car's gas tank every time you put your foot on the 
brake, you don't want to check your free space every time your system 
finishes every tiny command you type.


Scripts like this are possible in bash, but consider that every "ls" or 
even just enter you type would be followed by a "df" and a "grep" or 
whatever in whatever window you are using at the time. etc.


IF you think you are going to run out of space, and you are using _any_ 
kind of window system, then start a system manager display for a while 
until you get the feel for how not out of space you really are.


Nothing gets ignored faster than a text element that essentially never 
changes, and once you get in the habbit of ignoring the text you won't 
notice when it actually has something to say.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Balance & scrub & defrag

2014-12-10 Thread Robert White

On 12/10/2014 02:15 PM, sys.syphus wrote:

I am working on a script that i can run daily that will do maintenance
on my btrfs mountpoints. is there any reason not to concurrently do
all of the above? possibly including discards as well.


also, is there anything existing currently that will do maintenance on
btrfs so i don't have to reinvent the wheel?

#!/bin/bash
btrfs filesystem defragment -r -v /media/btrfs/  &
btrfs scrub start /media/btrfs/ &
btrfs balance start /media/btrfs/ &


watch -d -n 30 "btrfs balance status /media/btrfs/; btrfs scrub status
/media/btrfs/"



I'd recommend doing "none of the above" on a daily basis. One of the 
goals of the filesystem design is to remove the need for any of these 
operations on any regular basis. You are just going to bog down your 
system and increase you heat and wear profiles for no good reason.


Those tools should be used if you notice something fishy like recent 
decreases in efficiency or errors in your log files.


A _monthly_ scrub is maybe worth scheduling if you have a lot of churn 
in your disk contents.


Defragging should be done after significant content additions/changes 
(like replacing a lot of files via package management) and limited to 
the directories most likely changed.


Balancing is almost never necessary and can be anti-helpful if a 
experiences random updates in batches (because the nicely packed file 
may end up far, far away from the active data extent where its COW 
events are taking place.


Resist the urge to tinker with production systems. The exposure 
(rewriting stable data is just the chance to destabilize your data, 
balancing your drive can take two files that always change together and 
put them far away from one another, etc) is not worth the nearly 
non-existent chance of benefit. Once the system is "good" just leave it 
that way until you notice something "not good" coming on the horizon.


If you feel you _must_ do these tasks then doing them all at once, where 
possible, will just make both tasks take longer. If you are transcribing 
a file over on one side of the disk to defrag it, and you are 
transcribing an extent on the other side of the disk to balance it, you 
are just bouncing your disk heads back-and-forth and wasing wall-clock time.


So yea, it's not windows, it doesn't need the defrag hammer.

Trying to over-manage the system will prevent it from seeking its 
dynamic (and so predictable) equilibrium.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibility to have a "transient" snapshot?

2014-12-10 Thread Robert White

On 12/10/2014 11:52 AM, James West wrote:

I was just looking into using overlayfs, and although it has some
promise, I think it's biggest drawback is the upperdir will have to be
some sort of storage backed filesystem. From my limited understanding of
tmpfs, it's not supposed to be the greatest with many large files (and
my system in particular would be downloading many large movies/videos,
and doing any kind of os update to test it would involve many changes
all over the volume, which could be problematic to commit to a golden
state.)

I could partition the main drive in 2 parts, and dynamically zero-out
then create the volume in the second partition on each boot, but I'm
still saving no drive writes, and not really extending the life of the
hardware (which is one of my premises.)


You are over-thinking the "transient" part way too much. If the 
underlying device is not an SSD then most of your wear is immaterial. 
And if it is an SSD, you wear is still pretty damn immaterial.


The "full weight" snapshots are plenty "transient" if you delete them 
between uses and they don't do any recursive copying so they are almost 
wear-free.


[If you want to wear out a hard disk, park it's heads a lot. (The My 
Book series of WD external enclosures had a _horrible_ default of 
parking the heads after every eight seconds of idle time. Ouch.)]


A normal hard disk's runtime (provided its not a lemon) is shorter than 
its mean write wear time anyway.


So the best thing to do in your case is to customize your initramfs to 
do what you need and then "hide" your stuff from normal use. Consider 
this (untested but) hiper-simple init script... (assumes busybox in the 
initramfs providing mount and a few of the other basic tools and the 
btrfs command).


--- snip ---
#!/bin/bash
mkdir /dev
mount -t devtmpfs none /dev
mkdir /scratch
mount -t btrfs /dev/sda1 /scratch
if [ -d /scratch/active ]; then
  btrfs subvol del /scratch/active
fi
btrfs subvol snap /scratch/__Master /scratch/active
mkdir /root
mount -o subvol=/active /dev/sda1 /root
umount /scratch
rmdir /scratch
umount /dev
busybox switch_root /root /sbin/init "$@"
--- snip ---


Every time you boot it makes a fresh snapshot of the /__Master subvolume 
of /dev/sda1 into /active and mounts that as root then

treats that as the root of the file system.

Estimated human-scale time to run this script is in the 
one-second-or-less range.


None of the files in /__Master are then visible to the running system, 
so they won't be subject to search via find or locate etc.


Problem solved.

When you want to do maintenance you can log into you box and do

mount /dev/sda1 /mnt

at which point /__Master is visible as /mnt/__Master.

You can do your backup snapshots and your maintenance via the /mnt view 
without purturbing your running system.


chroot /mnt/__Master /bin/bash

That gives you the "native view" of your master system in that shell. 
From that shell all your package tools will work just fine etc.


You can prep new or covariant system is snapshots parallel to /__Master 
and use rename to select the __Master for the next reboot.


Even better, since snapshots of snapshots are not degenerate in any way 
at all, you can create multiple system roots as /Whatever and 
/OtherThing (and so on) and always do your maintenance there. Then 
before any reboot you can snap /mnt/Whatever into /mnt/__Master (using 
the same technique as for /active) and then reboot. On that reboot the 
new /__Master will be the master for the new /active.


All of the snapshot activities are almost instant (except for the 
cleanup of the previous /active if it's full of a lot of changes, but 
that will happen in the background so you don't have to care much about 
that time).



ASIDE: And I keep pointing people at it, but I do a lot of experimental 
boot behaviors while testing hardware and such for my job, and my 
baseline initramfs builder at http://underdog.sourceforge.net is easy to 
customize and plenty stable. It already sucks in the btrfs and command 
and friends, and you can take control of the boot-up to do experimental 
tweaks by adding "bash" to the kernel boot args for an individual boot.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and

2014-12-10 Thread Qu Wenruo


 Original Message 
Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and
From: David Sterba 
To: Qu Wenruo 
Date: 2014年12月10日 20:37

On Tue, Dec 09, 2014 at 04:27:19PM +0800, Qu Wenruo wrote:

The patchset introduce two new repair function and some helpers to
archive a huge goal:
   Repair btrfs whose fs tree's non-root leaf/node is corrupted when no
   duplication is valid.

The two new repair functions are:
   repair_inode_nlinks():
 Repair any inode nlink related problem.
 From fixing the nlink number and related
 inode_ref/dir_index/dir_item to recovering file name and file type
 and salvage them into the lost+found dir.
 This does not only fix a case that some users reported but also
 cooperate with repair_inode_no_item() function to salvaged heavily
 damaged inode to lost+found dir.

   repair_inode_no_item():
 Repair case for inode_item missing case, which is quite common when
 fs tree leaf/node is missing.
 This only does the inode item rebuild. Later recovery like move it
 to lost+found dir is done by repair_inode_nlinks().

The main helper is the repair_btree() function, which will drops the
corrupted non-root leaf/node and rebalance the tree to keep the
correctness of the btree.

Sounds a bit intrusive, but under the circumstances I don't see anything
better to do.

Better non-destructive but less generic method may be introduced later.
My dream is to inspect each key and its item to rebuild each member, but 
it would takes a long long time

to implement.



With this patchset, even a non-root leaf/node is corrupted and no
duplication survived, btrfsck can still repair it to a mountable status.
(And normal rw should also be OK,)

The remaining unfixable problems will be inode nbytes error with file
extent discounts error, which may be fixed in next patchset.

Cc David:
Sorry for the huge change in the patchset and merge the old inode nlink
repair with new inode item rebuild patchset.

No problem, the incremental changelogs helped a lot.


Since when developing inode item rebuild patchset, I found the old nlink
cooperated very bad with item rebuild and there is some duplicated codes
between the two patchset, no to mention the math lib introduced by nlink
repair patch.
So I decided to somewhat rebase the nlink repair patchset to provide
better generality.

Great, the patchset looks good for merge, I'm adding it to 3.18. From
now on please send only incremental changes and not the whole patchset.
Thanks.

Thanks, this should be the last large update patchset.
Later work will focus on file extent recovery and should not interfere 
with this patch.


Thanks.
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


out of space warning?

2014-12-10 Thread sys.syphus
I would like to avoid running out of space. is there a way to know
that I am getting close? i'd like to make a script that runs as part
of my bash prompt and lets me know when i am getting close. i know
there are several ways you can run out of space and I'd like to avoid
all of them.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Robert White

On 12/10/2014 10:56 AM, Patrik Lundquist wrote:

On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote:

Assuming no snapshots still contain the file, of course, and that the
ext* saved subvolume has already been deleted.


Got no snapshots or subvolumes. Keeping it simple for now.


Does that mean that you have already manually removed the subvolume that 
was automatically created by btrfs-convert?



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-10 Thread Robert White
So I started looking at the mkfs.btrfs manual page with an eye towards 
documenting some of the tidbits like metadata automatically switching 
from dup to raid1 when more than one device is used.


In experimenting I ended up with some questions...

(1) why is the dup profile for data restricted to only one device and 
only if it's mixed mode?


Gust t # mkfs.btrfs -f /dev/loop{0..1} -d dup
Error: unable to create FS with data profile 16 (have 2 devices)

Gust t # mkfs.btrfs -f /dev/loop0 -d dup
Error: dup for data is allowed only in mixed mode


(2) why is metadata dup profile restricted to only one device on 
creation when it will run that way just fine after a device add?


Gust t # mkfs.btrfs -f /dev/loop{0..1} -m dup
Error: unable to create FS with metadata profile 32 (have 2 devices)

(3) why can I make a raid5 out of two devices? (I understand that we are 
currently just making mirrors, but the standard requires three devices 
in the geometry etc. So I would expect a two device RAID5 to be 
considered degraded with all that entails. It just looks like its asking 
for trouble to allow this once the support is finalized as suddenly a 
working RAID5 thats really a mirror would become something that can only 
be mounted with the degraded flag.)


Gust t # mkfs.btrfs -f /dev/loop{0..1} -d raid5 -m raid5
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file 
to 65536

Turning ON incompat feature 'raid56': raid56 extended format
Performing full device TRIM (2.00GiB) ...
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
nodesize 16384 leafsize 16384 sectorsize 4096 size 4.00GiB


(4) Same question for raid6 but with three drives instead of the 
mandated four.


(5) If I can make a RAID5 or RAID6 device with one missing element, why 
can't I make a RAID1 out of one drive, e.g. with one missing element?


(6) If I make a RAID1 out of three devices are there three copies of 
every extent or are there always two copies that are semi-randomly 
spread across three devices? (ibid for more than three).


---

It seems to me (very dangerous words in computer science, I know) that 
we need a "failed" device designator so that a device can be in the 
geometry (e.g. have a device ID) but not actually exist. Reads/writes to 
the failed device would always be treated as error returns.


The failed device would be subject to replacement with "btrfs dev 
replace", and could be the source of said replacement to drop a 
problematic device out of an array.


EXAMPLE:
Gust t # mkfs.btrfs -f /dev/loop0 failed -d raid1 -m raid1
Btrfs v3.17.1
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (2.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file 
to 65536

Processing explicitly missing device
adding device (failed) id 2 (phantom device)

mount /dev/loop0 /mountpoint

btrfs replace start 2 /dev/loop1 /mountpoint

(and so on)

Being able to "replace" a faulty device with a phantom "failed" device 
would nicely disambiguate the whole device add/remove versus replace 
mistake.


It would make the degraded status less mysterious.

A filesystem with an explicitly failed element would also make the 
future roll-out of full RAID5/6 less confusing.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Balance & scrub & defrag

2014-12-10 Thread sys.syphus
I am working on a script that i can run daily that will do maintenance
on my btrfs mountpoints. is there any reason not to concurrently do
all of the above? possibly including discards as well.


also, is there anything existing currently that will do maintenance on
btrfs so i don't have to reinvent the wheel?

#!/bin/bash
btrfs filesystem defragment -r -v /media/btrfs/  &
btrfs scrub start /media/btrfs/ &
btrfs balance start /media/btrfs/ &


watch -d -n 30 "btrfs balance status /media/btrfs/; btrfs scrub status
/media/btrfs/"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

2014-12-10 Thread Zygo Blaxell
On Thu, Dec 04, 2014 at 02:56:55PM +0800, Qu Wenruo wrote:
> The main memory usage in btrfsck is extent record, which
> we can't free them until we read them all in and checked, so even we
> mmap/unmap, it can only help with
> the extent_buffer(which is already freed if not used according to refs).

I'm thinking aloud here, but is it *really* necessary to read everything
into memory?  Maybe a multiple-pass algorithm might be possible, e.g. one
to find free space by eliminating any areas that are occupied by extents,
then other passes to rebuild the metadata in the free space.  Or, one
pass to verify the connectivity of references and collect dangling refs,
then a second pass which fixes only the dangling refs.

Usually sequential reads are significantly faster than swapping--even
if swapping on solid-state media.  It could be that reading 260GB of
metadata sequentially two or three times is still faster than thrashing
through random lookups in 20GB of swap on a 4GB machine.



signature.asc
Description: Digital signature


[PATCH 10/18] btrfs restore: hide "offset is X" messages

2014-12-10 Thread mwilck
From: Martin Wilck 

Almost everyone who cares about her data will run btrfs restore
with -v. The "offset is" messages displayed will irritate users
because they reveal only btrfs internals. Users will think that
"offset" refers to a file offset and suspect severe corruption.

Limit these messages to verbose > 1.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 10bb8be..f1c63ed 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -315,7 +315,7 @@ static int copy_one_extent(struct btrfs_root *root, int fd,
if (compress == BTRFS_COMPRESS_NONE)
bytenr += offset;
 
-   if (verbose && offset)
+   if (verbose > 1 && offset)
printf("offset is %Lu\n", offset);
/* we found a hole */
if (disk_size == 0)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/18] btrfs-progs: btrfs-debug-tree: fix usage message

2014-12-10 Thread mwilck
From: Martin Wilck 

Adapt usage message to the additional options introduced.

Signed-off-by: Martin Wilck 
---
 btrfs-debug-tree.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index 7cdc368..4c1e835 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -31,22 +31,23 @@
 
 static int print_usage(void)
 {
-   fprintf(stderr, "usage: btrfs-debug-tree [-e] [-d] [-r] [-R] [-u]\n");
-   fprintf(stderr, "[-b block_num ] device\n");
+   fprintf(stderr, "usage: btrfs-debug-tree [-e] [-d] [-r] [-R] [-u] 
[-B]\n");
+   fprintf(stderr, "[-t tree_id] device\n");
+   fprintf(stderr, "   btrfs-debug-tree [-b block_num [-f]] device\n");
fprintf(stderr, "\t-e : print detailed extents info\n");
fprintf(stderr, "\t-d : print info of btrfs device and root tree dirs"
 " only\n");
fprintf(stderr, "\t-r : print info of roots only\n");
fprintf(stderr, "\t-R : print info of roots and root backups\n");
fprintf(stderr, "\t-u : print info of uuid tree only\n");
-   fprintf(stderr, "\t-b block_num : print info of the specified block"
-" only\n");
-   fprintf(stderr, "\t-f : (with -b) follow subtree of the specified"
-   " block\n");
fprintf(stderr,
"\t-t tree_id : print only the tree with the given id\n");
fprintf(stderr,
"\t-B nr: use root backup  instead of real root\n");
+   fprintf(stderr, "\t-b block_num : print info of the specified block"
+" only\n");
+   fprintf(stderr, "\t-f : (with -b) follow subtree of the specified"
+   " block\n");
fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
exit(1);
 }
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/18] btrfs restore: improve user-asking logic for files with many extents

2014-12-10 Thread mwilck
From: Martin Wilck 

The logic to ask after 1024 extents is broken. It unnecessarily
confuses users if big files are being restored, making them think
somthing is going wrong.

Change it to two cases: 1) no or little progress restoring,
2) writing beyond the file size.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |   18 +++---
 1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 80081b8..8ae3337 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -661,7 +661,7 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
 #define MAYBE_NL (verbose && (next_pos >> display_shift) ? "\n" : "")
const u64 display_shift = 16;
struct stat st;
-
+   int dont_ask = 0;
path = btrfs_alloc_path();
if (!path) {
fprintf(stderr, "Ran out of memory\n");
@@ -697,9 +697,21 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
}
 
while (1) {
-   if (loops >= 0 && loops++ >= 1024) {
+   int problem = 0;
+   if (st.st_size == _INVALID_SIZE && next_pos > st.st_size) {
+   fprintf(stderr, "%swriting at offset %llu beyond size "
+   "of file (%llu)\n",
+   MAYBE_NL, next_pos, st.st_size);
+   problem = 1;
+   }
+   if ((++loops % 1024) == 0 && (next_pos / loops < 4096)) {
+   fprintf(stderr, "%smany loops (%d) and little progress "
+   "(%llu bytes)\n", 
+   MAYBE_NL, loops, next_pos);
+   problem = 1;
+   }
+   if (problem && !dont_ask && loops++) {
enum loop_response resp;
-
resp = ask_to_continue(file);
if (resp == LOOP_STOP)
break;
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/18] btrfs restore: print progress marks for big files

2014-12-10 Thread mwilck
From: Martin Wilck 

print a '+' for every 64k restored. This gives people more confidence
in long-running restore processes.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index f1c63ed..004c82e 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -658,6 +658,8 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
int loops = 0;
u64 bytes_written, next_pos = 0ULL;
u64 total_written = 0ULL;
+#define MAYBE_NL (verbose && (next_pos >> display_shift) ? "\n" : "")
+   const u64 display_shift = 16;
struct stat st;
 
path = btrfs_alloc_path();
@@ -751,6 +753,10 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
printf("Weird extent type %d\n", extent_type);
}
total_written += bytes_written;
+   if (verbose && 
+   ((next_pos +  bytes_written) >> display_shift) > 
+   (next_pos >> display_shift))
+   fprintf(stderr, "+");
next_pos = found_key.offset + bytes_written;
if (ret) {
fprintf(stderr, "ERROR after writing %llu bytes\n",
@@ -764,6 +770,8 @@ next:
 
 set_size:
btrfs_free_path(path);
+
+   printf(MAYBE_NL);
if (get_xattrs) {
ret = set_file_xattrs(root, key->objectid, fd, file);
if (ret)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/18] btrfs restore: check progress of file restoration

2014-12-10 Thread mwilck
From: Martin Wilck 

extents should be ordered by file offset. Expect no overlaps,
and report holes.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 004c82e..80081b8 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -739,6 +739,14 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
ret = -1;
goto set_size;
}
+   if (found_key.offset < next_pos) {
+   fprintf(stderr, "extent overlap, %llu < %llu\n",
+   found_key.offset, next_pos);
+   ret = -1;
+   goto set_size;
+   } else if (found_key.offset > next_pos)
+   fprintf(stderr, "hole at %llu (%llu bytes)\n",
+   next_pos, found_key.offset - next_pos);
 
bytes_written = 0ULL;
if (extent_type == BTRFS_FILE_EXTENT_PREALLOC)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/18] btrfs restore: more graceful error handling in copy_file

2014-12-10 Thread mwilck
From: Martin Wilck 

Setting size and attributes of a file makes sense even if some
errors have occured during revovery.

Also, do something useful with the number of bytes written.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |   27 ++-
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 8ecd896..10bb8be 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -715,7 +715,7 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
return ret;
} else if (ret) {
/* No more leaves to search */
-   btrfs_free_path(path);
+   ret = 0;
goto set_size;
}
leaf = path->nodes[0];
@@ -734,35 +734,36 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
if (compression >= BTRFS_COMPRESS_LAST) {
fprintf(stderr, "Don't support compression yet %d\n",
compression);
-   btrfs_free_path(path);
-   return -1;
+   ret = -1;
+   goto set_size;
}
 
+   bytes_written = 0ULL;
if (extent_type == BTRFS_FILE_EXTENT_PREALLOC)
goto next;
if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
ret = copy_one_inline(fd, path, found_key.offset,
  &bytes_written);
-   if (ret) {
-   btrfs_free_path(path);
-   return -1;
-   }
} else if (extent_type == BTRFS_FILE_EXTENT_REG) {
ret = copy_one_extent(root, fd, leaf, fi,
  found_key.offset, &bytes_written);
-   if (ret) {
-   btrfs_free_path(path);
-   return ret;
-   }
} else {
printf("Weird extent type %d\n", extent_type);
}
+   total_written += bytes_written;
+   next_pos = found_key.offset + bytes_written;
+   if (ret) {
+   fprintf(stderr, "ERROR after writing %llu bytes\n",
+   total_written);
+   ret = -1;
+   goto set_size;
+   }
 next:
path->slots[0]++;
}
 
-   btrfs_free_path(path);
 set_size:
+   btrfs_free_path(path);
if (get_xattrs) {
ret = set_file_xattrs(root, key->objectid, fd, file);
if (ret)
@@ -771,7 +772,7 @@ set_size:
}
 
set_fd_attrs(fd, &st, file);
-   return 0;
+   return ret;
 }
 
 static int search_dir(struct btrfs_root *root, struct btrfs_key *key,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/18] btrfs restore: report mismatch in file size

2014-12-10 Thread mwilck
From: Martin Wilck 

A mismatch between the file size stored in the inode and the
number of bytes restored may indicate a problem.
restore reads data in 4k chunks, so it's normal that files are
truncated. Only emit the warning in unusual cases.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 8ae3337..3c4dc7a 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -799,6 +799,12 @@ set_size:
file);
}
 
+   if (st.st_size != _INVALID_SIZE && (st.st_size > next_pos || 
+   (st.st_size < next_pos &&
+(st.st_size >> 12) !=
+(next_pos >> 12) - 1)))
+   fprintf(stderr, "size mismatch: extpected %llu, got %llu 
(written %llu)\n",
+   st.st_size, next_pos, total_written);   
set_fd_attrs(fd, &st, file);
return ret;
 }
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/18] btrfs-progs: ctree.c: make bin_search non-static

2014-12-10 Thread mwilck
From: Martin Wilck 

I need it in btrfs-search-metadata

Signed-off-by: Martin Wilck 
---
 ctree.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ctree.c b/ctree.c
index 23399e2..1137312 100644
--- a/ctree.c
+++ b/ctree.c
@@ -602,8 +602,8 @@ static int generic_bin_search(struct extent_buffer *eb, 
unsigned long p,
  * simple bin_search frontend that does the right thing for
  * leaves vs nodes
  */
-static int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
- int level, int *slot)
+int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
+  int level, int *slot)
 {
if (level == 0)
return generic_bin_search(eb,
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/18] btrfs-progs: documentation for btrfs-raw and btrfs-search-metadata

2014-12-10 Thread mwilck
From: Martin Wilck 

Update documentation for btrfs-debug-tree, and add pages for
btrfs-search-metadata and btrfs-raw.

Signed-off-by: Martin Wilck 
---
 Documentation/Makefile  |2 +
 Documentation/btrfs-debug-tree.txt  |   10 +
 Documentation/btrfs-raw.txt |   54 +
 Documentation/btrfs-search-metadata.txt |   57 +++
 4 files changed, 123 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/btrfs-raw.txt
 create mode 100644 Documentation/btrfs-search-metadata.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index ef4f1bd..354c412 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -5,6 +5,8 @@ MAN8_TXT =
 MAN8_TXT += btrfs.txt
 MAN8_TXT += btrfs-convert.txt
 MAN8_TXT += btrfs-debug-tree.txt
+MAN8_TXT += btrfs-raw.txt
+MAN8_TXT += btrfs-search-metadata.txt
 MAN8_TXT += btrfs-find-root.txt
 MAN8_TXT += btrfs-image.txt
 MAN8_TXT += btrfs-map-logical.txt
diff --git a/Documentation/btrfs-debug-tree.txt 
b/Documentation/btrfs-debug-tree.txt
index 23fc115..69a547d 100644
--- a/Documentation/btrfs-debug-tree.txt
+++ b/Documentation/btrfs-debug-tree.txt
@@ -25,8 +25,18 @@ Print detailed extents info.
 Print info of btrfs device and root tree dirs only.
 -r::
 Print info of roots only.
+-u::
+Print info of UUID tree only.
+-R::
+Print info of roots and root backups.
+-t ::
+Only print the subvolume tree with given object ID.
+-B ::
+Start at backup root from superblock rather than current root.
 -b ::
 Print info of the specified block only.
+-f::
+Follow (descend) the (sub)tree rooted at the block given with -b.
 
 EXIT STATUS
 ---
diff --git a/Documentation/btrfs-raw.txt b/Documentation/btrfs-raw.txt
new file mode 100644
index 000..ae7bd2d
--- /dev/null
+++ b/Documentation/btrfs-raw.txt
@@ -0,0 +1,54 @@
+btrfs-raw(8)
+
+
+NAME
+
+btrfs-raw - low-level manipulation of btrfs meta data blocks.
+
+SYNOPSIS
+
+*btrfs-raw* [[-r|-w] ] 
+
+DESCRIPTION
+---
+*btrfs-raw* is used to dump the raw contents of the given logical block
+of a btrfs device to stdout, or write raw data read from stdin to the
+given logical block.
+
+*THIS TOOL IS DANGEROUS; IT MAY CORRUPT YOUR FILESYSTEM BEYOND REPAIR!!*
+
+Please exert extreme caution when writing modified blocks to disk. You
+should make a backup copy of the entire file system before doing so, even
+if the file system is already corrupted. Make sure you have a thorough
+understanding of the btrfs disk data structures before making any changes.
+
+*YOU USE THIS TOOL AT YOUR OWN RISK!*
+
+OPTIONS
+---
+-r ::
+Read the logical block starting at  and write the raw contents
+to stdout (caution, binary data).
+-w ::
+Write the logical block starting at  to disk, reading data from
+stdin. The tool will adjust the header checksum before writing to disk.
+
+EXIT STATUS
+---
+*btrfs-raw* will return 0 if no error happened.
+If any problems happened, 1 will be returned.
+
+EXAMPLE
+---
+
+`btrfs-raw -r 874991616 /dev/sda >/tmp/blob`
+
+`btrfs-raw -w 874991616 /dev/sda 
+
+DESCRIPTION
+---
+*btrfs-search-metadata* is used to dump the meta data of a device, or to 
+selectively dump nodes or leaves matching certain conditions.
+
+Unlike `btrfs-dump-tree`, this tool will also find tree "branches" that
+are disconnected from the root tree, and previous meta data copies. If a
+corruption occurs, this may be useful for finding old, still healthy copies.
+
+This is maybe useful for analyzing filesystem state or inconsistence and has
+a positive educational effect on understanding the internal structure.
+ is the device file where the filesystem is stored.
+
+OPTIONS
+---
+-k //::
+Search for leaves and nodes containing the given key.
+-g ::
+Search for leaves and nodes with the given generation (transid).
+-l ::
+Search for nodes with the given level, or leaves (level 0).
+-t ::
+Search for leaves and nodes with the given owner object ID.
+-L::
+dump full content of found nodes or leaves, like btrfs-debug-tree.
+
+EXIT STATUS
+---
+*btrfs-search-metadata* will return 0 if no error happened.
+If any problems happened, 1 will be returned.
+
+EXAMPLE
+---
+
+`btrfs-search-metadata -t 260 -l 0 -k 256/1/0 /dev/sda`
+
+Search the btrfs file system on `/dev/sda` for leaves belonging to
+subvolume 260 and containing the first inode item (type 1: inode item,
+object ID 256: first available object ID).
+
+See `ctree.h` in the btrfs source code and the btrfs Wiki for 
+assigned object IDs and key types.
+
+SEE ALSO
+
+`mkfs.btrfs`(8), `btrfs-debug-tree`(8)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/18] btrfs restore: track number of bytes restored

2014-12-10 Thread mwilck
From: Martin Wilck 

Track the number of bytes read from extents and restored.
This is useful for detecting errors and corruptions.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index f9dab7e..8ecd896 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -222,7 +222,8 @@ again:
return 0;
 }
 
-static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos)
+static int copy_one_inline(int fd, struct btrfs_path *path, u64 pos,
+  u64 *bytes_written)
 {
struct extent_buffer *leaf = path->nodes[0];
struct btrfs_file_extent_item *fi;
@@ -246,6 +247,7 @@ static int copy_one_inline(int fd, struct btrfs_path *path, 
u64 pos)
compress = btrfs_file_extent_compression(leaf, fi);
if (compress == BTRFS_COMPRESS_NONE) {
done = pwrite(fd, buf, len, pos);
+   *bytes_written += done;
if (done < len) {
fprintf(stderr, "Short inline write, wanted %d, did "
"%zd: %d\n", len, done, errno);
@@ -269,6 +271,7 @@ static int copy_one_inline(int fd, struct btrfs_path *path, 
u64 pos)
 
done = pwrite(fd, outbuf, ram_size, pos);
free(outbuf);
+   *bytes_written += done;
if (done < ram_size) {
fprintf(stderr, "Short compressed inline write, wanted %Lu, "
"did %zd: %d\n", ram_size, done, errno);
@@ -280,7 +283,8 @@ static int copy_one_inline(int fd, struct btrfs_path *path, 
u64 pos)
 
 static int copy_one_extent(struct btrfs_root *root, int fd,
   struct extent_buffer *leaf,
-  struct btrfs_file_extent_item *fi, u64 pos)
+  struct btrfs_file_extent_item *fi, u64 pos,
+  u64 *bytes_written)
 {
struct btrfs_multi_bio *multi = NULL;
struct btrfs_device *device;
@@ -410,6 +414,7 @@ again:
total += done;
}
 out:
+   *bytes_written += total;
free(inbuf);
free(outbuf);
return ret;
@@ -651,6 +656,8 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
int extent_type;
int compression;
int loops = 0;
+   u64 bytes_written, next_pos = 0ULL;
+   u64 total_written = 0ULL;
struct stat st;
 
path = btrfs_alloc_path();
@@ -734,14 +741,15 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
if (extent_type == BTRFS_FILE_EXTENT_PREALLOC)
goto next;
if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
-   ret = copy_one_inline(fd, path, found_key.offset);
+   ret = copy_one_inline(fd, path, found_key.offset,
+ &bytes_written);
if (ret) {
btrfs_free_path(path);
return -1;
}
} else if (extent_type == BTRFS_FILE_EXTENT_REG) {
ret = copy_one_extent(root, fd, leaf, fi,
- found_key.offset);
+ found_key.offset, &bytes_written);
if (ret) {
btrfs_free_path(path);
return ret;
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/18] btrfs-progs: NEW: brtfs-search-metadata

2014-12-10 Thread mwilck
From: Martin Wilck 

A new tool for dumping all meta data (also unlinked nodes and leaves)
and searching nodes or leaves with certain properties.

Signed-off-by: Martin Wilck 
---
 Makefile|2 +-
 btrfs-search-metadata.c |  224 +++
 2 files changed, 225 insertions(+), 1 deletions(-)
 create mode 100644 btrfs-search-metadata.c

diff --git a/Makefile b/Makefile
index fe65867..c670f67 100644
--- a/Makefile
+++ b/Makefile
@@ -48,7 +48,7 @@ MAKEOPTS = --no-print-directory Q=$(Q)
 
 progs = mkfs.btrfs btrfs-debug-tree btrfs-raw btrfsck \
btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \
-   btrfs-find-root btrfstune btrfs-show-super
+   btrfs-find-root btrfstune btrfs-show-super btrfs-search-metadata
 
 progs_extra = btrfs-corrupt-block btrfs-fragments btrfs-calc-size \
  btrfs-select-super
diff --git a/btrfs-search-metadata.c b/btrfs-search-metadata.c
new file mode 100644
index 000..80dc326
--- /dev/null
+++ b/btrfs-search-metadata.c
@@ -0,0 +1,224 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include "kerncompat.h"
+#include "radix-tree.h"
+#include "ctree.h"
+#include "disk-io.h"
+#include "print-tree.h"
+#include "version.h"
+#include "utils.h"
+#include "volumes.h"
+
+static int print_usage(void)
+{
+   fprintf(stderr, "usage: btrfs-search-metadata [options] device\n");
+   fprintf(stderr, "\t-k //: search for given key\n");
+   fprintf(stderr, "\t-g : search for given generation 
(transid)\n");
+   fprintf(stderr, "\t-t : search for given tree\n");
+   fprintf(stderr, "\t-l : search for node level (0=leaf)\n");
+   fprintf(stderr, "\t-L: print full listing of matching leaf/node 
contents\n");
+   fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
+   exit(1);
+}
+
+int bin_search(struct extent_buffer *eb, struct btrfs_key *key,
+  int level, int *slot);
+
+static int do_one_block(struct btrfs_root *root, u64 block_nr, u64 tree_id,
+   u64 gen_id, int level, struct btrfs_key *key, int brief)
+{
+   struct extent_buffer *leaf;
+   int ret;
+   int slot = -1;
+   struct btrfs_disk_key disk_key;
+
+   leaf = read_tree_block(root,
+  block_nr,
+  root->leafsize, 0);
+
+   if (leaf && btrfs_header_level(leaf) != 0) {
+   free_extent_buffer(leaf);
+   leaf = NULL;
+   }
+
+   if (!leaf) {
+   leaf = read_tree_block(root,
+  block_nr,
+  root->nodesize, 0);
+   }
+   if (!leaf) {
+   fprintf(stderr, "failed to read %llu\n",
+   (unsigned long long)block_nr);
+   return -1;
+   }
+
+   ret = btrfs_is_leaf(leaf);
+   if (tree_id != 0 && tree_id != btrfs_header_owner(leaf))
+   goto out;
+   if (gen_id != 0 && gen_id != btrfs_header_generation(leaf))
+   goto out;
+   if (level != -1 && level != (int)btrfs_header_level(leaf))
+   goto out;
+
+   if (key && key->type != 0ULL) {
+   if (bin_search(leaf, key, btrfs_header_level(leaf), &slot))
+   goto out;
+   }
+
+   if (brief)
+   printf("%s %llu level %u items %d free %lu generation %llu 
owner %llu\n",
+  (ret ? "leaf" : "node"),
+  (unsigned long long)btrfs_header_bytenr(leaf),
+  btrfs_header_level(leaf),
+  btrfs_header_nritems(leaf),
+  (ret ? btrfs_leaf_free_space(root, leaf) :
+   (unsigned long)BTRFS_NODEPTRS_PER_BLOCK(root) -
+   btrfs_header_nritems(leaf)),
+  (u64)btrfs_header_generation(leaf),
+  (u64)btrfs_header_owner(leaf));
+   else
+   btrfs_print_tree(root, leaf, 0);
+
+   if (key->objectid != 0ULL) {
+   btrfs_cpu_key_to_disk(&disk_key, key);
+   printf("\t");
+   btrfs_print_key(&disk_key);
+   printf(" found @ slot %d in %s %llu\n", slot,
+  (ret 

[PATCH 15/18] btrfs-progs: NEW: btrfs-raw

2014-12-10 Thread mwilck
From: Martin Wilck 

This program can be used to dump a meta data block, fix it e.g.
using a hex editor, and write it back to disk, adapting the check
sum.

Signed-off-by: Martin Wilck 
---
 Makefile|2 +-
 btrfs-raw.c |  143 +++
 2 files changed, 144 insertions(+), 1 deletions(-)
 create mode 100644 btrfs-raw.c

diff --git a/Makefile b/Makefile
index 4cae30c..fe65867 100644
--- a/Makefile
+++ b/Makefile
@@ -46,7 +46,7 @@ endif
 
 MAKEOPTS = --no-print-directory Q=$(Q)
 
-progs = mkfs.btrfs btrfs-debug-tree btrfsck \
+progs = mkfs.btrfs btrfs-debug-tree btrfs-raw btrfsck \
btrfs btrfs-map-logical btrfs-image btrfs-zero-log btrfs-convert \
btrfs-find-root btrfstune btrfs-show-super
 
diff --git a/btrfs-raw.c b/btrfs-raw.c
new file mode 100644
index 000..1dfeed9
--- /dev/null
+++ b/btrfs-raw.c
@@ -0,0 +1,143 @@
+/*
+ * Copyright (C) 2007 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include 
+#include 
+#include 
+#include "kerncompat.h"
+#include "radix-tree.h"
+#include "ctree.h"
+#include "utils.h"
+#include "disk-io.h"
+
+static int print_usage(void)
+{
+   fprintf(stderr, "usage: btrfs-raw [ -r block|-w block] device\n");
+   exit(1);
+}
+
+static int read_block(struct btrfs_root *root, u64 block_nr,
+ struct extent_buffer **eb)
+{
+   struct extent_buffer *leaf;
+   leaf = read_tree_block(root,
+  block_nr,
+  root->leafsize, 0);
+   
+   if (leaf && btrfs_header_level(leaf) != 0) {
+   free_extent_buffer(leaf);
+   leaf = NULL;
+   }
+   
+   if (!leaf) {
+   leaf = read_tree_block(root,
+  block_nr,
+  root->nodesize, 0);
+   }
+   if (!leaf) {
+   fprintf(stderr, "failed to read %llu\n",
+   (unsigned long long)block_nr);
+   return -1;
+   }
+
+   *eb = leaf;
+   return btrfs_is_leaf(leaf) ? root->leafsize : root->nodesize;
+}
+
+int main(int ac, char **av)
+{
+   struct btrfs_root *root;
+   struct btrfs_fs_info *info;
+   struct extent_buffer *eb = NULL;
+   struct btrfs_trans_handle *trans = NULL;
+   u64 block = ~0ULL;
+   int len;
+   enum btrfs_open_ctree_flags flags = OPEN_CTREE_PARTIAL;
+   radix_tree_init();
+
+   while(1) {
+   int c;
+   c = getopt(ac, av, "r:w:");
+   if (c < 0)
+   break;
+   switch(c) {
+   case 'r':
+   block = arg_strtou64(optarg);
+   break;
+   case 'w':
+   flags |= OPEN_CTREE_WRITES;
+   block = arg_strtou64(optarg);
+   break;
+   default:
+   print_usage();
+   }
+   }
+   set_argv0(av);
+   ac = ac - optind;
+   if (check_argc_exact(ac, 1) || block == ~0ULL)
+   print_usage();
+
+   info = open_ctree_fs_info(av[optind], 0, 0, flags);
+   if (!info) {
+   fprintf(stderr, "unable to open %s\n", av[optind]);
+   exit(1);
+   }
+
+   root = info->fs_root;
+   if (!root) {
+   fprintf(stderr, "unable to open %s\n", av[optind]);
+   exit(1);
+   }
+
+   len = read_block(root, block, &eb);
+   if (eb->len != len) {
+   fprintf(stderr, "length mismatch: %u %d\n", eb->len, len);
+   return 1;
+   }
+
+   if (flags & OPEN_CTREE_WRITES) {
+   char buf[4];
+   int ret;
+   fprintf(stderr, "*** THIS MAY CORRUPT YOUR FILE SYSTEM ***\n");
+   fprintf(stderr, "*** Do you want to write logical block %llu "
+   "on device %s ?\n", block, av[optind]);
+   fprintf(stderr, "*** Type upper case \"yes\" to continue: ");
+   memset(buf, 0, 4);
+   ret = read(fileno(stderr), buf, 3);
+   if (strcmp(buf, "YES")) {
+   fprintf(stderr, "*** Aborted.\n");
+

[PATCH 06/18] btrfs restore: set uid/gid/mode/times

2014-12-10 Thread mwilck
From: Martin Wilck 

current btrfs restore will discard file attributes. This patch
sets them regular files and directories, as found in the
meta data.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |  116 ---
 1 files changed, 101 insertions(+), 15 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 2f9b72d..5aa2167 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -549,6 +550,95 @@ out:
return ret;
 }
 
+#define _INVALID_SIZE ((off_t)~0ULL)
+static int stat_from_inode(struct stat *st, struct btrfs_root *root,
+  struct btrfs_key *key)
+{
+   static struct btrfs_path *path;
+   struct btrfs_inode_item *inode_item;
+   struct btrfs_timespec *ts;
+   struct extent_buffer *eb;
+
+   if (!path)
+   path = btrfs_alloc_path();
+
+   memset(st, 0, sizeof(*st));
+   st->st_size = _INVALID_SIZE;
+
+   if (!path) {
+   fprintf(stderr, "Ran out of memory\n");
+   return -ENOMEM;
+   }
+
+   if (btrfs_lookup_inode(NULL, root, path, key, 0)) {
+   btrfs_release_path(path);
+   return -ENOENT;
+   }
+
+   inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
+   struct btrfs_inode_item);
+   eb = path->nodes[0];
+
+   st->st_size = btrfs_inode_size(eb, inode_item);
+   st->st_uid = btrfs_inode_uid(eb, inode_item);
+   st->st_gid = btrfs_inode_gid(eb, inode_item);
+   st->st_mode = btrfs_inode_mode(eb, inode_item);
+
+   ts = btrfs_inode_atime(eb, inode_item);
+   st->st_atim.tv_sec = ts->sec;
+   st->st_atim.tv_nsec = ts->nsec;
+
+   ts = btrfs_inode_mtime(eb, inode_item);
+   st->st_mtim.tv_sec = ts->sec;
+   st->st_mtim.tv_nsec = ts->nsec;
+
+   ts = btrfs_inode_ctime(eb, inode_item);
+   st->st_ctim.tv_sec = ts->sec;
+   st->st_ctim.tv_nsec = ts->nsec;
+
+   btrfs_release_path(path);
+   return 0;
+}
+
+static void set_fd_attrs(int fd, const struct stat *st, const char *file)
+{
+   struct timeval tv[2];
+   if (st->st_size == _INVALID_SIZE)
+   return;
+
+   tv[0].tv_sec = st->st_atim.tv_sec;
+   tv[0].tv_usec = st->st_atim.tv_nsec/1000;
+   tv[1].tv_sec = st->st_mtim.tv_sec;
+   tv[1].tv_usec = st->st_mtim.tv_nsec/1000;
+   if (S_ISREG(st->st_mode) && ftruncate(fd, st->st_size) == -1)
+   fprintf(stderr, "failed to set file size on %s\n",
+   file);
+   if (fchown(fd, st->st_uid, st->st_gid) == -1)
+   fprintf(stderr, "failed to set uid/gid on %s\n",
+   file);
+   if (fchmod(fd, st->st_mode) == -1)
+   fprintf(stderr, "failed to set permissions on %s\n",
+   file);
+   if (futimes(fd, tv) == -1)
+   fprintf(stderr, "failed to set file times on %s\n",
+   file);
+}
+
+static int set_file_attrs(const char *output_rootdir, const char *file,
+ const struct stat *st)
+{
+   int fd;
+   static char path[4096];
+   snprintf(path, sizeof(path), "%s%s", output_rootdir, file);
+   fd = open(path, O_RDONLY|O_NOATIME);
+   if (fd == -1) {
+   fprintf(stderr, "failed to open %s\n", path_name);
+   return -1;
+   }
+   set_fd_attrs(fd, st, path);
+   close(fd);
+   return 0;
+}
 
 static int copy_file(struct btrfs_root *root, int fd, struct btrfs_key *key,
 const char *file)
@@ -556,13 +646,12 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
struct extent_buffer *leaf;
struct btrfs_path *path;
struct btrfs_file_extent_item *fi;
-   struct btrfs_inode_item *inode_item;
struct btrfs_key found_key;
int ret;
int extent_type;
int compression;
int loops = 0;
-   u64 found_size = 0;
+   struct stat st;
 
path = btrfs_alloc_path();
if (!path) {
@@ -570,13 +659,7 @@ static int copy_file(struct btrfs_root *root, int fd, 
struct btrfs_key *key,
return -ENOMEM;
}
 
-   ret = btrfs_lookup_inode(NULL, root, path, key, 0);
-   if (ret == 0) {
-   inode_item = btrfs_item_ptr(path->nodes[0], path->slots[0],
-   struct btrfs_inode_item);
-   found_size = btrfs_inode_size(path->nodes[0], inode_item);
-   }
-   btrfs_release_path(path);
+   stat_from_inode(&st, root, key);
 
key->offset = 0;
key->type = BTRFS_EXTENT_DATA_KEY;
@@ -672,16 +755,14 @@ next:
 
btrfs_free_path(path);
 set_size:
-   if (found_size) {
-   ret = ftruncate(fd, (loff_t)found_size);
-   if (ret)
-   return ret;
-   

[PATCH 00/18] Patch series related to my btrfs recovery

2014-12-10 Thread mwilck
From: Martin Wilck 

This patch series contains all changes I made to the btrfs tools
in the course of analyzing and repairing the corruption I described
in my other mail to linux-btrfs titled "A story of btrfs corruption
and recovery".

The bottom line of this patch set is: 1) have the tools continue with
error messages instead of aborting in certain error cases; and 
2) look for meta data outside the current trees. Both is useful if
the tree is internally corrupted in the way I described. I have also
added support for extracting inode meta data (times, permissions) 
in "btrfs restore"; this was also useful for my recovery case.

Please review and apply what you find useful.

Martin Wilck (18):
  btrfs-progs: btrfs-debug-tree: add option -f for "block only"
  btrfs-progs: btrfs-debug-tree: add option -B (backup root)
  btrfs-progs: btrfs-debug-tree: fix usage message
  btrfs-progs: btrfs-debug-tree: handle corruption more gracefully
  btrfs-progs: ctree.h: fix btrfs_inode_[amc]time
  btrfs restore: set uid/gid/mode/times
  btrfs restore: better output readability
  btrfs restore: track number of bytes restored
  btrfs restore: more graceful error handling in copy_file
  btrfs restore: hide "offset is X" messages
  btrfs restore: print progress marks for big files
  btrfs restore: check progress of file restoration
  btrfs restore: improve user-asking logic for files with many extents
  btrfs restore: report mismatch in file size
  btrfs-progs: NEW: btrfs-raw
  btrfs-progs: NEW: brtfs-search-metadata
  btrfs-progs: ctree.c: make bin_search non-static
  btrfs-progs: documentation for btrfs-raw and btrfs-search-metadata

 Documentation/Makefile  |2 +
 Documentation/btrfs-debug-tree.txt  |   10 ++
 Documentation/btrfs-raw.txt |   54 
 Documentation/btrfs-search-metadata.txt |   57 
 Makefile|4 +-
 btrfs-debug-tree.c  |   78 +--
 btrfs-raw.c |  143 
 btrfs-search-metadata.c |  224 +++
 cmds-restore.c  |  205 +++-
 ctree.c |4 +-
 ctree.h |   15 ++-
 print-tree.c|   22 +++-
 12 files changed, 752 insertions(+), 66 deletions(-)
 create mode 100644 Documentation/btrfs-raw.txt
 create mode 100644 Documentation/btrfs-search-metadata.txt
 create mode 100644 btrfs-raw.c
 create mode 100644 btrfs-search-metadata.c

-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/18] btrfs-progs: btrfs-debug-tree: handle corruption more gracefully

2014-12-10 Thread mwilck
From: Martin Wilck 

This patch fixes the same thing in two different places. First,
the first of the two BUG() tests is just a special case of the
second one and can therefore be omitted. Second, instead of bailing
out with BUG(), just print a reasonable error message and check the
next child.

Signed-off-by: Martin Wilck 
---
 btrfs-debug-tree.c |   22 --
 print-tree.c   |   22 --
 2 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index 4c1e835..d7c1155 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -73,13 +73,23 @@ static void print_extents(struct btrfs_root *root, struct 
extent_buffer *eb)
 btrfs_node_blockptr(eb, i),
 size,
 btrfs_node_ptr_generation(eb, i));
-   if (btrfs_is_leaf(next) &&
-   btrfs_header_level(eb) != 1)
-   BUG();
if (btrfs_header_level(next) !=
-   btrfs_header_level(eb) - 1)
-   BUG();
-   print_extents(root, next);
+   btrfs_header_level(eb) - 1) {
+   fprintf(stderr, "EXTENT TREE CORRUPTION detected at 
%llu, "
+   "slot %d pointing at %llu.\n"
+   "\tExpected child level: %d, found %d\n"
+   "\tExpected tree/transid: %llu/%llu,"
+   " found %llu/%llu\n",
+   eb->start, i, next->start,
+   btrfs_header_level(eb) - 1,
+   btrfs_header_level(next),
+   (unsigned long long)btrfs_header_owner(eb),
+   (unsigned long long)btrfs_header_generation(eb),
+   (unsigned long long)btrfs_header_owner(next),
+   (unsigned long long)
+   btrfs_header_generation(next));
+   } else
+   print_extents(root, next);
free_extent_buffer(next);
}
 }
diff --git a/print-tree.c b/print-tree.c
index 70a7acc..6769e20 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -1066,13 +1066,23 @@ void btrfs_print_tree(struct btrfs_root *root, struct 
extent_buffer *eb, int fol
(unsigned long long)btrfs_header_owner(eb));
continue;
}
-   if (btrfs_is_leaf(next) &&
-   btrfs_header_level(eb) != 1)
-   BUG();
if (btrfs_header_level(next) !=
-   btrfs_header_level(eb) - 1)
-   BUG();
-   btrfs_print_tree(root, next, 1);
+   btrfs_header_level(eb) - 1) {
+   fprintf(stderr, "TREE CORRUPTION detected at %llu, "
+   "slot %d pointing at %llu.\n"
+   "\tExpected child level: %d, found %d\n"
+   "\tExpected tree/transid: %llu/%llu,"
+   " found %llu/%llu\n",
+   eb->start, i, next->start,
+   btrfs_header_level(eb) - 1,
+   btrfs_header_level(next),
+   (unsigned long long)btrfs_header_owner(eb),
+   (unsigned long long)btrfs_header_generation(eb),
+   (unsigned long long)btrfs_header_owner(next),
+   (unsigned long long)
+   btrfs_header_generation(next));
+   } else
+   btrfs_print_tree(root, next, 1);
free_extent_buffer(next);
}
 }
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/18] btrfs-progs: ctree.h: fix btrfs_inode_[amc]time

2014-12-10 Thread mwilck
From: Martin Wilck 

make btrfs_inode_[amc]time work like the other btrfs_inode_xxx
functions. The current definition appears broken to me; it never
returns valid pointer unless an extent buffer address is added.

Signed-off-by: Martin Wilck 
---
 ctree.h |   15 +--
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/ctree.h b/ctree.h
index 89036de..1d5a5fc 100644
--- a/ctree.h
+++ b/ctree.h
@@ -1414,27 +1414,30 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_flags,
 struct btrfs_inode_item, flags, 64);
 
 static inline struct btrfs_timespec *
-btrfs_inode_atime(struct btrfs_inode_item *inode_item)
+btrfs_inode_atime(struct extent_buffer *eb,
+ struct btrfs_inode_item *inode_item)
 {
unsigned long ptr = (unsigned long)inode_item;
ptr += offsetof(struct btrfs_inode_item, atime);
-   return (struct btrfs_timespec *)ptr;
+   return (struct btrfs_timespec *)(ptr + eb->data);
 }
 
 static inline struct btrfs_timespec *
-btrfs_inode_mtime(struct btrfs_inode_item *inode_item)
+btrfs_inode_mtime(struct extent_buffer *eb,
+ struct btrfs_inode_item *inode_item)
 {
unsigned long ptr = (unsigned long)inode_item;
ptr += offsetof(struct btrfs_inode_item, mtime);
-   return (struct btrfs_timespec *)ptr;
+   return (struct btrfs_timespec *)(ptr + eb->data);
 }
 
 static inline struct btrfs_timespec *
-btrfs_inode_ctime(struct btrfs_inode_item *inode_item)
+btrfs_inode_ctime(struct extent_buffer *eb,
+ struct btrfs_inode_item *inode_item)
 {
unsigned long ptr = (unsigned long)inode_item;
ptr += offsetof(struct btrfs_inode_item, ctime);
-   return (struct btrfs_timespec *)ptr;
+   return (struct btrfs_timespec *)(ptr + eb->data);
 }
 
 static inline struct btrfs_timespec *
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/18] btrfs restore: better output readability

2014-12-10 Thread mwilck
From: Martin Wilck 

Don't print whole path for files, which will mangle output
for long path names. Rather distinguish between directories
and files.

Signed-off-by: Martin Wilck 
---
 cmds-restore.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 5aa2167..f9dab7e 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -908,7 +908,7 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
ret = 0;
}
if (verbose)
-   printf("Restoring %s\n", path_name);
+   printf("Restoring %s\n", filename);
if (dry_run)
goto next;
fd = open(path_name, O_CREAT|O_WRONLY, 0644);
@@ -982,7 +982,7 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
}
 
if (verbose)
-   printf("Restoring %s\n", path_name);
+   printf("Searching directory %s\n", path_name);
 
errno = 0;
if (dry_run)
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/18] btrfs-progs: btrfs-debug-tree: add option -B (backup root)

2014-12-10 Thread mwilck
From: Martin Wilck 

Option -B causes btrfs-debug-tree to dump the tree rooted at
the backup root number given instead of the real root.

Signed-off-by: Martin Wilck 
---
 btrfs-debug-tree.c |   39 ++-
 1 files changed, 38 insertions(+), 1 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index e61c71c..7cdc368 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -45,6 +45,8 @@ static int print_usage(void)
" block\n");
fprintf(stderr,
"\t-t tree_id : print only the tree with the given id\n");
+   fprintf(stderr,
+   "\t-B nr: use root backup  instead of real root\n");
fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
exit(1);
 }
@@ -140,6 +142,7 @@ int main(int ac, char **av)
int root_backups = 0;
u64 block_only = 0;
int block_follow = 0;
+   int use_backup = -1;
struct btrfs_root *tree_root_scan;
u64 tree_id = 0;
 
@@ -147,7 +150,7 @@ int main(int ac, char **av)
 
while(1) {
int c;
-   c = getopt(ac, av, "defb:rRut:");
+   c = getopt(ac, av, "defb:rRut:B:");
if (c < 0)
break;
switch(c) {
@@ -176,6 +179,9 @@ int main(int ac, char **av)
case 't':
tree_id = arg_strtou64(optarg);
break;
+   case 'B':
+   use_backup = arg_strtou64(optarg);
+   break;
default:
print_usage();
}
@@ -221,6 +227,37 @@ int main(int ac, char **av)
goto close_root;
}
 
+   if (use_backup >= BTRFS_NUM_BACKUP_ROOTS) {
+   fprintf(stderr, "Invalid backup root number %d\n",
+   use_backup);
+   exit(1);
+   } else if (use_backup >= 0) {
+   u64 bytenr, generation;
+   u32 blocksize;
+   struct btrfs_super_block *sb = info->super_copy;
+   struct btrfs_root_backup *backup = sb->super_roots + use_backup;
+   struct extent_buffer *eb;
+   bytenr = btrfs_backup_tree_root(backup);
+   generation = btrfs_backup_tree_root_gen(backup);
+   blocksize = btrfs_level_size(info->tree_root,
+btrfs_super_root_level(sb));
+   eb = info->tree_root->node;
+   info->tree_root->node = read_tree_block(root, bytenr,
+   blocksize, generation);
+   free_extent_buffer(eb);
+   bytenr = btrfs_backup_chunk_root(backup);
+   generation = btrfs_backup_chunk_root_gen(backup);
+   eb =  info->chunk_root->node;
+   info->chunk_root->node = read_tree_block(root, bytenr,
+   blocksize, generation);
+   free_extent_buffer(eb);
+   if (!extent_buffer_uptodate(info->tree_root->node) ||
+   !extent_buffer_uptodate(info->tree_root->node)) {
+   fprintf(stderr, "Couldn't backup root\n");
+   return 1;
+   }
+   }
+
if (!(extent_only || uuid_tree_only || tree_id)) {
if (roots_only) {
printf("root tree: %llu level %d\n",
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/18] btrfs-progs: btrfs-debug-tree: add option -f for "block only"

2014-12-10 Thread mwilck
From: Martin Wilck 

btrfs-debug-tree prints only the given block. It is sometimes
useful to be able to print the subtree under this block.
This patch enables this behavior with the option "-f".

Signed-off-by: Martin Wilck 
---
 btrfs-debug-tree.c |   10 --
 1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index e46500d..e61c71c 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -41,6 +41,8 @@ static int print_usage(void)
fprintf(stderr, "\t-u : print info of uuid tree only\n");
fprintf(stderr, "\t-b block_num : print info of the specified block"
 " only\n");
+   fprintf(stderr, "\t-f : (with -b) follow subtree of the specified"
+   " block\n");
fprintf(stderr,
"\t-t tree_id : print only the tree with the given id\n");
fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
@@ -137,6 +139,7 @@ int main(int ac, char **av)
int roots_only = 0;
int root_backups = 0;
u64 block_only = 0;
+   int block_follow = 0;
struct btrfs_root *tree_root_scan;
u64 tree_id = 0;
 
@@ -144,7 +147,7 @@ int main(int ac, char **av)
 
while(1) {
int c;
-   c = getopt(ac, av, "deb:rRut:");
+   c = getopt(ac, av, "defb:rRut:");
if (c < 0)
break;
switch(c) {
@@ -167,6 +170,9 @@ int main(int ac, char **av)
case 'b':
block_only = arg_strtou64(optarg);
break;
+   case 'f':
+   block_follow = 1;
+   break;
case 't':
tree_id = arg_strtou64(optarg);
break;
@@ -211,7 +217,7 @@ int main(int ac, char **av)
(unsigned long long)block_only);
goto close_root;
}
-   btrfs_print_tree(root, leaf, 0);
+   btrfs_print_tree(root, leaf, block_follow);
goto close_root;
}
 
-- 
1.7.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Goffredo Baroncelli
On 12/10/2014 09:36 PM, Robert White wrote:
[...]
> I tested it and sure enough, it's RAID1...
> 
> I also noticed that the default for data goes from single to RAID0 in
> a two slice build.
> 
> I generally don't expect defaults to change in undocumented ways.
> Particularly since that makes make-plus-add orthogonal to
> make-as-multi.
> 
> Without other guidance I'd been assuming that
> 
> mkfs.btrfs d1 d2 d3 ... 
> --vs-- 
> mkfs.btrfs d1 
> btrfs dev add d2 
> btrfs dev add d3 ...
> 
> would net the same resultant system. I have only ever done the latter
> until today.
> 
> Does/Will the defaults change when three, four, or more slices are
> used to build the system?
> 
> I'll take a stab at updating the manual page.

Why not printing from mkfs.btrfs the raid profiles used ?


> 
> -- Rob.
> 
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


A story of btrfs corruption and recovery

2014-12-10 Thread Martin Wilck
In April 2014, I reported a btrfs corruption on the linux-btrfs
mailing list (http://www.spinics.net/lists/linux-btrfs/msg33318.html).
8 months later, I am happy to be able to say I've been able to recover
the data with a combination of persistence and luck. I want to share
some of my insight with this list in the hope it that may be useful in
future cases.

I also did some work on the btrfs tools to be able to better
understand what was wrong; I will submit the additions and changes I
made for review later.

1. The history

I had created this file system in late 2012 when I installed OpenSUSE
12.2 on a friend's laptop. "btrfs was still unstable at that time", I
imagine you say.  That's easy to say in hindsight. OpenSUSE's
installer offered btrfs as a tier-1 choice, as far as I
remember. Articles written at the time (e.g.
http://rainbowtux.blogspot.de/2012/09/to-btrfs-or-not-to-btrfs.html)
suggest that I wasn't the only person considering it worth a serious
try. Today I wish I hadn't incautiously put my friend's /home on that
FS, too - I've certainly paid for that carelessness. So, /home was
subvolume 263 in this file system.  Complicating matters further, I
had created encrypted home file systems using ecryptfs on top of
btrfs.

2. The disaster

It all went well until April 14, 2014. On that day, the laptop
suddenly crashed.  OpenSUSE Kernel 3.4.11-2.16 was running at the time
of the crash.  Subsequent reboot attempts failed. I described the
phenomena in my posting to linux-btrfs, desparately hoping someone
would give me an easy recipe for recovery. It didn't happen. I got the
recommendation to use a newer version of the kernel and btrfs tools,
but they didn't get me any further. Whatever tool I tried, /home
appeared to be completely empty. I had to dig deeper.

3. The quest

After quite some time, I found the hint, looking at the root
of the /home subvolume, which was a level 2 node:

# ./btrfs-debug-tree -b 980717568 /dev/XX
node 980717568 level 2 items 78 free 43 generation 39637 owner 263
   key (256 INODE_ITEM 0) block 1012207616 (247121) gen 35754

Looking at the supposed level-1 subnode at 1012207616, I found that it
contained data of the wrong level (0), owner (2 - the extent tree),
and generation:

leaf 1012207616 items 26 free space 1967 generation 39622 owner 2
   item 0 key (8266870784 EXTENT_ITEM 12288) itemoff 3942 itemsize 53

So, the tree was massively corrupted at this crucial point; the top
inode of the subvolume couldn't be found, explaining why /home had
appeared empty on every recovery attempt. I looked at the other
children of the children of the tree root, and was pleasantly
surprised that these didn't look bad; I saw inodes and directory
entries of ecryptfs-encrypted home directories, as I had expected.

The obvious next thing to try was to look for previous generations of
the root of the /home subvolume, hoping they weren't corrupted. I
started with the super block root backups, with no luck. Later I went
back all the way from generation 39637 to 38081 (the oldest copy of
this root node I could find), but it was just as corrupted as the last
one - they all pointed to the same wrong level 1 block 1012207616.

I began to wonder whether the all-important level 1 and leaf meta data
of this part of the file system had survived somewhere at all. I
hacked together a tool to search for a specific btrfs key in all of
the meta data, and used it to search for the the key 256-1-0 of the
subvolume 263 (the first inode of the /home file system).  Luckily, I
found exactly one copy of a leaf containing this key, and a handful of
level 1 nodes referring to it.

At this point I didn't yet dare to even think of repairing the file
system.  Rather, I made additional debugging steps. One strange thing
I found was that beyond the 603 top (level 2) copies of /home's root
node, there were several instances with the same generation number:

node 1037123584 level 2 items 78 free 43 generation 39636 owner 263
node 1041215488 level 2 items 78 free 43 generation 39636 owner 263
node 980566016 level 2 items 78 free 43 generation 39636 owner 263
node 980717568 level 2 items 78 free 43 generation 39637 owner 263

Looking at the details of these blocks, I found that the various
level-2-gen-39636-owner-263 were actually different. I have no idea if
this can happen under any circumstances, but it gave me another hint
towards the final solution. Out of the generation 39636 roots listed
above, only the last one showed the original corruption I described -
the others actually had reasonable data in slot 1. My first hope that
these root copies might actually be healthy was quickly destroyed - a
tree dump showed other errors. But, and that was key, these other
corruptions were at different points of the tree. Taking the three
gen-39636 roots together, I was able to find sane data for every part
of the tree.  I was lucky insofar as the total number of corruptions I
needed to fix turned out to be so low that it was doable 

Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Robert White

On 12/10/2014 05:21 AM, Duncan wrote:

Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted:


On 12/09/2014 05:08 PM, Dongsheng Yang wrote:

On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote:

Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote:

When function btrfs_statfs() calculate the tatol size of fs, it is
calculating the total size of disks and then dividing it by a factor.
But in some usecase, the result is not good to user.

Example:
 # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
 # mount /dev/vdf1 /mnt
 # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
 # df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdf1   3.0G 1018M  1.3G  45% /mnt

 # btrfs fi show /dev/vdf1
Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
 Total devices 2 FS bytes used 1001.53MiB
 devid1 size 2.00GiB  used 1.85GiB path /dev/vdf1
 devid2 size 4.00GiB  used 1.83GiB path /dev/vdf2

a. df -h should report Size as 2GiB rather than as 3GiB.
Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.

I agree


NOPE.

The model you propose is too simple.

While the data portion of the file system is set to RAID1 the metadata
portion of the filesystem is still set to the default of DUP.


Well my bad... /D'oh...

Though I'd say the documentation needs to be updated. The only mention 
of changes from the default is this bit.


From man mkfs.btrfs as distributed in the source tree:

[QUOTE]
   -m|--metadata 
   Specify how metadata must be spanned across the devices 
specified. Valid values are raid0, raid1, raid5, raid6, raid10, single 
or dup.


   Single device will have dup set by default except in the 
case of SSDs which will default to single. This is because SSDs can 
remap blocks internally so duplicate blocks could end up in the same 
erase block which negates the benefits of doing metadata duplication.

[/QUOTE]

No mention is made of RAID1 for a multi-device FS, the two defaults are 
listed as DUP and Single.


ASIDE: The wiki page mentions RAID1 but doesn't mention the SSD fallback 
to single; and it's annotated as potentially out of date. But I never 
looked there because I had the manual page locally.


I tested it and sure enough, it's RAID1...

I also noticed that the default for data goes from single to RAID0 in a 
two slice build.


I generally don't expect defaults to change in undocumented ways. 
Particularly since that makes make-plus-add orthogonal to make-as-multi.


Without other guidance I'd been assuming that

mkfs.btrfs d1 d2 d3 ...
--vs--
mkfs.btrfs d1
btrfs dev add d2
btrfs dev add d3
...

would net the same resultant system. I have only ever done the latter 
until today.


Does/Will the defaults change when three, four, or more slices are used 
to build the system?


I'll take a stab at updating the manual page.

-- Rob.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Patrik Lundquist
On 10 December 2014 at 13:47, Duncan <1i5t5.dun...@cox.net> wrote:
>
> The recursive btrfs defrag after deleting the saved ext* subvolume
> _should_ have split up any such > 1 GiB extents so balance could deal
> with them, but either it failed for some reason on at least one such
> file, or there's some other weird corner-case going on, very likely
> something else having to do with the conversion.

I've run defrag several times again and it doesn't do anything additional.


> Patrik, assuming no btrfs snapshots yet, can you do a du --all --block-
> size=1M | sort -n (or similar), then take a look at all results over 1024
> (1 GiB since the du specified 1 MiB blocks), and see if it's reasonable
> to move all those files out of the filesystem and back?

Good idea, but it's quite a lot of files. I'd rather start over.

But I've identified 46 files from Btrfs errors in syslog and will try
to move them to another disk. They're ranging from 41KiB to 6.6GiB in
size.

Is btrfs-debug-tree -e useful in finding problematic files?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Possibility to have a "transient" snapshot?

2014-12-10 Thread James West
I was just looking into using overlayfs, and although it has some 
promise, I think it's biggest drawback is the upperdir will have to be 
some sort of storage backed filesystem. From my limited understanding of 
tmpfs, it's not supposed to be the greatest with many large files (and 
my system in particular would be downloading many large movies/videos, 
and doing any kind of os update to test it would involve many changes 
all over the volume, which could be problematic to commit to a golden 
state.)


I could partition the main drive in 2 parts, and dynamically zero-out 
then create the volume in the second partition on each boot, but I'm 
still saving no drive writes, and not really extending the life of the 
hardware (which is one of my premises.)


On 05/12/2014 11:12 PM, Chris Murphy wrote:

On Fri, Dec 5, 2014 at 11:27 AM, James West  wrote:


General idea would be to have a transient snapshot (optional quota support
possibility here) on top of a base snapshot (possibly readonly). On system
start/restart (whether clean or dirty), the transient snapshot would be
flushed, and the system would restart the snapshot, basically restarting
from the base snapshot.

Sounds similar to this idea:
http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

About 1/3 of the way down it gets to a proposal to Btrfs as a way to
get to a stateless system, which is basically what you want to be able
to rollback to. A variation on this that might serve the use case
better is seed device. You can either drop the added device that
stores changes to the seed device, or the volume (seed+added device)
can become another seed if you want to make the current state
persistent at next boot.

And still another possibility is overlayfs, which isn't Btrfs specific.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub status misreports as "interrupted"

2014-12-10 Thread Konstantinos Skarlatos

On 10/12/2014 9:28 μμ, Marc Joliet wrote:

Am Wed, 10 Dec 2014 10:51:15 +0800
schrieb Anand Jain :


   Is there any relevant log in the dmegs ?

Not in my case; at least, nothing that made it into the syslog.


Same with me, no messages at all
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs scrub status misreports as "interrupted"

2014-12-10 Thread Marc Joliet
Am Wed, 10 Dec 2014 10:51:15 +0800
schrieb Anand Jain :

>   Is there any relevant log in the dmegs ?

Not in my case; at least, nothing that made it into the syslog.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: PGP signature


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Goffredo Baroncelli
On 12/10/2014 04:02 PM, Dongsheng Yang wrote:
> On Wed, Dec 10, 2014 at 9:21 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted:
[...]
>> And in the example, the mkfs was supplied with two devices, so there's no
>> dup metadata remaining from a formerly single-device filesystem, either.
>> (Tho there will be the small single-mode stubs, empty, remaining from the
>> mkfs process, as no balance has been run to delete them yet, but those
>> are much smaller and empty.)
> 
> Yes. One question not related here: how about delete them in the end of mkfs?
> 
> Thanx

A btrfs balance should remove them. If you don't want to balance a full
filesystem, you can filter the chunk by usage (set a low usage).
Recently it was discussed in a tread...

BR
Goffredo


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Patrik Lundquist
On 10 December 2014 at 14:11, Duncan <1i5t5.dun...@cox.net> wrote:
>
> From there... I've never used it but I /think/ btrfs inspect-internal
> logical-resolve should let you map the 182109... address to a filename.
> From there, moving that file out of the filesystem and back in should
> eliminate that issue.

btrfs inspect-internal logical-resolve 1821099687936 /mnt gives me the
filename and it's only a 54175 bytes file.


> Assuming no snapshots still contain the file, of course, and that the
> ext* saved subvolume has already been deleted.

Got no snapshots or subvolumes. Keeping it simple for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2][BTRFS-PROGS] Don't use LVM snapshot device

2014-12-10 Thread Goffredo Baroncelli
On 12/10/2014 08:52 AM, Anand Jain wrote:
> 
> 
>> This patch allows btrfs to skip LVM snapshot during the device scan
>> phase.
> 
>  Its better to generalize the problem and fix it. The fix here is very
>  specific to LVM use case. This does not work in cases where device is
>  cloned using dd (device is unmounted).

See patch #5; this aborts btrfs[progs] when two devices have the same
dev.uuid and fsid.

Unfortunately this patch doesn't work with "btrfs dev scan"; this
because each device is discovered/registered alone. See my patches about 
mount.btrfs for an alternative approach.

> As mentioned we need to depend on the device wwn as provided by the
> device target driver.
> 
> Thanks, Anand
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Goffredo Baroncelli
On 12/10/2014 11:53 AM, Robert White wrote:
> On 12/09/2014 05:08 PM, Dongsheng Yang wrote:
>> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote:
>>> Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote:
 When function btrfs_statfs() calculate the tatol size of fs, it
 is calculating the total size of disks and then dividing it by
 a factor. But in some usecase, the result is not good to user.
 
 Example: # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 # mount
 /dev/vdf1 /mnt # dd if=/dev/zero of=/mnt/zero bs=1M count=1000 
 # df -h /mnt Filesystem  Size  Used Avail Use% Mounted on 
 /dev/vdf1   3.0G 1018M  1.3G  45% /mnt
 
 # btrfs fi show /dev/vdf1 Label: none  uuid:
 f85d93dc-81f4-445d-91e5-6a5cd9563294 Total devices 2 FS bytes
 used 1001.53MiB devid1 size 2.00GiB used 1.85GiB path
 /dev/vdf1 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2
 
 a. df -h should report Size as 2GiB rather than as 3GiB. 
 Because this is 2 device raid1, the limiting factor is devid 1
 @2GiB.
>>> I agree
> 
> NOPE.
> 
> The model you propose is too simple.
> 
> While the data portion of the file system is set to RAID1 the
> metadata portion of the filesystem is still set to the default of
> DUP. As such it is impossible to guess how much space is "free" since
> it is unknown how the space will be used before hand.


Hi Robert,

sorry but you are talking about a different problem.
Yang is  trying to solve a problem where it is impossible to fill
all the disk space because some portion is not raid1 protected. So
it is incorrect to report all space/2 as free space.

Instead you are stating that *if* the metadata are stored as DUP (and
is not this case, because the metadata are raid1, see below), it is possible
to fill all the disk space.

This is a complex problem. The fact that BTRFS allows different
raid levels causes to be very difficult to evaluate the free space (
as space available directly to the user). There is no a simple answer.

I am still convinced that the best free space *estimation* is considering
the ratio disk-space-consumed/file-allocated constant, and evaluate
the free space as the 

disk-space-unused*file-allocate/disk-space-consumed.

Of course there are pathological cases that make this
prediction fails completely. But I consider the best estimation
possible for the average users.

But again this is a different problem that the one raised by 
Yang.



[...]

> IF you wanted everything to be RAID-1 you should have instead done

> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 -m raid1
> 
> The mistake is yours, rest of you analysis is, therefore, completely
> inapplicable. Please read all the documentation before making that
> sort of filesystem. Your data will thank you later.
> 
> DSCLAIMER: I have _not_ looked at the numbers you would get if you
> used the corrected command.

Sorry, but you are wrong. Doing mkfs.btrfs -d raid1 /dev/loop[01] leads 
to have both data and metadata  in raid1. IIRC if you have more than
one disks, the metadata switched to raid1 automatically.

$ sudo mkfs.btrfs -d raid1 /dev/loop[01]
Btrfs v3.17
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (10.00GiB) ...
Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
Performing full device TRIM (30.00GiB) ...
adding device /dev/loop1 id 2
fs created label (null) on /dev/loop0
nodesize 16384 leafsize 16384 sectorsize 4096 size 40.00GiB
ghigo@venice:/tmp$ sudo mount /dev/loop0 t/
ghigo@venice:/tmp$ sudo dd if=/dev/zero of=t/fill bs=4M count=10
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.018853 s, 2.2 GB/s
ghigo@venice:/tmp$ sync
ghigo@venice:/tmp$ sudo btrfs fi df t/
Data, RAID1: total=1.00GiB, used=40.50MiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=160.00KiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-10 Thread Phillip Susi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/9/2014 10:10 PM, Anand Jain wrote:
> In the test case provided earlier who is triggering the scan ? 
> grub-probe ?

The scan is initiated by udev.  grub-probe only comes into it because
it is looking to /proc/mounts to find out what device is mounted, and
/proc/mounts is lieing.

> But we had to revert, Since btrfs bug become a feature for the
> system boot process and fixing that breaks mount at boot with
> subvol.

How is this?  Also are we talking about updating the cached list of
devices that *can* be mounted, or what device already *is* mounted?  I
can see doing the former, but the latter should never happen.

> if the device is already mounted, just the device path is updated 
> but still the original device will be still in use (bug).

Yep, that is the bug that started all of this.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUiG1MAAoJENRVrw2cjl5Rm0gIAJ6sq72zKSEfCuCjigknx25T
a97wjtMeb+yeaECc5FfwN7Fm454GSSuj6RFCRVjo3sCgJP3sUEH49syJnvW1QiEP
A5ktXfTpz6/zaeP9DbGPDCiVix0RdsJ6bCjP/8InsASueXOENCpxxmblxrbE4Wxj
Mdz8lu9L8G+fc6btbLLb0N4i0clSiImQds90zTQ1cXihJ/4wUIO3qgq+rruSYMqI
A182FS7NTUQrRcJ/rbcha3dCyD/urbCaRTUztMvTnSs3a7hK5p+SBNbfxEORC6ni
HrRMxpOlgHOTMnL3EHw843OuGv0Us3VqVbuPG3K6L4+G4W1sFxgKEAnLvEbjzAI=
=Vpre
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] btrfs-progs: subvol delete: add verbosity option

2014-12-10 Thread David Sterba
Add an the option -v and use it for the transaction commit mode message.

Signed-off-by: David Sterba 
---
 cmds-subvolume.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index 4e452f4f4eb7..b14f86e06cb4 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -66,7 +66,7 @@ static int cmd_subvol_create(int argc, char **argv)
 
optind = 1;
while (1) {
-   int c = getopt(argc, argv, "c:i:");
+   int c = getopt(argc, argv, "c:i:v");
if (c < 0)
break;
 
@@ -217,6 +217,7 @@ static int cmd_subvol_delete(int argc, char **argv)
char*dupvname = NULL;
char*path;
DIR *dirstream = NULL;
+   int verbose = 0;
int sync_mode = 0;
struct option long_options[] = {
{"commit-after", no_argument, NULL, 'c'},  /* sync mode 1 */
@@ -239,6 +240,9 @@ static int cmd_subvol_delete(int argc, char **argv)
case 'C':
sync_mode = 2;
break;
+   case 'v':
+   verbose++;
+   break;
default:
usage(cmd_subvol_delete_usage);
}
@@ -247,9 +251,11 @@ static int cmd_subvol_delete(int argc, char **argv)
if (check_argc_min(argc - optind, 1))
usage(cmd_subvol_delete_usage);
 
-   printf("Transaction commit: %s\n",
-   !sync_mode ? "none (default)" :
-   sync_mode == 1 ? "at the end" : "after each");
+   if (verbose > 0) {
+   printf("Transaction commit: %s\n",
+   !sync_mode ? "none (default)" :
+   sync_mode == 1 ? "at the end" : "after each");
+   }
 
cnt = optind;
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] btrfs-progs: let subvol delete print commit mode inline

2014-12-10 Thread David Sterba
There are options to specify if the subvolume deletion should wait for
commit after each subvol or at the end. This is reported at the
beginning and considered as a noise. We'd like to report the mode for
each subvolume instead.

http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg34617.html

Reported-by: Marc MERLIN 
---
 cmds-subvolume.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index 53eec467251d..4e452f4f4eb7 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -303,7 +303,9 @@ again:
goto out;
}
 
-   printf("Delete subvolume '%s/%s'\n", dname, vname);
+   printf("Delete subvolume (%s): '%s/%s'\n",
+   sync_mode == 2 || (sync_mode == 1 && cnt + 1 == argc)
+   ? "commit" : "no-commit", dname, vname);
strncpy_null(args.name, vname);
res = ioctl(fd, BTRFS_IOC_SNAP_DESTROY, &args);
e = errno;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Btrfs-progs: subvolume deletion commit mode update

2014-12-10 Thread David Sterba
Minor change in the output of 'subvolume delete' command, the commit mode is
printed inline with the subvolume and the global message is moved under the
newly added 'verbose' option.

David Sterba (3):
  btrfs-progs: let subvol delete print commit mode inline
  btrfs-progs: subvol delete: add verbosity option
  btrfs-progs: subvol delete: rename variable to match the option name

 cmds-subvolume.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs-progs: subvol delete: rename variable to match the option name

2014-12-10 Thread David Sterba
Signed-off-by: David Sterba 
---
 cmds-subvolume.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index b14f86e06cb4..15d4b975a916 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -218,10 +218,10 @@ static int cmd_subvol_delete(int argc, char **argv)
char*path;
DIR *dirstream = NULL;
int verbose = 0;
-   int sync_mode = 0;
+   int commit_mode = 0;
struct option long_options[] = {
-   {"commit-after", no_argument, NULL, 'c'},  /* sync mode 1 */
-   {"commit-each", no_argument, NULL, 'C'},  /* sync mode 2 */
+   {"commit-after", no_argument, NULL, 'c'},  /* commit mode 1 */
+   {"commit-each", no_argument, NULL, 'C'},  /* commit mode 2 */
{NULL, 0, NULL, 0}
};
 
@@ -235,10 +235,10 @@ static int cmd_subvol_delete(int argc, char **argv)
 
switch(c) {
case 'c':
-   sync_mode = 1;
+   commit_mode = 1;
break;
case 'C':
-   sync_mode = 2;
+   commit_mode = 2;
break;
case 'v':
verbose++;
@@ -253,8 +253,8 @@ static int cmd_subvol_delete(int argc, char **argv)
 
if (verbose > 0) {
printf("Transaction commit: %s\n",
-   !sync_mode ? "none (default)" :
-   sync_mode == 1 ? "at the end" : "after each");
+   !commit_mode ? "none (default)" :
+   commit_mode == 1 ? "at the end" : "after each");
}
 
cnt = optind;
@@ -310,7 +310,7 @@ again:
}
 
printf("Delete subvolume (%s): '%s/%s'\n",
-   sync_mode == 2 || (sync_mode == 1 && cnt + 1 == argc)
+   commit_mode == 2 || (commit_mode == 1 && cnt + 1 == argc)
? "commit" : "no-commit", dname, vname);
strncpy_null(args.name, vname);
res = ioctl(fd, BTRFS_IOC_SNAP_DESTROY, &args);
@@ -323,7 +323,7 @@ again:
goto out;
}
 
-   if (sync_mode == 1) {
+   if (commit_mode == 1) {
res = wait_for_commit(fd);
if (res < 0) {
fprintf(stderr,
@@ -347,7 +347,7 @@ out:
goto again;
}
 
-   if (sync_mode == 2 && fd != -1) {
+   if (commit_mode == 2 && fd != -1) {
res = wait_for_commit(fd);
if (res < 0) {
fprintf(stderr,
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Dongsheng Yang
On Wed, Dec 10, 2014 at 9:21 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted:
>
>> On 12/09/2014 05:08 PM, Dongsheng Yang wrote:
>>> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote:
 Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote:
> When function btrfs_statfs() calculate the tatol size of fs, it is
> calculating the total size of disks and then dividing it by a factor.
> But in some usecase, the result is not good to user.
>
> Example:
> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
> # mount /dev/vdf1 /mnt
> # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
> # df -h /mnt
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vdf1   3.0G 1018M  1.3G  45% /mnt
>
> # btrfs fi show /dev/vdf1
> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
> Total devices 2 FS bytes used 1001.53MiB
> devid1 size 2.00GiB  used 1.85GiB path /dev/vdf1
> devid2 size 4.00GiB  used 1.83GiB path /dev/vdf2
>
> a. df -h should report Size as 2GiB rather than as 3GiB.
> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
 I agree
>>
>> NOPE.
>>
>> The model you propose is too simple.
>>
>> While the data portion of the file system is set to RAID1 the metadata
>> portion of the filesystem is still set to the default of DUP.
>
> Metadata defaults to DUP only on a single-device filesystem.  On a multi-
> device filesystem, metadata defaults to raid1.  (FWIW, for both, data
> defaults to single.)

Exactly. Thanx for your clarification. :)
>
> And in the example, the mkfs was supplied with two devices, so there's no
> dup metadata remaining from a formerly single-device filesystem, either.
> (Tho there will be the small single-mode stubs, empty, remaining from the
> mkfs process, as no balance has been run to delete them yet, but those
> are much smaller and empty.)

Yes. One question not related here: how about delete them in the end of mkfs?

Thanx
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Dongsheng Yang
On Wed, Dec 10, 2014 at 9:59 PM, Shriramana Sharma  wrote:
> On Tue, Dec 9, 2014 at 4:50 PM, Dongsheng Yang
>  wrote:
>> # df -h /mnt
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/vdf1   3.0G 1018M  1.3G  45% /mnt
>
> LOL -- not being a user of RAID I can't comment on the patch, but I
> was somewhat wondering what the "fd" command in the subject line is...
> :-)

Yea, it should be "df". :)
>
> --
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Dongsheng Yang
On Wed, Dec 10, 2014 at 6:53 PM, Robert White  wrote:
> On 12/09/2014 05:08 PM, Dongsheng Yang wrote:
>>
>> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote:
>>>
>>> Hi Dongsheng
>>> On 12/09/2014 12:20 PM, Dongsheng Yang wrote:

 When function btrfs_statfs() calculate the tatol size of fs, it is
 calculating
 the total size of disks and then dividing it by a factor. But in some
 usecase,
 the result is not good to user.

 Example:
 # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
 # mount /dev/vdf1 /mnt
 # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
 # df -h /mnt
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/vdf1   3.0G 1018M  1.3G  45% /mnt

 # btrfs fi show /dev/vdf1
 Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
 Total devices 2 FS bytes used 1001.53MiB
 devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
 devid2 size 4.00GiB used 1.83GiB path /dev/vdf2

 a. df -h should report Size as 2GiB rather than as 3GiB.
 Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>>>
>>> I agree
>
>
> NOPE.
>
> The model you propose is too simple.
>
> While the data portion of the file system is set to RAID1 the metadata
> portion of the filesystem is still set to the default of DUP. As such it is
> impossible to guess how much space is "free" since it is unknown how the
> space will be used before hand.
>
> IF, say, this were used as a typical mail spool, web cache, or any number of
> similar smal-file applications virtually all of the data may end up in the
> metadata chunks. The "blocks free" in this usage are indistinguisable from
> any other file system.
>
> For all that DUP data the correct size is 3GiB because there will be two
> copies of all metadata but they could _all_ end up on /dev/vdf2.
>
> So you have a RAID-1 region that is constrained to 2Gib. You have 2GiB more
> storage for all your metadata, but the constraint is DUP (so everything is
> written twice "somewhere")
>
> So the space breakdown is, if optimally packed, actually

The issue you pointed here really exists. If the all data is stored inline,
the raid level will probably be different with the raid level we set by "-d".

If we want to give an exactly guess of the future use, I would say
it's impossible.

But, 2G of the @size is more proper than 3G in this case I think.

Let's compare them as below:

2G:
a). It's readable to user, we build a btrfs with two devices of 2G and 4G.
Then we got an fs of 2G. That's what raid1 should be understood.
b). Even if all data is stored in inline extent, the @size will also grows
at the same time. That said, if as you said, we got 3G data in it. The @size
will also be reported as 3G in df command.

3G:
   a). It is strange to user, why we got a fs of 3G in raid1 with 2G
and 4G device?
And why I can not use the all the 3G capacity df reported (we can not assume a
user understand what's inline extent.)?

So, I prefer 2G to 3G here. Furthermore, I have cooked a new patch to treat
space in metadata chunk and system chunk more properly. shown as below.
# df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdf1   2.0G  1.3G  713M  66% /mnt
# df /mnt
Filesystem 1K-blocksUsed Available Use% Mounted on
/dev/vdf12097152 1359424729536  66% /mnt
# btrfs fi show /dev/vdf1
Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
Total devices 2 FS bytes used 1001.55MiB
devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
devid2 size 4.00GiB used 1.83GiB path /dev/vdf2

Does this makes more sense to you, Robert?

Thanx
Yang

>
> 2GiB mirrored, for _data_, takes up 4GiB total spread evenly across
> /dev/vdf2 (2Gib) and /dev/vdf1 (2Gib).
>
> _AND_ 1GiB of metadata, written twice to /dev/vdf2 (2Gib)
>
> So free space is 3Gib on the presumption that data and metadata will be
> equally used.
>
> The program, not being psychic, can only make a fair-usage guess about
> future use.
>
> Now we have accounted for all 6GiB of raw storage _and_ the report of 3GiB
> free.
>
> IF you wanted everything to be RAID-1 you should have instead done
>
> # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 -m raid1
>
> The mistake is yours, rest of you analysis is, therefore, completely
> inapplicable. Please read all the documentation before making that sort of
> filesystem. Your data will thank you later.
>
> DSCLAIMER: I have _not_ looked at the numbers you would get if you used the
> corrected command.
>
>
>
>>>
 b. df -h should report Avail as 0.15GiB or less, rather than as 1.3GiB.
 2 - 1.85 = 0.15
>>>
>>> I cannot agree; the avail should be:
>>>  1.85   (the capacity of the allocated chunk)
>>> -1.018  (the file stored)
>>> +(2-1.85=0.15)  (the residual capacity of the disks
>>>  considering a raid1 fs)
>>> ---
>>> = 

Re: systemd.setenv and a mount.unit

2014-12-10 Thread David Sterba
On Thu, Nov 20, 2014 at 11:39:19AM -0700, Chris Murphy wrote:
> On Thu, Nov 20, 2014 at 4:14 AM, Goffredo Baroncelli  
> wrote:
> 
> > Supposing to have the following four subvolumes
> >
> > /root/
> > /root/etc
> > /root/usr
> > /root/var
> >
> > When you need to snapshot, you should:
> >
> > # btrfs subvolume snapshot /root /backup-root-20141120
> > # btrfs subvolume snapshot /root/etc /backup-root-20141120/etc
> > # btrfs subvolume snapshot /root/usr /backup-root-20141120/usr
> > # btrfs subvolume snapshot /root/var /backup-root-20141120/var
> >
> > So in order to remount an "old" filesystem, you need to make only
> > 1 mount.
> 
> I like this layout better than either the openSUSE or Fedora layouts.
> It's easier to mount and old filesystem, where on Fedora each
> subvolume must be explicitly mounted. And it ensures old binaries
> aren't in the current mount path – kinda like running in a chroot –
> where on openSUSE the snapshots containing old binaries are in the
> current mount path.

While the old binaries are in the current mount path, they're not
generally accessible due to 0750 on the .snapshots directory.

The 'single mountpoint for whole root' is not perfect in case there are
files that are independent on the system files, like logs or some
application data in wellknown paths.

The other option is to have separate subvolumes for the selected paths
and either mount them in fstab or do more work when the old filesystem
has to be rolled back and transformed to the expected layout.

Both have their pros and cons so this is a matter of user choice. Eg. if
the logs are forwarded and not kept locally, no applications store data
on root partition. Going to an older snapshot is trivial and without
unexpected consequences.

And of course the layouts are both ways convertible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Shriramana Sharma
On Tue, Dec 9, 2014 at 4:50 PM, Dongsheng Yang
 wrote:
> # df -h /mnt
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vdf1   3.0G 1018M  1.3G  45% /mnt

LOL -- not being a user of RAID I can't comment on the patch, but I
was somewhat wondering what the "fd" command in the subject line is...
:-)

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: mkfs: make skinny-metadata default

2014-12-10 Thread David Sterba
According to public poll, this is desired and deemed to be safe. Feature
introduced in kernel 3.10 (Jun 2013).

Signed-off-by: David Sterba 
---
 mkfs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mkfs.c b/mkfs.c
index e10e62d2f2e3..f930a5353f75 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -46,7 +46,8 @@
 
 static u64 index_cnt = 2;
 
-#define DEFAULT_MKFS_FEATURES  (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
+#define DEFAULT_MKFS_FEATURES  (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF \
+   | BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 
 #define DEFAULT_MKFS_LEAF_SIZE 16384
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: basic support for TREE_SEARCH_V2 ioctl

2014-12-10 Thread David Sterba
Add the interface and helper that checks if the v2 ioctl is supported.

Signed-off-by: David Sterba 
---
 ioctl.h | 14 ++
 utils.c | 40 
 utils.h |  2 ++
 3 files changed, 56 insertions(+)

diff --git a/ioctl.h b/ioctl.h
index 67c8de9808a7..2c2c7c1bc57e 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -279,6 +279,18 @@ struct btrfs_ioctl_search_args {
char buf[BTRFS_SEARCH_ARGS_BUFSIZE];
 };
 
+/*
+ * Extended version of TREE_SEARCH ioctl that can return more than 4k of bytes.
+ * The allocated size of the buffer is set in buf_size.
+ */
+struct btrfs_ioctl_search_args_v2 {
+struct btrfs_ioctl_search_key key; /* in/out - search parameters */
+__u64 buf_size;   /* in - size of buffer
+* out - on EOVERFLOW: needed size
+*   to store item */
+__u64 buf[0];  /* out - found items */
+};
+
 #define BTRFS_INO_LOOKUP_PATH_MAX 4080
 struct btrfs_ioctl_ino_lookup_args {
__u64 treeid;
@@ -542,6 +554,8 @@ struct btrfs_ioctl_clone_range_args {
struct btrfs_ioctl_defrag_range_args)
 #define BTRFS_IOC_TREE_SEARCH _IOWR(BTRFS_IOCTL_MAGIC, 17, \
   struct btrfs_ioctl_search_args)
+#define BTRFS_IOC_TREE_SEARCH_V2 _IOWR(BTRFS_IOCTL_MAGIC, 17, \
+  struct btrfs_ioctl_search_args_v2)
 #define BTRFS_IOC_INO_LOOKUP _IOWR(BTRFS_IOCTL_MAGIC, 18, \
   struct btrfs_ioctl_ino_lookup_args)
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, __u64)
diff --git a/utils.c b/utils.c
index 2a9241619128..d3ec0d4ab467 100644
--- a/utils.c
+++ b/utils.c
@@ -2450,3 +2450,43 @@ int find_next_key(struct btrfs_path *path, struct 
btrfs_key *key)
}
return 1;
 }
+
+int btrfs_tree_search2_ioctl_supported(int fd)
+{
+   struct btrfs_ioctl_search_args_v2 *args2;
+   struct btrfs_ioctl_search_key *sk;
+   int args2_size = 1024;
+   char args2_buf[args2_size];
+   int ret;
+   static int v2_supported = -1;
+
+   if (v2_supported != -1)
+   return v2_supported;
+
+   args2 = (struct btrfs_ioctl_search_args_v2 *)args2_buf;
+   sk = &(args2->key);
+
+   /*
+* Search for the extent tree item in the root tree.
+*/
+   sk->tree_id = BTRFS_ROOT_TREE_OBJECTID;
+   sk->min_objectid = BTRFS_EXTENT_TREE_OBJECTID;
+   sk->max_objectid = BTRFS_EXTENT_TREE_OBJECTID;
+   sk->min_type = BTRFS_ROOT_ITEM_KEY;
+   sk->max_type = BTRFS_ROOT_ITEM_KEY;
+   sk->min_offset = 0;
+   sk->max_offset = (u64)-1;
+   sk->min_transid = 0;
+   sk->max_transid = (u64)-1;
+   sk->nr_items = 1;
+   args2->buf_size = args2_size - sizeof(struct 
btrfs_ioctl_search_args_v2);
+   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH_V2, args2);
+   if (ret == -EOPNOTSUPP)
+   v2_supported = 0;
+   else if (ret == 0)
+   v2_supported = 1;
+   else
+   return ret;
+
+   return v2_supported;
+}
diff --git a/utils.h b/utils.h
index 289e86b4b11e..eb917d695f18 100644
--- a/utils.h
+++ b/utils.h
@@ -161,4 +161,6 @@ static inline u64 btrfs_min_dev_size(u32 leafsize)
 
 int find_next_key(struct btrfs_path *path, struct btrfs_key *key);
 
+int btrfs_tree_search2_ioctl_supported(int fd);
+
 #endif
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Patrik Lundquist
On 10 December 2014 at 13:17, Robert White  wrote:
> On 12/09/2014 11:19 PM, Patrik Lundquist wrote:
>>
> BUT FIRST UNDERSTAND: you do _not_ need to balance a newly converted
> filesystem. That is, the recommended balance (and recursive defrag) is _not_
> a useability issue, its an efficiency issue.

But if I can't start with an efficient filesystem I'd rather start
over now/soon. I intend to add four more old disks for a RAID1 and it
will be problematic to start over later on (I'd have to buy new, large
disks).

I deleted the subvolume after being satisfied with the conversion,
defragged recursively, and balanced. In that order.


> Because you made a backup and everything yes?

Shh!


> So anyway. Your system isn't "bugged" or "broken" it's "full" but its a
> fragmented fullness that has lots of free sectors but insufficent contiguous
> free sectors, so it cannot satisfy the request.

It's a half full 3TB disk. There _is_ space, somewhere. I can't speak
for contiguous space though.


>> I don't know how to interpret the space_info error. Why is only
>> 4773171200 (4,4GiB) free?
>> Can I inspect block group 1821099687936 to try to find out what makes
>> it problematic?
>>
>> BTRFS info (device sdc1): relocating block group 1821099687936 flags 1
>> BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920
>> BTRFS: space_info 1 has 4773171200 free, is not full
>> BTRFS: space_info total=1494648619008, used=1489775505408, pinned=0,
>> reserved=99700736, may_use=2102390784, readonly=241664
>
>
> So it was looking for a single chunk 2013265920 bytes long and it couldn't
> find one because all the spaces were smaller and there was no room to make a
> new suitable space.
>
> The problem is that it wanted 2013265920 bytes and while the system as a
> whole had no way to satisfy that desire. It asked for something just shy of
> two gigs as a single extent. That's a tough order on a full platter.
>
> Since your entire free size is 2102390784 that is an attempt to allocate
> about 80% of your free space as one contiguous block. That's never going to
> happen. 8-)

What about "space_info 1 has 4773171200 free"? Besides the other 1,5TB
free space.


> I don't even know if 2GiB is normally a legal size for an extent. My
> understanding is that data is allocated in 1G chunks, so I'd expect all
> extents to be smaller than 1G.

The 'summary' after the failed balances is always something like "98
enospc errors" which now makes me suspect that I have 98 files with
extents larger than 1GiB that the defrag didn't take care of.

So if I can find out which files have >1GiB extents I can then copy
them back and forth to solve the problem.

Maybe running defrag more times can also solve it? Can I get a list of
fragmented files?

Suppose an old file with 2GiB extent isn't fragmented, will btrfs
defrag still try to defrag it?


> After a quick glance at the btrfs-convert, it looks like it might make some
> pretty atypical extents if the underlying donor filesystem needed needed
> them. It wouldn't have had a choice. So it's easily within the realm of
> reason that you'd have some really fascinating data as a result of
> converting a nearly full EXT4 file system of the Terabyte+ size.

It was about half full at conversion.


> This would
> be quadruply true if you'd tweaked the block group ratios when you made the
> original file system.

Ext4 created with defaults, but I think it has been completely full at one time.


> So since you have nice backups... you should probably drop the ext2_saved
> subvolume and then get on with your life for good or ill.

Done before defrag and balance attempts.


> Think of the time and worry you'd have saved if you'd copied the thing in
> the first place. 8-)

But then I wouldn't learn as much. :-)


>>> P.S. you should re-balance your System and Metadata as "DUP" for now. Two
>>> copies of that stuff is better than one as right now you have no real
>>> recovery path for that stuff. If you didn't make that change on purpose
>>> it
>>> probably got down-revved from DUP automagically when you tired to RAID
>>> it.
>>
>>
>> Good point. Maybe btrfs-convert should do that by default? I don't
>> think it has ever been DUP.
>
> Eyup.

And the metadata is now DUP. That's ~1.5GB extra metadata that was
allocated just fine after the failed balance.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Duncan
Robert White posted on Wed, 10 Dec 2014 02:53:40 -0800 as excerpted:

> On 12/09/2014 05:08 PM, Dongsheng Yang wrote:
>> On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote:
>>> Hi Dongsheng On 12/09/2014 12:20 PM, Dongsheng Yang wrote:
 When function btrfs_statfs() calculate the tatol size of fs, it is
 calculating the total size of disks and then dividing it by a factor.
 But in some usecase, the result is not good to user.

 Example:
 # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
 # mount /dev/vdf1 /mnt
 # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
 # df -h /mnt
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/vdf1   3.0G 1018M  1.3G  45% /mnt

 # btrfs fi show /dev/vdf1
 Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
 Total devices 2 FS bytes used 1001.53MiB
 devid1 size 2.00GiB  used 1.85GiB path /dev/vdf1
 devid2 size 4.00GiB  used 1.83GiB path /dev/vdf2

 a. df -h should report Size as 2GiB rather than as 3GiB.
 Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
>>> I agree
> 
> NOPE.
> 
> The model you propose is too simple.
> 
> While the data portion of the file system is set to RAID1 the metadata
> portion of the filesystem is still set to the default of DUP.

Metadata defaults to DUP only on a single-device filesystem.  On a multi-
device filesystem, metadata defaults to raid1.  (FWIW, for both, data 
defaults to single.)

And in the example, the mkfs was supplied with two devices, so there's no 
dup metadata remaining from a formerly single-device filesystem, either.  
(Tho there will be the small single-mode stubs, empty, remaining from the 
mkfs process, as no balance has been run to delete them yet, but those 
are much smaller and empty.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V10 03/19] Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release extents aligned to block size.

2014-12-10 Thread Chandan Rajendra
Currently, the code reserves/releases extents in multiples of PAGE_CACHE_SIZE
units. Fix this.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file.c | 32 
 1 file changed, 20 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d3afac2..444819d 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1366,18 +1366,21 @@ fail:
 static noinline int
 lock_and_cleanup_extent_if_need(struct inode *inode, struct page **pages,
size_t num_pages, loff_t pos,
+   size_t write_bytes,
u64 *lockstart, u64 *lockend,
struct extent_state **cached_state)
 {
+   struct btrfs_root *root = BTRFS_I(inode)->root;
u64 start_pos;
u64 last_pos;
int i;
int ret = 0;
 
-   start_pos = pos & ~((u64)PAGE_CACHE_SIZE - 1);
-   last_pos = start_pos + ((u64)num_pages << PAGE_CACHE_SHIFT) - 1;
+   start_pos = pos & ~((u64)root->sectorsize - 1);
+   last_pos = start_pos
+   + ALIGN(pos + write_bytes - start_pos, root->sectorsize) - 1;
 
-   if (start_pos < inode->i_size) {
+   if (start_pos < inode->i_size) {
struct btrfs_ordered_extent *ordered;
lock_extent_bits(&BTRFS_I(inode)->io_tree,
 start_pos, last_pos, 0, cached_state);
@@ -1494,6 +1497,7 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
 
while (iov_iter_count(i) > 0) {
size_t offset = pos & (PAGE_CACHE_SIZE - 1);
+   size_t sector_offset;
size_t write_bytes = min(iov_iter_count(i),
 nrptrs * (size_t)PAGE_CACHE_SIZE -
 offset);
@@ -1514,7 +1518,9 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
break;
}
 
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+   sector_offset = pos & (root->sectorsize - 1);
+   reserve_bytes = ALIGN(write_bytes + sector_offset, 
root->sectorsize);
+
ret = btrfs_check_data_free_space(inode, reserve_bytes);
if (ret == -ENOSPC &&
(BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
@@ -1529,7 +1535,9 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
num_pages = (write_bytes + offset +
 PAGE_CACHE_SIZE - 1) >>
PAGE_CACHE_SHIFT;
-   reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
+
+   reserve_bytes = ALIGN(write_bytes + 
sector_offset,
+   root->sectorsize);
ret = 0;
} else {
ret = -ENOSPC;
@@ -1564,8 +1572,8 @@ again:
break;
 
ret = lock_and_cleanup_extent_if_need(inode, pages, num_pages,
- pos, &lockstart, &lockend,
- &cached_state);
+   pos, write_bytes, &lockstart, 
&lockend,
+   &cached_state);
if (ret < 0) {
if (ret == -EAGAIN)
goto again;
@@ -1602,9 +1610,9 @@ again:
 * we still have an outstanding extent for the chunk we actually
 * managed to copy.
 */
-   if (num_pages > dirty_pages) {
-   release_bytes = (num_pages - dirty_pages) <<
-   PAGE_CACHE_SHIFT;
+   if (write_bytes > copied) {
+   release_bytes = (write_bytes - copied)
+   & ~((u64)root->sectorsize - 1);
if (copied > 0) {
spin_lock(&BTRFS_I(inode)->lock);
BTRFS_I(inode)->outstanding_extents++;
@@ -1618,7 +1626,7 @@ again:
 release_bytes);
}
 
-   release_bytes = dirty_pages << PAGE_CACHE_SHIFT;
+   release_bytes = ALIGN(copied + sector_offset, root->sectorsize);
 
if (copied > 0)
ret = btrfs_dirty_pages(root, inode, pages,
@@ -1640,7 +1648,7 @@ again:
if (only_release_metadata && copied > 0) {
u64 lockstart = round_down(pos, root->sectorsize);
u64 lockend = lockstart +
-   (dirty_pages << PAGE_CACHE_SHIFT) - 1;
+   ALIGN(copied, root->sectorsize) - 1;
 
   

[RFC PATCH V10 16/19] Btrfs: subpagesize-blocksize: Track blocks of ordered extent submitted for write I/O.

2014-12-10 Thread Chandan Rajendra
In the subpagesize-blocksize scenario, the following command (with 4k as the
PAGE_SIZE and 2k as the block size) can cause false accounting of blocks of an
ordered extent that is written to disk:

$ xfs_io -f -c "pwrite 0 10240" \
-c "sync_range 0 4096" \
-c "sync_range 8192 2048" \
-c "pwrite 10240 2048" \
-c "sync_range 10240 2048" \
/mnt/btrfs/file.bin

To fix this, we would have to explicitly track the blocks of an ordered extent
that have already been submitted for write I/O.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c| 24 ++--
 fs/btrfs/ordered-data.c |  4 +++-
 fs/btrfs/ordered-data.h |  4 
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f9172aa..bc4dd46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3227,6 +3227,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
u64 extent_offset;
u64 extent_end;
u64 iosize;
+   u64 blk, nr_blks;
+   u64 blk_submitted;
sector_t sector;
struct extent_state *cached_state = NULL;
struct block_device *bdev;
@@ -3293,11 +3295,26 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
iosize = min(extent_end - cur, end - cur + 1);
iosize = ALIGN(iosize, blocksize);
 
+   blk = extent_offset >> inode->i_sb->s_blocksize_bits;
+   nr_blks = iosize >> inode->i_sb->s_blocksize_bits;
+
+   blk_submitted = find_next_bit(ordered->blocks_submitted,
+   ordered->len >> 
inode->i_sb->s_blocksize_bits,
+   blk);
+   if (blk_submitted < blk + nr_blks) {
+   if (blk_submitted == blk) {
+   cur += blocksize;
+   btrfs_put_ordered_extent(ordered);
+   continue;
+   }
+   iosize = (blk_submitted - blk)
+   << inode->i_sb->s_blocksize_bits;
+   nr_blks = iosize >> inode->i_sb->s_blocksize_bits;
+   }
+
sector = (ordered->start + extent_offset) >> 9;
bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
compressed = test_bit(BTRFS_ORDERED_COMPRESSED, 
&ordered->flags);
-   btrfs_put_ordered_extent(ordered);
-   ordered = NULL;
 
/*
 * compressed and inline extents are written through other
@@ -3310,6 +3327,7 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
 */
nr++;
cur += iosize;
+   btrfs_put_ordered_extent(ordered);
continue;
}
 
@@ -3324,6 +3342,8 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
} else {
unsigned long max_nr = (i_size >> PAGE_CACHE_SHIFT) + 1;
 
+   bitmap_set(ordered->blocks_submitted, blk, nr_blks);
+   btrfs_put_ordered_extent(ordered);
set_range_writeback(tree, cur, cur + iosize - 1);
if (!PageWriteback(page)) {
btrfs_err(BTRFS_I(inode)->root->fs_info,
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 4d9832f..59b2544 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -199,13 +199,15 @@ static int __btrfs_add_ordered_extent(struct inode 
*inode, u64 file_offset,
nr_longs = BITS_TO_LONGS(len >> inode->i_sb->s_blocksize_bits);
if (nr_longs == 1) {
entry->blocks_done = &entry->blocks_bitmap;
+   entry->blocks_submitted = &entry->blocks_submitted_bitmap;
} else {
-   entry->blocks_done = kzalloc(nr_longs * sizeof(unsigned long),
+   entry->blocks_done = kzalloc(2 * nr_longs * sizeof(unsigned 
long),
GFP_NOFS);
if (!entry->blocks_done) {
kmem_cache_free(btrfs_ordered_extent_cache, entry);
return -ENOMEM;
}
+   entry->blocks_submitted = entry->blocks_done + nr_longs;
}
 
entry->file_offset = file_offset;
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index 7de3b1e..851914c 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -139,6 +139,10 @@ struct btrfs_ordered_extent {
/* bitmap to track the blocks that have been written to disk */
unsigned long *blocks_done;
unsigned long blocks_bitmap;
+
+   /* bitmap to track the blocks that have been submitted for write i/o */
+   unsigned long *blocks_submitted;
+   unsigned

[RFC PATCH V10 01/19] Btrfs: subpagesize-blocksize: Get rid of whole page reads.

2014-12-10 Thread Chandan Rajendra
Based on original patch from Aneesh Kumar K.V 

For the subpagesize-blocksize scenario, a page can contain multiple
blocks. This patch handles this case.

This patch adds the new EXTENT_READ_IO extent state bit to reliably unlock
pages in readpage's end bio function.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 182 ---
 fs/btrfs/extent_io.h |   5 +-
 2 files changed, 89 insertions(+), 98 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index a389820..c98dfd8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1951,14 +1951,23 @@ int test_range_bit(struct extent_io_tree *tree, u64 
start, u64 end,
  * helper function to set a given page up to date if all the
  * extents in the tree for that page are up to date
  */
-static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
+static void check_page_uptodate(struct extent_io_tree *tree, struct page *page,
+   struct extent_state *cached)
 {
u64 start = page_offset(page);
u64 end = start + PAGE_CACHE_SIZE - 1;
-   if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
+   if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, cached))
SetPageUptodate(page);
 }
 
+static int page_read_complete(struct extent_io_tree *tree, struct page *page)
+{
+   u64 start = page_offset(page);
+   u64 end = start + PAGE_CACHE_SIZE - 1;
+
+   return !test_range_bit(tree, start, end, EXTENT_READ_IO, 0, NULL);
+}
+
 /*
  * When IO fails, either with EIO or csum verification fails, we
  * try other mirrors that might have a good copy of the data.  This
@@ -2275,7 +2284,9 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 *  a) deliver good data to the caller
 *  b) correct the bad sectors on disk
 */
-   if (failed_bio->bi_vcnt > 1) {
+   if ((failed_bio->bi_vcnt > 1)
+   || (failed_bio->bi_io_vec->bv_len
+   > BTRFS_I(inode)->root->sectorsize)) {
/*
 * to fulfill b), we need to know the exact failing sectors, as
 * we don't want to rewrite any more than the failed ones. thus,
@@ -2422,18 +2433,6 @@ static void end_bio_extent_writepage(struct bio *bio, 
int err)
bio_put(bio);
 }
 
-static void
-endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 len,
- int uptodate)
-{
-   struct extent_state *cached = NULL;
-   u64 end = start + len - 1;
-
-   if (uptodate && tree->track_uptodate)
-   set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
-   unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
-}
-
 /*
  * after a readpage IO is done, we need to:
  * clear the uptodate bits on error
@@ -2450,14 +2449,15 @@ static void end_bio_extent_readpage(struct bio *bio, 
int err)
struct bio_vec *bvec;
int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags);
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   struct extent_state *cached = NULL;
struct extent_io_tree *tree;
+   unsigned long flags;
u64 offset = 0;
u64 start;
u64 end;
-   u64 len;
-   u64 extent_start = 0;
-   u64 extent_len = 0;
+   int nr_sectors;
int mirror;
+   int unlock;
int ret;
int i;
 
@@ -2467,54 +2467,31 @@ static void end_bio_extent_readpage(struct bio *bio, 
int err)
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
struct inode *inode = page->mapping->host;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
 
pr_debug("end_bio_extent_readpage: bi_sector=%llu, err=%d, "
 "mirror=%lu\n", (u64)bio->bi_iter.bi_sector, err,
 io_bio->mirror_num);
tree = &BTRFS_I(inode)->io_tree;
 
-   /* We always issue full-page reads, but if some block
-* in a page fails to read, blk_update_request() will
-* advance bv_offset and adjust bv_len to compensate.
-* Print a warning for nonzero offsets, and an error
-* if they don't add up to a full page.  */
-   if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
-   if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
-   
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-  "partial page read in btrfs with offset %u 
and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   else
-   
btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-  "incomplete page read in btrfs with offset 
%u an

[RFC PATCH V10 15/19] Btrfs: subpagesize-blocksize: Revert commit fc4adbff823f76577ece26dcb88bf6f8392dbd43.

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize, we have multiple blocks in a page. Checking for
existence of a page in the page cache isn't a sufficient check, since we
could be truncating a subset of the blocks mapped by the page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/btrfs_inode.h |  2 --
 fs/btrfs/file.c|  4 ++-
 fs/btrfs/inode.c   | 77 +++---
 3 files changed, 7 insertions(+), 76 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 43527fd..50497bf 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -278,6 +278,4 @@ static inline void btrfs_inode_resume_unlocked_dio(struct 
inode *inode)
  &BTRFS_I(inode)->runtime_flags);
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
-
 #endif
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b1e0d27..3707515 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2314,7 +2314,9 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if ((!ordered ||
(ordered->file_offset + ordered->len <= lockstart ||
 ordered->file_offset > lockend)) &&
-!btrfs_page_exists_in_range(inode, lockstart, lockend)) {
+!test_range_bit(&BTRFS_I(inode)->io_tree, lockstart,
+lockend, EXTENT_UPTODATE, 0,
+cached_state)) {
if (ordered)
btrfs_put_ordered_extent(ordered);
break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e0dd338..b236417 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6832,76 +6832,6 @@ out:
return ret;
 }
 
-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
-{
-   struct radix_tree_root *root = &inode->i_mapping->page_tree;
-   int found = false;
-   void **pagep = NULL;
-   struct page *page = NULL;
-   int start_idx;
-   int end_idx;
-
-   start_idx = start >> PAGE_CACHE_SHIFT;
-
-   /*
-* end is the last byte in the last page.  end == start is legal
-*/
-   end_idx = end >> PAGE_CACHE_SHIFT;
-
-   rcu_read_lock();
-
-   /* Most of the code in this while loop is lifted from
-* find_get_page.  It's been modified to begin searching from a
-* page and return just the first page found in that range.  If the
-* found idx is less than or equal to the end idx then we know that
-* a page exists.  If no pages are found or if those pages are
-* outside of the range then we're fine (yay!) */
-   while (page == NULL &&
-  radix_tree_gang_lookup_slot(root, &pagep, NULL, start_idx, 1)) {
-   page = radix_tree_deref_slot(pagep);
-   if (unlikely(!page))
-   break;
-
-   if (radix_tree_exception(page)) {
-   if (radix_tree_deref_retry(page)) {
-   page = NULL;
-   continue;
-   }
-   /*
-* Otherwise, shmem/tmpfs must be storing a swap entry
-* here as an exceptional entry: so return it without
-* attempting to raise page count.
-*/
-   page = NULL;
-   break; /* TODO: Is this relevant for this use case? */
-   }
-
-   if (!page_cache_get_speculative(page)) {
-   page = NULL;
-   continue;
-   }
-
-   /*
-* Has the page moved?
-* This is part of the lockless pagecache protocol. See
-* include/linux/pagemap.h for details.
-*/
-   if (unlikely(page != *pagep)) {
-   page_cache_release(page);
-   page = NULL;
-   }
-   }
-
-   if (page) {
-   if (page->index <= end_idx)
-   found = true;
-   page_cache_release(page);
-   }
-
-   rcu_read_unlock();
-   return found;
-}
-
 static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
  struct extent_state **cached_state, int writing)
 {
@@ -6926,9 +6856,10 @@ static int lock_extent_direct(struct inode *inode, u64 
lockstart, u64 lockend,
 * invalidate needs to happen so that reads after a write do not
 * get stale data.
 */
-   if (!ordered &&
-   (!writing ||
-!btrfs_page_exists_in_range(inode, lockstart, lockend)))
+   if (!ordered && (!writing ||
+   !test_range_bit(&BTRFS_I(inode)->io_tree,
+   

[RFC PATCH V10 10/19] Btrfs: subpagesize-blocksize: fallocate: Work with sectorsized units.

2014-12-10 Thread Chandan Rajendra
While at it, this commit changes btrfs_truncate_page() to truncate sectorsized
blocks instead of pages. Hence the function has been renamed to
btrfs_truncate_block().

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/file.c  | 41 ++---
 fs/btrfs/inode.c | 48 +---
 3 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5b7b7ca..59779dc 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3815,7 +3815,7 @@ int btrfs_unlink_subvol(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct inode *dir, u64 objectid,
const char *name, int name_len);
-int btrfs_truncate_page(struct inode *inode, loff_t from, loff_t len,
+int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len,
int front);
 int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 444819d..b1e0d27 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2200,21 +2200,24 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
u64 tail_len;
u64 orig_start = offset;
u64 cur_offset;
+   unsigned char blocksize_bits;
u64 min_size = btrfs_calc_trunc_metadata_size(root, 1);
u64 drop_end;
int ret = 0;
int err = 0;
int rsv_count;
-   bool same_page;
+   bool same_block;
bool no_holes = btrfs_fs_incompat(root->fs_info, NO_HOLES);
u64 ino_size;
 
+   blocksize_bits = inode->i_sb->s_blocksize_bits;
+
ret = btrfs_wait_ordered_range(inode, offset, len);
if (ret)
return ret;
 
mutex_lock(&inode->i_mutex);
-   ino_size = round_up(inode->i_size, PAGE_CACHE_SIZE);
+   ino_size = round_up(inode->i_size, root->sectorsize);
ret = find_first_non_hole(inode, &offset, &len);
if (ret < 0)
goto out_only_mutex;
@@ -2224,29 +2227,28 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
goto out_only_mutex;
}
 
-   lockstart = round_up(offset , BTRFS_I(inode)->root->sectorsize);
+   lockstart = round_up(offset, BTRFS_I(inode)->root->sectorsize);
lockend = round_down(offset + len,
 BTRFS_I(inode)->root->sectorsize) - 1;
-   same_page = ((offset >> PAGE_CACHE_SHIFT) ==
-   ((offset + len - 1) >> PAGE_CACHE_SHIFT));
-
+   same_block = ((offset >> blocksize_bits)
+   == ((offset + len - 1) >> blocksize_bits));
/*
-* We needn't truncate any page which is beyond the end of the file
+* We needn't truncate any block which is beyond the end of the file
 * because we are sure there is no data there.
 */
/*
-* Only do this if we are in the same page and we aren't doing the
-* entire page.
+* Only do this if we are in the same block and we aren't doing the
+* entire block.
 */
-   if (same_page && len < PAGE_CACHE_SIZE) {
+   if (same_block && len < root->sectorsize) {
if (offset < ino_size)
-   ret = btrfs_truncate_page(inode, offset, len, 0);
+   ret = btrfs_truncate_block(inode, offset, len, 0);
goto out_only_mutex;
}
 
-   /* zero back part of the first page */
+   /* zero back part of the first block */
if (offset < ino_size) {
-   ret = btrfs_truncate_page(inode, offset, 0, 0);
+   ret = btrfs_truncate_block(inode, offset, 0, 0);
if (ret) {
mutex_unlock(&inode->i_mutex);
return ret;
@@ -2281,11 +2283,12 @@ static int btrfs_punch_hole(struct inode *inode, loff_t 
offset, loff_t len)
if (!ret) {
/* zero the front end of the last page */
if (tail_start + tail_len < ino_size) {
-   ret = btrfs_truncate_page(inode,
-   tail_start + tail_len, 0, 1);
+   ret = btrfs_truncate_block(inode,
+   tail_start + tail_len,
+   0, 1);
if (ret)
goto out_only_mutex;
-   }
+   }
}
}
 
@@ -2506,10 +2509,10 @@ static long btrfs_fallocate(struct file *file, int mode,
} else {
/*
 * If we are fallocating from the end of the file onward we
-* need to zero out the en

[RFC PATCH V10 00/19] Btrfs: Subpagesize-blocksize: Get rid of whole page I/O.

2014-12-10 Thread Chandan Rajendra
This patchset continues with the work posted earlier at
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg38862.html.

Changes from V9:
1. Earlier, In read_extent_buffer_pages(), we used to check for extent buffer
   pages' PG_uptodate flag immediately after the page was unlocked by the
   endio function. However, the PG_uptodate flag is set on the pages only
   after the read operation on all pages complete successfully and the
   verification of the extent buffer's contents is done. Fix this by checking
   only for EXTENT_BUFFER_UPTODATE flag in read_extent_buffer_pages().
2. Add the new EXTENT_READ_IO extent state bit to reliably unlock pages in
   readpage's end bio function.
3. Use (eb->start, seq) as search key for tree modification log.
4. btrfs_submit_direct_hook: Prevent zero length bios from being submitted
   when map_length < bio vector length.
5. Enabled POSIX ACL support.   

Changes from V8:
1. In subpagesize-blocksize scenario, prevent writes to an extent
   buffer when the corresponding page's PG_writeback flag is set. This
   race condition was triggered when running xfstests' generic/083
   test. With the new patch applied, I have run the complete xfstests
   suite as well as run generic/083 multiple times on both 4k and 2k
   block size setups. There were 2 non-related test failures that
   occured rarely, but they were reproducible even when the patch was
   not applied.

Changes from V7:
1. Fix a softlockup issue that occured because the page corresponding
   to the delalloc region did not exist. This bug was introduced by
   the code added in btrfs_invalidatepage and related functions in V7
   version.

Changes from V6:
1. Fix softlockup issue that occured during unmounting a 4k blocksized
   filesystem instance.
2. Track blocks of an ordered extent submitted for write I/O to avoid
   I/O resubmission in certain scenarios.

Changes from V5:
1. Rebased patchset on top of current btrfs-next tree (i.e. commit
   8d875f95da43c6a8f18f77869f2ef26e9594fecc). This involved using
   "immutable biovecs".
2. Deal with partially allocated ordered extents across a page.
3. Explicitly track I/O status of blocks of an ordered extent.

Changes from V4:
1. V2's "Btrfs: subpagesize-blocksize: Get rid of whole page reads"
   patch was incorrectly replaced with an older version when working
   on V3 patches. Fix this.
2. Fix btrfs_endio_direct_read() to compute checksums for all possible
   blocks in a page.

Changes from V3:
1. Get "Hole punching" and "Extent preallocation" to work correctly in
   subpagesize-blocksize scenario.
2. Get btrfs_page_mkwrite() to reserve space in sectorsized units.

Changes from V2:
1. Get __extent_writepage() to write only the dirty blocks of a page.
2. Fix "page private not zero on page" warning message which is printed
   when running xfstests.

Changes from V1:
1. Remove usage of bio_vec->bv_{len,offset} in end_bio_extent_readpage()
   and end_bio_extent_writepage().

Xfstests' generic tests were run on an x86_64 machine with the patches
applied for blocksizes 2k and 4k.

For 2k blocksize, the following xfstests' generic tests failed:
1. generic/068

The following xfstests' generic tests failed for both 2k and 4k blocksize:
1. generic/008 - FALLOC_FL_ZERO_RANGE is not supported by Btrfs.
2. generic/091 - FALLOC_FL_ZERO_RANGE is not supported by Btrfs.
3. generic/224
   This looks mostly an issue caused by non-btrfs code as the test failed
   for the exact same reason when run on an ext4 filesystem instance.
4. generic/263 - FALLOC_FL_ZERO_RANGE is not supported by Btrfs.
5. generic/274
   This test very rarely results in a hung task.

The following is a list of known TODO items which will be implemented in
future revisions of this patchset:
1. The xfstests suite was based off the corresponding git tree as available on
   March 2014. Pull the latest from the xfstests git tree and execute the
   tests.
2. Rebase the patches on top of the current linux-btrfs/next branch.
3. Get Xfstests' generic tests to successfully run on both 4k and 2k
   blocksizes.
4. Remove PAGE_CACHE_SIZE delalloc reservation in 
btrfs_writepage_fixup_worker().
5. Create separate slab caches for 'extent buffer head' and 'extent buffer'.
6. Add 'leak list' tracking for 'extent buffer' instances.
7. Rename EXTENT_BUFFER_TREE_REF and EXTENT_BUFFER_IN_TREE to
   EXTENT_BUFFER_HEAD_TREE_REF and EXTENT_BUFFER_HEAD_IN_TREE respectively.
   
Chandan Rajendra (17):
  Btrfs: subpagesize-blocksize: Get rid of whole page reads.
  Btrfs: subpagesize-blocksize: Get rid of whole page writes.
  Btrfs: subpagesize-blocksize: __btrfs_buffered_write: Reserve/release
extents aligned to block size.
  Btrfs: subpagesize-blocksize: Read tree blocks whose size is
start, seq) as search key for
tree modification log.
  Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle
map_length < bio vector length

Chandra Seetharaman (2):
  Btrfs: subpagesize-blocksize: Define extent_buffer_head.
  B

[RFC PATCH V10 04/19] Btrfs: subpagesize-blocksize: Define extent_buffer_head.

2014-12-10 Thread Chandan Rajendra
From: Chandra Seetharaman 

In order to handle multiple extent buffers per page, first we need to create a
way to handle all the extent buffers that are attached to a page.

This patch creates a new data structure 'struct extent_buffer_head', and moves
fields that are common to all extent buffers in a page from 'struct extent
buffer' to 'struct extent_buffer_head'

Also, this patch moves EXTENT_BUFFER_TREE_REF, EXTENT_BUFFER_DUMMY and
EXTENT_BUFFER_IN_TREE flags from extent_buffer->ebflags  to
extent_buffer_head->bflags.

Signed-off-by: Chandra Seetharaman 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/backref.c   |   2 +-
 fs/btrfs/ctree.c |   2 +-
 fs/btrfs/ctree.h |   6 +-
 fs/btrfs/disk-io.c   |  46 --
 fs/btrfs/extent-tree.c   |   6 +-
 fs/btrfs/extent_io.c | 373 +--
 fs/btrfs/extent_io.h |  47 --
 fs/btrfs/volumes.c   |   2 +-
 include/trace/events/btrfs.h |   2 +-
 9 files changed, 328 insertions(+), 158 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 54a201d..1d3d5d6 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1305,7 +1305,7 @@ char *btrfs_ref_to_path(struct btrfs_root *fs_root, 
struct btrfs_path *path,
eb = path->nodes[0];
/* make sure we can use eb after releasing the path */
if (eb != eb_in) {
-   atomic_inc(&eb->refs);
+   atomic_inc(&eb_head(eb)->refs);
btrfs_tree_read_lock(eb);
btrfs_set_lock_blocking_rw(eb, BTRFS_READ_LOCK);
}
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 44ee5d2..693b541 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -169,7 +169,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
*root)
 * the inc_not_zero dance and if it doesn't work then
 * synchronize_rcu and try again.
 */
-   if (atomic_inc_not_zero(&eb->refs)) {
+   if (atomic_inc_not_zero(&eb_head(eb)->refs)) {
rcu_read_unlock();
break;
}
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8e29b61..5b7b7ca 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2215,14 +2215,16 @@ static inline void btrfs_set_token_##name(struct 
extent_buffer *eb, \
 #define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits)\
 static inline u##bits btrfs_##name(struct extent_buffer *eb)   \
 {  \
-   type *p = page_address(eb->pages[0]);   \
+   type *p = page_address(eb_head(eb)->pages[0]) + \
+   (eb->start & (PAGE_CACHE_SIZE -1)); \
u##bits res = le##bits##_to_cpu(p->member); \
return res; \
 }  \
 static inline void btrfs_set_##name(struct extent_buffer *eb,  \
u##bits val)\
 {  \
-   type *p = page_address(eb->pages[0]);   \
+   type *p = page_address(eb_head(eb)->pages[0]) + \
+   (eb->start & (PAGE_CACHE_SIZE -1)); \
p->member = cpu_to_le##bits(val);   \
 }
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d0ed9e6..3a79833 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1030,13 +1030,21 @@ static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
struct extent_buffer *eb;
+   int i, dirty = 0;
 
BUG_ON(!PagePrivate(page));
eb = (struct extent_buffer *)page->private;
BUG_ON(!eb);
-   BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-   BUG_ON(!atomic_read(&eb->refs));
-   btrfs_assert_tree_locked(eb);
+
+   do {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
+   if (dirty)
+   break;
+   } while ((eb = eb->eb_next) != NULL);
+
+   BUG_ON(!dirty);
+   BUG_ON(!atomic_read(&(eb_head(eb)->refs)));
+   btrfs_assert_tree_locked(&ebh->eb);
 #endif
return __set_page_dirty_nobuffers(page);
 }
@@ -1080,7 +1088,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr, u32 blocksize,
if (!buf)
return 0;
 
-   set_bit(EXTENT_BUFFER_READAHEAD, &buf->bflags);
+   set_bit(EXTENT_BUFFER_READAHEAD, &buf->ebflags);
 
ret = read_extent_buffer_pages(io_tree, buf, 0, WAIT_PAGE_LOCK,
   btree_get_extent, mirror_num);
@@ -1089,7 +1097,7 @@ int reada_tree_bl

[RFC PATCH V10 14/19] Btrfs: subpagesize-blocksize: Explicitly Track I/O status of blocks of an ordered extent.

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize scenario a page can have more than one block. So
in addition to PagePrivate2 flag, we would have to track the I/O status of
each block of a page to reliably mark the ordered extent as complete.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c|  19 +--
 fs/btrfs/extent_io.h|   5 +-
 fs/btrfs/inode.c| 338 +++-
 fs/btrfs/ordered-data.c |  17 +++
 fs/btrfs/ordered-data.h |   4 +
 5 files changed, 285 insertions(+), 98 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 51ab453..f9172aa 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4302,11 +4302,10 @@ int extent_invalidatepage(struct extent_io_tree *tree,
  * to drop the page.
  */
 static int try_release_extent_state(struct extent_map_tree *map,
-   struct extent_io_tree *tree,
-   struct page *page, gfp_t mask)
+   struct extent_io_tree *tree,
+   struct page *page, u64 start, u64 end,
+   gfp_t mask)
 {
-   u64 start = page_offset(page);
-   u64 end = start + PAGE_CACHE_SIZE - 1;
int ret = 1;
 
if (test_range_bit(tree, start, end,
@@ -4340,12 +4339,12 @@ static int try_release_extent_state(struct 
extent_map_tree *map,
  * map records are removed
  */
 int try_release_extent_mapping(struct extent_map_tree *map,
-  struct extent_io_tree *tree, struct page *page,
-  gfp_t mask)
+   struct extent_io_tree *tree, struct page *page,
+   u64 start, u64 end, gfp_t mask)
 {
struct extent_map *em;
-   u64 start = page_offset(page);
-   u64 end = start + PAGE_CACHE_SIZE - 1;
+   u64 orig_start = start;
+   u64 orig_end = end;
 
if ((mask & __GFP_WAIT) &&
page->mapping->host->i_size > 16 * 1024 * 1024) {
@@ -4379,7 +4378,9 @@ int try_release_extent_mapping(struct extent_map_tree 
*map,
free_extent_map(em);
}
}
-   return try_release_extent_state(map, tree, page, mask);
+   return try_release_extent_state(map, tree, page,
+   orig_start, orig_end,
+   mask);
 }
 
 /*
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 264dfd4..15bb2a7 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -209,8 +209,9 @@ typedef struct extent_map *(get_extent_t)(struct inode 
*inode,
 void extent_io_tree_init(struct extent_io_tree *tree,
 struct address_space *mapping);
 int try_release_extent_mapping(struct extent_map_tree *map,
-  struct extent_io_tree *tree, struct page *page,
-  gfp_t mask);
+   struct extent_io_tree *tree, struct page *page,
+   u64 start, u64 end,
+   gfp_t mask);
 int try_release_extent_buffer(struct page *page);
 int lock_extent(struct extent_io_tree *tree, u64 start, u64 end);
 int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4ed78dd..e0dd338 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2827,51 +2827,115 @@ static void finish_ordered_fn(struct btrfs_work *work)
btrfs_finish_ordered_io(ordered_extent);
 }
 
-static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end,
+static void mark_blks_io_complete(struct btrfs_ordered_extent *ordered,
+   u64 blk, u64 nr_blks, int uptodate)
+{
+   struct inode *inode = ordered->inode;
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_workqueue *workers;
+   int done;
+
+   while (nr_blks--) {
+   if (test_and_set_bit(blk, ordered->blocks_done)) {
+   blk++;
+   continue;
+   }
+
+   done = btrfs_dec_test_ordered_pending(inode, &ordered,
+   ordered->file_offset
+   + (blk << 
inode->i_sb->s_blocksize_bits),
+   root->sectorsize,
+   uptodate);
+   if (done) {
+   btrfs_init_work(&ordered->work, finish_ordered_fn,
+   NULL, NULL);
+
+   ordered->work.func = finish_ordered_fn;
+   ordered->work.flags = 0;
+
+   if (btrfs_is_free_space_inode(inode))
+   workers = root->fs_info->endio_freespace_worker;
+   else
+   workers = root->fs_info->endio_write_workers;
+
+   

[RFC PATCH V10 07/19] Btrfs: subpagesize-blocksize: Allow mounting filesystems where sectorsize != PAGE_SIZE

2014-12-10 Thread Chandan Rajendra
From: Chandra Seetharaman 

This patch allows mounting filesystems with blocksize smaller than the
PAGE_SIZE.

Signed-off-by: Chandra Seetharaman 
Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 6c6e8bb..2f3caaf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2634,12 +2634,6 @@ int open_ctree(struct super_block *sb,
goto fail_sb_buffer;
}
 
-   if (sectorsize != PAGE_SIZE) {
-   printk(KERN_WARNING "BTRFS: Incompatible sector size(%lu) "
-  "found on %s\n", (unsigned long)sectorsize, sb->s_id);
-   goto fail_sb_buffer;
-   }
-
mutex_lock(&fs_info->chunk_mutex);
ret = btrfs_read_sys_array(tree_root);
mutex_unlock(&fs_info->chunk_mutex);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V10 17/19] Btrfs: subpagesize-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set.

2014-12-10 Thread Chandan Rajendra
In non-subpagesize-blocksize scenario, BTRFS_HEADER_FLAG_WRITTEN flag prevents
Btrfs code from writing into an extent buffer whose pages are under
writeback. This facility isn't sufficient for achieving the same in
subpagesize-blocksize scenario, since we have more than one extent buffer
mapped to a page.

Hence this patch adds a new flag (i.e. EXTENT_BUFFER_HEAD_WRITEBACK) and
corresponding code to track the writeback status of the page and to prevent
writes to any of the extent buffers mapped to the page while writeback is
going on.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.c   |  20 ++-
 fs/btrfs/extent-tree.c |  12 
 fs/btrfs/extent_io.c   | 153 +++--
 fs/btrfs/extent_io.h   |   2 +
 4 files changed, 157 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 693b541..75129da 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1543,6 +1543,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
struct extent_buffer *parent, int parent_slot,
struct extent_buffer **cow_ret)
 {
+   struct extent_buffer_head *ebh = eb_head(buf);
u64 search_start;
int ret;
 
@@ -1556,6 +1557,13 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
*trans,
   trans->transid, root->fs_info->generation);
 
if (!should_cow_block(trans, root, buf)) {
+   if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK, &ebh->bflags)) {
+   if (parent)
+   btrfs_set_lock_blocking(parent);
+   btrfs_set_lock_blocking(buf);
+   wait_on_bit(&ebh->bflags, EXTENT_BUFFER_HEAD_WRITEBACK,
+   eb_wait, TASK_UNINTERRUPTIBLE);
+   }
*cow_ret = buf;
return 0;
}
@@ -2687,6 +2695,7 @@ int btrfs_search_slot(struct btrfs_trans_handle *trans, 
struct btrfs_root
  *root, struct btrfs_key *key, struct btrfs_path *p, int
  ins_len, int cow)
 {
+   struct extent_buffer_head *ebh;
struct extent_buffer *b;
int slot;
int ret;
@@ -2789,8 +2798,17 @@ again:
 * then we don't want to set the path blocking,
 * so we test it here
 */
-   if (!should_cow_block(trans, root, b))
+   if (!should_cow_block(trans, root, b)) {
+   ebh = eb_head(b);
+   if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+   &ebh->bflags)) {
+   btrfs_set_path_blocking(p);
+   wait_on_bit(&ebh->bflags,
+   EXTENT_BUFFER_HEAD_WRITEBACK,
+   eb_wait, TASK_UNINTERRUPTIBLE);
+   }
goto cow_done;
+   }
 
btrfs_set_path_blocking(p);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fbcad82..fb5cc46 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7203,14 +7203,26 @@ static struct extent_buffer *
 btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
  u64 bytenr, u32 blocksize, int level)
 {
+   struct extent_buffer_head *ebh;
struct extent_buffer *buf;
 
buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
if (!buf)
return ERR_PTR(-ENOMEM);
+
+   ebh = eb_head(buf);
btrfs_set_header_generation(buf, trans->transid);
btrfs_set_buffer_lockdep_class(root->root_key.objectid, buf, level);
btrfs_tree_lock(buf);
+
+   if (test_bit(EXTENT_BUFFER_HEAD_WRITEBACK,
+   &ebh->bflags)) {
+   btrfs_set_lock_blocking(buf);
+   wait_on_bit(&ebh->bflags,
+   EXTENT_BUFFER_HEAD_WRITEBACK,
+   eb_wait, TASK_UNINTERRUPTIBLE);
+   }
+
clean_tree_block(trans, root, buf);
clear_bit(EXTENT_BUFFER_STALE, &buf->ebflags);
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bc4dd46..598923c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3448,7 +3448,7 @@ done_unlocked:
return 0;
 }
 
-static int eb_wait(void *word)
+int eb_wait(void *word)
 {
io_schedule();
return 0;
@@ -3460,6 +3460,52 @@ void wait_on_extent_buffer_writeback(struct 
extent_buffer *eb)
TASK_UNINTERRUPTIBLE);
 }
 
+static void lock_extent_buffers(struct extent_buffer_head *ebh,
+   struct extent_page_data *epd)
+{
+   struct extent_buffer *locked_eb = NULL;
+   struct extent_buffer *eb;
+a

[RFC PATCH V10 11/19] Btrfs: subpagesize-blocksize: btrfs_page_mkwrite: Reserve space in sectorsized units.

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize scenario, if i_size occurs in a block which is not
the last block in the page, then the space to be reserved should be calculated
appropriately.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7ad7d0f..23ce9ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7812,26 +7812,23 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
loff_t size;
int ret;
int reserved = 0;
+   u64 delalloc_size;
u64 page_start;
u64 page_end;
 
sb_start_pagefault(inode->i_sb);
-   ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
-   if (!ret) {
-   ret = file_update_time(vma->vm_file);
-   reserved = 1;
-   }
+
+   ret = file_update_time(vma->vm_file);
if (ret) {
if (ret == -ENOMEM)
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
-   if (reserved)
-   goto out;
-   goto out_noreserve;
+   goto out;
}
 
ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
+
 again:
lock_page(page);
size = i_size_read(inode);
@@ -7862,6 +7859,19 @@ again:
goto again;
}
 
+   if (page->index == ((size - 1) >> PAGE_CACHE_SHIFT))
+   delalloc_size = round_up(size - page_start, root->sectorsize);
+   else
+   delalloc_size = PAGE_CACHE_SIZE;
+
+   ret = btrfs_delalloc_reserve_space(inode, delalloc_size);
+   if (ret) {
+   /* -ENOSPC */
+   ret = VM_FAULT_SIGBUS;
+   goto out_unlock;
+   }
+   reserved = 1;
+
/*
 * XXX - page_mkwrite gets called every time the page is dirtied, even
 * if it was already dirty, so for space accounting reasons we need to
@@ -7874,7 +7884,8 @@ again:
  EXTENT_DO_ACCOUNTING | EXTENT_DEFRAG,
  0, 0, &cached_state, GFP_NOFS);
 
-   ret = btrfs_set_extent_delalloc(inode, page_start, page_end,
+   ret = btrfs_set_extent_delalloc(inode, page_start,
+   page_start + delalloc_size - 1,
&cached_state);
if (ret) {
unlock_extent_cached(io_tree, page_start, page_end,
@@ -7913,8 +7924,8 @@ out_unlock:
}
unlock_page(page);
 out:
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
-out_noreserve:
+   if (reserved)
+   btrfs_delalloc_release_space(inode, delalloc_size);
sb_end_pagefault(inode->i_sb);
return ret;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V10 19/19] Btrfs: subpagesize-blocksize: btrfs_submit_direct_hook: Handle map_length < bio vector length

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize scenario, map_length can be less than the length of a
bio vector. Such a condition may cause btrfs_submit_direct_hook() to submit a
zero length bio. Fix this.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/inode.c | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b236417..0f59c6c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7346,9 +7346,11 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
u64 file_offset = dip->logical_offset;
u64 submit_len = 0;
u64 map_length;
-   int nr_pages = 0;
+   u32 blocksize = root->sectorsize;
int ret = 0;
int async_submit = 0;
+   int nr_sectors;
+   int i;
 
map_length = orig_bio->bi_iter.bi_size;
ret = btrfs_map_block(root->fs_info, rw, start_sector << 9,
@@ -7378,9 +7380,12 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
atomic_inc(&dip->pending_bios);
 
while (bvec <= (orig_bio->bi_io_vec + orig_bio->bi_vcnt - 1)) {
-   if (unlikely(map_length < submit_len + bvec->bv_len ||
-   bio_add_page(bio, bvec->bv_page, bvec->bv_len,
-bvec->bv_offset) < bvec->bv_len)) {
+   nr_sectors = bvec->bv_len >> inode->i_sb->s_blocksize_bits;
+   i = 0;
+next_block:
+   if (unlikely(map_length < submit_len + blocksize ||
+   bio_add_page(bio, bvec->bv_page, blocksize,
+   bvec->bv_offset + (i * blocksize)) < blocksize)) {
/*
 * inc the count before we submit the bio so
 * we know the end IO handler won't happen before
@@ -7401,7 +7406,6 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
file_offset += submit_len;
 
submit_len = 0;
-   nr_pages = 0;
 
bio = btrfs_dio_bio_alloc(orig_bio->bi_bdev,
  start_sector, GFP_NOFS);
@@ -7418,9 +7422,14 @@ static int btrfs_submit_direct_hook(int rw, struct 
btrfs_dio_private *dip,
bio_put(bio);
goto out_err;
}
+
+   goto next_block;
} else {
-   submit_len += bvec->bv_len;
-   nr_pages++;
+   submit_len += blocksize;
+   if (--nr_sectors) {
+   i++;
+   goto next_block;
+   }
bvec++;
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH V10 18/19] Btrfs: subpagesize-blocksize: Use (eb->start, seq) as search key for tree modification log.

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize a page can map multiple extent buffers and hence
using (page index, seq) as the search key is incorrect. For example, searching
through tree modification log tree can return an entry associated with the
first extent buffer mapped by the page (if such an entry exists), when we are
actually searching for entries associated with extent buffers that are mapped
at position 2 or more in the page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/ctree.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 75129da..8344f49 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -314,7 +314,7 @@ struct tree_mod_root {
 
 struct tree_mod_elem {
struct rb_node node;
-   u64 index;  /* shifted logical */
+   u64 logical;
u64 seq;
enum mod_log_op op;
 
@@ -438,11 +438,11 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
 
 /*
  * key order of the log:
- *   index -> sequence
+ *   node/leaf start address -> sequence
  *
- * the index is the shifted logical of the *new* root node for root replace
- * operations, or the shifted logical of the affected block for all other
- * operations.
+ * The 'start address' is the logical address of the *new* root node
+ * for root replace operations, or the logical address of the affected
+ * block for all other operations.
  *
  * Note: must be called with write lock (tree_mod_log_write_lock).
  */
@@ -463,9 +463,9 @@ __tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct 
tree_mod_elem *tm)
while (*new) {
cur = container_of(*new, struct tree_mod_elem, node);
parent = *new;
-   if (cur->index < tm->index)
+   if (cur->logical < tm->logical)
new = &((*new)->rb_left);
-   else if (cur->index > tm->index)
+   else if (cur->logical > tm->logical)
new = &((*new)->rb_right);
else if (cur->seq < tm->seq)
new = &((*new)->rb_left);
@@ -526,7 +526,7 @@ alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
if (!tm)
return NULL;
 
-   tm->index = eb->start >> PAGE_CACHE_SHIFT;
+   tm->logical = eb->start;
if (op != MOD_LOG_KEY_ADD) {
btrfs_node_key(eb, &tm->key, slot);
tm->blockptr = btrfs_node_blockptr(eb, slot);
@@ -591,7 +591,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
goto free_tms;
}
 
-   tm->index = eb->start >> PAGE_CACHE_SHIFT;
+   tm->logical = eb->start;
tm->slot = src_slot;
tm->move.dst_slot = dst_slot;
tm->move.nr_items = nr_items;
@@ -702,7 +702,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
goto free_tms;
}
 
-   tm->index = new_root->start >> PAGE_CACHE_SHIFT;
+   tm->logical = new_root->start;
tm->old_root.logical = old_root->start;
tm->old_root.level = btrfs_header_level(old_root);
tm->generation = btrfs_header_generation(old_root);
@@ -742,16 +742,15 @@ __tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 
start, u64 min_seq,
struct rb_node *node;
struct tree_mod_elem *cur = NULL;
struct tree_mod_elem *found = NULL;
-   u64 index = start >> PAGE_CACHE_SHIFT;
 
tree_mod_log_read_lock(fs_info);
tm_root = &fs_info->tree_mod_log;
node = tm_root->rb_node;
while (node) {
cur = container_of(node, struct tree_mod_elem, node);
-   if (cur->index < index) {
+   if (cur->logical < start) {
node = node->rb_left;
-   } else if (cur->index > index) {
+   } else if (cur->logical > start) {
node = node->rb_right;
} else if (cur->seq < min_seq) {
node = node->rb_left;
@@ -1232,9 +1231,10 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
return NULL;
 
/*
-* the very last operation that's logged for a root is the replacement
-* operation (if it is replaced at all). this has the index of the *new*
-* root, making it the very first operation that's logged for this root.
+* the very last operation that's logged for a root is the
+* replacement operation (if it is replaced at all). this has
+* the logical address of the *new* root, making it the very
+* first operation that's logged for this root.
 */
while (1) {
tm = tree_mod_log_search_oldest(fs_info, root_logical,
@@ -1338,7 +1338,7 @@ __tree_mod_log_rewind(struct btrfs_fs_info *fs_info, 
struct extent_buffer *eb,
if (!next)
break;
tm = container_of(next, struct tree_mod_elem, node);
-   

[RFC PATCH V10 02/19] Btrfs: subpagesize-blocksize: Get rid of whole page writes.

2014-12-10 Thread Chandan Rajendra
This commit brings back functions that set/clear EXTENT_WRITEBACK bits. These
are required to reliably clear PG_writeback page flag.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 47 +++
 fs/btrfs/inode.c | 40 +++-
 2 files changed, 58 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c98dfd8..57db008 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1300,6 +1300,20 @@ int clear_extent_uptodate(struct extent_io_tree *tree, 
u64 start, u64 end,
cached_state, mask);
 }
 
+static int set_extent_writeback(struct extent_io_tree *tree, u64 start, u64 
end,
+   struct extent_state **cached_state, gfp_t mask)
+{
+   return set_extent_bit(tree, start, end, EXTENT_WRITEBACK, NULL,
+   cached_state, mask);
+}
+
+static int clear_extent_writeback(struct extent_io_tree *tree, u64 start, u64 
end,
+   struct extent_state **cached_state, gfp_t mask)
+{
+   return clear_extent_bit(tree, start, end, EXTENT_WRITEBACK, 1, 0,
+   cached_state, mask);
+}
+
 /*
  * either insert or lock state struct between start and end use mask to tell
  * us if waiting is desired.
@@ -1406,6 +1420,7 @@ static int set_range_writeback(struct extent_io_tree 
*tree, u64 start, u64 end)
page_cache_release(page);
index++;
}
+   set_extent_writeback(tree, start, end, NULL, GFP_NOFS);
return 0;
 }
 
@@ -2403,31 +2418,23 @@ static void end_bio_extent_writepage(struct bio *bio, 
int err)
 
bio_for_each_segment_all(bvec, bio, i) {
struct page *page = bvec->bv_page;
+   struct inode *inode = page->mapping->host;
+   struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
+   u64 page_start, page_end;
 
-   /* We always issue full-page reads, but if some block
-* in a page fails to read, blk_update_request() will
-* advance bv_offset and adjust bv_len to compensate.
-* Print a warning for nonzero offsets, and an error
-* if they don't add up to a full page.  */
-   if (bvec->bv_offset || bvec->bv_len != PAGE_CACHE_SIZE) {
-   if (bvec->bv_offset + bvec->bv_len != PAGE_CACHE_SIZE)
-   
btrfs_err(BTRFS_I(page->mapping->host)->root->fs_info,
-  "partial page write in btrfs with offset %u 
and length %u",
-   bvec->bv_offset, bvec->bv_len);
-   else
-   
btrfs_info(BTRFS_I(page->mapping->host)->root->fs_info,
-  "incomplete page write in btrfs with offset 
%u and "
-  "length %u",
-   bvec->bv_offset, bvec->bv_len);
-   }
-
-   start = page_offset(page);
-   end = start + bvec->bv_offset + bvec->bv_len - 1;
+   start = page_offset(page) + bvec->bv_offset;
+   end = start + bvec->bv_len - 1;
 
if (end_extent_writepage(page, err, start, end))
continue;
 
-   end_page_writeback(page);
+   clear_extent_writeback(tree, start, end, NULL, GFP_ATOMIC);
+
+   page_start = page_offset(page);
+   page_end = page_offset(page) + PAGE_CACHE_SIZE - 1;
+   if (!test_range_bit(tree, page_start, page_end,
+   EXTENT_WRITEBACK, 0, NULL))
+   end_page_writeback(page);
}
 
bio_put(bio);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7309832..2ffb4df 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2823,22 +2823,44 @@ static int btrfs_writepage_end_io_hook(struct page 
*page, u64 start, u64 end,
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_ordered_extent *ordered_extent = NULL;
struct btrfs_workqueue *workers;
+   u64 ordered_start, ordered_end;
+   int done;
 
trace_btrfs_writepage_end_io_hook(page, start, end, uptodate);
 
ClearPagePrivate2(page);
-   if (!btrfs_dec_test_ordered_pending(inode, &ordered_extent, start,
-   end - start + 1, uptodate))
-   return 0;
+loop:
+   ordered_extent = btrfs_lookup_ordered_range(inode, start,
+   end - start + 1);
+   if (!ordered_extent)
+   goto out;
 
-   btrfs_init_work(&ordered_extent->work, finish_ordered_fn, NULL, NULL);
+   ordered_start = max_t(u64, start, ordered_extent->file_offset);
+   ordered_end = min_t(u64, end,
+   o

[RFC PATCH V10 06/19] Btrfs: subpagesize-blocksize: Write only dirty extent buffers belonging to a page

2014-12-10 Thread Chandan Rajendra
For the subpagesize-blocksize scenario, This patch adds the ability to write a
single extent buffer to the disk.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c   |  20 ++--
 fs/btrfs/extent_io.c | 300 ---
 2 files changed, 250 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 20168e6..6c6e8bb 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -484,17 +484,23 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_root *root, struct page *page)
 {
-   u64 start = page_offset(page);
-   u64 found_start;
struct extent_buffer *eb;
+   u64 found_start;
 
eb = (struct extent_buffer *)page->private;
-   if (page != eb->pages[0])
+   if (page != eb_head(eb)->pages[0])
return 0;
-   found_start = btrfs_header_bytenr(eb);
-   if (WARN_ON(found_start != start || !PageUptodate(page)))
-   return 0;
-   csum_tree_block(root, eb, 0);
+   do {
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
+   continue;
+   if (WARN_ON(!test_bit(EXTENT_BUFFER_UPTODATE, &eb->ebflags)))
+   continue;
+   found_start = btrfs_header_bytenr(eb);
+   if (WARN_ON(found_start != eb->start))
+   return 0;
+   csum_tree_block(root, eb, 0);
+   } while ((eb = eb->eb_next) != NULL);
+
return 0;
 }
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 70bc10e..bbb5e980 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3453,33 +3453,54 @@ void wait_on_extent_buffer_writeback(struct 
extent_buffer *eb)
TASK_UNINTERRUPTIBLE);
 }
 
-static noinline_for_stack int
-lock_extent_buffer_for_io(struct extent_buffer *eb,
- struct btrfs_fs_info *fs_info,
- struct extent_page_data *epd)
+static void lock_extent_buffer_pages(struct extent_buffer_head *ebh,
+   struct extent_page_data *epd)
 {
+   struct extent_buffer *eb = &ebh->eb;
unsigned long i, num_pages;
-   int flush = 0;
+
+   num_pages = num_extent_pages(eb->start, eb->len);
+   for (i = 0; i < num_pages; i++) {
+   struct page *p = extent_buffer_page(eb, i);
+
+   if (!trylock_page(p)) {
+   flush_write_bio(epd);
+   lock_page(p);
+   }
+   }
+
+   return;
+}
+
+static int noinline_for_stack
+lock_extent_buffer_for_io(struct extent_buffer *eb,
+   struct btrfs_fs_info *fs_info,
+   struct extent_page_data *epd)
+{
+   int dirty;
int ret = 0;
 
if (!btrfs_try_tree_write_lock(eb)) {
-   flush = 1;
flush_write_bio(epd);
btrfs_tree_lock(eb);
}
 
-   if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
+   if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags)) {
+   dirty = test_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags);
btrfs_tree_unlock(eb);
-   if (!epd->sync_io)
-   return 0;
-   if (!flush) {
-   flush_write_bio(epd);
-   flush = 1;
+   if (!epd->sync_io) {
+   if (!dirty)
+   return 1;
+   else
+   return 2;
}
+
+   flush_write_bio(epd);
+
while (1) {
wait_on_extent_buffer_writeback(eb);
btrfs_tree_lock(eb);
-   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
+   if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags))
break;
btrfs_tree_unlock(eb);
}
@@ -3490,37 +3511,22 @@ lock_extent_buffer_for_io(struct extent_buffer *eb,
 * under IO since we can end up having no IO bits set for a short period
 * of time.
 */
-   spin_lock(&eb->refs_lock);
-   if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
-   set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
-   spin_unlock(&eb->refs_lock);
+   spin_lock(&eb_head(eb)->refs_lock);
+   if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->ebflags)) {
+   set_bit(EXTENT_BUFFER_WRITEBACK, &eb->ebflags);
+   spin_unlock(&eb_head(eb)->refs_lock);
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
__percpu_counter_add(&fs_info->dirty_metadata_bytes,
 -eb->len,
 fs_info->dirty_metadata_batch);
-   ret = 1;
+   re

[RFC PATCH V10 08/19] Btrfs: subpagesize-blocksize: Compute and look up csums based on sectorsized blocks.

2014-12-10 Thread Chandan Rajendra
Checksums are applicable to sectorsize units. The current code uses
bio->bv_len units to compute and look up checksums. This works on machines
where sectorsize == PAGE_CACHE_SIZE. This patch makes the checksum
computation and look up code to work with sectorsize units.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/file-item.c | 87 
 fs/btrfs/inode.c | 53 +---
 2 files changed, 89 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 54c84da..000418a 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -172,6 +172,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
u64 item_start_offset = 0;
u64 item_last_offset = 0;
u64 disk_bytenr;
+   u64 page_bytes_left;
u32 diff;
int nblocks;
int bio_index = 0;
@@ -220,6 +221,8 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
disk_bytenr = (u64)bio->bi_iter.bi_sector << 9;
if (dio)
offset = logical_offset;
+
+   page_bytes_left = bvec->bv_len;
while (bio_index < bio->bi_vcnt) {
if (!dio)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
@@ -243,7 +246,7 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
if (BTRFS_I(inode)->root->root_key.objectid ==
BTRFS_DATA_RELOC_TREE_OBJECTID) {
set_extent_bits(io_tree, offset,
-   offset + bvec->bv_len - 1,
+   offset + root->sectorsize - 1,
EXTENT_NODATASUM, GFP_NOFS);
} else {

btrfs_info(BTRFS_I(inode)->root->fs_info,
@@ -281,11 +284,17 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root 
*root,
 found:
csum += count * csum_size;
nblocks -= count;
-   bio_index += count;
+
while (count--) {
-   disk_bytenr += bvec->bv_len;
-   offset += bvec->bv_len;
-   bvec++;
+   disk_bytenr += root->sectorsize;
+   offset += root->sectorsize;
+   page_bytes_left -= root->sectorsize;
+   if (!page_bytes_left) {
+   bio_index++;
+   bvec++;
+   page_bytes_left = bvec->bv_len;
+   }
+
}
}
btrfs_free_path(path);
@@ -442,6 +451,8 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
struct bio_vec *bvec = bio->bi_io_vec;
int bio_index = 0;
int index;
+   int nr_sectors;
+   int i;
unsigned long total_bytes = 0;
unsigned long this_sum_bytes = 0;
u64 offset;
@@ -469,41 +480,51 @@ int btrfs_csum_one_bio(struct btrfs_root *root, struct 
inode *inode,
if (!contig)
offset = page_offset(bvec->bv_page) + bvec->bv_offset;
 
-   if (offset >= ordered->file_offset + ordered->len ||
-   offset < ordered->file_offset) {
-   unsigned long bytes_left;
-   sums->len = this_sum_bytes;
-   this_sum_bytes = 0;
-   btrfs_add_ordered_sum(inode, ordered, sums);
-   btrfs_put_ordered_extent(ordered);
+   data = kmap_atomic(bvec->bv_page);
 
-   bytes_left = bio->bi_iter.bi_size - total_bytes;
 
-   sums = kzalloc(btrfs_ordered_sum_size(root, bytes_left),
-  GFP_NOFS);
-   BUG_ON(!sums); /* -ENOMEM */
-   sums->len = bytes_left;
-   ordered = btrfs_lookup_ordered_extent(inode, offset);
-   BUG_ON(!ordered); /* Logic error */
-   sums->bytenr = ((u64)bio->bi_iter.bi_sector << 9) +
-  total_bytes;
-   index = 0;
+   nr_sectors = (bvec->bv_len + root->sectorsize - 1)
+   >> root->fs_info->sb->s_blocksize_bits;
+
+
+   for (i = 0; i < nr_sectors; i++) {
+   if (offset >= ordered->file_offset + ordered->len ||
+   offset < ordered->file_offset) {
+   unsigned long bytes_left;
+   sums->len = this_sum_bytes;
+   this_sum_bytes = 0;
+   btrfs_add_ordered_sum(inode, ordered, sums);
+   btrfs_put_ordered_extent(ordere

[RFC PATCH V10 13/19] Btrfs: subpagesize-blocksize: Deal with partial ordered extent allocations.

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize scenario, extent allocations for only some of the
dirty blocks of a page can succeed, while allocation for rest of the blocks
can fail. This patch allows I/O against such partially allocated ordered
extents to be submitted.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 24 +---
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c | 39 +--
 3 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3f6bec2..51ab453 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1774,15 +1774,22 @@ int extent_clear_unlock_delalloc(struct inode *inode, 
u64 start, u64 end,
if (page_ops & PAGE_SET_PRIVATE2)
SetPagePrivate2(pages[i]);
 
+   if (page_ops & PAGE_SET_ERROR)
+   SetPageError(pages[i]);
+
if (pages[i] == locked_page) {
page_cache_release(pages[i]);
continue;
}
-   if (page_ops & PAGE_CLEAR_DIRTY)
+
+   if ((page_ops & PAGE_CLEAR_DIRTY)
+   && !PagePrivate2(pages[i]))
clear_page_dirty_for_io(pages[i]);
-   if (page_ops & PAGE_SET_WRITEBACK)
+   if ((page_ops & PAGE_SET_WRITEBACK)
+   && !PagePrivate2(pages[i]))
set_page_writeback(pages[i]);
-   if (page_ops & PAGE_END_WRITEBACK)
+   if ((page_ops & PAGE_END_WRITEBACK)
+   && !PagePrivate2(pages[i]))
end_page_writeback(pages[i]);
if (page_ops & PAGE_UNLOCK)
unlock_page(pages[i]);
@@ -2398,7 +2405,7 @@ int end_extent_writepage(struct page *page, int err, u64 
start, u64 end)
uptodate = 0;
}
 
-   if (!uptodate) {
+   if (!uptodate || PageError(page)) {
ClearPageUptodate(page);
SetPageError(page);
ret = ret < 0 ? ret : -EIO;
@@ -3149,7 +3156,6 @@ static noinline_for_stack int writepage_delalloc(struct 
inode *inode,
   nr_written);
/* File system has been set read-only */
if (ret) {
-   SetPageError(page);
/* fill_delalloc should be return < 0 for error
 * but just in case, we use > 0 here meaning the
 * IO is started, so we don't want to return > 0
@@ -3358,7 +3364,6 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
struct inode *inode = page->mapping->host;
struct extent_page_data *epd = data;
u64 start = page_offset(page);
-   u64 page_end = start + PAGE_CACHE_SIZE - 1;
int ret;
int nr = 0;
size_t pg_offset;
@@ -3401,7 +3406,7 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
ret = writepage_delalloc(inode, page, wbc, epd, start, &nr_written);
if (ret == 1)
goto done_unlocked;
-   if (ret)
+   if (ret && !PagePrivate2(page))
goto done;
 
ret = __extent_writepage_io(inode, page, wbc, epd,
@@ -3415,10 +3420,7 @@ done:
set_page_writeback(page);
end_page_writeback(page);
}
-   if (PageError(page)) {
-   ret = ret < 0 ? ret : -EIO;
-   end_extent_writepage(page, ret, start, page_end);
-   }
+
unlock_page(page);
return ret;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 39e14fc..264dfd4 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -52,6 +52,7 @@
 #define PAGE_SET_WRITEBACK (1 << 2)
 #define PAGE_END_WRITEBACK (1 << 3)
 #define PAGE_SET_PRIVATE2  (1 << 4)
+#define PAGE_SET_ERROR (1 << 5)
 
 /*
  * page->private values.  Every page that is controlled by the extent
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 91c5580..4ed78dd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -880,6 +880,8 @@ static noinline int cow_file_range(struct inode *inode,
struct btrfs_key ins;
struct extent_map *em;
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+   struct btrfs_ordered_extent *ordered;
+   unsigned long page_ops, extent_ops;
int ret = 0;
 
if (btrfs_is_free_space_inode(inode)) {
@@ -924,8 +926,6 @@ static noinline int cow_file_range(struct inode *inode,
btrfs_drop_extent_cache(inode, start, start + num_bytes - 1, 0);
 
while (disk_num_bytes > 0) {
-   unsigned long op;
-
  

[RFC PATCH V10 05/19] Btrfs: subpagesize-blocksize: Read tree blocks whose size is

2014-12-10 Thread Chandan Rajendra
In the case of subpagesize-blocksize, this patch makes it possible to read
only a single metadata block from the disk instead of all the metadata blocks
that map into a page.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/disk-io.c   |  45 --
 fs/btrfs/disk-io.h   |   3 ++
 fs/btrfs/extent_io.c | 129 ++-
 3 files changed, 140 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3a79833..20168e6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -431,7 +431,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
int mirror_num = 0;
int failed_mirror = 0;
 
-   clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+   clear_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
io_tree = &BTRFS_I(root->fs_info->btree_inode)->io_tree;
while (1) {
ret = read_extent_buffer_pages(io_tree, eb, start,
@@ -450,7 +450,7 @@ static int btree_read_extent_buffer_pages(struct btrfs_root 
*root,
 * there is no reason to read the other copies, they won't be
 * any less wrong.
 */
-   if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags))
+   if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags))
break;
 
num_copies = btrfs_num_copies(root->fs_info,
@@ -582,12 +582,13 @@ static noinline int check_leaf(struct btrfs_root *root,
return 0;
 }
 
-static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio,
- u64 phy_offset, struct page *page,
- u64 start, u64 end, int mirror)
+int verify_extent_buffer_read(struct btrfs_io_bio *io_bio,
+   struct page *page,
+   u64 start, u64 end, int mirror)
 {
u64 found_start;
int found_level;
+   struct extent_buffer_head *ebh;
struct extent_buffer *eb;
struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
int ret = 0;
@@ -597,18 +598,26 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
goto out;
 
eb = (struct extent_buffer *)page->private;
+   do {
+   if ((eb->start <= start) && (eb->start + eb->len - 1 > start))
+   break;
+   } while ((eb = eb->eb_next) != NULL);
+
+   BUG_ON(!eb);
+
+   ebh = eb_head(eb);
 
/* the pending IO might have been the only thing that kept this buffer
 * in memory.  Make sure we have a ref for all this other checks
 */
extent_buffer_get(eb);
 
-   reads_done = atomic_dec_and_test(&eb->io_pages);
+   reads_done = atomic_dec_and_test(&ebh->io_bvecs);
if (!reads_done)
goto err;
 
eb->read_mirror = mirror;
-   if (test_bit(EXTENT_BUFFER_IOERR, &eb->bflags)) {
+   if (test_bit(EXTENT_BUFFER_IOERR, &eb->ebflags)) {
ret = -EIO;
goto err;
}
@@ -650,7 +659,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
 * return -EIO.
 */
if (found_level == 0 && check_leaf(root, eb)) {
-   set_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
+   set_bit(EXTENT_BUFFER_CORRUPT, &eb->ebflags);
ret = -EIO;
}
 
@@ -658,7 +667,7 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
set_extent_buffer_uptodate(eb);
 err:
if (reads_done &&
-   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
+   test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->ebflags))
btree_readahead_hook(root, eb, eb->start, ret);
 
if (ret) {
@@ -667,7 +676,7 @@ err:
 * again, we have to make sure it has something
 * to decrement
 */
-   atomic_inc(&eb->io_pages);
+   atomic_inc(&eb_head(eb)->io_bvecs);
clear_extent_buffer_uptodate(eb);
}
free_extent_buffer(eb);
@@ -675,20 +684,6 @@ out:
return ret;
 }
 
-static int btree_io_failed_hook(struct page *page, int failed_mirror)
-{
-   struct extent_buffer *eb;
-   struct btrfs_root *root = BTRFS_I(page->mapping->host)->root;
-
-   eb = (struct extent_buffer *)page->private;
-   set_bit(EXTENT_BUFFER_IOERR, &eb->bflags);
-   eb->read_mirror = failed_mirror;
-   atomic_dec(&eb->io_pages);
-   if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD, &eb->bflags))
-   btree_readahead_hook(root, eb, eb->start, -EIO);
-   return -EIO;/* we fixed nothing */
-}
-
 static void end_workqueue_bio(struct bio *bio, int err)
 {
struct end_io_wq *end_io_wq = bio->bi_private;
@@ -4156,8 +4151,6 @@ static int btrfs_cleanup_transaction(struct btrfs_root 
*root)
 }
 
 static struct extent_io_ops btree_extent_io_ops = {

[RFC PATCH V10 09/19] Btrfs: subpagesize-blocksize: __extent_writepage: Write only dirty blocks of a page.

2014-12-10 Thread Chandan Rajendra
The code now loops across 'ordered extents' instead of 'extent maps' to figure
out the dirty blocks of the page to be submitted for a write operation.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 74 
 1 file changed, 29 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bbb5e980..ceaf137 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3212,18 +3212,18 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
 int write_flags, int *nr_ret)
 {
struct extent_io_tree *tree = epd->tree;
+   struct btrfs_ordered_extent *ordered;
u64 start = page_offset(page);
u64 page_end = start + PAGE_CACHE_SIZE - 1;
u64 end;
u64 cur = start;
u64 extent_offset;
-   u64 block_start;
+   u64 extent_end;
u64 iosize;
sector_t sector;
struct extent_state *cached_state = NULL;
-   struct extent_map *em;
struct block_device *bdev;
-   size_t pg_offset = 0;
+   size_t pg_offset;
size_t blocksize;
int ret = 0;
int nr = 0;
@@ -3263,59 +3263,46 @@ static noinline_for_stack int 
__extent_writepage_io(struct inode *inode,
blocksize = inode->i_sb->s_blocksize;
 
while (cur <= end) {
-   u64 em_end;
if (cur >= i_size) {
if (tree->ops && tree->ops->writepage_end_io_hook)
tree->ops->writepage_end_io_hook(page, cur,
 page_end, NULL, 1);
break;
}
-   em = epd->get_extent(inode, page, pg_offset, cur,
-end - cur + 1, 1);
-   if (IS_ERR_OR_NULL(em)) {
-   SetPageError(page);
-   ret = PTR_ERR_OR_ZERO(em);
-   break;
-   }
 
-   extent_offset = cur - em->start;
-   em_end = extent_map_end(em);
-   BUG_ON(em_end <= cur);
+   ordered = btrfs_lookup_ordered_extent(inode, cur);
+   if (!ordered) {
+   cur += blocksize;
+   continue;
+   }
+
+   pg_offset = cur & (PAGE_CACHE_SIZE - 1);
+
+   extent_offset = cur - ordered->file_offset;
+   extent_end = ordered->file_offset + ordered->len;
+   extent_end = (extent_end < ordered->file_offset) ? -1 : 
extent_end;
+   BUG_ON(extent_end <= cur);
BUG_ON(end < cur);
-   iosize = min(em_end - cur, end - cur + 1);
+   iosize = min(extent_end - cur, end - cur + 1);
iosize = ALIGN(iosize, blocksize);
-   sector = (em->block_start + extent_offset) >> 9;
-   bdev = em->bdev;
-   block_start = em->block_start;
-   compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
-   free_extent_map(em);
-   em = NULL;
+
+   sector = (ordered->start + extent_offset) >> 9;
+   bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev;
+   compressed = test_bit(BTRFS_ORDERED_COMPRESSED, 
&ordered->flags);
+   btrfs_put_ordered_extent(ordered);
+   ordered = NULL;
 
/*
 * compressed and inline extents are written through other
 * paths in the FS
 */
-   if (compressed || block_start == EXTENT_MAP_HOLE ||
-   block_start == EXTENT_MAP_INLINE) {
-   /*
-* end_io notification does not happen here for
-* compressed extents
-*/
-   if (!compressed && tree->ops &&
-   tree->ops->writepage_end_io_hook)
-   tree->ops->writepage_end_io_hook(page, cur,
-cur + iosize - 1,
-NULL, 1);
-   else if (compressed) {
-   /* we don't want to end_page_writeback on
-* a compressed extent.  this happens
-* elsewhere
-*/
-   nr++;
-   }
-
+   if (compressed) {
+   /* we don't want to end_page_writeback on
+* a compressed extent.  this happens
+* elsewhere
+*/
+   nr++;
cur += iosize;
-   pg_offset += iosize;
continue;

[RFC PATCH V10 12/19] Btrfs: subpagesize-blocksize: Search for all ordered extents that could span across a page.

2014-12-10 Thread Chandan Rajendra
In subpagesize-blocksize scenario it is not sufficient to search using the
first byte of the page to make sure that there are no ordered extents
present across the page. Fix this.

Signed-off-by: Chandan Rajendra 
---
 fs/btrfs/extent_io.c | 3 ++-
 fs/btrfs/inode.c | 6 +++---
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index ceaf137..3f6bec2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3053,7 +3053,8 @@ static int __extent_read_full_page(struct extent_io_tree 
*tree,
 
while (1) {
lock_extent(tree, start, end);
-   ordered = btrfs_lookup_ordered_extent(inode, start);
+   ordered = btrfs_lookup_ordered_range(inode, start,
+   PAGE_CACHE_SIZE);
if (!ordered)
break;
unlock_extent(tree, start, end);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 23ce9ff..91c5580 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1821,7 +1821,7 @@ again:
if (PagePrivate2(page))
goto out;
 
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, 
PAGE_CACHE_SIZE);
if (ordered) {
unlock_extent_cached(&BTRFS_I(inode)->io_tree, page_start,
 page_end, &cached_state, GFP_NOFS);
@@ -7724,7 +7724,7 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 
if (!inode_evicting)
lock_extent_bits(tree, page_start, page_end, 0, &cached_state);
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, 
PAGE_CACHE_SIZE);
if (ordered) {
/*
 * IO on this page will never be started, so we need
@@ -7849,7 +7849,7 @@ again:
 * we can't set the delalloc bits if there are pending ordered
 * extents.  Drop our locks and wait for them to finish
 */
-   ordered = btrfs_lookup_ordered_extent(inode, page_start);
+   ordered = btrfs_lookup_ordered_range(inode, page_start, page_end);
if (ordered) {
unlock_extent_cached(io_tree, page_start, page_end,
 &cached_state, GFP_NOFS);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Duncan
Robert White posted on Wed, 10 Dec 2014 04:17:50 -0800 as excerpted:

>> BTRFS info (device sdc1): relocating block group 1821099687936 flags 1
>> BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920
>> BTRFS: space_info 1 has 4773171200 free, is not full BTRFS: space_info
>> total=1494648619008, used=1489775505408, pinned=0, reserved=99700736,
>> may_use=2102390784, readonly=241664
> 
> So it was looking for a single chunk 2013265920 bytes long and it
> couldn't find one because all the spaces were smaller and there was no
> room to make a new suitable space.
> 
> The problem is that it wanted 2013265920 bytes and while the system as a
> whole had no way to satisfy that desire. It asked for something just shy
> of two gigs as a single extent. That's a tough order on a full platter.
> 
> Since your entire free size is 2102390784 that is an attempt to allocate
> about 80% of your free space as one contiguous block. That's never going
> to happen. 8-)
> 
> I don't even know if 2GiB is normally a legal size for an extent. My
> understanding is that data is allocated in 1G chunks, so I'd expect all
> extents to be smaller than 1G.

On native btrfs, an extent must fit within the 1 GiB data chunk size, 
with extents inherited from an ext* conversion being an obvious non-
native exception.

I hadn't looked at the actual output, but that confirms my earlier 
suspicion, that after the ext* saved subvolume delete, the defrag somehow 
missed at least one file > 1 GiB with a "super-extent" also > 1 GiB in 
size.

>From there... I've never used it but I /think/ btrfs inspect-internal 
logical-resolve should let you map the 182109... address to a filename.  
>From there, moving that file out of the filesystem and back in should 
eliminate that issue.

Assuming no snapshots still contain the file, of course, and that the 
ext* saved subvolume has already been deleted.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Duncan
Robert White posted on Tue, 09 Dec 2014 16:01:02 -0800 as excerpted:

> On 12/09/2014 03:48 PM, Robert White wrote:
>> On 12/09/2014 02:29 PM, Patrik Lundquist wrote:
>>> (stuff depicting a nearly full file system).
>>
>> Having taken another look at it all, I'd bet (there is not sufficient
>> information to be _sure_ from the output you've provided) that you
>> don't have the necessary 1Gb free on your disk slice to allocate
>> another data extent.

[snip most of both quote levels]

> Full filesystems always get into corner cases.

But, from the content you snipped from his post, this from btrfs fi show:

>>> Label: none  uuid: 770fe01d-6a45-42b9-912e-e8f8b413f6a4
>>>Total devices 1 FS bytes used 1.35TiB
>>>devid1 size 2.73TiB used 1.36TiB path /dev/sdc1

Device 2.73 TiB, used only 1.36 TiB.

That's over a TiB of entirely unallocated space, so a mere 1 GiB chunk 
allocation shouldn't be a problem.


I'm sticking with my original hypothesis (assuming this is a continuation 
from the thread I think it was), that there's something about the 
conversion from ext* that didn't work correctly; most likely a file 
larger than the btrfs 1 GiB data-chunk size, that has an extent larger 
than that size as well.  Btrfs balance couldn't do anything with that, as 
it's larger than the native 1 GiB data-chunk size and balance alone 
doesn't know how to split it up.

The recursive btrfs defrag after deleting the saved ext* subvolume 
_should_ have split up any such > 1 GiB extents so balance could deal 
with them, but either it failed for some reason on at least one such 
file, or there's some other weird corner-case going on, very likely 
something else having to do with the conversion.

Patrik, assuming no btrfs snapshots yet, can you do a du --all --block-
size=1M | sort -n (or similar), then take a look at all results over 1024 
(1 GiB since the du specified 1 MiB blocks), and see if it's reasonable 
to move all those files out of the filesystem and back?  Assuming there's 
not too many of them, the idea is to kill the copy in the filesystem by 
moving them elsewhere, then move them back so they get recreated using 
native btrfs semantics -- no extents larger than the native btrfs data 
chunk size of 1 GiB.

If you have lots of memory to work with, one method would be to create a 
tmpfs, then /copy/ the files to tmpfs and /move/ them back to a temporary 
tree on the btrfs, deleting the originals on btrfs only after the move 
back from tmpfs and a sync (or btrfs fi sync) so there's always a 
permanent copy if the machine should crash and take down the tmpfs with 
it.  After all the files have been processed and the originals deleted 
you can then move the contents of the temporary tree back into the 
original location.

That should ensure no more > 1 GiB file extents and will I hope get rid 
of the problem, as this workaround has been demonstrated to fix problems 
other people had with converted-from-ext* btrfs, generally where they had 
failed to run the defrag right after the conversion, and now had a bunch 
more data on the filesystem and didn't want to have to defrag it too.  
Obviously it works best when there's only a handful of > 1 GiB files, 
however, and snapshots containing references to the affected files will 
prevent the file delete from actually deleting the problematic extents.

With luck that'll allow a full 100% balance without error.  If not, at 
least it should eliminate the > 1 GiB file extents possibility, and the 
focus can move to something else.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 08/13] btrfs-progs: Add count_digit() function to help calculate filename len.

2014-12-10 Thread David Sterba
On Tue, Dec 09, 2014 at 04:27:27PM +0800, Qu Wenruo wrote:
> +static inline int count_digit(u64 num)

FYI, I've renamed it to count_digits, and updated all callers.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and

2014-12-10 Thread David Sterba
On Tue, Dec 09, 2014 at 04:27:19PM +0800, Qu Wenruo wrote:
> The patchset introduce two new repair function and some helpers to
> archive a huge goal:
>   Repair btrfs whose fs tree's non-root leaf/node is corrupted when no
>   duplication is valid.
> 
> The two new repair functions are:
>   repair_inode_nlinks():
> Repair any inode nlink related problem.
> From fixing the nlink number and related
> inode_ref/dir_index/dir_item to recovering file name and file type
> and salvage them into the lost+found dir.
> This does not only fix a case that some users reported but also
> cooperate with repair_inode_no_item() function to salvaged heavily
> damaged inode to lost+found dir.
> 
>   repair_inode_no_item():
> Repair case for inode_item missing case, which is quite common when
> fs tree leaf/node is missing.
> This only does the inode item rebuild. Later recovery like move it
> to lost+found dir is done by repair_inode_nlinks().
> 
> The main helper is the repair_btree() function, which will drops the
> corrupted non-root leaf/node and rebalance the tree to keep the
> correctness of the btree.

Sounds a bit intrusive, but under the circumstances I don't see anything
better to do.

> With this patchset, even a non-root leaf/node is corrupted and no
> duplication survived, btrfsck can still repair it to a mountable status.
> (And normal rw should also be OK,)
> 
> The remaining unfixable problems will be inode nbytes error with file
> extent discounts error, which may be fixed in next patchset.
> 
> Cc David:
> Sorry for the huge change in the patchset and merge the old inode nlink
> repair with new inode item rebuild patchset.

No problem, the incremental changelogs helped a lot.

> Since when developing inode item rebuild patchset, I found the old nlink
> cooperated very bad with item rebuild and there is some duplicated codes
> between the two patchset, no to mention the math lib introduced by nlink
> repair patch.
> So I decided to somewhat rebase the nlink repair patchset to provide
> better generality.

Great, the patchset looks good for merge, I'm adding it to 3.18. From
now on please send only incremental changes and not the whole patchset.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixing Btrfs Filesystem Full Problems typo?

2014-12-10 Thread Robert White

On 12/09/2014 11:19 PM, Patrik Lundquist wrote:

On 10 December 2014 at 00:13, Robert White  wrote:

On 12/09/2014 02:29 PM, Patrik Lundquist wrote:


Label: none  uuid: 770fe01d-6a45-42b9-912e-e8f8b413f6a4
  Total devices 1 FS bytes used 1.35TiB
  devid1 size 2.73TiB used 1.36TiB path /dev/sdc1


Data, single: total=1.35TiB, used=1.35TiB
System, single: total=32.00MiB, used=112.00KiB
Metadata, single: total=3.00GiB, used=1.55GiB
GlobalReserve, single: total=512.00MiB, used=0.00B



Are you trying to convert a filesystem on a single device/partition to RAID
1?


Not yet. I'm stuck at the full balance after the conversion from ext4.
I haven't added the disks for RAID1 and might need them for starting
over instead.


You are not "stuck" here as this step is not mandatory. (see below)



A balance with -musage=100 -dusage=99 works but a full fails. It would
be nice to nail the bug since the fs passes btrfs check and it seems
to be a clear ENOSPC bug.


Conversion from ext2/3/4 is constrained because it needs to be reversible.

If you are out of space this isn't a "bug", you are just out of space. 
So by telling the system to ignore the 100% full clusters it is free to 
juggle the fragments. But once you get into moving the fully full 
extents the COW features _MUST_ have access to _contiguous_ 1Gib blocks 
to make the new extents int which the Copy will be Written. If your file 
system was nearly full it's completely likely that there are no such 
contiguous blocks available to make the necessary extents.


BUT FIRST UNDERSTAND: you do _not_ need to balance a newly converted 
filesystem. That is, the recommended balance (and recursive defrag) is 
_not_ a useability issue, its an efficiency issue.


Check what you've got. Make sure it is good. Make sure you are cool with 
it all. When you know everything is usable then remove the undo 
information snapshot. That snapshot is pinning a _lot_ of data into 
exact positions on disk. It's memorializing your previous fragmentation 
and the anniversary positions of all the EXT4 data structures. Since 
your system is basically full that undo information has to go.


At that point your balance will probably have the room it needs.

_Then_ you can balance if you feel the desire.

If you are _still_ out of space you'll need to add some, at least 
temporarily, to give the system enough room to work.


Since we all _know_ you are a dilligent system administrator and 
architect with a good, recent, and well tested backup we know we can 
recommend that you just dump the undo partition with a nice btrfs subvol 
delete, right? Because you made a backup and everything yes?


So anyway. Your system isn't "bugged" or "broken" it's "full" but its a 
fragmented fullness that has lots of free sectors but insufficent 
contiguous free sectors, so it cannot satisfy the request.


That Said...

I suspect you _have_ revealed a problem with the error reporting in the 
case of "scary and wrong error message".


The allocator in extent-tree.c just tells you the raw free space on the 
disk and says "hua... there are lots of bytes out there".


Which is _WAY_ different than "there are enough bytes all in one clump 
to satisfy my needs. E.g. there is _not_ a lot of brains behind the message.



ret = find_free_extent(root, num_bytes, empty_size, hint_byte, ins,
   flags, delalloc);

if (ret == -ENOSPC) {
if (!final_tried && ins->offset) {
num_bytes = min(num_bytes >> 1, ins->offset);
num_bytes = round_down(num_bytes, 
root->sectorsize);

num_bytes = max(num_bytes, min_alloc_size);
if (num_bytes == min_alloc_size)
final_tried = true;
goto again;
} else if (btrfs_test_opt(root, ENOSPC_DEBUG)) {
struct btrfs_space_info *sinfo;

sinfo = __find_space_info(root->fs_info, flags);
btrfs_err(root->fs_info, "allocation failed 
flags %llu, wanted %llu",

flags, num_bytes);
if (sinfo)
dump_space_info(sinfo, num_bytes, 1);
}
}






I don't know how to interpret the space_info error. Why is only
4773171200 (4,4GiB) free?
Can I inspect block group 1821099687936 to try to find out what makes
it problematic?

BTRFS info (device sdc1): relocating block group 1821099687936 flags 1
BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920
BTRFS: space_info 1 has 4773171200 free, is not full
BTRFS: space_info total=1494648619008, used=1489775505408, pinned=0,
reserved=99700736, may_use=2102390784, readonly=241664


So it was looking for a single chunk 2013265920 bytes long and it 
couldn't find one because all the spaces were smaller and there was no 
room to make a new suitable spac

Re: [PATCH] Btrfs: get more accurate output in fd command.

2014-12-10 Thread Robert White

On 12/09/2014 05:08 PM, Dongsheng Yang wrote:

On 12/10/2014 02:47 AM, Goffredo Baroncelli wrote:

Hi Dongsheng
On 12/09/2014 12:20 PM, Dongsheng Yang wrote:

When function btrfs_statfs() calculate the tatol size of fs, it is
calculating
the total size of disks and then dividing it by a factor. But in some
usecase,
the result is not good to user.

Example:
# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
# mount /dev/vdf1 /mnt
# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
# df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdf1   3.0G 1018M  1.3G  45% /mnt

# btrfs fi show /dev/vdf1
Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
Total devices 2 FS bytes used 1001.53MiB
devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
devid2 size 4.00GiB used 1.83GiB path /dev/vdf2

a. df -h should report Size as 2GiB rather than as 3GiB.
Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.

I agree


NOPE.

The model you propose is too simple.

While the data portion of the file system is set to RAID1 the metadata 
portion of the filesystem is still set to the default of DUP. As such it 
is impossible to guess how much space is "free" since it is unknown how 
the space will be used before hand.


IF, say, this were used as a typical mail spool, web cache, or any 
number of similar smal-file applications virtually all of the data may 
end up in the metadata chunks. The "blocks free" in this usage are 
indistinguisable from any other file system.


For all that DUP data the correct size is 3GiB because there will be two 
copies of all metadata but they could _all_ end up on /dev/vdf2.


So you have a RAID-1 region that is constrained to 2Gib. You have 2GiB 
more storage for all your metadata, but the constraint is DUP (so 
everything is written twice "somewhere")


So the space breakdown is, if optimally packed, actually

2GiB mirrored, for _data_, takes up 4GiB total spread evenly across 
/dev/vdf2 (2Gib) and /dev/vdf1 (2Gib).


_AND_ 1GiB of metadata, written twice to /dev/vdf2 (2Gib)

So free space is 3Gib on the presumption that data and metadata will be 
equally used.


The program, not being psychic, can only make a fair-usage guess about 
future use.


Now we have accounted for all 6GiB of raw storage _and_ the report of 
3GiB free.


IF you wanted everything to be RAID-1 you should have instead done

# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1 -m raid1

The mistake is yours, rest of you analysis is, therefore, completely 
inapplicable. Please read all the documentation before making that sort 
of filesystem. Your data will thank you later.


DSCLAIMER: I have _not_ looked at the numbers you would get if you used 
the corrected command.






b. df -h should report Avail as 0.15GiB or less, rather than as 1.3GiB.
2 - 1.85 = 0.15

I cannot agree; the avail should be:
 1.85   (the capacity of the allocated chunk)
-1.018  (the file stored)
+(2-1.85=0.15)  (the residual capacity of the disks
 considering a raid1 fs)
---
=   0.97


My bad here. It should be 0.97. My mistake in this changelog.
I will update it in next version.

This patch drops the factor at all and calculate the size observable to
user without considering which raid level the data is in and what's the
size exactly in disk.

After this patch applied:
# mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
# mount /dev/vdf1 /mnt
# dd if=/dev/zero of=/mnt/zero bs=1M count=1000
# df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/vdf1   2.0G 1018M  713M  59% /mnt

I am confused: in this example you reported as Avail 713MB, when previous
you stated that the right value should be 150MB...


As you pointed above, the right value should be 970MB or less (Some
space is used for metadata and system).
And the 713MB is my result of it.


What happens when the filesystem is RAID5/RAID6 or Linear ?


The original df did not consider the RAID5/6. So it still does not work
well with
this patch applied. But I will update this patch to handle these
scenarios in V2.

Thanx
Yang

  [...]

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] Btrfs: qgroup: free reserved in exceeding quota.

2014-12-10 Thread Dongsheng Yang

On 12/09/2014 11:42 PM, Josef Bacik wrote:

On 12/09/2014 06:27 AM, Dongsheng Yang wrote:

When we exceed quota limit in writing, we will free
some reserved extent when we need to drop but not free
account in qgroup. It means, each time we exceed quota
in writing, there will be some remain space in qg->reserved
we can not use any more. If things go on like this, the
all space will be ate up.

Signed-off-by: Dongsheng Yang 
---
  fs/btrfs/extent-tree.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a84e00d..014b7f2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5262,8 +5262,11 @@ out_fail:
  to_free = 0;
  }
  spin_unlock(&BTRFS_I(inode)->lock);
-if (dropped)
+if (dropped) {
+if (root->fs_info->quota_enabled)
+btrfs_qgroup_free(root, dropped * root->nodesize);


This needs to be num_bytes + dropped * root->nodesize.  Thanks,


Let me try to explain why it did not free num_bytes here.

In out_fail, we did not reserve num_bytes in qgroup successfully, then 
we do not need

to free it in out_fail.

The problem this patch attempts to solve is that, when we run into 
out_fail here,
we will drop a outstanding extent. That said, in out_fail here, the 
extra reserved

nodesize for some extents should be freed.

Example:
1). BTRFS_I(inode)->reserved_extents: 2, 
BTRFS_I(inode)->outstanding_extents: 1.
  In this case, we go intobtrfs_delalloc_reserve_metadata(). 
outstanding_extents
  will be increased at first. then 
BTRFS_I(inode)->outstanding_extents is 2.

  If we want to reserve space and failed. it will goto out_fail.
2). In out_failed: reserved_extents is 2, outstanding_extents is 2. 
we will get a dropped of 1
 from dropping_outstanding_extent(). And now, 
reserved_extents:1, outstanding_extents:1.


In step 2, we just decrease the reserved_extents without freeing the 
related nodesize in qgroup
at the same time. So it will cause the problem I described in changelog 
which will eat the space.


Therefore, this patch here will free the nodesize related with the 
dropped extents in step 2.
About the num_bytes, as we did not reserve it successfully, no need to 
free it.


With my poor english, there must be something confusing in my 
description. Please correct me

if anything is wrong or not-good-explained.

Thanx
Yang


Josef



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.

2014-12-10 Thread Dongsheng Yang

On 12/09/2014 11:55 PM, Josef Bacik wrote:

On 12/09/2014 06:27 AM, Dongsheng Yang wrote:

Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to 
account the

bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.

Example:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  20987904 20987904 010485760 --- ---

qg->excl is 20987904 larger than max_excl 10485760.

This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use   --- qgroup->reserved --- qgroup->excl

With this patch applied:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rferexclmax_rfer max_excl parent  child
   --  -
0/5  9453568 9453568 010485760 --- ---

Reported-by: Cyril SCETBON 
Signed-off-by: Dongsheng Yang 
---
  fs/btrfs/extent-tree.c | 25 ++-
  fs/btrfs/inode.c   | 22 +++-
  fs/btrfs/qgroup.c  | 68 
+++---

  fs/btrfs/qgroup.h  |  4 +++
  4 files changed, 113 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 014b7f2..9eaf268 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5500,8 +5500,13 @@ static int pin_down_extent(struct btrfs_root 
*root,


  set_extent_dirty(root->fs_info->pinned_extents, bytenr,
   bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
-if (reserved)
+if (reserved) {
+if (root->fs_info->quota_enabled)


You already have this check in btrfs_qgroup_update_reserved_bytes, 
just call it unconditionally everywhere in this patch.  Otherwise this 
looks good, thanks,


Thanx, I will update it in V2.


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html