date:20141212

Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and

2014-12-12 Thread Filipe David Manana

On Fri, Dec 12, 2014 at 12:32 AM, Qu Wenruo  wrote:
>
>  Original Message 
> Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and
> From: Filipe David Manana 
> To: Qu Wenruo 
> Date: 2014年12月11日 19:07
>>
>> On Thu, Dec 11, 2014 at 12:50 AM, Qu Wenruo 
>> wrote:
>>>
>>>  Original Message 
>>> Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch
>>> and
>>> From: David Sterba 
>>> To: Qu Wenruo 
>>> Date: 2014年12月10日 20:37

 On Tue, Dec 09, 2014 at 04:27:19PM +0800, Qu Wenruo wrote:
>
> The patchset introduce two new repair function and some helpers to
> archive a huge goal:
> Repair btrfs whose fs tree's non-root leaf/node is corrupted when
> no
> duplication is valid.
>
> The two new repair functions are:
> repair_inode_nlinks():
>   Repair any inode nlink related problem.
>   From fixing the nlink number and related
>   inode_ref/dir_index/dir_item to recovering file name and file
> type
>   and salvage them into the lost+found dir.
>   This does not only fix a case that some users reported but also
>   cooperate with repair_inode_no_item() function to salvaged
> heavily
>   damaged inode to lost+found dir.
>
> repair_inode_no_item():
>   Repair case for inode_item missing case, which is quite common
> when
>   fs tree leaf/node is missing.
>   This only does the inode item rebuild. Later recovery like move
> it
>   to lost+found dir is done by repair_inode_nlinks().
>
> The main helper is the repair_btree() function, which will drops the
> corrupted non-root leaf/node and rebalance the tree to keep the
> correctness of the btree.

 Sounds a bit intrusive, but under the circumstances I don't see anything
 better to do.
>>>
>>> Better non-destructive but less generic method may be introduced later.
>>> My dream is to inspect each key and its item to rebuild each member, but
>>> it
>>> would takes a long long time
>>> to implement.


> With this patchset, even a non-root leaf/node is corrupted and no
> duplication survived, btrfsck can still repair it to a mountable
> status.
> (And normal rw should also be OK,)
>
> The remaining unfixable problems will be inode nbytes error with file
> extent discounts error, which may be fixed in next patchset.
>
> Cc David:
> Sorry for the huge change in the patchset and merge the old inode nlink
> repair with new inode item rebuild patchset.

 No problem, the incremental changelogs helped a lot.

> Since when developing inode item rebuild patchset, I found the old
> nlink
> cooperated very bad with item rebuild and there is some duplicated
> codes
> between the two patchset, no to mention the math lib introduced by
> nlink
> repair patch.
> So I decided to somewhat rebase the nlink repair patchset to provide
> better generality.

 Great, the patchset looks good for merge, I'm adding it to 3.18. From
 now on please send only incremental changes and not the whole patchset.
 Thanks.
>>>
>>> Thanks, this should be the last large update patchset.
>>> Later work will focus on file extent recovery and should not interfere
>>> with
>>> this patch.
>>>
>>> Thanks.
>>> Qu
>>
>> Can we please get some tests too?
>> Add some broken fs images, document what is broken and the expected
>> result after running the repair code (besides verifying the repair
>> worked for every single inode of course)...
>>
>> thanks
>
> Tests are definitely needed, I tested this by randomly corrupt a leaf of
> fstree, which contains contents of my /etc,
> and run repair.
>
> But the problem is that, we can't add tests like other btrfsck using
> btrfs-image dump, since it will fail to dump
> a btree-broken btrfs.
> And if we add test image directly, it may takes up several MB as a binary
> image dump.
>
> Any good idea about how to add test case without btrfs-image support?

Very simple solution.

Do:

1) Create an empty file;
2) Use it as the backing file for a loop device;
3) Run mkfs.btrfs against the loop device;
4) Mount it;
5) Populate the fs;
6) Umount it;
7) Corrupt some nodes or leafs (by zeroing them out for e.g.);
8) Create a tarball from the backing file like this: ZX_OPT=-9 tar
cJSvf foobar.tar.xz run.sh backing_file
9) Add the tarball to the fsck-tests directory;
10) Make the test run fsck against the backing file extracted from the
tarball - fsck can operate against regular files, and not only against
devices.

I did that a couple months ago, see:

http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/tests/fsck-tests.sh?h=v3.17.x#n30

Exactly because for some kinds of damage in a filesystem btrfs-image won't work.

Thanks.

>
> Thanks,
> Qu
>
>>
>>> --
>>> To unsubscribe from this list:

[PATCH v2 2/2] Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.

2014-12-12 Thread Dongsheng Yang

Currently, for pre_alloc or delay_alloc, the bytes will be accounted
in space_info by the three guys.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
But on the other hand, in qgroup, there are only two counters to account the
bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
bytes in space_info->bytes_may_use and qg->excl accounts bytes in
space_info->used. So the bytes in space_info->reserved is not accounted
in qgroup. If so, there is a window we can exceed the quota limit when
bytes is in space_info->reserved.

Example:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rfer excl max_rfer max_excl parent  child
     --  -
0/5  20987904 20987904 010485760 --- ---

qg->excl is 20987904 larger than max_excl 10485760.

This patch introduce a new counter named may_use to qgroup, then
there are three counters in qgroup to account bytes in space_info
as below.
space_info->bytes_may_use --- space_info->reserved --- space_info->used.
qgroup->may_use   --- qgroup->reserved --- qgroup->excl

With this patch applied:
# btrfs quota enable /mnt
# btrfs qgroup limit -e 10M /mnt
# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
# sync
# btrfs qgroup show -pcre /mnt
qgroupid rferexclmax_rfer max_excl parent  child
   --  -
0/5  9453568 9453568 010485760 --- ---

Reported-by: Cyril SCETBON 
Signed-off-by: Dongsheng Yang 
---
Changelog:
v1 -> v2:
Remove the redundant check for fs_info->quota_enabled.

 fs/btrfs/extent-tree.c | 20 ++-
 fs/btrfs/inode.c   | 18 -
 fs/btrfs/qgroup.c  | 68 +++---
 fs/btrfs/qgroup.h  |  4 +++
 4 files changed, 104 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 014b7f2..f4ad737 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5500,8 +5500,12 @@ static int pin_down_extent(struct btrfs_root *root,
 
set_extent_dirty(root->fs_info->pinned_extents, bytenr,
 bytenr + num_bytes - 1, GFP_NOFS | __GFP_NOFAIL);
-   if (reserved)
+   if (reserved) {
+   btrfs_qgroup_update_reserved_bytes(root->fs_info,
+  root->root_key.objectid,
+  num_bytes, -1);
trace_btrfs_reserved_extent_free(root, bytenr, num_bytes);
+   }
return 0;
 }
 
@@ -6230,6 +6234,9 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
btrfs_update_reserved_bytes(cache, buf->len, RESERVE_FREE, 0);
trace_btrfs_reserved_extent_free(root, buf->start, buf->len);
pin = 0;
+   btrfs_qgroup_update_reserved_bytes(root->fs_info,
+  root->root_key.objectid,
+  buf->len, -1);
}
 out:
if (pin)
@@ -6964,7 +6971,11 @@ static int __btrfs_free_reserved_extent(struct 
btrfs_root *root,
else {
btrfs_add_free_space(cache, start, len);
btrfs_update_reserved_bytes(cache, len, RESERVE_FREE, delalloc);
+   btrfs_qgroup_update_reserved_bytes(root->fs_info,
+  root->root_key.objectid,
+  len, -1);
}
+
btrfs_put_block_group(cache);
 
trace_btrfs_reserved_extent_free(root, start, len);
@@ -7200,6 +7211,9 @@ int btrfs_alloc_logged_file_extent(struct 
btrfs_trans_handle *trans,
BUG_ON(ret); /* logic error */
ret = alloc_reserved_file_extent(trans, root, 0, root_objectid,
 0, owner, offset, ins, 1);
+   btrfs_qgroup_update_reserved_bytes(root->fs_info,
+  root->root_key.objectid,
+

[PATCH v2 1/2] Btrfs: qgroup: free reserved in exceeding quota.

2014-12-12 Thread Dongsheng Yang

When we exceed quota limit in writing, we will free
some reserved extent when we need to drop but not free
account in qgroup. It means, each time we exceed quota
in writing, there will be some remain space in qg->reserved
we can not use any more. If things go on like this, the
all space will be ate up.

Signed-off-by: Dongsheng Yang 
Reviewed-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a84e00d..014b7f2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5262,8 +5262,11 @@ out_fail:
to_free = 0;
}
spin_unlock(&BTRFS_I(inode)->lock);
-   if (dropped)
+   if (dropped) {
+   if (root->fs_info->quota_enabled)
+   btrfs_qgroup_free(root, dropped * root->nodesize);
to_free += btrfs_calc_trans_metadata_size(root, dropped);
+   }
 
if (to_free) {
btrfs_block_rsv_release(root, block_rsv, to_free);
-- 
1.8.4.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and

2014-12-12 Thread Qu Wenruo

 Original Message 
Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and
From: Filipe David Manana 
To: Qu Wenruo 
Date: 2014年12月12日 16:34

On Fri, Dec 12, 2014 at 12:32 AM, Qu Wenruo  wrote:

 Original Message 
Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and
From: Filipe David Manana 
To: Qu Wenruo 
Date: 2014年12月11日 19:07

On Thu, Dec 11, 2014 at 12:50 AM, Qu Wenruo 
wrote:

 Original Message 
Subject: Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch
and
From: David Sterba 
To: Qu Wenruo 
Date: 2014年12月10日 20:37

On Tue, Dec 09, 2014 at 04:27:19PM +0800, Qu Wenruo wrote:

The patchset introduce two new repair function and some helpers to
archive a huge goal:
 Repair btrfs whose fs tree's non-root leaf/node is corrupted when
no
 duplication is valid.

The two new repair functions are:
 repair_inode_nlinks():
   Repair any inode nlink related problem.
   From fixing the nlink number and related
   inode_ref/dir_index/dir_item to recovering file name and file
type
   and salvage them into the lost+found dir.
   This does not only fix a case that some users reported but also
   cooperate with repair_inode_no_item() function to salvaged
heavily
   damaged inode to lost+found dir.

 repair_inode_no_item():
   Repair case for inode_item missing case, which is quite common
when
   fs tree leaf/node is missing.
   This only does the inode item rebuild. Later recovery like move
it
   to lost+found dir is done by repair_inode_nlinks().

The main helper is the repair_btree() function, which will drops the
corrupted non-root leaf/node and rebalance the tree to keep the
correctness of the btree.

Sounds a bit intrusive, but under the circumstances I don't see anything
better to do.

Better non-destructive but less generic method may be introduced later.
My dream is to inspect each key and its item to rebuild each member, but
it
would takes a long long time
to implement.

With this patchset, even a non-root leaf/node is corrupted and no
duplication survived, btrfsck can still repair it to a mountable
status.
(And normal rw should also be OK,)

The remaining unfixable problems will be inode nbytes error with file
extent discounts error, which may be fixed in next patchset.

Cc David:
Sorry for the huge change in the patchset and merge the old inode nlink
repair with new inode item rebuild patchset.

No problem, the incremental changelogs helped a lot.

Since when developing inode item rebuild patchset, I found the old
nlink
cooperated very bad with item rebuild and there is some duplicated
codes
between the two patchset, no to mention the math lib introduced by
nlink
repair patch.
So I decided to somewhat rebase the nlink repair patchset to provide
better generality.

Great, the patchset looks good for merge, I'm adding it to 3.18. From
now on please send only incremental changes and not the whole patchset.
Thanks.

Thanks, this should be the last large update patchset.
Later work will focus on file extent recovery and should not interfere
with
this patch.

Thanks.
Qu

Can we please get some tests too?
Add some broken fs images, document what is broken and the expected
result after running the repair code (besides verifying the repair
worked for every single inode of course)...

thanks

Tests are definitely needed, I tested this by randomly corrupt a leaf of
fstree, which contains contents of my /etc,
and run repair.

But the problem is that, we can't add tests like other btrfsck using
btrfs-image dump, since it will fail to dump
a btree-broken btrfs.
And if we add test image directly, it may takes up several MB as a binary
image dump.

Any good idea about how to add test case without btrfs-image support?

Very simple solution.

Do:

1) Create an empty file;
2) Use it as the backing file for a loop device;
3) Run mkfs.btrfs against the loop device;
4) Mount it;
5) Populate the fs;
6) Umount it;
7) Corrupt some nodes or leafs (by zeroing them out for e.g.);
8) Create a tarball from the backing file like this: ZX_OPT=-9 tar
cJSvf foobar.tar.xz run.sh backing_file
9) Add the tarball to the fsck-tests directory;
10) Make the test run fsck against the backing file extracted from the
tarball - fsck can operate against regular files, and not only against
devices.

I did that a couple months ago, see:

http://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/tree/tests/fsck-tests.sh?h=v3.17.x#n30

Exactly because for some kinds of damage in a filesystem btrfs-image won't work.

Thanks.
Oh, thanks for pointing out the fact that btrfs-progs tests can handle 
raw dump image.

I'll try to pick some good size image for it.
(Currently I use 1G file for test, I must find a smaller one)

Thanks,
Qu

Thanks,
Qu

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majord

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread David Taylor


On Thu, 11 Dec 2014, Robert White wrote:


On 12/11/2014 07:56 PM, Zygo Blaxell wrote:


RAID5 with even parity and two devices should be exactly the same as
RAID1 (i.e. disk1 ^ disk2 == 0, therefore disk1 == disk2, the striping
is irrelevant because there is no difference in disk contents so the
disks are interchangeable), except with different behavior when more
devices are added (RAID1 will mirror chunks on pairs of disks, RAID5
should start writing new chunks with N stripes instead of two).


That's not correct. A RAID5 with three elements presents two 
_different_ sectors in each stripe. When one element is lost, it would 
still present two different sectors, but the safety is gone.


The above quote is discussing two device RAID5, you are discussing
three device RAID5.

I understand that the XOR collapses into a mirror if only two datum 
are involved, but that's a mathematical fact that is irrelevant to the 
definition of a RAID5 layout. When you take a wheel off of a tricycle 
it doesn't just become a bike. And you can't make a bicycle into a 
trike by just welding on a wheel somewhere. The infrastructure of the 
two is completely different.


True.  A two-device RAID5 is not the same as a degraded three-device 
RAID5.



So RAID5 with three media M is

MMM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6

If MMM is lost D1, D2, D3, and D5 are intact
D4 and D6 can be recreated via D3^P(b) and P(c)^D5

MMM   X
D1   D2   .
D3   P(b) .
P(c) D5   .


So under _no_ circumstances would a two-disk RAID5 be the same as a 
RAID1 since a two disk RAID5 functionally implies disk three because 
the _minimum_ arity of a RAID5 is 3. A two-disk RAID5 has _zero_ data 
protection because the minimum third element is a computational 
phantom.


You again seem to be treating a "two disk RAID5" as synonymous with your 
degraded three disk RAID5 above.  It is not.


RAID5 with two media M would be:

MMM
D1   P(a)
P(b) D2
D3   P(c)

[and each P would be identical to its corresponding D]

In short it is irrational to have a "two disk" RAID5 that is "not 
degraded" in the same way you cannot have a two-wheeled tricycle 
without scraping some part of something along the asphalt.


There is nothing irrational about it at all, except that it is
exactly equivalent to two disk RAID1.


A RAID1 with two elements presents one sector along the "stripe".


A RAID5 with N elements presents N-1 sectors along the "stripe",
so I'm not sure what the problem is with setting N=2.

I realize that what has been implemented is what you call a two drive 
RAID5, and done so by really implementing a RAID1, but it's nonsense.


It's not really, it's merely an argument of semantics if you want
to define it as nonsense.

I mean I understand what you are saying you've done, but it makes no 
sense according to the definitions of RAID5. There is no circumstance 
where RAID5 falls back to mirroring. Trying to implement RAID5 as an 
extension of a mirroring paradigm would involve a fundamental conflict 
in definitions. Especially when you reached a failure mode.


I have no idea what you mean by "a fundamental conflict in definition".

This is so fundamental to the design that the "fast" way to assemble a 
RAID5 of N-arity (minimum N being 3) is to just connect the first N-1 
elements, declare the raid valid-but-degraded using (N-1) of the 
media, and then "replacing" the Nth phantom/missing/failed element 
with the real disk and triggering a rebuild. This only works if you 
don't need the initial contents of the array to have a specific value 
like zero. (This involves fewest reads and the array is instantly 
available while it builds.)


There is no reason you could not do exactly this with N=2.

As soon as you start writing to the array, the stripes you write 
"repair" the extents if the repair process hadn't gotten to them yet.


Its basically impossible to turn a mirror into a RAID5 if you _ever_ 
expect the code base to to be able to recover an array that's lost an 
element.


Again, I'm not really sure what you mean.

Uh, no. A raid 6 with three drives, or even two drives, is also 
degraded because the minimum is four.


You're doing your weird semantic dance again.  Just because you
define the minimum to be four does not mean that someone talking
about a three device RAID6 is talking about a degraded four device
RAID6, they're not.

As above, a non-degraded three-device RAID6 can be perfectly
sensibly defined.  Once again, it has exactly the same failure
properties as a three device RAID1 (any two of the devices can
fail), so it's a bit pointless.  But not "impossible"...



A   B   C   D
D1  D2  Pa  Qa
D3  Pb  Qb  D4
Pc  Qc  D5  D6
Qd  D7  D8  Pd


You can lose one or two media but the minimum stripe is again [X1,X2] 
for any read (ABCD)(ABC.)(AB..)(A..D) etc.


Minimum arity for RAID6 is 4, maximum lost-but-functional 
configuration is arity-minus-two.


A   B   C
D1  Pa  Qa
Pb  Qb  D2
Qc  D3  Pc
D4  Pd  Qd


They're only missing i

Re: Balance & scrub & defrag

2014-12-12 Thread Erkki Seppala

Robert White  writes:

> You need to buy better disks. 8-)

Where can one buy these better disks with reasonable prices?-) Disks are
best thought of as consumables.

> I use SMART (smartmontools etc) and its tests to keep track of and
> warn me of such issues. It's way more likely to catch incipient media
> failures long before scrub would.

That may be sort of true, but I think even SMART is helped by the fact
that the media is read through from the beginning to the end*, so it can
detect even the errors that don't bubble through the IO layer. And BTRFS
can indeed note errors that the media doesn't - two checksums is better
than one checksum, assuming they aren't exactly the same algorithm ;).

Do you alternatively execute SMART self tests?

* scrub doesn't do this, it reads only through used data

-- 
  _
 / __// /__   __   http://www.modeemi.fi/~flux/\   \
/ /_ / // // /\ \/ /\  /
   /_/  /_/ \___/ /_/\_\@modeemi.fi  \/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Balance & scrub & defrag

2014-12-12 Thread Tomasz Chmielewski


I use SMART (smartmontools etc) and its tests to keep track of and warn
me of such issues. It's way more likely to catch incipient media
failures long before scrub would. It's also more likely to correct
situations before they become visible to userspace. Its also a way
better full-platter scan that involves less real time delay and won't
bog down a running system.


Don't put too much trust in SMART - sectors can rot unexpectedly even if 
SMART is thinking everything is fine with the drive.


I had exactly this issue recently:

1) one of the drives in the server failed and was replaced

2) "btrfs device delete missing" (which basically moves data from the 
remaining drive to the new one) was failing with IO error


3) according to SMART, the drive with IO error was fine (no reallocated 
sectors, no warnings etc.)



So, scrub to the rescue - it printed "broken" files, after removing them 
manually, it was possible to finish "btrfs device delete missing".


Probably it makes sense to run scrub occasionally (just like mdraid is 
doing on most distributions).



--
Tomasz Chmielewski
http://www.sslrack.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcements for btrfs-progs?

2014-12-12 Thread David Sterba

On Thu, Dec 11, 2014 at 12:37:56PM +, Holger Hoffstätte wrote:
> I was wondering if you could please send out announcements for btrfs-progs
> when you tag a release or -rc? There doesn't seem to be a good mechanism
> to track releases and IMHO the more people are notified, the more
> testing we can get, not to mention faster propagation into distros.

Will do. Sorry for inconvenience.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v3 0/7] btrfs: implement swap file support

2014-12-12 Thread David Sterba

On Tue, Dec 09, 2014 at 05:45:41PM -0800, Omar Sandoval wrote:
> After some discussion on the mailing list, I decided that for simplicity and
> reliability, it's best to simply disallow COW files and files with shared
> extents (like files with extents shared with a snapshot). From a user's
> perspective, this means that a snapshotted subvolume cannot be used for a swap
> file, but keeping the swap file in a separate subvolume that is never
> snapshotted seems entirely reasonable to me.

Well, there are enough special cases how to do things on btrfs and I'd
like to avoid introducing another one.

> An alternative suggestion was to
> allow swap files to be snapshotted and to do an implied COW on swap file
> activation, which I was ready to implement until I realized that we can't 
> permit
> snapshotting a subvolume with an active swap file, so this creates a 
> surprising
> inconsistency for users (in my opinion).

I still don't see why it's not possible to do the snapshot with an
active swapfile.

> As with before, this functionality is tenuously tested in a virtual machine 
> with
> some artificial workloads, but it "works for me". I'm pretty happy with the
> results on my end, so please comment away.

The non-btrfs changes can go independently and do not have to wait until
we resolve the swap vs snapshot problem.

I did a simple test and it crashed instantly, lockep complains:

memory: 2G
swap file: 1G
kernel: 3.17 + v3

[  739.790731] Adding 1054716k swap on /mnt/test-swap/mnt/swapfile.  
Priority:-1 extents:1 across:1054716k
[  751.848607]
[  751.851852] =
[  751.852161] [ BUG: bad unlock balance detected! ]
[  751.852161] 3.17.0-default+ #199 Not tainted
[  751.852161] -
[  751.852161] heavy_swap/4119 is trying to release lock 
(&sb->s_type->i_mutex_key) at:
[  751.852161] [] mutex_unlock+0xe/0x10
[  751.852161] but there are no more locks to release!
[  751.852161]
[  751.852161] other info that might help us debug this:
[  751.852161] 1 lock held by heavy_swap/4119:
[  751.852161]  #0:  (&mm->mmap_sem){++}, at: [] 
__do_page_fault+0x149/0x560
[  751.852161]
[  751.852161] stack backtrace:
[  751.852161] CPU: 1 PID: 4119 Comm: heavy_swap Not tainted 3.17.0-default+ 
#199
[  751.852161] Hardware name: Intel Corporation Santa Rosa platform/Matanzas, 
BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
[  751.852161]  81a4f0ce 880075dbb3d8 81a4b268 
0001
[  751.852161]  8800775e 880075dbb408 810b51a9 

[  751.852161]  8800775e  8800763c9d00 
880075dbb4a8
[  751.852161] Call Trace:
[  751.852161]  [] ? mutex_unlock+0xe/0x10
[  751.852161]  [] dump_stack+0x51/0x71
[  751.852161]  [] print_unlock_imbalance_bug+0xf9/0x100
[  751.852161]  [] lock_release_non_nested+0x2cf/0x3e0
[  751.852161]  [] ? ftrace_call+0x5/0x2f
[  751.852161]  [] ? mutex_unlock+0xe/0x10
[  751.852161]  [] ? mutex_unlock+0xe/0x10
[  751.852161]  [] lock_release+0xc9/0x240
[  751.852161]  [] __mutex_unlock_slowpath+0x80/0x190
[  751.852161]  [] ? mutex_unlock+0x9/0x10
[  751.852161]  [] mutex_unlock+0xe/0x10
[  751.852161]  [] btrfs_direct_IO+0x2b8/0x310 [btrfs]
[  751.852161]  [] ? __wake_up_bit+0xd/0x50
[  751.852161]  [] __swap_writepage+0x10b/0x270
[  751.852161]  [] ? page_swapcount+0x53/0x70
[  751.852161]  [] swap_writepage+0x37/0x60
[  751.852161]  [] shmem_writepage+0x2a2/0x2e0
[  751.852161]  [] shrink_page_list+0x44e/0x9d0
[  751.852161]  [] ? _raw_spin_unlock_irq+0x30/0x40
[  751.852161]  [] shrink_inactive_list+0x26d/0x4f0
[  751.852161]  [] ? blk_start_plug+0x9/0x50
[  751.852161]  [] shrink_lruvec+0x5c8/0x6c0
[  751.852161]  [] ? compaction_suitable+0x19/0xc0
[  751.852161]  [] ? compaction_suitable+0x19/0xc0
[  751.852161]  [] shrink_zone+0x4d/0x120
[  751.852161]  [] do_try_to_free_pages+0x19a/0x3a0
[  751.852161]  [] ? pfmemalloc_watermark_ok+0xd/0xc0
[  751.852161]  [] try_to_free_pages+0xb2/0x160
[  751.852161]  [] ? _cond_resched+0x9/0x30
[  751.852161]  [] __alloc_pages_nodemask+0x5eb/0xa90
[  751.852161]  [] ? ftrace_call+0x5/0x2f
[  751.852161]  [] ? anon_vma_prepare+0x21/0x190
[  751.852161]  [] do_huge_pmd_anonymous_page+0xe8/0x330
[  751.852161]  [] ? is_vma_temporary_stack+0x9/0x30
[  751.852161]  [] handle_mm_fault+0x135/0xb60
[  751.852161]  [] ? find_vma+0x15/0x80
[  751.852161]  [] ? vmacache_find+0xd/0xd0
[  751.852161]  [] ? __might_sleep+0xe/0x110
[  751.852161]  [] __do_page_fault+0x1ad/0x560
[  751.852161]  [] ? do_fork+0xe0/0x420
[  751.852161]  [] ? error_sti+0x5/0x6
[  751.852161]  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
[  751.852161]  [] do_page_fault+0xc/0x10
[  751.852161]  [] page_fault+0x22/0x30
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v3 6/7] btrfs: add EXTENT_FLAG_SWAPFILE

2014-12-12 Thread David Sterba

On Tue, Dec 09, 2014 at 05:45:47PM -0800, Omar Sandoval wrote:
> Extents mapping a swap file should remain pinned in memory in order to
> avoid doing allocations to look up an extent when we're already low on
> memory. Rather than overloading EXTENT_FLAG_PINNED, add a new flag
> specifically for this purpose.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v3 7/7] btrfs: enable swap file support

2014-12-12 Thread David Sterba

On Tue, Dec 09, 2014 at 05:45:48PM -0800, Omar Sandoval wrote:
> +static void __clear_swapfile_extents(struct inode *inode)
> +{
> + u64 isize = inode->i_size;
> + struct extent_map *em;
> + u64 start, len;
> +
> + start = 0;
> + while (start < isize) {
> + len = isize - start;
> + em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> + if (IS_ERR(em))
> + return;

This could transiently fail if there's no memory to allocate the em, and
would leak the following extents.

> +
> + clear_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> +
> + start = extent_map_end(em);
> + free_extent_map(em);
> + }
> +}
> +
> +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> *file,
> +sector_t *span)
> +{
> + struct inode *inode = file_inode(file);
> + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> + int ret = 0;
> + u64 isize = inode->i_size;
> + struct extent_state *cached_state = NULL;
> + struct extent_map *em;
> + u64 start, len;
> +
> + if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> + /* Can't do direct I/O on a compressed file. */
> + btrfs_err(fs_info, "swapfile is compressed");
> + return -EINVAL;
> + }
> + if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> + /*
> +  * Going through the copy-on-write path while swapping pages
> +  * in/out and doing a bunch of allocations could stress the
> +  * memory management code that got us there in the first place,
> +  * and that's sure to be a bad time.
> +  */
> + btrfs_err(fs_info, "swapfile is copy-on-write");
> + return -EINVAL;
> + }
> +
> + lock_extent_bits(io_tree, 0, isize - 1, 0, &cached_state);
> +
> + /*
> +  * All of the extents must be allocated and support direct I/O. Inline
> +  * extents and compressed extents fall back to buffered I/O, so those
> +  * are no good. Additionally, all of the extents must be safe for nocow.
> +  */
> + atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> + start = 0;
> + while (start < isize) {
> + len = isize - start;
> + em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> + if (IS_ERR(em)) {

IS_ERR_OR_NULL(em)

>From now on the em is valid and has to be free_extent_map()ed ...

> + ret = PTR_ERR(em);
> + goto out;
> + }
> +
> + if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
> + em->block_start == EXTENT_MAP_HOLE) {
> + btrfs_err(fs_info, "swapfile has holes");
> + ret = -EINVAL;

... and all the error branches would miss it.

> + goto out;
> + }
> + if (em->block_start == EXTENT_MAP_INLINE) {
> + /*
> +  * It's unlikely we'll ever actually find ourselves
> +  * here, as a file small enough to fit inline won't be
> +  * big enough to store more than the swap header, but in
> +  * case something changes in the future, let's catch it
> +  * here rather than later.
> +  */
> + btrfs_err(fs_info, "swapfile is inline");
> + ret = -EINVAL;

here

> + goto out;
> + }
> + if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> + btrfs_err(fs_info, "swapfile is compresed");
> + ret = -EINVAL;

here

> + goto out;
> + }
> + ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> + if (ret < 0) {

here

> + goto out;
> + } else if (ret == 1) {
> + ret = 0;
> + } else {
> + btrfs_err(fs_info, "swapfile has extent requiring COW 
> (%llu-%llu)",
> +   start, start + len - 1);
> + ret = -EINVAL;

here

> + goto out;
> + }
> +
> + set_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> +
> + start = extent_map_end(em);
> + free_extent_map(em);
> + }
> +
> +out:
> + if (ret) {

should be fixed by:

if (!IS_ERR_OR_NULL(em))
free_extent_map(em);

> + __clear_swapfile_extents(inode);
> + atomic_dec(&BTRFS_I(inode)->root->nr_swapfiles);
> + }
> + unlock_extent_cached(io_tree, 0, isize - 1, &cached_state, GFP_NOFS);
> + return ret;
> +}
--
To unsubscribe from this list: send th

[PATCH] btrfs-progs: fix typedef

2014-12-12 Thread Karel Zak

Signed-off-by: Karel Zak 
---
 kerncompat.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kerncompat.h b/kerncompat.h
index 8afadc8..5c1cca9 100644
--- a/kerncompat.h
+++ b/kerncompat.h
@@ -123,7 +123,7 @@ typedef unsigned long long u64;
 typedef unsigned char u8;
 typedef unsigned short u16;
 typedef long long s64;
-typedef int s32
+typedef int s32;
 #endif
 
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread Robert White


On 12/12/2014 01:06 AM, David Taylor wrote:

The above quote is discussing two device RAID5, you are discussing
three device RAID5.


Heresy! (yes, some humor is required here.)

There is no such thing as a "two device RAID5". That's what RAID1 is for.

Saying "The above quote is discussing a two device RAID5" is exactly 
like saying "The above quote is discussing a two wheeled tricycle".


You might as well be talking about three-octet IP addresses. That is you 
could make a network address out of three octets, but it wouldn't' be an 
IP address. It would be something else with the wrong name attached.


I challenge you... nay I _defy_ you... to find a single authority on 
disk storage anywhere on this planet (except, apparently, this list and 
its directly attached people and materials) that discusses, describes, 
or acknowledges the existence of a "two device RAID5" while not 
discussing a system with an arity of 3 degraded by the absence of one media.


All these words have standardized definitions.

[That's not hyperbole. I searched for several hours and could not find 
_any_ reference anywhere to construction of a RAID5 array using only two 
devices that did not involve airity-3 and a dummy/missing/failed psudo 
target. So if you can find any reference to doing this _anywhere_ 
outside of BTRFS I'd like to see it. Genuinely.]


THAT SAID...

I really can find no reason the math wouldn't work using only two 
drives. It would be a terrific waste of CPU cycles and storage space to 
construct the stripe buffers and do the XORs instead of just copying the 
data, but the math would work.


So, um, "well I'll be damned".

Perhaps is just a tautological belief that someone here didn't buy into. 
Like how people keep partitioning drives into little slices for things 
because thats the preserved wisdom from early eighties.


I think constructing a non-degraded-mode two device thing and calling it 
RAID5 will surprise virtually _everyone_ on the planet.


In every other system. And I do mean _every_ other system, if I had two 
media and I put them under RAID-5 I'd be required to specify the third 
drive as some sort failed device (the block device equivalent of 
/dev/null but that returns error results for all operations instead of 
successes.) See the reserved keyword "missing" in the mdadm 
documentation etc.


That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a 2TiB 
array with no actual redundancy. As in


mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc

the resulting array would be the same effective size as a stripe of the 
two drives, but when the third was added later it would just slot in as 
a replacement for the missing device and the airity-3 thing would 
"reestablish" it's redundancy. (this is actually what mdadm does 
internally with a normal build, it blesses the first N-1 drives into an 
array with a missing member, and adds the Nth drive as a "spare" and 
then the spare is immediately adopted as a replacement for the "missing" 
drive.)


The parity computation on a single value is just nutty waste of time 
though. "Backing it out" when the array is degraded is double-nuts.


Maybe everybody just decided it was too crazy to consider for the CPU 
time penalty...?


So yea, semantics... apparently...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs-prog: improve build-system by autoconf

2014-12-12 Thread Karel Zak

This is first step to make btrfs-progs build system more conventional
for userspace users and developers. All is implemented by small incremental
patches to keep things review-able.

The Makefile targets and rules are no changed, things like V=1 (verbose), C=1
(sparse) static builds, etc. still work as expected. All the changes are mostly
about $LIBS, $CFLAGS and proper libraries (uuid, blkid, lzo2, ..) detection.

Note that there is also strange unused btrfs_convert_libs, btrfs_image_libs and
btrfs_fragments_libs variables with things like "-lgd -lpng -ljpeg -lfreetype".
I guess it's some legacy, right? I didn't touch these variables as I have no
clue about sense of this stuff.


[PATCH 01/10] btrfs-progs: add ./configure script
[PATCH 02/10] btrfs-progs: use config.h
[PATCH 03/10] btrfs-progs: use standard PACKAGE_* macros
[PATCH 04/10] btrfs-progs: use ./configure to generate version.h
[PATCH 05/10] btrfs-progs: check for build programs in ./configure
[PATCH 06/10] btrfs-progs: use paths and $*_LIBS from ./configure
[PATCH 07/10] btrfs-progs: cleanup compilation flags usage
[PATCH 08/10] btrfs-progs: clean generated files, make version.h
[PATCH 09/10] btrfs-progs: add --disable-backtrace
[PATCH 10/10] btrfs-progs: add --disable-documentation

The next possible step is automake, but I'd like merge ./configure stuff first.

Karel


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 06/10] btrfs-progs: use paths and $*_LIBS from ./configure

2014-12-12 Thread Karel Zak

Signed-off-by: Karel Zak 
---
 Makefile.in | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index 17eea58..df590ab 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -26,11 +26,11 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
rbtree.h btrfs-list.h \
 TESTS = fsck-tests.sh convert-tests.sh
 
 INSTALL = @INSTALL@
-prefix ?= /usr/local
-bindir = $(prefix)/bin
-lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L.
-libdir ?= $(prefix)/lib
-incdir = $(prefix)/include/btrfs
+prefix ?= @prefix@
+bindir = @bindir@
+lib_LIBS = @UUID_LIBS@ @BLKID_LIBS@ @ZLIB_LIBS@ @LZO2_LIBS@ -m -L.
+libdir ?= @libdir@
+incdir = @includedir@/btrfs
 LIBS = $(lib_LIBS) $(libs_static)
 
 ifeq ("$(origin V)", "command line")
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/10] btrfs-progs: check for build programs in ./configure

2014-12-12 Thread Karel Zak

Signed-off-by: Karel Zak 
---
 Makefile.in  | 12 ++--
 configure.ac |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index dad1685..17eea58 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -1,9 +1,9 @@
 # Export all variables to sub-makes by default
 export
 
-CC = gcc
-LN = ln
-AR = ar
+CC = @CC@
+LN_S = @LN_S@
+AR = @AR@
 AM_CFLAGS = -include config.h -Wall \
-D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES \
-fno-strict-aliasing -fPIC
@@ -25,7 +25,7 @@ libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h 
btrfs-list.h \
   extent_io.h ioctl.h ctree.h btrfsck.h version.h
 TESTS = fsck-tests.sh convert-tests.sh
 
-INSTALL = install
+INSTALL = @INSTALL@
 prefix ?= /usr/local
 bindir = $(prefix)/bin
 lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L.
@@ -165,8 +165,8 @@ $(libs_static): $(libbtrfs_objects)
 
 $(lib_links):
@echo "[LN] $@"
-   $(Q)$(LN) -sf libbtrfs.so.0.1 libbtrfs.so.0
-   $(Q)$(LN) -sf libbtrfs.so.0.1 libbtrfs.so
+   $(Q)$(LN_S) -f libbtrfs.so.0.1 libbtrfs.so.0
+   $(Q)$(LN_S) -f libbtrfs.so.0.1 libbtrfs.so
 
 # keep intermediate files from the below implicit rules around
 .PRECIOUS: $(addsuffix .o,$(progs))
diff --git a/configure.ac b/configure.ac
index 937d50f..662d9ff 100644
--- a/configure.ac
+++ b/configure.ac
@@ -27,6 +27,8 @@ AC_C_BIGENDIAN
 AC_SYS_LARGEFILE
 
 AC_PROG_INSTALL
+AC_PROG_LN_S
+AC_PATH_PROG([AR], [ar])
 
 AC_CHECK_FUNCS([openat], [],
[AC_MSG_ERROR([cannot find openat() function])])
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/10] btrfs-progs: add --disable-backtrace

2014-12-12 Thread Karel Zak

It's better to use ./configure than manually edit Makefile.

Signed-off-by: Karel Zak 
---
 Makefile.in  |  4 
 configure.ac | 10 ++
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index df752d3..bdd7683 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -82,10 +82,6 @@ BUILDDIRS = $(patsubst %,build-%,$(SUBDIRS))
 INSTALLDIRS = $(patsubst %,install-%,$(SUBDIRS))
 CLEANDIRS = $(patsubst %,clean-%,$(SUBDIRS))
 
-ifeq ($(DISABLE_BACKTRACE),1)
-CFLAGS += -DBTRFS_DISABLE_BACKTRACE
-endif
-
 ifneq ($(DISABLE_DOCUMENTATION),1)
 BUILDDIRS += build-Documentation
 INSTALLDIRS += install-Documentation
diff --git a/configure.ac b/configure.ac
index f6adefb..290d022 100644
--- a/configure.ac
+++ b/configure.ac
@@ -56,6 +56,16 @@ AC_DEFUN([PKG_STATIC], [
   fi
 ])
 
+
+AC_ARG_ENABLE([backtrace],
+  AS_HELP_STRING([--disable-backtrace], [disable btrfs backtrace]),
+  [], [enable_backtrace=yes]
+)
+
+AS_IF([test "x$enable_backtrace" = xno], [
+  AC_DEFINE([BTRFS_DISABLE_BACKTRACE], [1], [disable backtrace stuff in 
kerncompat.h ])
+])
+
 dnl Define _LIBS= and _CFLAGS= by pkg-config
 dnl
 dnl The default PKG_CHECK_MODULES() action-if-not-found is end the
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/10] btrfs-progs: use ./configure to generate version.h

2014-12-12 Thread Karel Zak

The original homemade solution is unnecessary, autotools provides better
infrastructure to generate files.

Signed-off-by: Karel Zak 
---
 Makefile.in  |  4 
 configure.ac |  9 +
 version.h.in | 11 +++
 version.sh   | 30 +++---
 4 files changed, 23 insertions(+), 31 deletions(-)
 create mode 100644 version.h.in

diff --git a/Makefile.in b/Makefile.in
index 0dd83ea..dad1685 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -154,10 +154,6 @@ test:
 #
 static: $(progs_static)
 
-version.h:
-   @echo "[SH] $@"
-   $(Q)bash version.sh
-
 $(libs_shared): $(libbtrfs_objects) $(lib_links) send.h
@echo "[LD] $@"
$(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(lib_LIBS) \
diff --git a/configure.ac b/configure.ac
index 7a6c264..937d50f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -3,6 +3,10 @@ AC_INIT([btrfs-progs],
[linux-btrfs@vger.kernel.org],,
[http://btrfs.wiki.kernel.org])
 
+dnl library version
+LIBBTRFS_MAJOR=0
+LIBBTRFS_MINOR=1
+LIBBTRFS_PATCHLEVEL=1
 
 AC_PREREQ([2.60])
 
@@ -74,11 +78,16 @@ AC_SUBST([LZO2_LIBS_STATIC])
 AC_SUBST([LZO2_CFLAGS])
 
 
+dnl library stuff
+AC_SUBST([LIBBTRFS_MAJOR])
+AC_SUBST([LIBBTRFS_MINOR])
+AC_SUBST([LIBBTRFS_PATCHLEVEL])
 
 AC_CONFIG_HEADERS([config.h])
 
 AC_CONFIG_FILES([
 Makefile
+version.h
 ])
 
 AC_OUTPUT
diff --git a/version.h.in b/version.h.in
new file mode 100644
index 000..012d265
--- /dev/null
+++ b/version.h.in
@@ -0,0 +1,11 @@
+#ifndef __LIBBTRFS_VERSION_H__
+#define __LIBBTRFS_VERSION_H__
+
+#define BTRFS_LIB_MAJOR@LIBBTRFS_MAJOR@
+#define BTRFS_LIB_MINOR@LIBBTRFS_MINOR@
+#define BTRFS_LIB_PATCHLEVEL   @LIBBTRFS_PATCHLEVEL@
+
+#define BTRFS_LIB_VERSION ( BTRFS_LIB_MAJOR * 1 + \
+BTRFS_LIB_MINOR * 100 + \
+BTRFS_LIB_PATCHLEVEL )
+#endif
diff --git a/version.sh b/version.sh
index 456853c..42b47c4 100755
--- a/version.sh
+++ b/version.sh
@@ -9,9 +9,6 @@
 v="v3.17.3"
 
 opt=$1
-lib_major=0
-lib_minor=1
-lib_patchlevel=1
 
 which git &> /dev/null
 if [ $? == 0 -a -d .git ]; then
@@ -32,30 +29,9 @@ fi
 if [ "$opt" = "--configure" ]; then
# Omit the trailing newline, so that m4_esyscmd can use the result 
directly.
echo "$v" | tr -d '\n'
-   exit 0
+else
+   echo "$v"
 fi
 
-echo "/* NOTE: this file is autogenerated by version.sh, do not edit */" > 
.build-version.h
-echo "#ifndef __BUILD_VERSION" >> .build-version.h
-echo >> .build-version.h
-echo "#define __BUILD_VERSION" >> .build-version.h
-echo >> .build-version.h
-echo "#define BTRFS_LIB_MAJOR $lib_major" >> .build-version.h
-echo "#define BTRFS_LIB_MINOR $lib_minor" >> .build-version.h
-echo "#define BTRFS_LIB_PATCHLEVEL $lib_patchlevel" >> .build-version.h
-echo >> .build-version.h
-echo "#define BTRFS_LIB_VERSION ( BTRFS_LIB_MAJOR * 1 + \\" >> 
.build-version.h
-echo "BTRFS_LIB_MINOR * 100 + \\" >> 
.build-version.h
-echo "BTRFS_LIB_PATCHLEVEL )" >> .build-version.h
-echo >> .build-version.h
-echo "#define BTRFS_BUILD_VERSION \"Btrfs $v\"" >> .build-version.h
-echo "#endif" >> .build-version.h
+exit 0
 
-diff -q version.h .build-version.h >& /dev/null
-
-if [ $? == 0 ]; then
-rm .build-version.h
-exit 0
-fi
-
-mv .build-version.h version.h
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/10] btrfs-progs: use config.h

2014-12-12 Thread Karel Zak

- the header file is generated by ./configure, the standard autotools
  way is to use -include config.h on compiler command line rather than
  include the file directly from code

- remove _GNU_SOURCE from code, the macros is already defined in config.h
  by AC_USE_SYSTEM_EXTENSIONS autoconf macro

Signed-off-by: Karel Zak 
---
 Makefile.in   | 4 +++-
 btrfs-calc-size.c | 1 -
 btrfs-convert.c   | 1 -
 btrfs-corrupt-block.c | 2 +-
 btrfs-find-root.c | 2 +-
 btrfs-fragments.c | 1 -
 btrfs-image.c | 2 +-
 btrfs-list.c  | 1 -
 btrfs-map-logical.c   | 2 +-
 btrfs-select-super.c  | 2 +-
 btrfs-show-super.c| 2 +-
 btrfs-zero-log.c  | 2 +-
 btrfs.c   | 1 -
 btrfstune.c   | 2 +-
 chunk-recover.c   | 1 -
 cmds-check.c  | 2 +-
 cmds-receive.c| 1 -
 cmds-restore.c| 1 -
 cmds-send.c   | 2 --
 disk-io.c | 2 +-
 mkfs.c| 1 -
 send-test.c   | 2 --
 super-recover.c   | 1 -
 utils-lib.c   | 2 --
 utils.c   | 2 +-
 25 files changed, 14 insertions(+), 28 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index 4cae30c..0dd83ea 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -4,7 +4,9 @@ export
 CC = gcc
 LN = ln
 AR = ar
-AM_CFLAGS = -Wall -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES 
-fno-strict-aliasing -fPIC
+AM_CFLAGS = -include config.h -Wall \
+   -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES \
+   -fno-strict-aliasing -fPIC
 CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic
 objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
diff --git a/btrfs-calc-size.c b/btrfs-calc-size.c
index 50c..3ec8230 100644
--- a/btrfs-calc-size.c
+++ b/btrfs-calc-size.c
@@ -17,7 +17,6 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
 #include 
 #include 
 #include 
diff --git a/btrfs-convert.c b/btrfs-convert.c
index 02c5e94..c88acc1 100644
--- a/btrfs-convert.c
+++ b/btrfs-convert.c
@@ -17,7 +17,6 @@
  */
 
 #define _XOPEN_SOURCE 600
-#define _GNU_SOURCE 1
 
 #include "kerncompat.h"
 
diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index af9ae4d..e993680 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs-find-root.c b/btrfs-find-root.c
index 6fa61cc..b24dddf 100644
--- a/btrfs-find-root.c
+++ b/btrfs-find-root.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs-fragments.c b/btrfs-fragments.c
index d03c2c3..ca45686 100644
--- a/btrfs-fragments.c
+++ b/btrfs-fragments.c
@@ -14,7 +14,6 @@
  * Boston, MA 021110-1307, USA.
  */
 
-#define _GNU_SOURCE
 #include 
 #include 
 #include 
diff --git a/btrfs-image.c b/btrfs-image.c
index cb17f16..1257966 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs-list.c b/btrfs-list.c
index 50edcf4..3e29cf8 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -16,7 +16,6 @@
  * Boston, MA 021110-1307, USA.
  */
 
-#define _GNU_SOURCE
 #include 
 #include 
 #include "ioctl.h"
diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 47d1104..c34484f 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs-select-super.c b/btrfs-select-super.c
index 6231d42..54ac436 100644
--- a/btrfs-select-super.c
+++ b/btrfs-select-super.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs-show-super.c b/btrfs-show-super.c
index 2b48f44..9702eb0 100644
--- a/btrfs-show-super.c
+++ b/btrfs-show-super.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs-zero-log.c b/btrfs-zero-log.c
index 4154175..411fae3 100644
--- a/btrfs-zero-log.c
+++ b/btrfs-zero-log.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/btrfs.c b/btrfs.c
index e83349c..2451885 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -14,7 +14,6 @@
  * Boston, MA 021110-1307, USA.
  */
 
-#define _GNU_SOURCE
 #include 
 #include 
 #include 
diff --git a/btrfstune.c b/btrfstune.c
index 050418a..899a721 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -17,7 +17,7 @@
  */
 
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE 1
+
 #include 
 #include 
 #include 
diff --git a/chunk-recover.c b/chunk-recover.c
index 6f43066..688a7d7 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -16,7 +16,6 @@
  * Boston, MA 021110-1307, USA.
  */
 #define _XOPEN_SOURCE 500
-#define _GNU_SOURCE
 
 #include 
 #include 
diff -

[PATCH 10/10] btrfs-progs: add --disable-documentation

2014-12-12 Thread Karel Zak

Signed-off-by: Karel Zak 
---
 Makefile.in  | 1 +
 configure.ac | 9 +
 2 files changed, 10 insertions(+)

diff --git a/Makefile.in b/Makefile.in
index bdd7683..5889224 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -5,6 +5,7 @@ CC = @CC@
 LN_S = @LN_S@
 AR = @AR@
 INSTALL = @INSTALL@
+DISABLE_DOCUMENTATION = @DISABLE_DOCUMENTATION@
 
 # Non-static compilation flags
 CFLAGS = @CFLAGS@ \
diff --git a/configure.ac b/configure.ac
index 290d022..79cb591 100644
--- a/configure.ac
+++ b/configure.ac
@@ -66,6 +66,15 @@ AS_IF([test "x$enable_backtrace" = xno], [
   AC_DEFINE([BTRFS_DISABLE_BACKTRACE], [1], [disable backtrace stuff in 
kerncompat.h ])
 ])
 
+
+AC_ARG_ENABLE([documentation],
+ AS_HELP_STRING([--disable-documentation], [do not build 
domumentation]),
+  [], [enable_documentation=yes]
+)
+AS_IF([test "x$enable_documentation" = xyes], [DISABLE_DOCUMENTATION=0], 
[DISABLE_DOCUMENTATION=1])
+AC_SUBST([DISABLE_DOCUMENTATION])
+
+
 dnl Define _LIBS= and _CFLAGS= by pkg-config
 dnl
 dnl The default PKG_CHECK_MODULES() action-if-not-found is end the
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 07/10] btrfs-progs: cleanup compilation flags usage

2014-12-12 Thread Karel Zak

- define basic default CFLAGS in configure.ac, because:

   * autoconf default is -g -O2, but btrfs uses -g -O1

   * it's better to follow autoconf; standard way to modify
 CFLAGS is to call:  CFLAGS="foo bar" ./configure

- move all flags to one place in Makefile.in

- don't use AM_CFLAGS, the CFLAGS and STATIC_CFLAGS are enough

- don't mix objects and flags in $LIBS, it's more readable to
  add $(libs) to make rules

Signed-off-by: Karel Zak 
---
 Makefile.in  | 67 +++-
 configure.ac |  2 ++
 2 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index df590ab..58200ca 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -4,10 +4,27 @@ export
 CC = @CC@
 LN_S = @LN_S@
 AR = @AR@
-AM_CFLAGS = -include config.h -Wall \
-   -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES \
-   -fno-strict-aliasing -fPIC
-CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic
+INSTALL = @INSTALL@
+
+# Non-static compilation flags
+CFLAGS = @CFLAGS@ \
+-include config.h -Wall \
+-D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES \
+-fno-strict-aliasing -fPIC \
+-rdynamic
+
+LDFLAGS = @LDFLAGS@
+
+LIBS = @UUID_LIBS@ @BLKID_LIBS@ @ZLIB_LIBS@ @LZO2_LIBS@ -lm -L.
+LIBBTRFS_LIBS = $(LIBS)
+
+# Static compilation flags
+STATIC_CFLAGS = $(CFLAGS) -ffunction-sections -fdata-sections
+STATIC_LDFLAGS = -static -Wl,--gc-sections
+STATIC_LIBS = @UUID_LIBS_STATIC@ @BLKID_LIBS_STATIC@ \
+ @ZLIB_LIBS_STATIC@ @LZO2_LIBS_STATIC@ -lpthread
+
+
 objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
@@ -25,13 +42,10 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
rbtree.h btrfs-list.h \
   extent_io.h ioctl.h ctree.h btrfsck.h version.h
 TESTS = fsck-tests.sh convert-tests.sh
 
-INSTALL = @INSTALL@
 prefix ?= @prefix@
 bindir = @bindir@
-lib_LIBS = @UUID_LIBS@ @BLKID_LIBS@ @ZLIB_LIBS@ @LZO2_LIBS@ -m -L.
 libdir ?= @libdir@
 incdir = @includedir@/btrfs
-LIBS = $(lib_LIBS) $(libs_static)
 
 ifeq ("$(origin V)", "command line")
   BUILD_VERBOSE = $(V)
@@ -69,7 +83,7 @@ INSTALLDIRS = $(patsubst %,install-%,$(SUBDIRS))
 CLEANDIRS = $(patsubst %,clean-%,$(SUBDIRS))
 
 ifeq ($(DISABLE_BACKTRACE),1)
-AM_CFLAGS += -DBTRFS_DISABLE_BACKTRACE
+CFLAGS += -DBTRFS_DISABLE_BACKTRACE
 endif
 
 ifneq ($(DISABLE_DOCUMENTATION),1)
@@ -89,10 +103,6 @@ static_objects = $(patsubst %.o, %.static.o, $(objects))
 static_cmds_objects = $(patsubst %.o, %.static.o, $(cmds_objects))
 static_libbtrfs_objects = $(patsubst %.o, %.static.o, $(libbtrfs_objects))
 
-# Define static compilation flags
-STATIC_CFLAGS = $(CFLAGS) -ffunction-sections -fdata-sections
-STATIC_LDFLAGS = -static -Wl,--gc-sections
-STATIC_LIBS = $(lib_LIBS) -lpthread
 
 libs_shared = libbtrfs.so.0.1
 libs_static = libbtrfs.a
@@ -120,7 +130,7 @@ ifdef C
 else
check = true
check_echo = true
-   AM_CFLAGS += -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2
+   CFLAGS += -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2
 endif
 
 %.o.d: %.c
@@ -128,13 +138,13 @@ endif
 
 .c.o:
@$(check_echo) "[SP] $<"
-   $(Q)$(check) $(AM_CFLAGS) $(CFLAGS) $<
+   $(Q)$(check) $(CFLAGS) $<
@echo "[CC] $@"
-   $(Q)$(CC) $(AM_CFLAGS) $(CFLAGS) -c $<
+   $(Q)$(CC) $(CFLAGS) -c $<
 
 %.static.o: %.c
@echo "[CC] $@"
-   $(Q)$(CC) $(AM_CFLAGS) $(STATIC_CFLAGS) -c $< -o $@
+   $(Q)$(CC) $(STATIC_CFLAGS) -c $< -o $@
 
 all: $(progs) $(BUILDDIRS)
 $(SUBDIRS): $(BUILDDIRS)
@@ -156,7 +166,7 @@ static: $(progs_static)
 
 $(libs_shared): $(libbtrfs_objects) $(lib_links) send.h
@echo "[LD] $@"
-   $(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(lib_LIBS) \
+   $(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(LIBBTRFS_LIBS) \
-shared -Wl,-soname,libbtrfs.so.0 -o libbtrfs.so.0.1
 
 $(libs_static): $(libbtrfs_objects)
@@ -186,12 +196,13 @@ btrfs-%.static: $(static_objects) btrfs-%.static.o 
$(static_libbtrfs_objects)
 
 btrfs-%: $(objects) $(libs) btrfs-%.o
@echo "[LD] $@"
-   $(Q)$(CC) $(CFLAGS) -o $@ $(objects) $@.o $(LDFLAGS) $(LIBS) $($(subst 
-,_,$@-libs))
+   $(Q)$(CC) $(CFLAGS) -o $@ $(objects) $@.o $(libs) \
+   $(LDFLAGS) $(LIBS) $($(subst -,_,$@-libs))
 
 btrfs: $(objects) btrfs.o help.o $(cmds_objects) $(libs)
@echo "[LD] $@"
$(Q)$(CC) $(CFLAGS) -o btrfs btrfs.o help.o $(cmds_objects) \
-   $(objects) $(LDFLAGS) $(LIBS) -lpthread
+   $(objects) $(libs) $(LDFLAGS) $(LIBS) -lpthread
 
 btrfs.static: $(static_objects) btrfs.static.o help.static.o 
$(static_cmds_objects) $(static_libbtrfs_objects)
@echo "[LD] $@"
@@ -201,15 +212,15 @@ btrfs.static: $(static_objects) btrfs.static.o 
help.static.o $(static_cmds_objec

[PATCH 03/10] btrfs-progs: use standard PACKAGE_* macros

2014-12-12 Thread Karel Zak

- use standard PACKAGE_{NAME,VERSION,STRING,URL,...} autoconf macros
  rather than homemade BTRFS_BUILD_VERSION

- don't #include version.h, now the file is necessary for library API only

Note that "btrfs version" returns "btrfs-progs " instead of
the original confusing "btrfs ".

Signed-off-by: Karel Zak 
---
 btrfs-calc-size.c | 1 -
 btrfs-corrupt-block.c | 1 -
 btrfs-debug-tree.c| 5 ++---
 btrfs-find-root.c | 1 -
 btrfs-image.c | 1 -
 btrfs-map-logical.c   | 1 -
 btrfs-select-super.c  | 3 +--
 btrfs-show-super.c| 3 +--
 btrfs-zero-log.c  | 3 +--
 btrfs.c   | 3 +--
 btrfstune.c   | 1 -
 chunk-recover.c   | 1 -
 cmds-check.c  | 3 +--
 cmds-filesystem.c | 5 ++---
 cmds-restore.c| 1 -
 library-test.c| 1 -
 mkfs.c| 9 -
 17 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/btrfs-calc-size.c b/btrfs-calc-size.c
index 3ec8230..2be0d64 100644
--- a/btrfs-calc-size.c
+++ b/btrfs-calc-size.c
@@ -32,7 +32,6 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "volumes.h"
 #include "utils.h"
 
diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index e993680..ba7358d 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -30,7 +30,6 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "utils.h"
 
 #define FIELD_BUF_LEN 80
diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index e46500d..4468297 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -26,7 +26,6 @@
 #include "disk-io.h"
 #include "print-tree.h"
 #include "transaction.h"
-#include "version.h"
 #include "utils.h"
 
 static int print_usage(void)
@@ -43,7 +42,7 @@ static int print_usage(void)
 " only\n");
fprintf(stderr,
"\t-t tree_id : print only the tree with the given id\n");
-   fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
+   fprintf(stderr, "%s\n", PACKAGE_STRING);
exit(1);
 }
 
@@ -406,7 +405,7 @@ no_node:
uuidbuf[BTRFS_UUID_UNPARSED_SIZE - 1] = '\0';
uuid_unparse(info->super_copy->fsid, uuidbuf);
printf("uuid %s\n", uuidbuf);
-   printf("%s\n", BTRFS_BUILD_VERSION);
+   printf("%s\n", PACKAGE_STRING);
 close_root:
return close_ctree(root);
 }
diff --git a/btrfs-find-root.c b/btrfs-find-root.c
index b24dddf..571c86a 100644
--- a/btrfs-find-root.c
+++ b/btrfs-find-root.c
@@ -30,7 +30,6 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "volumes.h"
 #include "utils.h"
 #include "crc32c.h"
diff --git a/btrfs-image.c b/btrfs-image.c
index 1257966..74681ba 100644
--- a/btrfs-image.c
+++ b/btrfs-image.c
@@ -33,7 +33,6 @@
 #include "disk-io.h"
 #include "transaction.h"
 #include "utils.h"
-#include "version.h"
 #include "volumes.h"
 #include "extent_io.h"
 
diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index c34484f..fc4a29b 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -30,7 +30,6 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "utils.h"
 
 /* we write the mirror info to stdout unless they are dumping the data
diff --git a/btrfs-select-super.c b/btrfs-select-super.c
index 54ac436..492d38d 100644
--- a/btrfs-select-super.c
+++ b/btrfs-select-super.c
@@ -29,13 +29,12 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "utils.h"
 
 static void print_usage(void)
 {
fprintf(stderr, "usage: btrfs-select-super -s number dev\n");
-   fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
+   fprintf(stderr, "%s\n", PACKAGE_STRING);
exit(1);
 }
 
diff --git a/btrfs-show-super.c b/btrfs-show-super.c
index 9702eb0..15b9e36 100644
--- a/btrfs-show-super.c
+++ b/btrfs-show-super.c
@@ -33,7 +33,6 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "utils.h"
 #include "crc32c.h"
 
@@ -51,7 +50,7 @@ static void print_usage(void)
fprintf(stderr, "\t-a : print information of all superblocks\n");
fprintf(stderr, "\t-i  : specify which mirror to print 
out\n");
fprintf(stderr, "\t-F : attempt to dump superblocks with bad magic\n");
-   fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
+   fprintf(stderr, "%s\n", PACKAGE_STRING);
 }
 
 int main(int argc, char **argv)
diff --git a/btrfs-zero-log.c b/btrfs-zero-log.c
index 411fae3..0859fe2 100644
--- a/btrfs-zero-log.c
+++ b/btrfs-zero-log.c
@@ -29,14 +29,13 @@
 #include "print-tree.h"
 #include "transaction.h"
 #include "list.h"
-#include "version.h"
 #include "utils.h"
 
 static void print_usage(void) __attribute__((noreturn));
 static void print_usage(void)
 {
fprintf(stderr, "usage: btrfs-zero-log dev\n");
-   fprintf(stderr, "%s\n", BTRFS_BUILD_VERSION);
+   fprintf(stderr, "%s\n", PAC

[PATCH 08/10] btrfs-progs: clean generated files, make version.h stuff more robust

2014-12-12 Thread Karel Zak

- add rule to generated version.h when any relevant stuff changed
- add rule to clean generated files on "make clean-all"

Signed-off-by: Karel Zak 
---
 Makefile.in | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/Makefile.in b/Makefile.in
index 58200ca..df752d3 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -164,6 +164,10 @@ test:
 #
 static: $(progs_static)
 
+version.h: version.sh version.h.in configure.ac
+   @echo "[SH] $@"
+   $(Q)bash ./config.status --silent $@
+
 $(libs_shared): $(libbtrfs_objects) $(lib_links) send.h
@echo "[LD] $@"
$(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(LIBBTRFS_LIBS) \
@@ -271,14 +275,15 @@ test-build:
 manpages:
$(Q)$(MAKE) $(MAKEOPTS) -C Documentation
 
-clean-all: clean-doc clean
+
+clean-all: clean clean-doc clean-gen
 
 clean: $(CLEANDIRS)
@echo "Cleaning"
$(Q)rm -f $(progs) cscope.out *.o *.o.d \
  dir-test ioctl-test quick-test send-test library-test 
library-test-static \
  btrfs.static mkfs.btrfs.static \
- version.h $(check_defs) \
+ $(check_defs) \
  $(libs) $(lib_links) \
  $(progs_static) $(progs_extra)
 
@@ -286,6 +291,11 @@ clean-doc:
@echo "Cleaning Documentation"
$(Q)$(MAKE) $(MAKEOPTS) -C Documentation clean
 
+clean-gen:
+   @echo "Cleaning Generated Files"
+   $(Q)rm -f version.h config.status config.cache connfig.log \
+   configure.lineno config.status.lineno Makefile
+
 $(CLEANDIRS):
@echo "Cleaning $(patsubst clean-%,%,$@)"
$(Q)$(MAKE) $(MAKEOPTS) -C $(patsubst clean-%,%,$@) clean
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/10] btrfs-progs: add ./configure script

2014-12-12 Thread Karel Zak

Add ./autogen.sh script, you have to use it after "git clone/clean" to
generate ./configure from configure.ac.

Modify version.sh to be usable from the configure script.

The patch also renames Makefile to Makefile.in, but does NOT change
anything in the file.

Signed-off-by: Karel Zak 
---
 .gitignore   |  28 ++
 Makefile |  87 +
 Makefile.in  | 314 +++
 autogen.sh   |  58 +++
 configure.ac | 102 +++
 version.sh   |   7 ++
 6 files changed, 557 insertions(+), 39 deletions(-)
 create mode 100644 Makefile.in
 create mode 100755 autogen.sh
 create mode 100644 configure.ac
 mode change 100644 => 100755 version.sh

diff --git a/.gitignore b/.gitignore
index e637b17..beddedb 100644
--- a/.gitignore
+++ b/.gitignore
@@ -39,3 +39,31 @@ libbtrfs.so.0
 libbtrfs.so.0.1
 library-test
 library-test-static
+
+
+aclocal.m4
+autom4te.cache
+compile
+config.cache
+config.guess
+config.h
+config.h.in
+config.log
+config.rpath
+config.status
+config.sub
+config/ltmain.sh
+config/py-compile
+config/test-driver
+configure
+cscope.out
+depcomp
+install-sh
+libtool
+m4/*.m4
+Makefile
+missing
+mkinstalldirs
+stamp-h
+stamp-h.in
+stamp-h1
diff --git a/Makefile b/Makefile
index 4cae30c..95700be 100644
--- a/Makefile
+++ b/Makefile
@@ -1,11 +1,30 @@
 # Export all variables to sub-makes by default
 export
 
-CC = gcc
-LN = ln
-AR = ar
-AM_CFLAGS = -Wall -D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES 
-fno-strict-aliasing -fPIC
-CFLAGS = -g -O1 -fno-strict-aliasing -rdynamic
+CC = gcc -std=gnu99
+LN_S = ln -s
+AR = /usr/bin/ar
+INSTALL = /usr/bin/install -c
+
+# Non-static compilation flags
+CFLAGS = -g -O1 \
+-include config.h -Wall \
+-D_FILE_OFFSET_BITS=64 -DBTRFS_FLAT_INCLUDES \
+-fno-strict-aliasing -fPIC \
+-rdynamic
+
+LDFLAGS = 
+
+LIBS = -luuid  -lblkid  -L/usr/lib64 -lz  -lzo2 -lm -L.
+LIBBTRFS_LIBS = $(LIBS)
+
+# Static compilation flags
+STATIC_CFLAGS = $(CFLAGS) -ffunction-sections -fdata-sections
+STATIC_LDFLAGS = -static -Wl,--gc-sections
+STATIC_LIBS = @UUID_LIBS_STATIC@ @BLKID_LIBS_STATIC@ \
+ @ZLIB_LIBS_STATIC@ @LZO2_LIBS_STATIC@ -lpthread
+
+
 objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
@@ -23,13 +42,10 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
rbtree.h btrfs-list.h \
   extent_io.h ioctl.h ctree.h btrfsck.h version.h
 TESTS = fsck-tests.sh convert-tests.sh
 
-INSTALL = install
-prefix ?= /usr/local
-bindir = $(prefix)/bin
-lib_LIBS = -luuid -lblkid -lm -lz -llzo2 -L.
-libdir ?= $(prefix)/lib
-incdir = $(prefix)/include/btrfs
-LIBS = $(lib_LIBS) $(libs_static)
+prefix ?= /usr
+bindir = ${exec_prefix}/bin
+libdir ?= ${exec_prefix}/lib
+incdir = ${prefix}/include/btrfs
 
 ifeq ("$(origin V)", "command line")
   BUILD_VERBOSE = $(V)
@@ -67,7 +83,7 @@ INSTALLDIRS = $(patsubst %,install-%,$(SUBDIRS))
 CLEANDIRS = $(patsubst %,clean-%,$(SUBDIRS))
 
 ifeq ($(DISABLE_BACKTRACE),1)
-AM_CFLAGS += -DBTRFS_DISABLE_BACKTRACE
+CFLAGS += -DBTRFS_DISABLE_BACKTRACE
 endif
 
 ifneq ($(DISABLE_DOCUMENTATION),1)
@@ -87,10 +103,6 @@ static_objects = $(patsubst %.o, %.static.o, $(objects))
 static_cmds_objects = $(patsubst %.o, %.static.o, $(cmds_objects))
 static_libbtrfs_objects = $(patsubst %.o, %.static.o, $(libbtrfs_objects))
 
-# Define static compilation flags
-STATIC_CFLAGS = $(CFLAGS) -ffunction-sections -fdata-sections
-STATIC_LDFLAGS = -static -Wl,--gc-sections
-STATIC_LIBS = $(lib_LIBS) -lpthread
 
 libs_shared = libbtrfs.so.0.1
 libs_static = libbtrfs.a
@@ -118,7 +130,7 @@ ifdef C
 else
check = true
check_echo = true
-   AM_CFLAGS += -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2
+   CFLAGS += -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2
 endif
 
 %.o.d: %.c
@@ -126,13 +138,13 @@ endif
 
 .c.o:
@$(check_echo) "[SP] $<"
-   $(Q)$(check) $(AM_CFLAGS) $(CFLAGS) $<
+   $(Q)$(check) $(CFLAGS) $<
@echo "[CC] $@"
-   $(Q)$(CC) $(AM_CFLAGS) $(CFLAGS) -c $<
+   $(Q)$(CC) $(CFLAGS) -c $<
 
 %.static.o: %.c
@echo "[CC] $@"
-   $(Q)$(CC) $(AM_CFLAGS) $(STATIC_CFLAGS) -c $< -o $@
+   $(Q)$(CC) $(STATIC_CFLAGS) -c $< -o $@
 
 all: $(progs) $(BUILDDIRS)
 $(SUBDIRS): $(BUILDDIRS)
@@ -152,13 +164,9 @@ test:
 #
 static: $(progs_static)
 
-version.h:
-   @echo "[SH] $@"
-   $(Q)bash version.sh
-
 $(libs_shared): $(libbtrfs_objects) $(lib_links) send.h
@echo "[LD] $@"
-   $(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(lib_LIBS) \
+   $(Q)$(CC) $(CFLAGS) $(libbtrfs_objects) $(LDFLAGS) $(LIBBTRFS_LIBS) \
-shared -Wl,-soname,libbtrfs.so.0 -o libbtrfs.so.0.1
 
 $(libs_static): $(libbtrfs_objects)
@@ -167,8 +175,8 @@ $(libs_static): $(libbtrfs_objects)
 
 $(lib_links):

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread Hugo Mills

On Fri, Dec 12, 2014 at 03:16:03AM -0800, Robert White wrote:
> On 12/12/2014 01:06 AM, David Taylor wrote:
> >The above quote is discussing two device RAID5, you are discussing
> >three device RAID5.
> 
> Heresy! (yes, some humor is required here.)
> 
> There is no such thing as a "two device RAID5". That's what RAID1 is for.
> 
> Saying "The above quote is discussing a two device RAID5" is exactly
> like saying "The above quote is discussing a two wheeled tricycle".
> 
> You might as well be talking about three-octet IP addresses. That is
> you could make a network address out of three octets, but it
> wouldn't' be an IP address. It would be something else with the
> wrong name attached.

   OK. Sounds like I need to dust off the change-of-nomenclature patch
again.

   The argument here is about the 1c1s1p configuration. Is there a
problem with that?

   Hugo.

> I challenge you... nay I _defy_ you... to find a single authority on
> disk storage anywhere on this planet (except, apparently, this list
> and its directly attached people and materials) that discusses,
> describes, or acknowledges the existence of a "two device RAID5"
> while not discussing a system with an arity of 3 degraded by the
> absence of one media.
> 
> All these words have standardized definitions.
> 
> [That's not hyperbole. I searched for several hours and could not
> find _any_ reference anywhere to construction of a RAID5 array using
> only two devices that did not involve airity-3 and a
> dummy/missing/failed psudo target. So if you can find any reference
> to doing this _anywhere_ outside of BTRFS I'd like to see it.
> Genuinely.]
> 
> THAT SAID...
> 
> I really can find no reason the math wouldn't work using only two
> drives. It would be a terrific waste of CPU cycles and storage space
> to construct the stripe buffers and do the XORs instead of just
> copying the data, but the math would work.
> 
> So, um, "well I'll be damned".
> 
> Perhaps is just a tautological belief that someone here didn't buy
> into. Like how people keep partitioning drives into little slices
> for things because thats the preserved wisdom from early eighties.
> 
> I think constructing a non-degraded-mode two device thing and
> calling it RAID5 will surprise virtually _everyone_ on the planet.
> 
> In every other system. And I do mean _every_ other system, if I had
> two media and I put them under RAID-5 I'd be required to specify the
> third drive as some sort failed device (the block device equivalent
> of /dev/null but that returns error results for all operations
> instead of successes.) See the reserved keyword "missing" in the
> mdadm documentation etc.
> 
> That is, If I put two 1TiB disks into a RAID-5 I'd expect to get a
> 2TiB array with no actual redundancy. As in
> 
> mdadm --create md0 --level=r5 --raid-devices=3 /dev/sda missing /dev/sdc
> 
> the resulting array would be the same effective size as a stripe of
> the two drives, but when the third was added later it would just
> slot in as a replacement for the missing device and the airity-3
> thing would "reestablish" it's redundancy. (this is actually what
> mdadm does internally with a normal build, it blesses the first N-1
> drives into an array with a missing member, and adds the Nth drive
> as a "spare" and then the spare is immediately adopted as a
> replacement for the "missing" drive.)
> 
> The parity computation on a single value is just nutty waste of time
> though. "Backing it out" when the array is degraded is double-nuts.
> 
> Maybe everybody just decided it was too crazy to consider for the
> CPU time penalty...?
> 
> So yea, semantics... apparently...

-- 
Hugo Mills | There's an infinite number of monkeys outside who
hugo@... carfax.org.uk | want to talk to us about this new script for Hamlet
http://carfax.org.uk/  | they've worked out!
PGP: 65E74AC0  |   Arthur Dent


signature.asc
Description: Digital signature

Re: A note on spotting "bugs" [Was: ENOSPC after conversion]

2014-12-12 Thread Robert White


On 12/11/2014 10:42 PM, Patrik Lundquist wrote:

On 11 December 2014 at 23:00, Robert White  wrote:

On 12/11/2014 12:18 AM, Patrik Lundquist wrote:


* Full balance, that ended with "98 enospc errors during balance."


Assuming that quote is an actual quote from the output of the balance...


It is, from dmesg.



"Bugs" are unexpected things that cause failures and/or damage.


Not all errors are as pretty as

BTRFS info (device sdc1): relocating block group 1756675178496 flags 1
BTRFS error (device sdc1): allocation failed flags 1, wanted 1272844288
BTRFS: space_info 1 has 13703077888 free, is not full
BTRFS: space_info total=1504312295424, used=1487622750208, pinned=0,
reserved=2986196992, may_use=1308749824, readonly=270336

some are

BTRFS info (device sdc1): relocating block group 1780297498624 flags 1
[ cut here ]
WARNING: CPU: 2 PID: 11094 at
/build/linux-Y9HjRe/linux-3.16.7/fs/btrfs/extent-tree.c:7280
btrfs_alloc_free_block+0x219/0x450 [btrfs]()
BTRFS: block rsv returned -28
Modules linked in: nfsd auth_rpcgss oid_registry nfs_acl nfs lockd
fscache sunrpc btrfs xor nls_utf8 nls_cp437 vfat fat kvm_intel
raid6_pq kvm crc32_pclmul jc42 coretemp ghash_clmulni_intel iTCO_wdt
ipmi_watchdog iTCO_vendor_support aesni_intel joydev aes_x86_64
efi_pstore lrw gf128mul evdev glue_helper ast ablk_helper lpc_ich
cryptd ttm pcspkr efivars mfd_core i2c_i801 drm_kms_helper drm tpm_tis
tpm acpi_cpufreq i2c_ismt shpchp button processor thermal_sys ipmi_si
ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 mbcache
jbd2 sg sd_mod crc_t10dif crct10dif_generic hid_generic usbhid hid
ahci libahci crct10dif_pclmul crct10dif_common crc32c_intel igb libata
ehci_pci i2c_algo_bit xhci_hcd ehci_hcd i2c_core dca scsi_mod ptp
usbcore pps_core usb_common
CPU: 2 PID: 11094 Comm: btrfs Tainted: GW 3.16.0-4-amd64
#1 Debian 3.16.7-2
Hardware name: Supermicro A1SAi/A1SAi, BIOS 1.0c 02/27/2014
  0009 81506b43 88032779f780 81065717
  88032d68a640 88032779f7d0 1000 8803117df480
   8106577c a0536338 0020
Call Trace:
  [] ? dump_stack+0x41/0x51
  [] ? warn_slowpath_common+0x77/0x90
  [] ? warn_slowpath_fmt+0x4c/0x50
  [] ? btrfs_alloc_free_block+0x219/0x450 [btrfs]
  [] ? free_hot_cold_page_list+0x46/0x90
  [] ? read_extent_buffer+0xc8/0x120 [btrfs]
  [] ? btrfs_copy_root+0x101/0x2e0 [btrfs]
  [] ? create_reloc_root+0x201/0x2d0 [btrfs]
  [] ? btrfs_init_reloc_root+0x98/0xb0 [btrfs]
  [] ? record_root_in_trans+0xa4/0xf0 [btrfs]
  [] ? btrfs_record_root_in_trans+0x3f/0x70 [btrfs]
  [] ? start_transaction+0x90/0x560 [btrfs]
  [] ? btrfs_evict_inode+0x33a/0x4d0 [btrfs]
  [] ? evict+0xac/0x170
  [] ? btrfs_run_delayed_iputs+0xd2/0xf0 [btrfs]
  [] ? btrfs_commit_transaction+0x922/0x9c0 [btrfs]
  [] ? start_transaction+0x90/0x560 [btrfs]
  [] ? prepare_to_relocate+0xf4/0x1b0 [btrfs]
  [] ? relocate_block_group+0x42/0x670 [btrfs]
  [] ? btrfs_relocate_block_group+0x1c7/0x2d0 [btrfs]
  [] ? btrfs_relocate_chunk.isra.27+0x62/0x700 [btrfs]
  [] ? btrfs_set_path_blocking+0x31/0x70 [btrfs]
  [] ? btrfs_search_slot+0x4ad/0xad0 [btrfs]
  [] ? btrfs_get_token_64+0x55/0xf0 [btrfs]
  [] ? btrfs_balance+0x82b/0xe80 [btrfs]
  [] ? btrfs_ioctl_balance+0x154/0x500 [btrfs]
  [] ? btrfs_ioctl+0x58c/0x2b10 [btrfs]
  [] ? handle_mm_fault+0xa91/0x11a0
  [] ? __do_page_fault+0x1d1/0x4e0
  [] ? vma_link+0xb1/0xc0
  [] ? do_vfs_ioctl+0x2cf/0x4b0
  [] ? SyS_ioctl+0x81/0xa0
  [] ? page_fault+0x28/0x30
  [] ? system_call_fast_compare_end+0x10/0x15
---[ end trace 880987d36ae50245 ]---
BTRFS error (device sdc1): allocation failed flags 1, wanted 2013265920
BTRFS: space_info 1 has 8384299008 free, is not full
BTRFS: space_info total=1500017328128, used=1491533037568, pinned=0,
reserved=99807232, may_use=2147475456, readonly=184320



Interesting but only fractionally so.

The function btrfs_alloc_free_block() has disappeared from the kernel 
sources in Linus' git tree for the kernel. It used to be in 
linux/fs/btrfs/extent-tree.c ... direct allocation seems to have been 
replaced by a reservation system.


This still doesnt say _anything_ is wrong with your filesystem except 
that it doesn't have enough _raw_ space to create a 2-ish gig extent.


To produce that backtrace as a _WARNING_ (check out the first line) the 
programmer explicitly had to call the function that generates that 
backtrace. That is, it's not a "oops" or other _unforeseen_ critical 
path failure.


So while it's still just a harmless out-of-space condition in terms 
balance, and its got nothing to do with being "out of space" at the 
functional level, some work is being done on the way the handling is 
taking place.


Particularly, there was some code that explicitly called WARN() or 
BUG_ON() while it was processing that out of raw space condition. This 
is a normal-ish thing for code to do when the programmer is like "hey, 
I'd like to see what the state ac

Re: Balance & scrub & defrag

2014-12-12 Thread Robert White


On 12/12/2014 01:17 AM, Erkki Seppala wrote:

Robert White  writes:


You need to buy better disks. 8-)


Where can one buy these better disks with reasonable prices?-) Disks are
best thought of as consumables.


A good disk is only about 9% more expensive. So like the WD "green" 
disks were all cheap because they were (essentially) the disks that 
didn't pass the full quality suite for the higher WD lines like "caviar".


"Inexpensive" and "Cheap" are not the same thing.

Disks are not best thought of as consumables unless the data you store 
on them is discardable.



Do you alternatively execute SMART self tests?


Indeed. If you install and activate SMART but you never run the tests 
you've done another one of those half-measures I was talking about.


The "long offline" test reads 100% of the disk surface (well, up until 
it hits an error anyway). But since none of that data has to leave the 
disk controller and go out through the interface etc it doesn't bog the 
rest of the system.


All but the oldest or cheapest drives have controllers that will "resume 
the offline test after any command" so you do


smartctl --test=long /dev/sda # or whatever

every few days and you'll know when things start to get dicy.

The one thing you do have to be watchful of is that the tests _stop_ 
when they hit the first read error, so you do have to keep up with things.


For instance I just had a pair of uncorrectable read errors. When I used 
hdparm to write the sectors, however, the disk didn't need to relocate 
the block(s) as bad. So it was some funky event on the disk itself.


Of course it's a very old disk (1525 days of power-on runtime) so two 
correctable-with-overwrite read errors isn't bad.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs progs pre-release 3.18-rc1

2014-12-12 Thread David Sterba

Hi,

the 3.18-rc1 has been tagged. There are changes that were pending for too long
(the 'fi usage' and 'dev usage' commands) and I've pulled a lot of changes to
fsck/recovery tools. There are missing pieces of documentation that will be
added in the next round, the point is to release the code first and make it
available for testing.

The ETA for release is ~1 week if everything goes fine. I'm accepting namely
bugfixes, tests and documentation updates, possibly small and unintrusive
patches if reasonable or reviewed.  New rcs will be tagged when enough changes
accumulate.

I can upload the pre-release tarballs if desired.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A note on spotting "bugs" [Was: ENOSPC after conversion]

2014-12-12 Thread Patrik Lundquist

On 12 December 2014 at 14:29, Robert White  wrote:
>
> You yourself even found the annotation in the wiki that said you should have
> e4defragged the system before conversion.

There's no mention of e4defrag on the Btrfs wiki, it says to btrfs
defrag before balance to avoid ENOSPC, as the last step of conversion.

> What you are experiencing is a little vexing, but it's not a bug. It's not
> even a "huge problem". And if you'd stop banging your head against it it
> wouldn't be any sort of problem at all. Neither of us can change these
> facts.

I stopped banging my head several emails ago. I understand the problem
and I will start over.

> I feel your pain man, but thats about it.

I'm in no pain, it has been interesting. No data loss. No hurry.

> What more can I do?

The conversion wiki is lacking. It would be great if someone (maybe
you?) could expand upon the drawbacks of conversion.

> What is it that you want?

Nothing more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

2014-12-12 Thread Tomasz Chmielewski


FYI, still seeing this with 3.18 (scrub passes fine on this filesystem).

# time btrfs balance start /mnt/lxc2
Segmentation fault

real322m32.153s
user0m0.000s
sys 16m0.930s


[20182.461873] BTRFS info (device sdd1): relocating block group 
6915027369984 flags 17

[20194.050641] BTRFS info (device sdd1): found 4819 extents
[20286.243576] BTRFS info (device sdd1): found 4819 extents
[20287.143471] BTRFS info (device sdd1): relocating block group 
6468350771200 flags 17

[20295.756934] BTRFS info (device sdd1): found 3613 extents
[20306.981773] BTRFS (device sdd1): parent transid verify failed on 
5568935395328 wanted 70315 found 102416
[20306.983962] BTRFS (device sdd1): parent transid verify failed on 
5568935395328 wanted 70315 found 102416
[20307.029841] BTRFS (device sdd1): parent transid verify failed on 
5568935395328 wanted 70315 found 102416

[20307.030037] [ cut here ]
[20307.030083] kernel BUG at fs/btrfs/relocation.c:242!
[20307.030130] invalid opcode:  [#1] SMP
[20307.030175] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack ip_tables x_tables cpufreq_ondemand cpufreq_conservative 
cpufreq_powersave cpufreq_stats nfsd auth_rpcgss oid_registry exportfs 
nfs_acl nfs lockd grace fscache sunrpc ipv6 btrfs xor raid6_pq 
zlib_deflate coretemp hwmon loop pcspkr i2c_i801 i2c_core lpc_ich 
mfd_core 8250_fintek battery parport_pc parport tpm_infineon tpm_tis tpm 
ehci_pci ehci_hcd video button acpi_cpufreq ext4 crc16 jbd2 mbcache 
raid1 sg sd_mod r8169 mii ahci libahci libata scsi_mod

[20307.030587] CPU: 3 PID: 4218 Comm: btrfs Not tainted 3.18.0 #1
[20307.030634] Hardware name: System manufacturer System Product 
Name/P8H77-M PRO, BIOS 1101 02/04/2013
[20307.030724] task: 8807f2cac830 ti: 8807e9198000 task.ti: 
8807e9198000
[20307.030811] RIP: 0010:[]  [] 
relocate_block_group+0x432/0x4de [btrfs]

[20307.030914] RSP: 0018:8807e919bb18  EFLAGS: 00010202
[20307.030960] RAX: 8805f06c40f8 RBX: 8805f06c4000 RCX: 
00018023
[20307.031008] RDX: 8805f06c40d8 RSI: 8805f06c40e8 RDI: 
8807ff403900
[20307.031056] RBP: 8807e919bb88 R08: 0001 R09: 

[20307.031105] R10: 0003 R11: a02e43a6 R12: 
8807e637f090
[20307.031153] R13: 8805f06c4108 R14: fff4 R15: 
8805f06c4020
[20307.031201] FS:  7f1bdb4ba880() GS:88081fac() 
knlGS:

[20307.031289] CS:  0010 DS:  ES:  CR0: 80050033
[20307.031336] CR2: 7f5672e18070 CR3: 0007e99cc000 CR4: 
001407e0

[20307.031384] Stack:
[20307.031426]  ea0016296680 8805f06c40e8 ea0016296380 

[20307.031515]  ea0016296400 00ffea0016296440 a805e22b2a30 
1000
[20307.031604]  8804d86963f0 8805f06c4000  
8807f2d785a8

[20307.031693] Call Trace:
[20307.031743]  [] 
btrfs_relocate_block_group+0x158/0x278 [btrfs]
[20307.031838]  [] 
btrfs_relocate_chunk.isra.70+0x35/0xa5 [btrfs]

[20307.031931]  [] btrfs_balance+0xa66/0xc6b [btrfs]
[20307.031981]  [] ? 
__alloc_pages_nodemask+0x137/0x702
[20307.032036]  [] btrfs_ioctl_balance+0x220/0x29f 
[btrfs]

[20307.032089]  [] btrfs_ioctl+0x1134/0x22f6 [btrfs]
[20307.032138]  [] ? handle_mm_fault+0x44d/0xa00
[20307.032186]  [] ? avc_has_perm+0x2e/0xf7
[20307.032234]  [] ? __vm_enough_memory+0x25/0x13c
[20307.032282]  [] do_vfs_ioctl+0x3f2/0x43c
[20307.032329]  [] SyS_ioctl+0x4e/0x7d
[20307.032376]  [] ? do_page_fault+0xc/0x11
[20307.032424]  [] system_call_fastpath+0x12/0x17
[20307.032488] Code: 00 00 00 48 39 83 f8 00 00 00 74 02 0f 0b 4c 39 ab 
08 01 00 00 74 02 0f 0b 48 83 7b 20 00 74 02 0f 0b 83 bb 20 01 00 00 00 
74 02 <0f> 0b 83 bb 24 01 00 00 00 74 02 0f 0b 48 8b 73 18 48 8b 7b 08
[20307.032660] RIP  [] 
relocate_block_group+0x432/0x4de [btrfs]

[20307.032754]  RSP 
[20307.033068] ---[ end trace 18be77360e49d59d ]---



On 2014-11-25 23:33, Tomasz Chmielewski wrote:

I'm still seeing this when running balance with 3.18-rc6:

[95334.066898] BTRFS info (device sdd1): relocating block group
6468350771200 flags 17
[95344.384279] BTRFS info (device sdd1): found 5371 extents
[95373.555640] BTRFS (device sdd1): parent transid verify failed on
5568935395328 wanted 70315 found 89269
[95373.574208] BTRFS (device sdd1): parent transid verify failed on
5568935395328 wanted 70315 found 89269
[95373.574483] [ cut here ]
[95373.574542] kernel BUG at fs/btrfs/relocation.c:242!
[95373.574601] invalid opcode:  [#1] SMP
[95373.574661] Modules linked in: ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables cpufreq_ondemand
cpufreq_conservative cpufreq_powersave cpufreq_stats nfsd auth_rpcgss
oid_registry exportfs nfs_acl nfs lockd grace fscache sunrpc ipv6
btrfs xor raid6_pq zlib_deflat

Re: [PATCH v4 00/13] btrfs-progs:fsck: Add inode nlink mismatch and

2014-12-12 Thread David Sterba

On Fri, Dec 12, 2014 at 08:34:09AM +, Filipe David Manana wrote:
> Very simple solution.
> 
> Do:
> 
> 1) Create an empty file;
> 2) Use it as the backing file for a loop device;
> 3) Run mkfs.btrfs against the loop device;
> 4) Mount it;
> 5) Populate the fs;
> 6) Umount it;
> 7) Corrupt some nodes or leafs (by zeroing them out for e.g.);
> 8) Create a tarball from the backing file like this: ZX_OPT=-9 tar
> cJSvf foobar.tar.xz run.sh backing_file
> 9) Add the tarball to the fsck-tests directory;
> 10) Make the test run fsck against the backing file extracted from the
> tarball - fsck can operate against regular files, and not only against
> devices.

I made a few scripts that help to automate most of the steps (no
populating or fuzzing), attached.
#!/bin/sh -x

SIZE=64m
MKFS_OPTIONS=
MOUNT_OPTIONS=

mkdir -p mnt
truncate -s0 test.img
truncate -s$SIZE test.img
chmod a+w test.img
LODEV=$(sudo losetup -f --show test.img)
if [ $? -ne 0 ]; then
echo "Cannot create loop device"
exit 1
fi

mkfs.btrfs -f $MKFS_OPTIONS $LODEV
sudo mount ${MOUNT_OPTIONS:+-o "$MOUNT_OPTIONS"} $LODEV mnt || { echo "Cannot 
mount image"; exit 1; }

sudo touch mnt/.btrfs-test-image

echo "Mounted test.img as $LODEV into mnt/"
echo "Populate the image and run the next phase"
#!/bin/sh

LODEV=$(losetup -j test.img -O NAME | tail -n +2)
if [ $? -ne 0 ]; then
echo "Cannot detect loop devices for test.img"
exit 1
fi
echo $LODEV

for dev in $LODEV; do
sudo umount $dev
sudo losetup -d $dev
done

echo "Unmounted and all loop devices detached"
echo "Corrupt the image now."
#!/bin/sh -x

if ! [ -f test.img ]; then
echo "No image found"
exit 1
fi

export ZX_OPT
ZX_OPT=-9 tar cJSvf image.tar.xz --owner=root --group=root test.img

echo "Created image.tar.gz from test.img"
echo "Link it to the test dir, run 'make test'"

btrfs fi df output "unknown"

2014-12-12 Thread sys.syphus

why would there be "unknown" data below? i have 2 btrfs arrays and
both have this going on. neither are active. any idea why and how to
make it go away?


Btrfs v3.12
btrfs fi df /media/btrfs
Data, RAID1: total=314.00GiB, used=313.55GiB
System, RAID1: total=32.00MiB, used=64.00KiB
Metadata, RAID1: total=1.00GiB, used=422.45MiB
unknown, single: total=144.00MiB, used=0.00
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs fi df output "unknown"

2014-12-12 Thread Hugo Mills

On Fri, Dec 12, 2014 at 09:38:10AM -0600, sys.syphus wrote:
> why would there be "unknown" data below? i have 2 btrfs arrays and
> both have this going on. neither are active. any idea why and how to
> make it go away?
> 
> 
> Btrfs v3.12
  ^^^
  This is what's "wrong". Upgrade your userspace tools to (IIRC) 3.16
or later, and it'll report it as "BlockReserve", which is what it is.

   The kernel started reporting a new chunk type to account for the
block reserve, and your tools aren't new enough to know about it. For
now, just ignore it -- it's harmless.

   Hugo.

> btrfs fi df /media/btrfs
> Data, RAID1: total=314.00GiB, used=313.55GiB
> System, RAID1: total=32.00MiB, used=64.00KiB
> Metadata, RAID1: total=1.00GiB, used=422.45MiB
> unknown, single: total=144.00MiB, used=0.00

-- 
Hugo Mills | I am but mad north-north-west: when the wind is
hugo@... carfax.org.uk | southerly, I know a hawk from a handsaw.
http://carfax.org.uk/  |
PGP: 65E74AC0  | Hamlet, Prince of Denmark


signature.asc
Description: Digital signature

Re: Announcements for btrfs-progs?

2014-12-12 Thread Eric Sandeen

On 12/11/14 6:37 AM, Holger Hoffstätte wrote:
> 
> David,
> 
> I was wondering if you could please send out announcements for btrfs-progs
> when you tag a release or -rc? There doesn't seem to be a good mechanism
> to track releases and IMHO the more people are notified, the more
> testing we can get, not to mention faster propagation into distros.

Not that it matters for non-Fedora users, but FWIW Fedora has release monitoring
scripts which notify me when a new tarball appears.

So Fedora, at least, gets a prompt update even if there's no formal announcement
(assuming I'm not behind the curve).  ;)

Still, announcements would be great, esp. if they have a summary of what's 
contained
in the update.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: kill btrfs_inode_*time helpers

2014-12-12 Thread David Sterba

They just opencode taking address of the timespec member.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h | 25 -
 fs/btrfs/delayed-inode.c | 28 
 fs/btrfs/inode.c | 28 
 fs/btrfs/send.c  |  9 +++--
 fs/btrfs/tree-log.c  | 12 ++--
 5 files changed, 33 insertions(+), 69 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7e607416755a..937a19e06793 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2464,31 +2464,6 @@ BTRFS_SETGET_STACK_FUNCS(stack_inode_gid, struct 
btrfs_inode_item, gid, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_mode, struct btrfs_inode_item, mode, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_rdev, struct btrfs_inode_item, rdev, 64);
 BTRFS_SETGET_STACK_FUNCS(stack_inode_flags, struct btrfs_inode_item, flags, 
64);
-
-static inline struct btrfs_timespec *
-btrfs_inode_atime(struct btrfs_inode_item *inode_item)
-{
-   unsigned long ptr = (unsigned long)inode_item;
-   ptr += offsetof(struct btrfs_inode_item, atime);
-   return (struct btrfs_timespec *)ptr;
-}
-
-static inline struct btrfs_timespec *
-btrfs_inode_mtime(struct btrfs_inode_item *inode_item)
-{
-   unsigned long ptr = (unsigned long)inode_item;
-   ptr += offsetof(struct btrfs_inode_item, mtime);
-   return (struct btrfs_timespec *)ptr;
-}
-
-static inline struct btrfs_timespec *
-btrfs_inode_ctime(struct btrfs_inode_item *inode_item)
-{
-   unsigned long ptr = (unsigned long)inode_item;
-   ptr += offsetof(struct btrfs_inode_item, ctime);
-   return (struct btrfs_timespec *)ptr;
-}
-
 BTRFS_SETGET_FUNCS(timespec_sec, struct btrfs_timespec, sec, 64);
 BTRFS_SETGET_FUNCS(timespec_nsec, struct btrfs_timespec, nsec, 32);
 BTRFS_SETGET_STACK_FUNCS(stack_timespec_sec, struct btrfs_timespec, sec, 64);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 054577bddaf2..00f7d08b04c7 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1755,19 +1755,19 @@ static void fill_stack_inode_item(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_inode_flags(inode_item, BTRFS_I(inode)->flags);
btrfs_set_stack_inode_block_group(inode_item, 0);
 
-   btrfs_set_stack_timespec_sec(btrfs_inode_atime(inode_item),
+   btrfs_set_stack_timespec_sec(&inode_item->atime,
 inode->i_atime.tv_sec);
-   btrfs_set_stack_timespec_nsec(btrfs_inode_atime(inode_item),
+   btrfs_set_stack_timespec_nsec(&inode_item->atime,
  inode->i_atime.tv_nsec);
 
-   btrfs_set_stack_timespec_sec(btrfs_inode_mtime(inode_item),
+   btrfs_set_stack_timespec_sec(&inode_item->mtime,
 inode->i_mtime.tv_sec);
-   btrfs_set_stack_timespec_nsec(btrfs_inode_mtime(inode_item),
+   btrfs_set_stack_timespec_nsec(&inode_item->mtime,
  inode->i_mtime.tv_nsec);
 
-   btrfs_set_stack_timespec_sec(btrfs_inode_ctime(inode_item),
+   btrfs_set_stack_timespec_sec(&inode_item->ctime,
 inode->i_ctime.tv_sec);
-   btrfs_set_stack_timespec_nsec(btrfs_inode_ctime(inode_item),
+   btrfs_set_stack_timespec_nsec(&inode_item->ctime,
  inode->i_ctime.tv_nsec);
 }
 
@@ -1775,7 +1775,6 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
 {
struct btrfs_delayed_node *delayed_node;
struct btrfs_inode_item *inode_item;
-   struct btrfs_timespec *tspec;
 
delayed_node = btrfs_get_delayed_node(inode);
if (!delayed_node)
@@ -1802,17 +1801,14 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
*rdev = btrfs_stack_inode_rdev(inode_item);
BTRFS_I(inode)->flags = btrfs_stack_inode_flags(inode_item);
 
-   tspec = btrfs_inode_atime(inode_item);
-   inode->i_atime.tv_sec = btrfs_stack_timespec_sec(tspec);
-   inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(tspec);
+   inode->i_atime.tv_sec = btrfs_stack_timespec_sec(&inode_item->atime);
+   inode->i_atime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->atime);
 
-   tspec = btrfs_inode_mtime(inode_item);
-   inode->i_mtime.tv_sec = btrfs_stack_timespec_sec(tspec);
-   inode->i_mtime.tv_nsec = btrfs_stack_timespec_nsec(tspec);
+   inode->i_mtime.tv_sec = btrfs_stack_timespec_sec(&inode_item->mtime);
+   inode->i_mtime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->mtime);
 
-   tspec = btrfs_inode_ctime(inode_item);
-   inode->i_ctime.tv_sec = btrfs_stack_timespec_sec(tspec);
-   inode->i_ctime.tv_nsec = btrfs_stack_timespec_nsec(tspec);
+   inode->i_ctime.tv_sec = btrfs_stack_timespec_sec(&inode_item->ctime);
+   inode->i_ctime.tv_nsec = btrfs_stack_timespec_nsec(&inode_item->ctime);
 
inode->i_generation = BTRFS_I(inode)->generation;
BTRFS_I(ino

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread Zygo Blaxell

On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
> So RAID5 with three media M is
> 
> MMM   MMM
> D1   D2   P(a)
> D3   P(b) D4
> P(c) D5   D6

RAID5 with two media is well defined, and looks like this:

MMM
D1   P(a)
P(b) D2
D3   P(c)

With even parity and N disks

P(a) ^ D1 [^ D2 ^ ... ^ DN] = 0

Simplifying for one data disk and one parity stripe:

P(a) ^ D1 = 0

therefore

P(a) = D1

which is effectively (and, in practice, literally) mirroring.

signature.asc
Description: Digital signature

[PATCH 2/9] btrfs: remove blocksize from reada_extent

2014-12-12 Thread David Sterba

Replace with global nodesize instead.

Signed-off-by: David Sterba 
---
 fs/btrfs/reada.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index b63ae20618fb..5c3fde6571bb 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -66,7 +66,6 @@ struct reada_extctl {
 struct reada_extent {
u64 logical;
struct btrfs_keytop;
-   u32 blocksize;
int err;
struct list_headextctl;
int refcnt;
@@ -349,7 +348,6 @@ static struct reada_extent *reada_find_extent(struct 
btrfs_root *root,
 
blocksize = root->nodesize;
re->logical = logical;
-   re->blocksize = blocksize;
re->top = *top;
INIT_LIST_HEAD(&re->extctl);
spin_lock_init(&re->lock);
@@ -660,7 +658,6 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
int mirror_num = 0;
struct extent_buffer *eb = NULL;
u64 logical;
-   u32 blocksize;
int ret;
int i;
int need_kick = 0;
@@ -694,7 +691,7 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
spin_unlock(&fs_info->reada_lock);
return 0;
}
-   dev->reada_next = re->logical + re->blocksize;
+   dev->reada_next = re->logical + fs_info->tree_root->nodesize;
re->refcnt++;
 
spin_unlock(&fs_info->reada_lock);
@@ -709,7 +706,6 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
}
}
logical = re->logical;
-   blocksize = re->blocksize;
 
spin_lock(&re->lock);
if (re->scheduled_for == NULL) {
@@ -724,8 +720,8 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
return 0;
 
atomic_inc(&dev->reada_in_flight);
-   ret = reada_tree_block_flagged(fs_info->extent_root, logical, blocksize,
-mirror_num, &eb);
+   ret = reada_tree_block_flagged(fs_info->extent_root, logical,
+   fs_info->tree_root->nodesize, mirror_num, &eb);
if (ret)
__readahead_hook(fs_info->extent_root, NULL, logical, ret);
else if (eb)
@@ -851,7 +847,7 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int 
all)
break;
printk(KERN_DEBUG
"  re: logical %llu size %u empty %d for %lld",
-   re->logical, re->blocksize,
+   re->logical, fs_info->tree_root->nodesize,
list_empty(&re->extctl), re->scheduled_for ?
re->scheduled_for->devid : -1);
 
@@ -886,7 +882,8 @@ static void dump_devs(struct btrfs_fs_info *fs_info, int 
all)
}
printk(KERN_DEBUG
"re: logical %llu size %u list empty %d for %lld",
-   re->logical, re->blocksize, list_empty(&re->extctl),
+   re->logical, fs_info->tree_root->nodesize,
+   list_empty(&re->extctl),
re->scheduled_for ? re->scheduled_for->devid : -1);
for (i = 0; i < re->nzones; ++i) {
printk(KERN_CONT " zone %llu-%llu devs",
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL] [PATCH 0/9] Cleanup, remove superfluous blocksize parameter, part 2

2014-12-12 Thread David Sterba

Here's the rest of the parameter removal, no warnings anymore and passed
xfstests. There are a few more clenaups that were required to finish the goal.

You can pull from

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git 
cleanup/blocksize-diet-part2

Based on current next branch 9627aeee3e203e30679549e4962633698a6bf87f,
top commit ce3e69847e3ec79a38421bfd3d6f554d5e481231

David Sterba (9):
  btrfs: sink blocksize parameter to readahead_tree_block
  btrfs: remove blocksize from reada_extent
  btrfs: sink blocksize parameter to reada_tree_block_flagged
  btrfs: sink blocksize parameter to btrfs_init_new_buffer
  btrfs: sink blocksize parameter to btrfs_find_create_tree_block
  btrfs: sink blocksize parameter to tree_block_processed
  btrfs: use GFP_NOFS in __alloc_extent_buffer directly
  btrfs: unify extent buffer allocation api
  btrfs: sink parameter len to alloc_extent_buffer

 fs/btrfs/ctree.c | 13 +
 fs/btrfs/disk-io.c   | 17 -
 fs/btrfs/disk-io.h   |  6 +++---
 fs/btrfs/extent-tree.c   | 13 ++---
 fs/btrfs/extent_io.c | 34 --
 fs/btrfs/extent_io.h |  7 ---
 fs/btrfs/reada.c | 15 ++-
 fs/btrfs/relocation.c| 12 ++--
 fs/btrfs/tests/extent-buffer-tests.c |  2 +-
 fs/btrfs/tests/inode-tests.c |  4 ++--
 fs/btrfs/tests/qgroup-tests.c| 23 +++
 fs/btrfs/tree-log.c  |  2 +-
 fs/btrfs/volumes.c   |  9 +++--
 13 files changed, 84 insertions(+), 73 deletions(-)

-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/9] btrfs: sink blocksize parameter to btrfs_init_new_buffer

2014-12-12 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c025751c20d7..50ebc74db508 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7215,11 +7215,11 @@ int btrfs_alloc_logged_file_extent(struct 
btrfs_trans_handle *trans,
 
 static struct extent_buffer *
 btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
- u64 bytenr, u32 blocksize, int level)
+ u64 bytenr, int level)
 {
struct extent_buffer *buf;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return ERR_PTR(-ENOMEM);
btrfs_set_header_generation(buf, trans->transid);
@@ -7338,7 +7338,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
 
if (btrfs_test_is_dummy_root(root)) {
buf = btrfs_init_new_buffer(trans, root, root->alloc_bytenr,
-   blocksize, level);
+   level);
if (!IS_ERR(buf))
root->alloc_bytenr += blocksize;
return buf;
@@ -7355,8 +7355,7 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
return ERR_PTR(ret);
}
 
-   buf = btrfs_init_new_buffer(trans, root, ins.objectid,
-   blocksize, level);
+   buf = btrfs_init_new_buffer(trans, root, ins.objectid, level);
BUG_ON(IS_ERR(buf)); /* -ENOMEM */
 
if (root_objectid == BTRFS_TREE_RELOC_OBJECTID) {
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/9] btrfs: sink blocksize parameter to reada_tree_block_flagged

2014-12-12 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/disk-io.h | 2 +-
 fs/btrfs/reada.c   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index be9d7c612489..8123b03b1f9d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1086,7 +1086,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 
bytenr)
free_extent_buffer(buf);
 }
 
-int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 
blocksize,
+int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 int mirror_num, struct extent_buffer **eb)
 {
struct extent_buffer *buf = NULL;
@@ -1094,7 +1094,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr, u32 blocksize,
struct extent_io_tree *io_tree = &BTRFS_I(btree_inode)->io_tree;
int ret;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return 0;
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 9cf4359ace05..4d4ecdd9f4a2 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -47,7 +47,7 @@ struct btrfs_fs_devices;
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
  u64 parent_transid);
 void readahead_tree_block(struct btrfs_root *root, u64 bytenr);
-int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 
blocksize,
+int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 int mirror_num, struct extent_buffer **eb);
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
   u64 bytenr, u32 blocksize);
diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c
index 5c3fde6571bb..4d3d4e5287c5 100644
--- a/fs/btrfs/reada.c
+++ b/fs/btrfs/reada.c
@@ -721,7 +721,7 @@ static int reada_start_machine_dev(struct btrfs_fs_info 
*fs_info,
 
atomic_inc(&dev->reada_in_flight);
ret = reada_tree_block_flagged(fs_info->extent_root, logical,
-   fs_info->tree_root->nodesize, mirror_num, &eb);
+   mirror_num, &eb);
if (ret)
__readahead_hook(fs_info->extent_root, NULL, logical, ret);
else if (eb)
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/9] btrfs: sink blocksize parameter to readahead_tree_block

2014-12-12 Thread David Sterba

All callers pass nodesize.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c   | 8 +++-
 fs/btrfs/disk-io.c | 4 ++--
 fs/btrfs/disk-io.h | 2 +-
 fs/btrfs/extent-tree.c | 2 +-
 fs/btrfs/relocation.c  | 3 +--
 5 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 14a72ed14ef7..50eca331812c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2282,7 +2282,7 @@ static void reada_for_search(struct btrfs_root *root,
if ((search <= target && target - search <= 65536) ||
(search > target && search - target <= 65536)) {
gen = btrfs_node_ptr_generation(node, nr);
-   readahead_tree_block(root, search, blocksize);
+   readahead_tree_block(root, search);
nread += blocksize;
}
nscan++;
@@ -2301,7 +2301,6 @@ static noinline void reada_for_balance(struct btrfs_root 
*root,
u64 gen;
u64 block1 = 0;
u64 block2 = 0;
-   int blocksize;
 
parent = path->nodes[level + 1];
if (!parent)
@@ -2309,7 +2308,6 @@ static noinline void reada_for_balance(struct btrfs_root 
*root,
 
nritems = btrfs_header_nritems(parent);
slot = path->slots[level + 1];
-   blocksize = root->nodesize;
 
if (slot > 0) {
block1 = btrfs_node_blockptr(parent, slot - 1);
@@ -2334,9 +2332,9 @@ static noinline void reada_for_balance(struct btrfs_root 
*root,
}
 
if (block1)
-   readahead_tree_block(root, block1, blocksize);
+   readahead_tree_block(root, block1);
if (block2)
-   readahead_tree_block(root, block2, blocksize);
+   readahead_tree_block(root, block2);
 }
 
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 30965120772b..be9d7c612489 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1073,12 +1073,12 @@ static const struct address_space_operations btree_aops 
= {
.set_page_dirty = btree_set_page_dirty,
 };
 
-void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize)
+void readahead_tree_block(struct btrfs_root *root, u64 bytenr)
 {
struct extent_buffer *buf = NULL;
struct inode *btree_inode = root->fs_info->btree_inode;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, blocksize);
+   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
if (!buf)
return;
read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 414651821fb3..9cf4359ace05 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -46,7 +46,7 @@ struct btrfs_fs_devices;
 
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
  u64 parent_transid);
-void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize);
+void readahead_tree_block(struct btrfs_root *root, u64 bytenr);
 int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr, u32 
blocksize,
 int mirror_num, struct extent_buffer **eb);
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 222d6aea4a8a..c025751c20d7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7485,7 +7485,7 @@ static noinline void reada_walk_down(struct 
btrfs_trans_handle *trans,
continue;
}
 reada:
-   readahead_tree_block(root, bytenr, blocksize);
+   readahead_tree_block(root, bytenr);
nread++;
}
wc->reada_slot = slot;
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 74257d6436ad..cb5d4462ebb4 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2965,8 +2965,7 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans,
while (rb_node) {
block = rb_entry(rb_node, struct tree_block, rb_node);
if (!block->key_ready)
-   readahead_tree_block(rc->extent_root, block->bytenr,
-   block->key.objectid);
+   readahead_tree_block(rc->extent_root, block->bytenr);
rb_node = rb_next(rb_node);
}
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/9] btrfs: sink blocksize parameter to btrfs_find_create_tree_block

2014-12-12 Thread David Sterba

Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 12 ++--
 fs/btrfs/disk-io.h |  2 +-
 fs/btrfs/extent-tree.c |  4 ++--
 fs/btrfs/tree-log.c|  2 +-
 fs/btrfs/volumes.c |  9 +++--
 5 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8123b03b1f9d..548cb540e516 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1078,7 +1078,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 
bytenr)
struct extent_buffer *buf = NULL;
struct inode *btree_inode = root->fs_info->btree_inode;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
+   buf = btrfs_find_create_tree_block(root, bytenr);
if (!buf)
return;
read_extent_buffer_pages(&BTRFS_I(btree_inode)->io_tree,
@@ -1094,7 +1094,7 @@ int reada_tree_block_flagged(struct btrfs_root *root, u64 
bytenr,
struct extent_io_tree *io_tree = &BTRFS_I(btree_inode)->io_tree;
int ret;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
+   buf = btrfs_find_create_tree_block(root, bytenr);
if (!buf)
return 0;
 
@@ -1125,12 +1125,12 @@ struct extent_buffer *btrfs_find_tree_block(struct 
btrfs_root *root,
 }
 
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-u64 bytenr, u32 blocksize)
+u64 bytenr)
 {
if (btrfs_test_is_dummy_root(root))
return alloc_test_extent_buffer(root->fs_info, bytenr,
-   blocksize);
-   return alloc_extent_buffer(root->fs_info, bytenr, blocksize);
+   root->nodesize);
+   return alloc_extent_buffer(root->fs_info, bytenr, root->nodesize);
 }
 
 
@@ -1152,7 +1152,7 @@ struct extent_buffer *read_tree_block(struct btrfs_root 
*root, u64 bytenr,
struct extent_buffer *buf = NULL;
int ret;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
+   buf = btrfs_find_create_tree_block(root, bytenr);
if (!buf)
return NULL;
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 4d4ecdd9f4a2..27d44c0fd236 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -50,7 +50,7 @@ void readahead_tree_block(struct btrfs_root *root, u64 
bytenr);
 int reada_tree_block_flagged(struct btrfs_root *root, u64 bytenr,
 int mirror_num, struct extent_buffer **eb);
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-  u64 bytenr, u32 blocksize);
+  u64 bytenr);
 void clean_tree_block(struct btrfs_trans_handle *trans,
  struct btrfs_root *root, struct extent_buffer *buf);
 int open_ctree(struct super_block *sb,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 50ebc74db508..8ff31f81d870 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7219,7 +7219,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, 
struct btrfs_root *root,
 {
struct extent_buffer *buf;
 
-   buf = btrfs_find_create_tree_block(root, bytenr, root->nodesize);
+   buf = btrfs_find_create_tree_block(root, bytenr);
if (!buf)
return ERR_PTR(-ENOMEM);
btrfs_set_header_generation(buf, trans->transid);
@@ -7825,7 +7825,7 @@ static noinline int do_walk_down(struct 
btrfs_trans_handle *trans,
 
next = btrfs_find_tree_block(root, bytenr);
if (!next) {
-   next = btrfs_find_create_tree_block(root, bytenr, blocksize);
+   next = btrfs_find_create_tree_block(root, bytenr);
if (!next)
return -ENOMEM;
btrfs_set_buffer_lockdep_class(root->root_key.objectid, next,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9a02da16f2be..4a42edc224a8 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2164,7 +2164,7 @@ static noinline int walk_down_log_tree(struct 
btrfs_trans_handle *trans,
parent = path->nodes[*level];
root_owner = btrfs_header_owner(parent);
 
-   next = btrfs_find_create_tree_block(root, bytenr, blocksize);
+   next = btrfs_find_create_tree_block(root, bytenr);
if (!next)
return -ENOMEM;
 
diff --git a/fs/btrfs/volumes.c

[PATCH 6/9] btrfs: sink blocksize parameter to tree_block_processed

2014-12-12 Thread David Sterba

Signed-off-by: David Sterba 
---
 fs/btrfs/relocation.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index cb5d4462ebb4..d83085381bcc 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2855,9 +2855,10 @@ static void update_processed_blocks(struct reloc_control 
*rc,
}
 }
 
-static int tree_block_processed(u64 bytenr, u32 blocksize,
-   struct reloc_control *rc)
+static int tree_block_processed(u64 bytenr, struct reloc_control *rc)
 {
+   u32 blocksize = rc->extent_root->nodesize;
+
if (test_range_bit(&rc->processed_blocks, bytenr,
   bytenr + blocksize - 1, EXTENT_DIRTY, 1, NULL))
return 1;
@@ -3352,7 +3353,7 @@ static int __add_tree_block(struct reloc_control *rc,
bool skinny = btrfs_fs_incompat(rc->extent_root->fs_info,
SKINNY_METADATA);
 
-   if (tree_block_processed(bytenr, blocksize, rc))
+   if (tree_block_processed(bytenr, rc))
return 0;
 
if (tree_search(blocks, bytenr))
@@ -3610,7 +3611,7 @@ static int find_data_references(struct reloc_control *rc,
if (added)
goto next;
 
-   if (!tree_block_processed(leaf->start, leaf->len, rc)) {
+   if (!tree_block_processed(leaf->start, rc)) {
block = kmalloc(sizeof(*block), GFP_NOFS);
if (!block) {
err = -ENOMEM;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/9] btrfs: use GFP_NOFS in __alloc_extent_buffer directly

2014-12-12 Thread David Sterba

Same mask from all callers.

Signed-off-by: David Sterba 
---
 fs/btrfs/extent_io.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4ebabd237153..619592d86c2a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4598,11 +4598,11 @@ static inline void btrfs_release_extent_buffer(struct 
extent_buffer *eb)
 
 static struct extent_buffer *
 __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
- unsigned long len, gfp_t mask)
+ unsigned long len)
 {
struct extent_buffer *eb = NULL;
 
-   eb = kmem_cache_zalloc(extent_buffer_cache, mask);
+   eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS);
if (eb == NULL)
return NULL;
eb->start = start;
@@ -4643,7 +4643,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
extent_buffer *src)
struct extent_buffer *new;
unsigned long num_pages = num_extent_pages(src->start, src->len);
 
-   new = __alloc_extent_buffer(NULL, src->start, src->len, GFP_NOFS);
+   new = __alloc_extent_buffer(NULL, src->start, src->len);
if (new == NULL)
return NULL;
 
@@ -4672,7 +4672,7 @@ struct extent_buffer *alloc_dummy_extent_buffer(u64 
start, unsigned long len)
unsigned long num_pages = num_extent_pages(0, len);
unsigned long i;
 
-   eb = __alloc_extent_buffer(NULL, start, len, GFP_NOFS);
+   eb = __alloc_extent_buffer(NULL, start, len);
if (!eb)
return NULL;
 
@@ -4824,7 +4824,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
if (eb)
return eb;
 
-   eb = __alloc_extent_buffer(fs_info, start, len, GFP_NOFS);
+   eb = __alloc_extent_buffer(fs_info, start, len);
if (!eb)
return NULL;
 
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 9/9] btrfs: sink parameter len to alloc_extent_buffer

2014-12-12 Thread David Sterba

Because we're using globally known nodesize. Do the same for the sanity
test function variant.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c| 5 ++---
 fs/btrfs/extent_io.c  | 5 +++--
 fs/btrfs/extent_io.h  | 4 ++--
 fs/btrfs/tests/qgroup-tests.c | 2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 548cb540e516..9c204533fd22 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1128,9 +1128,8 @@ struct extent_buffer *btrfs_find_create_tree_block(struct 
btrfs_root *root,
 u64 bytenr)
 {
if (btrfs_test_is_dummy_root(root))
-   return alloc_test_extent_buffer(root->fs_info, bytenr,
-   root->nodesize);
-   return alloc_extent_buffer(root->fs_info, bytenr, root->nodesize);
+   return alloc_test_extent_buffer(root->fs_info, bytenr);
+   return alloc_extent_buffer(root->fs_info, bytenr);
 }
 
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dc424e32545a..c4ca90ab687e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4775,7 +4775,7 @@ struct extent_buffer *find_extent_buffer(struct 
btrfs_fs_info *fs_info,
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
-  u64 start, unsigned long len)
+  u64 start)
 {
struct extent_buffer *eb, *exists = NULL;
int ret;
@@ -4821,8 +4821,9 @@ free_eb:
 #endif
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
- u64 start, unsigned long len)
+ u64 start)
 {
+   unsigned long len = fs_info->tree_root->nodesize;
unsigned long num_pages = num_extent_pages(start, len);
unsigned long i;
unsigned long index = start >> PAGE_CACHE_SHIFT;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index e6553e3d35c8..71268e508b7a 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -262,7 +262,7 @@ int get_state_private(struct extent_io_tree *tree, u64 
start, u64 *private);
 void set_page_extent_mapped(struct page *page);
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
- u64 start, unsigned long len);
+ u64 start);
 struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
u64 start);
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src);
@@ -378,5 +378,5 @@ noinline u64 find_lock_delalloc_range(struct inode *inode,
  u64 *end, u64 max_bytes);
 #endif
 struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
-  u64 start, unsigned long len);
+  u64 start);
 #endif
diff --git a/fs/btrfs/tests/qgroup-tests.c b/fs/btrfs/tests/qgroup-tests.c
index 7336b1c09cd8..73f299ebdabb 100644
--- a/fs/btrfs/tests/qgroup-tests.c
+++ b/fs/btrfs/tests/qgroup-tests.c
@@ -419,7 +419,7 @@ int btrfs_test_qgroups(void)
 * Can't use bytenr 0, some things freak out
 * *cough*backref walking code*cough*
 */
-   root->node = alloc_test_extent_buffer(root->fs_info, 4096, 4096);
+   root->node = alloc_test_extent_buffer(root->fs_info, 4096);
if (!root->node) {
test_msg("Couldn't allocate dummy buffer\n");
ret = -ENOMEM;
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/9] btrfs: unify extent buffer allocation api

2014-12-12 Thread David Sterba

Make the extent buffer allocation interface consistent.  Cloned eb will
set a valid fs_info.  For dummy eb, we can drop the length parameter and
set it from fs_info.

The built-in sanity checks may pass a NULL fs_info that's queried for
nodesize, but we know it's 4096.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c |  5 ++---
 fs/btrfs/extent_io.c | 23 ++-
 fs/btrfs/extent_io.h |  3 ++-
 fs/btrfs/tests/extent-buffer-tests.c |  2 +-
 fs/btrfs/tests/inode-tests.c |  4 ++--
 fs/btrfs/tests/qgroup-tests.c| 21 ++---
 6 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 50eca331812c..276d4187cbf0 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1363,8 +1363,7 @@ tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct 
btrfs_path *path,
 
if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
BUG_ON(tm->slot != 0);
-   eb_rewin = alloc_dummy_extent_buffer(eb->start,
-   fs_info->tree_root->nodesize);
+   eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
if (!eb_rewin) {
btrfs_tree_read_unlock_blocking(eb);
free_extent_buffer(eb);
@@ -1444,7 +1443,7 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
} else if (old_root) {
btrfs_tree_read_unlock(eb_root);
free_extent_buffer(eb_root);
-   eb = alloc_dummy_extent_buffer(logical, root->nodesize);
+   eb = alloc_dummy_extent_buffer(root->fs_info, logical);
} else {
btrfs_set_lock_blocking_rw(eb_root, BTRFS_READ_LOCK);
eb = btrfs_clone_extent_buffer(eb_root);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 619592d86c2a..dc424e32545a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4643,7 +4643,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
extent_buffer *src)
struct extent_buffer *new;
unsigned long num_pages = num_extent_pages(src->start, src->len);
 
-   new = __alloc_extent_buffer(NULL, src->start, src->len);
+   new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
if (new == NULL)
return NULL;
 
@@ -4666,13 +4666,26 @@ struct extent_buffer *btrfs_clone_extent_buffer(struct 
extent_buffer *src)
return new;
 }
 
-struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len)
+struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
+   u64 start)
 {
struct extent_buffer *eb;
-   unsigned long num_pages = num_extent_pages(0, len);
+   unsigned long len;
+   unsigned long num_pages;
unsigned long i;
 
-   eb = __alloc_extent_buffer(NULL, start, len);
+   if (!fs_info) {
+   /*
+* Called only from tests that don't always have a fs_info
+* available, but we know that nodesize is 4096
+*/
+   len = 4096;
+   } else {
+   len = fs_info->tree_root->nodesize;
+   }
+   num_pages = num_extent_pages(0, len);
+
+   eb = __alloc_extent_buffer(fs_info, start, len);
if (!eb)
return NULL;
 
@@ -4770,7 +4783,7 @@ struct extent_buffer *alloc_test_extent_buffer(struct 
btrfs_fs_info *fs_info,
eb = find_extent_buffer(fs_info, start);
if (eb)
return eb;
-   eb = alloc_dummy_extent_buffer(start, len);
+   eb = alloc_dummy_extent_buffer(fs_info, start);
if (!eb)
return NULL;
eb->fs_info = fs_info;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index ece9ce87edff..e6553e3d35c8 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -263,7 +263,8 @@ void set_page_extent_mapped(struct page *page);
 
 struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
  u64 start, unsigned long len);
-struct extent_buffer *alloc_dummy_extent_buffer(u64 start, unsigned long len);
+struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
+   u64 start);
 struct extent_buffer *btrfs_clone_extent_buffer(struct extent_buffer *src);
 struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 u64 start);
diff --git a/fs/btrfs/tests/extent-buffer-tests.c 
b/fs/btrfs/tests/extent-buffer-tests.c
index cc286ce97d1e..f51963a8f929 100644
--- a/fs/btrfs/tests/extent-buffer-tests.c
+++ b/fs/btrfs/tests/extent-buffer-tests.c
@@ -53,7 +53,7 @@ static int test_btrfs_split_item(void)
return -ENOMEM;
}
 
-   path->nodes[0] = eb = alloc_dummy_extent_buffer(0, 4096);
+   path->nodes[0]

Re: [PATCH v2 1/3] Btrfs: get more accurate output in df command.

2014-12-12 Thread Goffredo Baroncelli

On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
> the total size of disks and then dividing it by a factor. But in some usecase,
> the result is not good to user.

I am checking it; to me it seems a good improvement. However
I noticed that df now doesn't seem to report anymore the space consumed
by the meta-data chunk; eg:

# I have two disks of 5GB each
$ sudo ~/mkfs.btrfs -f -m raid1 -d raid1 /dev/vgtest/disk /dev/vgtest/disk1

$ df -h /mnt/btrfs1/
Filesystem   Size  Used Avail Use% Mounted on
/dev/mapper/vgtest-disk  5.0G  1.1G  4.0G  21% /mnt/btrfs1

$ sudo btrfs fi show
Label: none  uuid: 884414c6-9374-40af-a5be-3949cdf6ad0b
Total devices 2 FS bytes used 640.00KB
devid2 size 5.00GB used 2.01GB path /dev/dm-1
devid1 size 5.00GB used 2.03GB path /dev/dm-0

$ sudo ./btrfs fi df /mnt/btrfs1/
Data, RAID1: total=1.00GiB, used=512.00KiB
Data, single: total=8.00MiB, used=0.00B
System, RAID1: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, RAID1: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

In this case the filesystem is empty (it was a new filesystem !). However a
1G metadata chunk was already allocated. This is the reasons why the free 
space is only 4Gb.

On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring the
metadata chunk from the free space may not be a big problem. 





> Example:
>   # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>   # mount /dev/vdf1 /mnt
>   # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>   # df -h /mnt
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vdf1   3.0G 1018M  1.3G  45% /mnt
>   # btrfs fi show /dev/vdf1
> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>   Total devices 2 FS bytes used 1001.53MiB
>   devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
>   devid2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a. df -h should report Size as 2GiB rather than as 3GiB.
> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
> 
> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
> 1.85   (the capacity of the allocated chunk)
>-1.018  (the file stored)
>+(2-1.85=0.15)  (the residual capacity of the disks
> considering a raid1 fs)
>---
> =   0.97
> 
> This patch drops the factor at all and calculate the size observable to
> user without considering which raid level the data is in and what's the
> size exactly in disk.
> After this patch applied:
>   # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>   # mount /dev/vdf1 /mnt
>   # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>   # df -h /mnt
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vdf1   2.0G  1.3G  713M  66% /mnt
>   # df /mnt
> Filesystem 1K-blocksUsed Available Use% Mounted on
> /dev/vdf12097152 1359424729536  66% /mnt
>   # btrfs fi show /dev/vdf1
> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>   Total devices 2 FS bytes used 1001.55MiB
>   devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
>   devid2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a). The @Size is 2G as we expected.
> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
> c). @Used is changed to 1.3G rather than 1018M as above. Because
> this patch do not treat the free space in metadata chunk
> and system chunk as available to user. It's true, user can
> not use these space to store data, then it should not be
> thought as available. At the same time, it will make the
> @Used + @Available == @Size as possible to user.
> 
> Signed-off-by: Dongsheng Yang 
> ---
> Changelog:
> v1 -> v2:
> a. account the total_bytes in medadata chunk and
>system chunk as used to user.
> b. set data_stripes to the correct value in RAID0.
> 
>  fs/btrfs/extent-tree.c | 13 ++--
>  fs/btrfs/super.c   | 56 
> ++
>  2 files changed, 26 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a84e00d..9954d60 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct 
> list_head *groups_list)
>  {
>   struct btrfs_block_group_cache *block_group;
>   u64 free_bytes = 0;
> - int factor;
>  
>   list_for_each_entry(block_group, groups_list, list) {
>   spin_lock(&block_group->lock);
> @@ -8581,16 +8580,8 @@ static u64 
> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>   continue;
>   }
>  
> - if (block_group->flags & (BTRFS_BLOCK_

[GIT PULL] Btrfs for 3.19-rc

2014-12-12 Thread Chris Mason

Hi Linus,

Please pull my for-linus branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

>From a feature point of view, most of the code here comes from Miao Xie
and others at Fujitsu to implement scrubbing and replacing devices on
raid56.  This has been in development for a while, and it's a big
improvement.

Filipe and Josef have a great assortment of fixes, many of which solve
problems corruptions either after a crash or in error conditions.  I
still have a round two from Filipe for next week that solves corruptions
with discard and block group removal.

This was based on 3.18-rc5 but does have a merge with rc6 because I
wanted to make sure testers had the use-after-free fix you pulled in
from the networking tree.

Filipe Manana (30) commits (+804/-272):
Btrfs: fix snapshot inconsistency after a file write followed by truncate 
(+60/-24)
Btrfs: make btrfs_abort_transaction consider existence of new block groups 
(+4/-3)
Btrfs: report error after failure inlining extent in compressed write path 
(+4/-0)
Btrfs: fix race between fs trimming and block group remove/allocation 
(+140/-21)
Btrfs: deal with convert_extent_bit errors to avoid fs corruption (+76/-18)
Btrfs: fix freeing used extents after removing empty block group (+10/-11)
Btrfs: collect only the necessary ordered extents on ranged fsync (+17/-5)
Btrfs: fix invalid block group rbtree access after bg is removed (+13/-0)
Btrfs: avoid returning -ENOMEM in convert_extent_bit() too early (+10/-1)
Btrfs: fix freeing used extent after removing empty block group (+11/-2)
Btrfs: fix race between writing free space cache and trimming (+71/-6)
Btrfs: correctly flush compressed data before/after direct IO (+24/-4)
Btrfs: make find_first_extent_bit be able to cache any state (+15/-4)
Btrfs: don't leak pages and memory on compressed write error (+19/-9)
Btrfs: set page and mapping error on compressed write failure (+8/-1)
Btrfs: process all async extents on compressed write failure (+1/-5)
Btrfs: ensure ordered extent errors aren't missed on fsync (+37/-0)
Btrfs: make inode.c:submit_compressed_extents() return void (+2/-5)
Btrfs: ensure send always works on roots without orphans (+49/-29)
Btrfs: fix unprotected deletion from pending_chunks list (+7/-1)
Btrfs: make inode.c:compress_file_range() return void (+2/-5)
Btrfs: fix memory leak after block remove + trimming (+6/-0)
Btrfs: avoid premature -ENOMEM in clear_extent_bit() (+7/-2)
Btrfs: don't ignore compressed bio write errors (+12/-6)
Btrfs: fix crash caused by block group removal (+28/-0)
Btrfs: don't ignore log btree writeback errors (+15/-6)
Btrfs: make xattr replace operations atomic (+102/-65)
Btrfs: add helper btrfs_fdatawrite_range (+34/-39)
Btrfs: fix hang on compressed write error (+14/-0)
Btrfs: fix fs mapping extent map leak (+6/-0)

David Sterba (8) commits (+122/-44):
btrfs: switch inode_cache option handling to pending changes (+18/-15)
btrfs: fix wrong accounting of raid1 data profile in statfs (+1/-1)
btrfs: do commit in sync_fs if there are pending changes (+11/-3)
btrfs: move commit out of sysfs when changing features (+5/-8)
btrfs: move commit out of sysfs when changing label (+8/-13)
btrfs: add support for processing pending changes (+69/-0)
btrfs: fix typos in btrfs_check_super_valid (+4/-4)
btrfs: introduce pending action: commit (+6/-0)

Miao Xie (7) commits (+1556/-150):
Btrfs, raid56: fix use-after-free problem in the final device replace 
procedure on raid56 (+45/-20)
Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted 
(+235/-33)
Btrfs, replace: write raid56 parity into the replace target device (+24/-1)
Btrfs, replace: write dirty pages into the replace target device (+97/-43)
Btrfs, raid56: use a variant to record the operation type (+17/-14)
Btrfs, raid56: support parity scrub on raid56 (+1115/-20)
Btrfs, raid56: don't change bbio and raid_map (+23/-19)

Josef Bacik (6) commits (+141/-63):
Btrfs: make sure logged extents complete in the current transaction V3 
(+72/-6)
Btrfs: make sure we wait on logged extents when fsycning two subvols (+1/-1)
Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2 (+47/-25)
Btrfs: make get_caching_control unconditionally return the ctl (+4/-6)
Btrfs: move read only block groups onto their own list V2 (+17/-23)
Btrfs: do not move em to modified list when unpinning (+0/-2)

Zhao Lei (3) commits (+2/-10):
Btrfs: remove unnecessary code of stripe_index assignment in 
__btrfs_map_block (+1/-3)
Btrfs: remove noused bbio_ret in __btrfs_map_block in condition (+1/-2)
Btrfs, replace: enable dev-replace for raid56 (+0/-5)

Stefan Behrens (2) commits (+53/-93):
Btrfs: check-int: don't complain about balanced blocks (+42/-38)
Btrfs: check_int: use the known block location (+11/

Re: [GIT PULL] Btrfs for 3.19-rc

2014-12-12 Thread Linus Torvalds

On Fri, Dec 12, 2014 at 11:07 AM, Chris Mason  wrote:
>
> From a feature point of view, most of the code here comes from Miao Xie
> and others at Fujitsu to implement scrubbing and replacing devices on
> raid56.  This has been in development for a while, and it's a big
> improvement.

So this has probably happened before, and I just haven't been looking,
but I thought I'd mention it.

There are merges from github for this feature, and those merges aren't
signed, and don't have merge messages. Maybe you actually verified all
of it other ways, but there's no sign of it. I generally push back on
merging unsigned stuff from random hosting places (to the point where
I just refuse to do it, although it's possible that some pass though
just due to inattention), and I think that's just good practice in
general. And merges that don't explain what the merge does are just
bad merges (they are extra annoying when they are back-merges, but
it's a problem even otherwise).

Now, sometimes the "why did you merge" is obvious in just the merge
itself (maybe the branch name is already sufficient to explain some
trivial pull). But I thought I'd mention this as an area where the
kernel development process can still improve. I strive to make sure
that my merge commits have good messages (generally by asking
submaintainers to explain things to me in email or in the signed tag),
and I'm starting to try to encourage others to the same.

Again, this is probably something you've done before without me ever
mentioning/noticing it, and I really don't think the btrfs tree is at
all alone in this, but I thought I'd mention it since I happened to
react to it this time.

Regardless - pulled,

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] Btrfs: get more accurate output in df command.

2014-12-12 Thread Goffredo Baroncelli

On 12/11/2014 09:31 AM, Dongsheng Yang wrote:
> When function btrfs_statfs() calculate the tatol size of fs, it is calculating
> the total size of disks and then dividing it by a factor. But in some usecase,
> the result is not good to user.


I Yang; during my test I discovered an error:

$ sudo lvcreate -L +10G -n disk vgtest
$ sudo /home/ghigo/mkfs.btrfs -f /dev/vgtest/disk
Btrfs v3.17
See http://btrfs.wiki.kernel.org for more information.

Turning ON incompat feature 'extref': increased hardlink limit per file to 65536
fs created label (null) on /dev/vgtest/disk
nodesize 16384 leafsize 16384 sectorsize 4096 size 10.00GiB
$ sudo mount /dev/vgtest/disk /mnt/btrfs1/
$ df /mnt/btrfs1/
Filesystem  1K-blocksUsed Available Use% Mounted on
/dev/mapper/vgtest-disk   9428992 1069312   8359680  12% /mnt/btrfs1
$ sudo ~/btrfs fi df /mnt/btrfs1/
Data, single: total=8.00MiB, used=256.00KiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=16.00MiB, used=0.00B

What seems me strange is the 9428992KiB of total disk space as reported 
by df. About 600MiB are missing !

Without your patch, I got:

$ df /mnt/btrfs1/
Filesystem  1K-blocks  Used Available Use% Mounted on
/dev/mapper/vgtest-disk  10485760 16896   8359680   1% /mnt/btrfs1


> Example:
>   # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>   # mount /dev/vdf1 /mnt
>   # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>   # df -h /mnt
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vdf1   3.0G 1018M  1.3G  45% /mnt
>   # btrfs fi show /dev/vdf1
> Label: none  uuid: f85d93dc-81f4-445d-91e5-6a5cd9563294
>   Total devices 2 FS bytes used 1001.53MiB
>   devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
>   devid2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a. df -h should report Size as 2GiB rather than as 3GiB.
> Because this is 2 device raid1, the limiting factor is devid 1 @2GiB.
> 
> b. df -h should report Avail as 0.97GiB or less, rather than as 1.3GiB.
> 1.85   (the capacity of the allocated chunk)
>-1.018  (the file stored)
>+(2-1.85=0.15)  (the residual capacity of the disks
> considering a raid1 fs)
>---
> =   0.97
> 
> This patch drops the factor at all and calculate the size observable to
> user without considering which raid level the data is in and what's the
> size exactly in disk.
> After this patch applied:
>   # mkfs.btrfs -f /dev/vdf1 /dev/vdf2 -d raid1
>   # mount /dev/vdf1 /mnt
>   # dd if=/dev/zero of=/mnt/zero bs=1M count=1000
>   # df -h /mnt
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/vdf1   2.0G  1.3G  713M  66% /mnt
>   # df /mnt
> Filesystem 1K-blocksUsed Available Use% Mounted on
> /dev/vdf12097152 1359424729536  66% /mnt
>   # btrfs fi show /dev/vdf1
> Label: none  uuid: e98c1321-645f-4457-b20d-4f41dc1cf2f4
>   Total devices 2 FS bytes used 1001.55MiB
>   devid1 size 2.00GiB used 1.85GiB path /dev/vdf1
>   devid2 size 4.00GiB used 1.83GiB path /dev/vdf2
> a). The @Size is 2G as we expected.
> b). @Available is 700M = 1.85G - 1.3G + (2G - 1.85G).
> c). @Used is changed to 1.3G rather than 1018M as above. Because
> this patch do not treat the free space in metadata chunk
> and system chunk as available to user. It's true, user can
> not use these space to store data, then it should not be
> thought as available. At the same time, it will make the
> @Used + @Available == @Size as possible to user.
> 
> Signed-off-by: Dongsheng Yang 
> ---
> Changelog:
> v1 -> v2:
> a. account the total_bytes in medadata chunk and
>system chunk as used to user.
> b. set data_stripes to the correct value in RAID0.
> 
>  fs/btrfs/extent-tree.c | 13 ++--
>  fs/btrfs/super.c   | 56 
> ++
>  2 files changed, 26 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index a84e00d..9954d60 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8571,7 +8571,6 @@ static u64 __btrfs_get_ro_block_group_free_space(struct 
> list_head *groups_list)
>  {
>   struct btrfs_block_group_cache *block_group;
>   u64 free_bytes = 0;
> - int factor;
>  
>   list_for_each_entry(block_group, groups_list, list) {
>   spin_lock(&block_group->lock);
> @@ -8581,16 +8580,8 @@ static u64 
> __btrfs_get_ro_block_group_free_space(struct list_head *groups_list)
>   continue;
>   }
>  
> - if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
> -   BTRFS_BLOCK_GROUP_RAID10 |
> -

[patch] Btrfs, scrub: uninitialized variable in scrub_extent_for_parity()

2014-12-12 Thread Dan Carpenter

The only way that "ret" is set is when we call scrub_pages_for_parity()
so the skip to "if (ret) " test doesn't make sense and causes a static
checker warning.

Signed-off-by: Dan Carpenter 
---
Static checker work.  Not tested.  There are some other valid looking
warnings from the same file:

fs/btrfs/scrub.c:2933 scrub_raid56_parity() warn: XXX passing uninitialized 
'extent_physical'
fs/btrfs/scrub.c:2933 scrub_raid56_parity() warn: XXX passing uninitialized 
'extent_dev'
fs/btrfs/scrub.c:2933 scrub_raid56_parity() warn: XXX passing uninitialized 
'extent_mirror_num'

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f2bb13a..9e1569f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2607,9 +2607,9 @@ static int scrub_extent_for_parity(struct scrub_parity 
*sparity,
ret = scrub_pages_for_parity(sparity, logical, l, physical, dev,
 flags, gen, mirror_num,
 have_csum ? csum : NULL);
-skip:
if (ret)
return ret;
+skip:
len -= l;
logical += l;
physical += l;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] Btrfs for 3.19-rc

2014-12-12 Thread Chris Mason




On Fri, Dec 12, 2014 at 2:24 PM, Linus Torvalds 
 wrote:

On Fri, Dec 12, 2014 at 11:07 AM, Chris Mason  wrote:


 From a feature point of view, most of the code here comes from Miao 
Xie
 and others at Fujitsu to implement scrubbing and replacing devices 
on

 raid56.  This has been in development for a while, and it's a big
 improvement.


So this has probably happened before, and I just haven't been looking,
but I thought I'd mention it.

There are merges from github for this feature, and those merges aren't
signed, and don't have merge messages. Maybe you actually verified all
of it other ways, but there's no sign of it. I generally push back on
merging unsigned stuff from random hosting places (to the point where
I just refuse to do it, although it's possible that some pass though
just due to inattention), and I think that's just good practice in
general. And merges that don't explain what the merge does are just
bad merges (they are extra annoying when they are back-merges, but
it's a problem even otherwise).


Thanks, in this case he also posted the patches to the btrfs list.  
Using git pull is easier for all the obvious reasons, so I took the 
github tree.  It definitely looked right to me, but I'll compare the 
github code with his patches directly on top of rc6.


Next time I'll make sure they are signed though.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Bug Fixed?

2014-12-12 Thread Chris Mason




On Fri, Dec 12, 2014 at 12:59 PM, nick  wrote:

Greetings Chris and Josef,
I am wondering if the bug at this URL, 
https://urldefense.proofpoint.com/v1/url?u=https://bugzilla.kernel.org/show_bug.cgi?id%3D82251&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=LzvRKZOlBaBgUF3ZlMvZ0oV6t5RcfUy7yo7%2B9JU0i14%3D%0A&s=5d04890d0104f2e119d684146b556e93dc0e16cc7a5b83de7aca69239c0e963a 
is fixed. If it's not I am

a idea of the issue after tracing a little today :).
Nick



Hi Nick,

Yes, this was a bug in our workqueue usage and it has been fixed.

-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v3 7/7] btrfs: enable swap file support

2014-12-12 Thread Omar Sandoval

On Fri, Dec 12, 2014 at 11:51:22AM +0100, David Sterba wrote:
> On Tue, Dec 09, 2014 at 05:45:48PM -0800, Omar Sandoval wrote:
> > +static void __clear_swapfile_extents(struct inode *inode)
> > +{
> > +   u64 isize = inode->i_size;
> > +   struct extent_map *em;
> > +   u64 start, len;
> > +
> > +   start = 0;
> > +   while (start < isize) {
> > +   len = isize - start;
> > +   em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> > +   if (IS_ERR(em))
> > +   return;
> 
> This could transiently fail if there's no memory to allocate the em, and
> would leak the following extents.
> 
This leak I was aware of, and at the time I didn't see a good way to get
around it. After all, if we can't get the current extent, there's no way
to iterate through the rest of them. Now I see that instead of doing
this at the btrfs_get_extent level, I can just go through all of the
extent_maps in the extent_map_tree.

> > +
> > +   clear_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> > +
> > +   start = extent_map_end(em);
> > +   free_extent_map(em);
> > +   }
> > +}
> > +
> > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> > *file,
> > +  sector_t *span)
> > +{
> > +   struct inode *inode = file_inode(file);
> > +   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +   struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
> > +   int ret = 0;
> > +   u64 isize = inode->i_size;
> > +   struct extent_state *cached_state = NULL;
> > +   struct extent_map *em;
> > +   u64 start, len;
> > +
> > +   if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> > +   /* Can't do direct I/O on a compressed file. */
> > +   btrfs_err(fs_info, "swapfile is compressed");
> > +   return -EINVAL;
> > +   }
> > +   if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> > +   /*
> > +* Going through the copy-on-write path while swapping pages
> > +* in/out and doing a bunch of allocations could stress the
> > +* memory management code that got us there in the first place,
> > +* and that's sure to be a bad time.
> > +*/
> > +   btrfs_err(fs_info, "swapfile is copy-on-write");
> > +   return -EINVAL;
> > +   }
> > +
> > +   lock_extent_bits(io_tree, 0, isize - 1, 0, &cached_state);
> > +
> > +   /*
> > +* All of the extents must be allocated and support direct I/O. Inline
> > +* extents and compressed extents fall back to buffered I/O, so those
> > +* are no good. Additionally, all of the extents must be safe for nocow.
> > +*/
> > +   atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles);
> > +   start = 0;
> > +   while (start < isize) {
> > +   len = isize - start;
> > +   em = btrfs_get_extent(inode, NULL, 0, start, len, 0);
> > +   if (IS_ERR(em)) {
> 
>   IS_ERR_OR_NULL(em)
> 
> From now on the em is valid and has to be free_extent_map()ed ...
> 
> > +   ret = PTR_ERR(em);
> > +   goto out;
> > +   }
> > +
> > +   if (test_bit(EXTENT_FLAG_VACANCY, &em->flags) ||
> > +   em->block_start == EXTENT_MAP_HOLE) {
> > +   btrfs_err(fs_info, "swapfile has holes");
> > +   ret = -EINVAL;
> 
> ... and all the error branches would miss it.
> 
> > +   goto out;
> > +   }
> > +   if (em->block_start == EXTENT_MAP_INLINE) {
> > +   /*
> > +* It's unlikely we'll ever actually find ourselves
> > +* here, as a file small enough to fit inline won't be
> > +* big enough to store more than the swap header, but in
> > +* case something changes in the future, let's catch it
> > +* here rather than later.
> > +*/
> > +   btrfs_err(fs_info, "swapfile is inline");
> > +   ret = -EINVAL;
> 
> here
> 
> > +   goto out;
> > +   }
> > +   if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> > +   btrfs_err(fs_info, "swapfile is compresed");
> > +   ret = -EINVAL;
> 
> here
> 
> > +   goto out;
> > +   }
> > +   ret = can_nocow_extent(inode, start, &len, NULL, NULL, NULL);
> > +   if (ret < 0) {
> 
> here
> 
> > +   goto out;
> > +   } else if (ret == 1) {
> > +   ret = 0;
> > +   } else {
> > +   btrfs_err(fs_info, "swapfile has extent requiring COW 
> > (%llu-%llu)",
> > + start, start + len - 1);
> > +   ret = -EINVAL;
> 
> here
> 
> > +   goto out;
> > +   }
> > +
> > +   set_bit(EXTENT_FLAG_SWAPFILE, &em->flags);
> > +
> > +   start = extent_map_

Re: [RFC PATCH v3 0/7] btrfs: implement swap file support

2014-12-12 Thread Omar Sandoval

On Fri, Dec 12, 2014 at 11:32:13AM +0100, David Sterba wrote:
> On Tue, Dec 09, 2014 at 05:45:41PM -0800, Omar Sandoval wrote:
> > After some discussion on the mailing list, I decided that for simplicity and
> > reliability, it's best to simply disallow COW files and files with shared
> > extents (like files with extents shared with a snapshot). From a user's
> > perspective, this means that a snapshotted subvolume cannot be used for a 
> > swap
> > file, but keeping the swap file in a separate subvolume that is never
> > snapshotted seems entirely reasonable to me.
> 
> Well, there are enough special cases how to do things on btrfs and I'd
> like to avoid introducing another one.
> 
> > An alternative suggestion was to
> > allow swap files to be snapshotted and to do an implied COW on swap file
> > activation, which I was ready to implement until I realized that we can't 
> > permit
> > snapshotting a subvolume with an active swap file, so this creates a 
> > surprising
> > inconsistency for users (in my opinion).
> 
> I still don't see why it's not possible to do the snapshot with an
> active swapfile.
> 
Creating a snapshot of an active swapfile would create shared extents,
so the next time we have to swap out a page, we'd have to do a COW,
which we're already trying pretty hard to avoid. We could allow it, but
it might lead to some unreliable behavior and unhappy emails to the
mailing list. However, I do see your point about wanting to avoid
special cases, so I'd like to get some more input from others on this as
well.

> > As with before, this functionality is tenuously tested in a virtual machine 
> > with
> > some artificial workloads, but it "works for me". I'm pretty happy with the
> > results on my end, so please comment away.
> 
> The non-btrfs changes can go independently and do not have to wait until
> we resolve the swap vs snapshot problem.
> 
> I did a simple test and it crashed instantly, lockep complains:
> 
> memory: 2G
> swap file: 1G
> kernel: 3.17 + v3
> 
[snip]

That's my fault for not running with lockdep enabled. The problem here
is that swap-over-NFS is the only caller of nfs_direct_IO, so
nfs_direct_IO doesn't observe the normal direct_IO locking conventions
and neither does swap_writepage. I'll have to shuffle around some code
on the NFS side to fix that.

It looks like the non-btrfs parts of this might get a bit bigger, so
I'll look into getting that in separately.

Thanks!
-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: don't delete csum for free space cache

2014-12-12 Thread Josef Bacik

We unconditionally delete csums for data extents, but we don't have csums for
free space cache, so all this does is force us to recow the csum root, which
will cause us to re-write the block group cache.  This patch fixes this by
noticing if we're a free space cache extent and simply skipping the delete csum
step.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 8 
 fs/btrfs/delayed-ref.h | 1 +
 fs/btrfs/extent-tree.c | 8 ++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 6d16bea..7c729c3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -605,6 +605,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
head_ref->is_data = is_data;
head_ref->ref_root = RB_ROOT;
head_ref->processing = 0;
+   head_ref->no_csums = 0;
 
spin_lock_init(&head_ref->lock);
mutex_init(&head_ref->mutex);
@@ -848,6 +849,13 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
head_ref = add_delayed_ref_head(fs_info, trans, &head_ref->node,
bytenr, num_bytes, action, 1);
 
+   /*
+* If ref_root is the tree root then this is a block group space cache
+* extent and doesn't have csums, so we can set no_csums.
+*/
+   if (ref_root == BTRFS_ROOT_TREE_OBJECTID)
+   head_ref->no_csums = 1;
+
add_delayed_data_ref(fs_info, trans, head_ref, &ref->node, bytenr,
   num_bytes, parent, ref_root, owner, offset,
   action, no_quota);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index a764e23..c464dd3 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -101,6 +101,7 @@ struct btrfs_delayed_ref_head {
 * the free has happened.
 */
unsigned int must_insert_reserved:1;
+   unsigned int no_csums:1;
unsigned int is_data:1;
unsigned int processing:1;
 };
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 74eb29d..23b704e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2312,7 +2312,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
if (insert_reserved) {
btrfs_pin_extent(root, node->bytenr,
 node->num_bytes, 1);
-   if (head->is_data) {
+   if (head->is_data && !head->no_csums) {
ret = btrfs_del_csums(trans, root,
  node->bytenr,
  node->num_bytes);
@@ -6094,7 +6094,11 @@ static int __btrfs_free_extent(struct btrfs_trans_handle 
*trans,
}
btrfs_release_path(path);
 
-   if (is_data) {
+   /*
+* If the ref root is the tree root then this is a nodatasum
+* extent and we can skip the btrfs_del_csums step.
+*/
+   if (is_data && (root_objectid != BTRFS_ROOT_TREE_OBJECTID)) {
ret = btrfs_del_csums(trans, root, bytenr, num_bytes);
if (ret) {
btrfs_abort_transaction(trans, extent_root, 
ret);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: separate out the extent root update

2014-12-12 Thread Josef Bacik

Depending on where the extent root shows up in the dirty list we could end up
recowing the extent root a few times during commit.  This is inefficient, so
instead only track the other COW only roots and update them all at once, and
then do the extent root/block group update loop by itself to try and reduce the
amount of churn we do at commit time.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c |  1 -
 fs/btrfs/transaction.c | 34 +++---
 2 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2409718..b69402a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2713,7 +2713,6 @@ retry_root_backup:
ret = PTR_ERR(extent_root);
goto recovery_tree_root;
}
-   set_bit(BTRFS_ROOT_TRACK_DIRTY, &extent_root->state);
fs_info->extent_root = extent_root;
 
location.objectid = BTRFS_DEV_TREE_OBJECTID;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index dcaae36..1b84595 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -933,17 +933,16 @@ int btrfs_write_and_wait_transaction(struct 
btrfs_trans_handle *trans,
 }
 
 /*
- * this is used to update the root pointer in the tree of tree roots.
+ * Update the extent_root pointer in the tree of tree roots.
  *
- * But, in the case of the extent allocation tree, updating the root
- * pointer may allocate blocks which may change the root of the extent
- * allocation tree.
+ * In the case of the extent allocation tree, updating the root pointer may
+ * allocate blocks which may change the root of the extent allocation tree.
  *
  * So, this loops and repeats and makes sure the cowonly root didn't
  * change while the root pointer was being updated in the metadata.
  */
-static int update_cowonly_root(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root)
+static int update_extent_root(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root)
 {
int ret;
u64 old_root_bytenr;
@@ -986,6 +985,7 @@ static noinline int commit_cowonly_roots(struct 
btrfs_trans_handle *trans,
 struct btrfs_root *root)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_root *tree_root = fs_info->tree_root;
struct list_head *next;
struct extent_buffer *eb;
int ret;
@@ -1017,26 +1017,30 @@ static noinline int commit_cowonly_roots(struct 
btrfs_trans_handle *trans,
if (ret)
return ret;
 
-   /* run_qgroups might have added some more refs */
-   ret = btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
-   if (ret)
-   return ret;
-
while (!list_empty(&fs_info->dirty_cowonly_roots)) {
next = fs_info->dirty_cowonly_roots.next;
list_del_init(next);
root = list_entry(next, struct btrfs_root, dirty_list);
 
-   if (root != fs_info->extent_root)
-   list_add_tail(&root->dirty_list,
- &trans->transaction->switch_commits);
-   ret = update_cowonly_root(trans, root);
+   list_add_tail(&root->dirty_list,
+ &trans->transaction->switch_commits);
+   btrfs_set_root_node(&root->root_item, root->node);
+   ret = btrfs_update_root(trans, tree_root, &root->root_key,
+   &root->root_item);
if (ret)
return ret;
}
 
+   ret = btrfs_run_delayed_refs(trans, root, (unsigned long)-1);
+   if (ret)
+   return ret;
+
list_add_tail(&fs_info->extent_root->dirty_list,
  &trans->transaction->switch_commits);
+   ret = update_extent_root(trans, fs_info->extent_root);
+   if (ret)
+   return ret;
+
btrfs_after_dev_replace_commit(fs_info);
 
return 0;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: track dirty block groups on their own list

2014-12-12 Thread Josef Bacik

Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set.  This function
can get called several times during a commit.  So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time.  Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
free space cache out so it is much cleaner.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h|   5 +-
 fs/btrfs/extent-tree.c  | 167 ++--
 fs/btrfs/free-space-cache.c |   8 ++-
 fs/btrfs/transaction.c  |  13 ++--
 fs/btrfs/transaction.h  |   2 +
 5 files changed, 71 insertions(+), 124 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 334a76e..ee8b8b8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1237,7 +1237,6 @@ enum btrfs_disk_cache_state {
BTRFS_DC_ERROR  = 1,
BTRFS_DC_CLEAR  = 2,
BTRFS_DC_SETUP  = 3,
-   BTRFS_DC_NEED_WRITE = 4,
 };
 
 struct btrfs_caching_control {
@@ -1275,7 +1274,6 @@ struct btrfs_block_group_cache {
unsigned long full_stripe_len;
 
unsigned int ro:1;
-   unsigned int dirty:1;
unsigned int iref:1;
 
int disk_cache_state;
@@ -1309,6 +1307,9 @@ struct btrfs_block_group_cache {
 
/* For read-only block groups */
struct list_head ro_list;
+
+   /* For dirty block groups */
+   struct list_head dirty_list;
 };
 
 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 23b704e..4ddc838 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -74,8 +74,9 @@ enum {
RESERVE_ALLOC_NO_ACCOUNT = 2,
 };
 
-static int update_block_group(struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, int alloc);
+static int update_block_group(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root, u64 bytenr,
+ u64 num_bytes, int alloc);
 static int __btrfs_free_extent(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
u64 bytenr, u64 num_bytes, u64 parent,
@@ -3315,120 +3316,42 @@ int btrfs_write_dirty_block_groups(struct 
btrfs_trans_handle *trans,
   struct btrfs_root *root)
 {
struct btrfs_block_group_cache *cache;
-   int err = 0;
+   struct btrfs_transaction *cur_trans = trans->transaction;
+   int ret = 0;
struct btrfs_path *path;
-   u64 last = 0;
+
+   if (list_empty(&cur_trans->dirty_bgs))
+   return 0;
 
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
 
-again:
-   while (1) {
-   cache = btrfs_lookup_first_block_group(root->fs_info, last);
-   while (cache) {
-   if (cache->disk_cache_state == BTRFS_DC_CLEAR)
-   break;
-   cache = next_block_group(root, cache);
-   }
-   if (!cache) {
-   if (last == 0)
-   break;
-   last = 0;
-   continue;
-   }
-   err = cache_save_setup(cache, trans, path);
-   last = cache->key.objectid + cache->key.offset;
-   btrfs_put_block_group(cache);
-   }
-
-   while (1) {
-   if (last == 0) {
-   err = btrfs_run_delayed_refs(trans, root,
-(unsigned long)-1);
-   if (err) /* File system offline */
-   goto out;
-   }
-
-   cache = btrfs_lookup_first_block_group(root->fs_info, last);
-   while (cache) {
-   if (cache->disk_cache_state == BTRFS_DC_CLEAR) {
-   btrfs_put_block_group(cache);
-   goto again;
-   }
-
-   if (cache->dirty)
-   break;
-   cache = next_block_group(root, cache);
-   }
-   if (!cache) {
-   if (last == 0)
-   break;
-   last = 0;
-   continue;
-   }
-
-   if (cache->disk_cache_state == BTRFS_DC_SETUP)
-   cache->disk_cache_state = BTRFS_DC_NEED_

[PATCH] Btrfs: abort transaction if we don't find the block group

2014-12-12 Thread Josef Bacik

We shouldn't BUG_ON() if there is corruption.  I hit this while testing my block
group patch and the abort worked properly.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4ddc838..a86e55a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3152,9 +3152,11 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
struct extent_buffer *leaf;
 
ret = btrfs_search_slot(trans, extent_root, &cache->key, path, 0, 1);
-   if (ret < 0)
+   if (ret) {
+   if (ret > 0)
+   ret = -ENOENT;
goto fail;
-   BUG_ON(ret); /* Corruption */
+   }
 
leaf = path->nodes[0];
bi = btrfs_item_ptr_offset(leaf, path->slots[0]);
@@ -3162,11 +3164,9 @@ static int write_one_cache_group(struct 
btrfs_trans_handle *trans,
btrfs_mark_buffer_dirty(leaf);
btrfs_release_path(path);
 fail:
-   if (ret) {
+   if (ret)
btrfs_abort_transaction(trans, root, ret);
-   return ret;
-   }
-   return 0;
+   return ret;
 
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

2014-12-12 Thread Robert White


On 12/12/2014 06:37 AM, Tomasz Chmielewski wrote:

FYI, still seeing this with 3.18 (scrub passes fine on this filesystem).

# time btrfs balance start /mnt/lxc2
Segmentation fault

real322m32.153s
user0m0.000s
sys 16m0.930s



(...)



[20306.981773] BTRFS (device sdd1): parent transid verify failed on
5568935395328 wanted 70315 found 102416
[20306.983962] BTRFS (device sdd1): parent transid verify failed on
5568935395328 wanted 70315 found 102416


Uh... isn't fixing an invalid transaction id a job for btrfsck? I don't 
see anything in linux/fs/btrfs/*.c that would fix this sort of semantic 
error, like ever.


I think that this is a case of thing_a points to thing_b and thing_b is 
much newer (transaction 102416) than thing_a thinks it should be 
(transaction 70315).


In another thread [that was discussing SMART] you talked about replacing 
a drive and then needing to do some patching-up of the result because of 
drive failures. Is this the same filesystem where that happened? That 
kind of work could leave you in this state if thing_a was one of the 
damaged bits and the system had to go fall back to an earlier version.


So I'd run a btrfsck from the very recent btrfs-tools package. If it 
tells you to run it again with --repair, then do that.


By my reading balance is simply refusing to touch an extent that doesn't 
seem to make sense because it can't be sure it wouldn't undermine some 
active data if it relocated the block.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

2014-12-12 Thread Tomasz Chmielewski


On 2014-12-12 22:36, Robert White wrote:


In another thread [that was discussing SMART] you talked about
replacing a drive and then needing to do some patching-up of the
result because of drive failures. Is this the same filesystem where
that happened?


Nope, it was on a different server.

--
Tomasz Chmielewski
http://www.sslrack.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread Robert White


On 12/12/2014 08:45 AM, Zygo Blaxell wrote:

On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:

So RAID5 with three media M is

MMM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6


RAID5 with two media is well defined, and looks like this:

MMM
D1   P(a)
P(b) D2
D3   P(c)


Like I said in the other fork of this thread... I see (now) that the 
math works but I can find no trace of anyone having ever implemented 
this for arity less than 3 RAID greater than one paradigm (outside btrfs 
and its associated materials).


It's like talking about a two-wheeled tricycle. 8-)

I would _genuinely_ like to see any third party discussion of this. It 
just isn't done (probably because, as you've shown it just a really 
complicated and CPU intensive way to end up with a simple mirror). I 
spent several hours looking. I can see the math works, and I understand 
what you are doing (as I said at some length in the grandparent message) 
but it "just isn't done".


The reason I use the tricycle example is that, while most people know 
this instinctively few are aware of the fact that going from two wheels 
to three-or-more wheels reverses the steering paradigm. On a bike you 
push-left lean-left and go-left. At the higher arity vehicles (including 
adding a side-car to a bike) you push-right go left (you lean left too, 
but that's just to keep from nosing over 8-). I find that quite apt in 
the whole RAID1 vs RAID5 discussion since the former is about copying 
one-or-more times and the latter is about starting with a theoretically 
zeroed buffer and doing reversible checksumming into it.


I doubt that I will be the last person to be confused by BTRFS' 
implementation of a two-wheeled tricycle.


You're going to get a lot of mail over the years. 8-)


MEANWHILE

the system really needs to be able to explicitly express and support the 
"missing" media paradigm.


 M xMMM
 D1.P(a)
 D3.D4
 P(c)  .D6

The correct logic here to "remove" (e.g. "replace with nothing" instead 
of "delete") a media just doesn't seem to exist. And it's already 
painfully missing in the RAID1 situation.


If I have a system with N SATA ports, and I have connected N drives, and 
device M is starting to fail... I need to be able to disconnect M and 
then connect M(new). Possibly with a non-trivial amount of time in 
there. For all RAID levels greater than zero this is a natural operation 
in a degraded mode. And for a nearly full filesystem the shrink 
operation that is btrfs device delete would not work. And for any 
nontrivially occupied fiesystem it would be way slow, and need to be 
reversed for another way-slow interval.


So I need to be able to "replace" a drive with a "nothing" so that the 
number of active media becomes N-1 but the arity remains N.


mdadm has the "missing" keyword. the Device Mapper has the "zero" 
target. As near as I can tell btrfs has got nothing in this functional slot.


Imagine, if you will, a block device that is the anti-/dev/null. All 
operations on this block device return EFAULT. lets call it 
/dev/nothing. And lets say I have a /dev/sdc that has to come out 
immediately (and all my stuff is RAID1/5/6).  The operational chain would be


btrfs replace start /dev/sdc /dev/nothing /
(time pases, physical device is removed and replace)
btrfs replace start /dev/nothing /dev/sdc /

Now that's good-ish, but really the first replace is pernicious. The 
internal state for the filesystem should just be able to record that 
device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this 
example) is just gone. The replace-with-nothing becomes more-or-less 
instant.


The first replace is also pernicious if its the second media failure on 
a fully RAID6 array since that would trying to put the same kernel level 
device in the array twice.


The restore operation, the replace of the nothing with the something, 
remains fully elaborate.


The "nothing" devices need to show up in the device id tables for a 
running array in their geographically correct positions and all that.


Without this "missing" status as a first-class part of the system, 
dealing with failures and communicating about those failures with the 
operator will become vexatious.



[The use of "device delete" and "device add" as changes in arity and 
size, and its inaplicability to cases where failure is being dealt with 
abent a change of arity, could be clearer in the documentation.]

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

2014-12-12 Thread Robert White


On 12/12/2014 01:46 PM, Tomasz Chmielewski wrote:

On 2014-12-12 22:36, Robert White wrote:


In another thread [that was discussing SMART] you talked about
replacing a drive and then needing to do some patching-up of the
result because of drive failures. Is this the same filesystem where
that happened?


Nope, it was on a different server.



okay, so how did the btrfsck turn out?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

2014-12-12 Thread Tomasz Chmielewski


On 2014-12-12 23:34, Robert White wrote:

On 12/12/2014 01:46 PM, Tomasz Chmielewski wrote:

On 2014-12-12 22:36, Robert White wrote:


In another thread [that was discussing SMART] you talked about
replacing a drive and then needing to do some patching-up of the
result because of drive failures. Is this the same filesystem where
that happened?


Nope, it was on a different server.



okay, so how did the btrfsck turn out?


# time btrfsck /dev/sdc1 &>/root/btrfsck.log

real22m0.140s
user0m3.090s
sys 0m6.120s

root@bkp010 /usr/src/btrfs-progs # echo $?
1

# cat /root/btrfsck.log
root item for root 8681, current bytenr 5568935395328, current gen 
70315, current level 2, new bytenr 5569014104064, new gen 70316, new 
level 2

Found 1 roots with an outdated root item.
Please run a filesystem check with the option --repair to fix them.


Now, I'm a bit afraid to run --repair - as far as I remember, some time 
ago, it used to do all weird things except the actual repair.
Is it better nowadays? I'm using latest clone from 
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git



--
Tomasz Chmielewski
http://www.sslrack.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RAID0 extent sizes?

2014-12-12 Thread Robert White

I've seen it mentioned here that generally data extents are 1G and 
metadata extents are 256M.


Is that per-drive or per-stripe in the case of RAID0?

That is, if I have data mode raid0 across N drives does the system 
allocate one 1G extent on each drive making the full stripe allocation 
N-gigs; or does it allocate 1/Nth(gig) on each drive making the total 
new allocation 1G?


Does the raid0 have any arity constraints (like how raid1 is always 
arity-2)?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!

2014-12-12 Thread Robert White


On 12/12/2014 02:46 PM, Tomasz Chmielewski wrote:

On 2014-12-12 23:34, Robert White wrote:

On 12/12/2014 01:46 PM, Tomasz Chmielewski wrote:

On 2014-12-12 22:36, Robert White wrote:


In another thread [that was discussing SMART] you talked about
replacing a drive and then needing to do some patching-up of the
result because of drive failures. Is this the same filesystem where
that happened?


Nope, it was on a different server.



okay, so how did the btrfsck turn out?


# time btrfsck /dev/sdc1 &>/root/btrfsck.log

real22m0.140s
user0m3.090s
sys 0m6.120s

root@bkp010 /usr/src/btrfs-progs # echo $?
1

# cat /root/btrfsck.log
root item for root 8681, current bytenr 5568935395328, current gen
70315, current level 2, new bytenr 5569014104064, new gen 70316, new
level 2
Found 1 roots with an outdated root item.
Please run a filesystem check with the option --repair to fix them.


Now, I'm a bit afraid to run --repair - as far as I remember, some time
ago, it used to do all weird things except the actual repair.
Is it better nowadays? I'm using latest clone from
git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git




I don't have the history to answer this definitively, but I don't think 
you have a choice. Nothing else is going to touch that error.


I have not seen any "oh my god, btrfsck just ate my filesystem errors" 
since I joined the list -- but I am a relative newcomer.


I know that you, of course, as a contentious and well-traveled system 
administrator, already have a current backup since you are doing storage 
maintenance... right? 8-)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID0 extent sizes?

2014-12-12 Thread Hugo Mills

On Fri, Dec 12, 2014 at 02:54:24PM -0800, Robert White wrote:
> I've seen it mentioned here that generally data extents are 1G and
> metadata extents are 256M.
> 
> Is that per-drive or per-stripe in the case of RAID0?
> 
> That is, if I have data mode raid0 across N drives does the system
> allocate one 1G extent on each drive making the full stripe
> allocation N-gigs; or does it allocate 1/Nth(gig) on each drive
> making the total new allocation 1G?
> 
> Does the raid0 have any arity constraints (like how raid1 is always
> arity-2)?

   The 1 GiB (or 256 MiB for metadata) is the allocation unit. So for
striped RAID levels (like 0, 10, 5, 6), the FS will allocate as many
as it can across all the available devices, and stripe within those.

   Now on to your question -- the stripes within the allocation unit
are 64 KiB in size, so the first 64k goes on the first device, the
next 64k on the second device, and so on.

   The minimum stripe width (e.g. number of devices) is 2 for RAID-0,
4 for RAID-10, 2 for RAID-5 and 3 for RAID-6.

   Hugo.

-- 
Hugo Mills | I get nervous when I see words like 'mayhaps' in a
hugo@... carfax.org.uk | novel, because I fear that just round the corner is
http://carfax.org.uk/  | lurking 'forsooth'
PGP: 65E74AC0  |  GRRM's UK editor

signature.asc
Description: Digital signature

Re: RAID0 extent sizes?

2014-12-12 Thread Robert White


On 12/12/2014 02:59 PM, Hugo Mills wrote:

On Fri, Dec 12, 2014 at 02:54:24PM -0800, Robert White wrote:

I've seen it mentioned here that generally data extents are 1G and
metadata extents are 256M.

Is that per-drive or per-stripe in the case of RAID0?

That is, if I have data mode raid0 across N drives does the system
allocate one 1G extent on each drive making the full stripe
allocation N-gigs; or does it allocate 1/Nth(gig) on each drive
making the total new allocation 1G?

Does the raid0 have any arity constraints (like how raid1 is always
arity-2)?


The 1 GiB (or 256 MiB for metadata) is the allocation unit. So for
striped RAID levels (like 0, 10, 5, 6), the FS will allocate as many
as it can across all the available devices, and stripe within those.

Now on to your question -- the stripes within the allocation unit
are 64 KiB in size, so the first 64k goes on the first device, the
next 64k on the second device, and so on.

The minimum stripe width (e.g. number of devices) is 2 for RAID-0,
4 for RAID-10, 2 for RAID-5 and 3 for RAID-6.

Hugo.



[So to check my understanding, and just sticking to RAID-0 data only].

So for RAID-0 data on 5 drives with ample space, the expected outcome of 
allocating more data space is 5GiB, one 1GiB allocated on each drive.


If one drive is too full (say it was smaller) and didn't have 1G of 
contiguous space available, the allocation would simply fail.


The net effect is to create an association of allocations, one on each 
available drive that had "enough space", each of which will contribute 
exactly 1GiB to the association. So every time the data space allocation 
expands its going to expand by N-GiB total on an N-drive data=raid0 system.


Since data and metadata are separate you can end up being "out of space" 
for big files but still be able to create files small enough to fit into 
the metadata with the inode.


Am I correct?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID0 extent sizes?

2014-12-12 Thread Hugo Mills

On Fri, Dec 12, 2014 at 03:25:19PM -0800, Robert White wrote:
> On 12/12/2014 02:59 PM, Hugo Mills wrote:
> >On Fri, Dec 12, 2014 at 02:54:24PM -0800, Robert White wrote:
> >>I've seen it mentioned here that generally data extents are 1G and
> >>metadata extents are 256M.
> >>
> >>Is that per-drive or per-stripe in the case of RAID0?
> >>
> >>That is, if I have data mode raid0 across N drives does the system
> >>allocate one 1G extent on each drive making the full stripe
> >>allocation N-gigs; or does it allocate 1/Nth(gig) on each drive
> >>making the total new allocation 1G?
> >>
> >>Does the raid0 have any arity constraints (like how raid1 is always
> >>arity-2)?
> >
> >The 1 GiB (or 256 MiB for metadata) is the allocation unit. So for
> >striped RAID levels (like 0, 10, 5, 6), the FS will allocate as many
> >as it can across all the available devices, and stripe within those.
> >
> >Now on to your question -- the stripes within the allocation unit
> >are 64 KiB in size, so the first 64k goes on the first device, the
> >next 64k on the second device, and so on.
> >
> >The minimum stripe width (e.g. number of devices) is 2 for RAID-0,
> >4 for RAID-10, 2 for RAID-5 and 3 for RAID-6.
> >
> >Hugo.
> >
> 
> [So to check my understanding, and just sticking to RAID-0 data only].
> 
> So for RAID-0 data on 5 drives with ample space, the expected
> outcome of allocating more data space is 5GiB, one 1GiB allocated on
> each drive.

   Correct.

> If one drive is too full (say it was smaller) and didn't have 1G of
> contiguous space available, the allocation would simply fail.

   No, it would allocate on the remaining 4 devices instead, with a
total of 4 GiB of space. The allocation in these cases is the maximum
feasible, not precisely the number of devices.

> The net effect is to create an association of allocations, one on
> each available drive that had "enough space", each of which will
> contribute exactly 1GiB to the association.

   Yes.

> So every time the data
> space allocation expands its going to expand by N-GiB total on an
> N-drive data=raid0 system.

   Not necessarily -- if one device is already full (because it's
smaller), then the number of devices will decrease as appropriate,
down to the minimum of 2.

> Since data and metadata are separate you can end up being "out of
> space" for big files but still be able to create files small enough
> to fit into the metadata with the inode.

   Yes, but this isn't related to the number of devices in striped
RAID allocations.

> Am I correct?

   Partially. :)

   Hugo.

-- 
Hugo Mills | "How deep will this sub go?"
hugo@... carfax.org.uk | "Oh, she'll go all the way to the bottom if we don't
http://carfax.org.uk/  | stop her."
PGP: 65E74AC0  |  U571


signature.asc
Description: Digital signature

Re: RAID0 extent sizes?

2014-12-12 Thread Chris Murphy

Based on looking at how identically sized, empty, qcow2 files grow
when they're added to a Btrfs volume, the 1GiB Btrfs chunk or
allocation unit, doesn't have an immediate physical allocation. It's
more of a virtual thing, but it has a physical manifestation.

Single profile, 5 disks: As data is copied, one drive has one chunk
allocated to it, and data is copied into that chunk and thus into one
qcow2 file until the qcow2 file is about 1GiB in size. Then it stops
growing and another qcow2 file starts to grow, again up to 1GiB in
size. Until all qcow2s are 1GiB. Now when everyone is identical, they
actually aren't, chances are one of them has some little bit of extra
metadata so the allocator is going to pick the block device with the
most free space next, which is how this can affect uneven sized
devices.

For raid0,5,6 I'm not sure if my interpretation is correct. But what I
see is, at the time the chunk is allocated, the block devices with
sufficient free space belong to it; and grow in 64KB increments. e.g.
5 qcow2's in a Btrfs data raid0 will grow to ~1GiB in size each as I
copy 5GiB of data to the volume. Since I used raid1 metadata in all
cases, the qcow2's are a bit uneven in practice. If I then add a 6th
qcow2, I don't immediately notice it grow. I *think* it's because the
most recent chunk is still only writing to the block devices available
at the time that chunk was created; shortly though I start seeing that
6th qcow2 grow. This suggests that this volume has 5 strip (device)
chunks; and a 6 strip chunks.

*sigh* for what it's worth, Btrfs chunk is not the same thing as mdadm
chunk. The mdadm chunk is a strip, since that's what SNIA's dictionary
calls it. A stripe is strip x numdevices. So if you have a 5 device
raid0, that's 5 strips of 64KB each, or a stripe size of 320KB meaning
it takes a file of at least 320KB to write to all 5 disks at the same
time.

And I side note that in the latest Phoronix raid tests, Btrfs is
kicking ass compared to most everything else.

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/3] Btrfs: get more accurate output in df command.

2014-12-12 Thread Duncan

Goffredo Baroncelli posted on Fri, 12 Dec 2014 19:00:20 +0100 as
excerpted:

> $ sudo ./btrfs fi df /mnt/btrfs1/
> Data, RAID1: total=1.00GiB, used=512.00KiB
> Data, single: total=8.00MiB, used=0.00B
> System, RAID1: total=8.00MiB, used=16.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, RAID1: total=1.00GiB, used=112.00KiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=16.00MiB, used=0.00B
> 
> In this case the filesystem is empty (it was a new filesystem !).
> However a 1G metadata chunk was already allocated. This is the reasons
> why the free space is only 4Gb.

Trivial(?) correction.

Metadata chunks are quarter-gig, not 1 gig.  So that's 4 quarter-gig 
metadata chunks allocated, not a (one/single) 1-gig metadata chunk.

> On my system the ratio metadata/data is 234MB/8.82GB = ~3%, so ignoring
> the metadata chunk from the free space may not be a big problem. 

Presumably your use-case is primarily reasonably large files; too large 
for their data to be tucked directly into metadata instead of allocating 
an extent from a data chunk.

That's not always going to be the case.  And given the multi-device 
default allocation of raid1 metadata, single data, files small enough to 
fit into metadata have a default size effect double their actual size!  
(Tho it can be noted that given btrfs' 4 KiB standard block size, without 
metadata packing there'd still be an outsized effect for files smaller 
than half that, 2 KiB or under, but there it'd be in data chunks, not 
metadata.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A note on spotting "bugs" [Was: ENOSPC after conversion]

2014-12-12 Thread Duncan

Robert White posted on Fri, 12 Dec 2014 05:29:58 -0800 as excerpted:

> This still doesnt say _anything_ is wrong with your filesystem except
> that it doesn't have enough _raw_ space to create a 2-ish gig extent.

What's wrong with the filesystem is that there shouldn't /be/ a need to 
create a 2-ish gig extent.  All btrfs native structures are 1 GiB each or 
smaller, and the completed-without-error btrfs fi defrag should have 
eliminated any > 1 GiB structures remaining from the conversion from 
ext*, such that btrfs balance only has to deal with <= 1 GiB structures.

So that balance is having to deal with a 2-ish gig extent at all is 
indicative of a bug.  Balance isn't prepared to have to allocate 2-ish 
GiB extents in the first place as that's beyond it's design specs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread Duncan

Robert White posted on Fri, 12 Dec 2014 03:16:03 -0800 as excerpted:

> Perhaps is just a tautological belief that someone here didn't buy into.
> Like how people keep partitioning drives into little slices for things
> because thats the preserved wisdom from early eighties.

While I absolutely agree with your raid5 sentiments (which is exactly 
what I suppose they might be; I'm getting a bit of an education in that 
regard myself, here)...

In the context of the 80s, or even the 90s, nothing about multi-gigabyte 
could be considered "little"! =:^)

In fact, while it most assuredly dates me, it /still/ feels a bit odd 
referring to the 1 GiB btrfs default threshold for mixed-bg-mode as 
"small", given that I distinctly remember wondering how long it might 
take me to fill my first 1 GB (not GiB, unfortunately) drive, tho by that 
time I did have enough experience to know I'd eventually be dealing with 
multi-gig as at the time I was dealing with multi-meg.

More to the point, however...

Those partitions have saved my a** quite a few times over the years.  
Among other things, partitioning allows me to keep my (8 GiB) rootfs an 
entirely separate filesystem that's mounted read-only by default, which 
has kept it undamaged and the tools on it still available to help recover 
my other filesystems, when /var/log and /home were damaged due to a hard 
shutdown recently.

And some years ago I had an AC failure here in Phoenix in the middle of 
the summer, resulting in a physical head-crash and loss of the operating 
partitions on my disk in use at the time, while the backup partitions on 
the same device remained intact, such that after cooldown I actually 
continued to use that disk for some time, mounting the damaged partitions 
only to recover the most recent copies of what I could, updating the 
backups which were now promoted to operational.

Sure, technology such as LVM can do similar and is more flexible in some 
ways, but unfortunately it requires userspace and thus an initr* in 
ordered to handle a root on the same technology.  Otherwise, root must be 
treated differently, and then you have partitioning again.

Additionally, LVM is yet another layer of software that can and does go 
wrong and itself need fixed.  Partitioning is too, to some extent, but in 
practice it has been pretty bullet-proof compared to technologies such as 
LVM and btrfs-subvolumes.  LVM has some way to go before it's as robust 
as partitioning, and of course btrfs with its subvolumes isn't really 
even completely stable yet.  Further, btrfs doesn't well limit damage of 
a subvolume to just that subvolume (that head-crash scenario would have 
almost certainly been a total loss on btrfs subvolumes), the way 
partitioning tends to do.  And LVM's very flexibility means it doesn't 
normally have that sort of damage limitation either.  It certainly can, 
but doing so severely reduces its flexibility, making going back to 
regular partitions to avoid the complexity and additional points of 
failure entirely a rather viable and often better choice.

Meanwhile, technology such as EFI and GPT is breathing new life into 
partitioning, making it more reliable (checksummed redundant partition 
tables), more useful/flexible (killing the primary/secondary/logical 
divisions and adding partition names/labels and a far larger typing 
space), and creating yet more uses for partitioning in the first place, 
due to separate reserved EFI and legacy-BIOS partition types.

Tho of course these days those partition "slices" are often tens or 
hundreds of gigs, and are now sometimes "teras"[1], bringing up my 
initial point once again; that's NOT actually so small!

But to each his own, of course, and I definitely do agree with you on 
raid5, the larger point.  FWIW, I still consider allowing a two-device 
"raid5" or a three-device "raid6" a bug, particularly given that a single-
device "raid1" is /not/ allowed, nor is a 3-device "raid10".

---
[1] Hmm, K, megs, gigs, "ters", "teras", simply "T" to match K ???

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A note on spotting "bugs" [Was: ENOSPC after conversion]

2014-12-12 Thread Robert White


On 12/12/2014 05:12 PM, Duncan wrote:

Robert White posted on Fri, 12 Dec 2014 05:29:58 -0800 as excerpted:


This still doesnt say _anything_ is wrong with your filesystem except
that it doesn't have enough _raw_ space to create a 2-ish gig extent.


What's wrong with the filesystem is that there shouldn't /be/ a need to
create a 2-ish gig extent.  All btrfs native structures are 1 GiB each or
smaller, and the completed-without-error btrfs fi defrag should have
eliminated any > 1 GiB structures remaining from the conversion from
ext*, such that btrfs balance only has to deal with <= 1 GiB structures.

So that balance is having to deal with a 2-ish gig extent at all is
indicative of a bug.  Balance isn't prepared to have to allocate 2-ish
GiB extents in the first place as that's beyond it's design specs.



Looking at the error message and the code, I would strongly disagree.

Balance has no hard-coded limit of 1-Gig.

The superblock has entries for the various sizes the filesystem uses.

Balance is _clearly_ asking for the chunk of that size and the system 
configuraion is _clearly_ considering that size legal.


The structures are only limited by their 64bit unsigned integer 
representation which is _way_ bigger than 2-ish gigs.


The actual code that controls the 1GiB extent allocation does a 
(paraphrased) :: min(requested,system_minimum_for_this_purpose)


Indeed, if it were a hard limit of 1GiB, then the message wouldn't even 
bother to say how much was requested, it would just say "Could not 
allocate data extent" because that extent would have both a minimum and 
maxium of the same known value.


Nowhere in or near the allocator does it do a test that I can find that 
would force the number to be smaller. And if it did, that wouldn't be a 
"ENOSPC" etc, it'd come back as a "too big" result such as ENOSUP.


So your statement that this is "beyond its design specs" isn't supported 
by the code. (That's why I looked it up the before I said "this is not a 
bug").


I went through the code until I accounted for _ever_ _word_ in the error 
message.


I don't see any kind of bug.

And now that we know that btrfs-convert has to make larger-than-one-GiB 
extents to encompass the block groups that are preexistent in EXT4, we 
also know that the sliding-block puzzle its trying to solve doesn't have 
nice round numbers of 1GiB and 256MiB, so the holes it leaves behind 
when it does move something are not nice and square and evenly sized etc.


So yea, he's jammed up and needs more space. As his filesystem churns 
all its new extents will be the 1GiB or 256MiB sized things and he'd 
slowly, but perhaps asymptotically, approach a clean layout of customary 
extent sizes.


It's not broken, it's just not pretty.

And _IF_ I had been saying it's "not a bug" and any of the actual code 
contributors had disagreed, they'd have jumped in here and shut me down 
(as they should in that case). e.g. If I was wrong about its non-bug 
status we'd have heard by now.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mkfs.btrfs limits "odd" [and maybe a "failed" phantom device?]

2014-12-12 Thread Zygo Blaxell

On Fri, Dec 12, 2014 at 02:28:06PM -0800, Robert White wrote:
> On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
> >On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
> >>So RAID5 with three media M is
> >>
> >>MMM   MMM
> >>D1   D2   P(a)
> >>D3   P(b) D4
> >>P(c) D5   D6
> >
> >RAID5 with two media is well defined, and looks like this:
> >
> >MMM
> >D1   P(a)
> >P(b) D2
> >D3   P(c)
> 
> Like I said in the other fork of this thread... I see (now) that the
> math works but I can find no trace of anyone having ever implemented
> this for arity less than 3 RAID greater than one paradigm (outside
> btrfs and its associated materials).

I've set up mdadm that way (though it does ask you to use '--force'
when you set it up).  mdadm will also ask for --force if you try to set
up RAID1 with one disk.

I don't know of a RAID implementation that _doesn't_ do these modes,
excluding a few ancient proprietary implementations which have no way to
change a layout once created (usually because they shoot themselves in
the foot with bad choices early on, e.g. by picking odd parity for RAID5).

The reason to allow it is future expansion:  below-3-disk RAID5 ensures
that you have the layout constraints *now* for stripe/chunk size so you
can add more disks later.  If RAID5 has a 512K chunk size, and you start
with a linear or RAID1 array and add another disk later, you might lose
part of the last 512K when you switch to RAID5.  So you start with RAID5
on one or two disks so you can scale up without losing any data.

Also, mdadm can grow a two-disk RAID5, but if you try to grow a two-disk
mdadm RAID1 you just get a three-disk RAID1 (i.e. two redudant copies
with no additional capacity).

btrfs doesn't really need this capability for expansion, since it can
just create new RAID5 profile chunks whenever it wants to; however, I'd
expect a complete btrfs RAID5 implementation to borrow some ideas from
ZFS, and dynamically change the number of disks per chunk to maintain
write integrity as drives are added/removed/missing.  That would imply
btrfs-RAID56 profile chunks would have to be able to exist on two or even
one disk, if that was all that was available for writing at the time.
Simply using btrfs-RAID1 chunks wouldn't work since they'd behave the
wrong way when more disks were added later.

> MEANWHILE
> 
> the system really needs to be able to explicitly express and support
> the "missing" media paradigm.
> 
>  M xMMM
>  D1.P(a)
>  D3.D4
>  P(c)  .D6
> 
> The correct logic here to "remove" (e.g. "replace with nothing"
> instead of "delete") a media just doesn't seem to exist. And it's
> already painfully missing in the RAID1 situation.

There are a number of permanent mistakes a naive admin can make when
dealing with a broken array.  I've destroyed arrays (made them permanently
read-only beyond the ability of btrfs kernel or user tools to recover)
by getting "add" and "replace" confused, or by allowing an offline drive
to rejoin an array that had been mounted read-write,degraded for some time.

The basic functionality works.  btrfs does track missing devices and
can replace them relatively quickly (not as fast as mdadm, but less
than an order of magnitude slower) in RAID1.  The reporting is full
of out-of-date cached data, but when a disk is really failing,
there is usually little doubt which one needs to be replaced.

> If I have a system with N SATA ports, and I have connected N drives,
> and device M is starting to fail... I need to be able to disconnect
> M and then connect M(new). Possibly with a non-trivial amount of
> time in there. For all RAID levels greater than zero this is a
> natural operation in a degraded mode. And for a nearly full
> filesystem the shrink operation that is btrfs device delete would
> not work. And for any nontrivially occupied fiesystem it would be
> way slow, and need to be reversed for another way-slow interval.
> 
> So I need to be able to "replace" a drive with a "nothing" so that
> the number of active media becomes N-1 but the arity remains N.

btrfs already does that, but it sucks.  In a naive RAID5 implementation,
a write in degraded mode will corrupt your data if it is interrupted.
This is a general property of all RAID5 implementations that don't have
NVRAM journalling or some other way to solve the atomic update problem.

ZFS does this well:  when a device is missing, it leaves old data in
degraded mode, but writes new data striped across the existing disks
in non-degraded mode.  If you have 5 disks, and one dies, your writes
are then spread across 4 disks (3 data + parity) while your reads are
reconstructed from 4 disks (4 data + 1 parity - 1 missing).  This prevents
the degraded mode write data integrity problem.

When the dead disk is replaced you would have the 3 data + parity promoted
to 4 data + parity, or you can elect not to replace the dead disk and
get 3 data + party everywhere (with a loss of capacity).  btrfs could
presumably do

Re: Balance & scrub & defrag

2014-12-12 Thread Zygo Blaxell

On Fri, Dec 12, 2014 at 11:17:58AM +0200, Erkki Seppala wrote:
> That may be sort of true, but I think even SMART is helped by the fact
> that the media is read through from the beginning to the end*, so it can
> detect even the errors that don't bubble through the IO layer. And BTRFS
> can indeed note errors that the media doesn't - two checksums is better
> than one checksum, assuming they aren't exactly the same algorithm ;).
> 
> Do you alternatively execute SMART self tests?
> 
> * scrub doesn't do this, it reads only through used data

I do both.  They operate at different layers of the storage stack, and have
access to different information.  They also have different (and hopefully
non-overlapping) bugs.

scrub pros:

+ can compare data with the other copies in RAID1 or DUP mode

+ can fix bad data when good copies available

+ slows down when other processes want to use the disk

+ can be suspended and resumed at will by software

+ error data is impervious to drive firmware bugs

+ straightforward error reports

+ only scans allocated data

scrub cons:

- only scans allocated data

- btrfs filesystems only

- CPU and I/O burden

- error sources are not localized:  scrub errors could be software
bugs, bad RAM, bad CPU cooling, bad cabling, bad power supply,
or bad hard drive

smart pros:

+ runs in the background

+ no CPU or I/O required, just read results from previous run
and launch new test daily

+ access to electrical and mechanical data from the drive
that are otherwise unavailable to the host

+ 100% surface scan (including bad sector count)

+ logs host I/O errors that OS might miss
(e.g. because they occur during BIOS booting)

+ works with any filesystems, partitions, swap, etc.

+ error sources are localized to the drive in test

smart cons:

- buggy firmware does not detect or report error events when
significant failures occur

- buggy firmware does detect and report error events when
signficant failures do not occur

- buggy firmware will make host accesses painfully slow during
scan (WD Green is very bad for this)

- firmware does not implement useful subset of SMART command set

- SMART command set can be inaccessible through some SATA bridge
chips (especially USB)

- cannot fix anything, only report quantities of data already lost

- cannot reliably detect RAM or CPU failure (on host or drive)

- requires the drive to spin for 1-2 continuous hours during test

- interpreting the raw data is a black art


signature.asc
Description: Digital signature

76 matches

Mail list logo