Allocator behaviour during device delete

2016-06-09 Thread Brendan Hide

Hey, all

I noticed this odd behaviour while migrating from a 1TB spindle to SSD 
(in this case on a LUKS-encrypted 200GB partition) - and am curious if 
this behaviour I've noted below is expected or known. I figure it is a 
bug. Depending on the situation, it *could* be severe. In my case it was 
simply annoying.


---
Steps

After having added the new device (btrfs dev add), I deleted the old 
device (btrfs dev del)


Then, whilst waiting for that to complete, I started a watch of "btrfs 
fi show /". Note that the below is very close to the output at the time 
- but is not actually copy/pasted from the output.


> Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
> Total devices 2 FS bytes used 115.03GiB
> devid1 size 0.00GiB used 298.06GiB path /dev/sda2
> devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot


devid1 is the old disk while devid2 is the new SSD

After a few minutes, I saw that the numbers have changed - but that the 
SSD still had no data:


> Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
> Total devices 2 FS bytes used 115.03GiB
> devid1 size 0.00GiB used 284.06GiB path /dev/sda2
> devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot

The "FS bytes used" amount was changing a lot - but mostly stayed near 
the original total, which is expected since there was very little 
happening other than the "migration".


I'm not certain of the exact point where it started using the new disk's 
space. I figure that may have been helpful to pinpoint. :-/


---
Educated guess as to what was happening:

Key: Though the available space on devid1 is displayed as 0 GiB, 
internally the allocator still sees most of the device's space as 
available. The allocator will continue writing to the old disk even 
though the intention is to remove it.


The dev delete operation goes through the chunks in sequence and does a 
"normal" balance operation on each, which the kernel simply sends to the 
"normal" single allocator. At the start of the operation, the allocator 
will see that the device of 1TB has more space available than the 200GB 
device, thus it writes the data to a new chunk on the 1TB spindle.


Only after the chunk is balanced away, does the operation mark *only* 
that "source" chunk as being unavailable. As each chunk is subsequently 
balanced away, eventually the allocator will see that there is more 
space available on the new device than on the old device (1:199/2:200), 
thus the next chunk gets allocated to the new device. The same occurs 
for the next chunk (1:198/2:199) and so on, until the device finally has 
zero usage and is removed completely.


---
Naive approach for a fix (assuming my assessment above is correct)

At the start:
1. "Balance away"/Mark-as-Unavailable empty space
2. Balance away the *current* chunks (data+metadata) that would 
otherwise be written to if the device was still available

3. As before, balance in whatever order is applicable.

---
Severity

I figure that, for my use-case, this isn't a severe issue. However, in 
the case where you want quickly to remove a potentially failing disk 
(common use case for dev delete), I'd much rather that btrfs does *not* 
write data to the disk I'm trying to remove, making this a potentially 
severe bug.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Hans van Kranenburg

On 06/09/2016 10:52 AM, Marc Haber wrote:

On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote:

So, instead of being the cause, apt-get update causing a new chunk to be
allocated might as well be the result of existing ones already filled up
with too many fragments.

The next question is what files these extents belong to. To find out, I need
to open up the extent items I get back and follow a backreference to an
inode object. Might do that tomorrow, fun.


Does your apt use pdiffs to update the packages lists? If yes, I'd try
turning it off just for the fun of it and to see whether this changes
btrfs' allocation behavior. I have never looked at apt's pdiff stuff
in detail, but I guess that it creates many tiny temporary files.


No, it does not:

Acquire::Pdiffs "false";

--
Hans van Kranenburg - System / Network Engineer
Mendix | Driving Digital Innovation | www.mendix.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-09 Thread Austin S. Hemmelgarn

On 2016-06-09 02:16, Duncan wrote:

Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as
excerpted:


As far as BTRFS raid10 mode in general, there are a few things that are
important to remember about it:
1. It stores exactly two copies of everything, any extra disks just add
to the stripe length on each copy.


I'll add one more, potentially very important, related to this one:

Btrfs raid mode (any of them) works in relation to individual chunks,
*NOT* individual devices.

What that means for btrfs raid10 in combination with the above exactly
two copies rule, is that it works rather differently than a standard
raid10, which can tolerate loss of two devices as long as they're from
the same mirror set, as the other mirror set will then still be whole.
Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of
a second device close to assures loss of data, because the very likely
true assumption is that both mirror sets will be affected for some
chunks, but not others.
Actually, that's not _quite_ the case.  Assuming that you have an even 
number of devices, BTRFS raid10 will currently always span all the 
available devices with two striped copies of the data (if there's an odd 
number, it spans one less than the total, and rotates which one gets 
left out of each chunk).  This means that as long as all the devices are 
the same size and you have have stripes that are the full width of the 
array (you can end up with shorter ones if you have run in degraded mode 
or expanded the array), your probability of data loss per-chunk goes 
down as you add more devices (because the probability of a two device 
failure affecting both copies of a stripe in a given chunk decreases), 
but goes up as you add more chunks (because you then have to apply that 
probability for each individual chunk).  Once you've lost one disk, the 
probability that losing another will compromise a specific chunk is:

1/(N - 1)
Where N is the total number of devices.
The probability that it will compromise _any_ chunk is:
(1/(N - 1))/C
Where C is the total number of chunks
BTRFS raid1 mode actually has the exact same probabilities, but they 
apply even if you have an odd number of disks.


By using a layered approach, btrfs raid1 on top (for its error correction
from the other copy feature) of a pair of mdraid0s, you force one of the
btrfs raid1 copies to each of the mdraid0s, thus making allocation more
deterministic than btrfs raid10, and can thus again tolerate loss of two
devices, as long as they're from the same underlying mdraid0.

(Traditionally, raid1 on top of raid0 is called raid01, and is
discouraged compared to raid10, raid0 on top of raid1, because device
failure and replacement with the latter triggers a much more localized
rebuild than the former, across the pair of devices in the raid1 when
it's closest to the physical devices, across the whole array, one raid0
to the other, when the raid1 is on top.  However, btrfs raid1's data
integrity and error repair from the good mirror feature is generally
considered to be useful enough to be worth the rebuild-inefficiency of
the raid01 design.)

So in regard to failure tolerance, btrfs raid10 is far closer to
traditional raid5, loss of a single device is tolerated, loss of a second
before a repair is complete generally means data loss -- there's not the
chance of it being on the same mirror set to save you that traditional
raid10 has.

Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors
on raid0 arrays that traditional raid10 does, thus doesn't have the fault
tolerance of losing say the connection or power to one entire device
bank, as long as it's all one mirror set, that traditional raid10 has.

And again, doing the layered thing with btrfs raid1 on top and mdraid0
(or whatever else) underneath gets that back for you, if you set it up
that way, of course.

And will get you better performance than just BTRFS most of the time too.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/21] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-09 Thread Felipe Balbi

Hi,

Deepa Dinamani  writes:
>  drivers/usb/gadget/function/f_fs.c |  2 +-
>  drivers/usb/gadget/legacy/inode.c  |  2 +-

for drivers/usb/gadget:

Acked-by: Felipe Balbi 

-- 
balbi


signature.asc
Description: PGP signature


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Marc Haber
On Thu, Jun 09, 2016 at 01:10:46AM +0200, Hans van Kranenburg wrote:
> So, instead of being the cause, apt-get update causing a new chunk to be
> allocated might as well be the result of existing ones already filled up
> with too many fragments.
> 
> The next question is what files these extents belong to. To find out, I need
> to open up the extent items I get back and follow a backreference to an
> inode object. Might do that tomorrow, fun.

Does your apt use pdiffs to update the packages lists? If yes, I'd try
turning it off just for the fun of it and to see whether this changes
btrfs' allocation behavior. I have never looked at apt's pdiff stuff
in detail, but I guess that it creates many tiny temporary files.

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: fix check_shared for fiemap ioctl

2016-06-09 Thread David Sterba
On Wed, Jun 08, 2016 at 08:53:00AM -0700, Mark Fasheh wrote:
> On Wed, Jun 08, 2016 at 01:13:03PM +0800, Lu Fengqi wrote:
> > Only in the case of different root_id or different object_id, check_shared
> > identified extent as the shared. However, If a extent was referred by
> > different offset of same file, it should also be identified as shared.
> > In addition, check_shared's loop scale is at least  n^3, so if a extent
> > has too many references,  even causes soft hang up.
> > 
> > First, add all delayed_ref to the ref_tree and calculate the unqiue_refs,
> > if the unique_refs is greater than one, return BACKREF_FOUND_SHARED.
> > Then individually add the  on-disk reference(inline/keyed) to the ref_tree
> > and calculate the unique_refs of the ref_tree to check if the unique_refs
> > is greater than one.Because once there are two references to return
> > SHARED, so the time complexity is close to the constant.
> > 
> > Reported-by: Tsutomu Itoh 
> > Signed-off-by: Lu Fengqi 
> > ---
> > The caller is fiemap that called from an ioctl. Because it isn't on a
> > writeout path, so we temporarily use GFP_KERNEL in ref_root_alloc() and
> > ref_tree_add(). If we use ref_tree replace the existing backref structure
> > later, we can consider whether to use GFP_NOFS again.
> 
> NACK.
> 
> You don't need to be on a writeout path to deadlock, you simply need to be
> holding locks that the writeout path takes when you allocate. If the
> allocator does writeout to free memory then you deadlock. Fiemap is locking 
> down extents which may also get locked down when you allocate within those  
> locks. See my e-mail here for details,
> 
> http://www.spinics.net/lists/linux-btrfs/msg55789.html

There seems to be a code path that leads to a deadlock scenario, it
depends on the extent state of the file under fiemap. If there's a
delalloc range in the middle of the file, fiemap locks the whole range,
asks for memory that could trigger writeout, the delalloc range would
need to be written and requests locking the dealloc range again. There's
no direct locking but rather waiting for the extent LOCKED bit unset.

So, use GFP_NOFS now, though I still hope we can find a way to avoid
rb-tree and allocations alltogether.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Allocator behaviour during device delete

2016-06-09 Thread Austin S. Hemmelgarn

On 2016-06-09 08:34, Brendan Hide wrote:

Hey, all

I noticed this odd behaviour while migrating from a 1TB spindle to SSD
(in this case on a LUKS-encrypted 200GB partition) - and am curious if
this behaviour I've noted below is expected or known. I figure it is a
bug. Depending on the situation, it *could* be severe. In my case it was
simply annoying.

---
Steps

After having added the new device (btrfs dev add), I deleted the old
device (btrfs dev del)

Then, whilst waiting for that to complete, I started a watch of "btrfs
fi show /". Note that the below is very close to the output at the time
- but is not actually copy/pasted from the output.


Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
Total devices 2 FS bytes used 115.03GiB
devid1 size 0.00GiB used 298.06GiB path /dev/sda2
devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot



devid1 is the old disk while devid2 is the new SSD

After a few minutes, I saw that the numbers have changed - but that the
SSD still had no data:


Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
Total devices 2 FS bytes used 115.03GiB
devid1 size 0.00GiB used 284.06GiB path /dev/sda2
devid2 size 200.88GiB used 0.00GiB path /dev/mapper/cryptroot


The "FS bytes used" amount was changing a lot - but mostly stayed near
the original total, which is expected since there was very little
happening other than the "migration".

I'm not certain of the exact point where it started using the new disk's
space. I figure that may have been helpful to pinpoint. :-/
OK, I'm pretty sure I know what was going on in this case.  Your 
assumption that device delete uses the balance code is correct, and that 
is why you see what's happening happening.  There are two key bits that 
are missing though:

1. Balance will never allocate chunks when it doesn't need to.
2. The space usage listed in fi show is how much space is allocated to 
chunks, not how much is used in those chunks.


In this case, based on what you've said, you had a lot of empty or 
mostly empty chunks.  As a result of this, the device delete was both 
copying data, and consolidating free space.  If you have a lot of empty 
or mostly empty chunks, it's not unusual for a device delete to look 
like this until you start hitting chunks that have actual data in them. 
The pri8mary point of this behavior is that it makes it possible to 
directly switch to a smaller device without having to run a balance and 
then a resize before replacing the device, and then resize again afterwards.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: add the error message when open fails

2016-06-09 Thread David Sterba
On Thu, Jun 09, 2016 at 10:23:15AM +0900, Tsutomu Itoh wrote:
> When open in btrfs_open_devices failed, only the following message is
> displayed. Therefore the user doesn't understand the reason why open
> failed.
> 
>   # btrfs check /dev/sdb8
>   Couldn't open file system
> 
> This patch adds the error message when open fails.

I think the message should be printed by the caller, not by the helper.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: account for non-CoW'd blocks in btrfs_abort_transaction

2016-06-09 Thread Filipe Manana
On Wed, Jun 8, 2016 at 5:36 AM, Jeff Mahoney  wrote:
> The test for !trans->blocks_used in btrfs_abort_transaction is
> insufficient to determine whether it's safe to drop the transaction
> handle on the floor.  btrfs_cow_block, informed by should_cow_block,
> can return blocks that have already been CoW'd in the current
> transaction.  trans->blocks_used is only incremented for new block
> allocations. If an operation overlaps the blocks in the current
> transaction entirely and must abort the transaction, we'll happily
> let it clean up the trans handle even though it may have modified
> the blocks and will commit an incomplete operation.

IMHO it wouldn't hurt to be more explicit here and mention that if the
transaction handle ends up not COWing any nodes/leafs, calling abort
against the handle won't abort the transaction.

>
> In the long-term, I'd like to do closer tracking of when the fs
> is actually modified so we can still recover as gracefully as possible,
> but that approach will need some discussion.  In the short term,
> since this is the only code using trans->blocks_used, let's just
> switch it to a bool indicating whether any blocks were used and set
> it when should_cow_block returns false.
>
> Cc: sta...@vger.kernel.org # 3.4+
> Signed-off-by: Jeff Mahoney 

But anyway,

Reviewed-by: Filipe Manana 


> ---
>  fs/btrfs/ctree.c   | 5 -
>  fs/btrfs/extent-tree.c | 2 +-
>  fs/btrfs/super.c   | 2 +-
>  fs/btrfs/transaction.h | 2 +-
>  4 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> index 427c36b..135af4e 100644
> --- a/fs/btrfs/ctree.c
> +++ b/fs/btrfs/ctree.c
> @@ -1552,6 +1552,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle 
> *trans,
>trans->transid, root->fs_info->generation);
>
> if (!should_cow_block(trans, root, buf)) {
> +   trans->dirty = true;
> *cow_ret = buf;
> return 0;
> }
> @@ -2773,8 +2774,10 @@ again:
>  * then we don't want to set the path blocking,
>  * so we test it here
>  */
> -   if (!should_cow_block(trans, root, b))
> +   if (!should_cow_block(trans, root, b)) {
> +   trans->dirty = true;
> goto cow_done;
> +   }
>
> /*
>  * must have write locks on this node and the
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 689d25a..1ed31eb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -8044,7 +8044,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, 
> struct btrfs_root *root,
> set_extent_dirty(>transaction->dirty_pages, buf->start,
>  buf->start + buf->len - 1, GFP_NOFS);
> }
> -   trans->blocks_used++;
> +   trans->dirty = true;
> /* this returns a buffer locked for blocking */
> return buf;
>  }
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 4e59a91..bd07e01 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -235,7 +235,7 @@ void __btrfs_abort_transaction(struct btrfs_trans_handle 
> *trans,
> trans->aborted = errno;
> /* Nothing used. The other threads that have joined this
>  * transaction may be able to continue. */
> -   if (!trans->blocks_used && list_empty(>new_bgs)) {
> +   if (!trans->dirty && list_empty(>new_bgs)) {
> const char *errstr;
>
> errstr = btrfs_decode_error(errno);
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 9fe0ec2..c5abee4 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -110,7 +110,6 @@ struct btrfs_trans_handle {
> u64 chunk_bytes_reserved;
> unsigned long use_count;
> unsigned long blocks_reserved;
> -   unsigned long blocks_used;
> unsigned long delayed_ref_updates;
> struct btrfs_transaction *transaction;
> struct btrfs_block_rsv *block_rsv;
> @@ -121,6 +120,7 @@ struct btrfs_trans_handle {
> bool can_flush_pending_bgs;
> bool reloc_reserved;
> bool sync;
> +   bool dirty;
> unsigned int type;
> /*
>  * this root is only needed to validate that the root passed to
> --
> 2.7.1
>
>
> --
> Jeff Mahoney
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the 

Re: [PATCH v2] Btrfs: check if extent buffer is aligned to sectorsize

2016-06-09 Thread David Sterba
On Mon, Jun 06, 2016 at 12:01:23PM -0700, Liu Bo wrote:
> Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
> via alloc_extent_buffer().  An unaligned eb can have more pages than it
> should have, which ends up extent buffer's leak or some corrupted content
> in extent buffer.
> 
> This adds a warning to let us quickly know what was happening.
> 
> Now that alloc_extent_buffer() no more returns NULL, this changes its
> caller and callers of its caller to match with the new error
> handling.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs]

2016-06-09 Thread Duncan
Fugou Nashi posted on Sun, 05 Jun 2016 10:12:31 +0900 as excerpted:

> Hi,
> 
> Do I need to worry about this?
> 
> Thanks.
> 
> Linux nakku 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59
> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

There's a patch for it that didn't quite make 4.6.0, but is in 4.7-rc1 
and should appear in later 4.6.x stable releases (tho I don't track 
stable myself so couldn't tell you if it's there yet or not).

In general, some people were reporting many of these warnings during the 
4.6 rcs with no visible damage or further problems, so it doesn't tend to 
be anything but the irritation of the warnings in practice, tho under the 
right conditions, in theory, it could be a bigger issue.  But I know of 
no one who actually experienced a problem beyond the warn noise.

So in practice it's a get the patch as soon as you can thing, but not 
something to be hugely worried about in the mean time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs]

2016-06-09 Thread Noah Massey
On Thu, Jun 9, 2016 at 10:52 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Fugou Nashi posted on Sun, 05 Jun 2016 10:12:31 +0900 as excerpted:
>
>> Hi,
>>
>> Do I need to worry about this?
>>
>> Thanks.
>>
>> Linux nakku 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59
>> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
>
> There's a patch for it that didn't quite make 4.6.0, but is in 4.7-rc1
> and should appear in later 4.6.x stable releases (tho I don't track
> stable myself so couldn't tell you if it's there yet or not).
>

It's in 4.6.1

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=4704fa547224fd49041c26f7fca3710a47fc449f
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Duncan
Hans van Kranenburg posted on Thu, 09 Jun 2016 01:10:46 +0200 as
excerpted:

> The next question is what files these extents belong to. To find out, I
> need to open up the extent items I get back and follow a backreference
> to an inode object. Might do that tomorrow, fun.
> 
> To be honest, I suspect /var/log and/or the file storage of mailman to
> be the cause of the fragmentation, since there's logging from postfix,
> mailman and nginx going on all day long in a slow but steady tempo.
> While using btrfs for a number of use cases at work now, we normally
> don't use it for the root filesystem. And the cases where it's used as
> root filesystem don't do much logging or mail.

FWIW, that's one reason I have a dedicated partition (and filesystem) for 
logs, here.  (The other reason is that should something go runaway log-
spewing, I get a warning much sooner when my log filesystem fills up, not 
much later, with much worse implications, when the main filesystem fills 
up!)

> And no, autodefrag is not in the mount options currently. Would that be
> helpful in this case?

It should be helpful, yes.  Be aware that autodefrag works best with 
smaller (sub-half-gig) files, however, and that it used to cause 
performance issues with larger database and VM files, in particular.  
There used to be a warning on the wiki about that, that was recently 
removed, so apparently it's not the issue that it was, but you might wish 
to monitor any databases or VMs with gig-plus files to see if it's going 
to be a performance issue, once you turn on autodefrag.

The other issue with autodefrag is that if it hasn't been on and things 
are heavily fragmented, it can at first drive down performance as it 
rewrites all these heavily fragmented files, until it catches up and is 
mostly dealing only with the normal refragmentation load.  Of course the 
best way around that is to run autodefrag from the first time you mount 
the filesystem and start writing to it, so it never gets overly 
fragmented in the first place.  For a currently in-use and highly 
fragmented filesystem, you have two choices, either backup and do a fresh 
mkfs.btrfs so you can start with a clean filesystem and autodefrag from 
the beginning, or doing manual defrag.

However, be aware that if you have snapshots locking down the old extents 
in their fragmented form, a manual defrag will copy the data to new 
extents without releasing the old ones as they're locked in place by the 
snapshots, thus using additional space.  Worse, if the filesystem is 
already heavily fragmented and snapshots are locking most of those 
fragments in place, defrag likely won't help a lot, because the free 
space as well will be heavily fragmented.   So starting off with a clean 
and new filesystem and using autodefrag from the beginning really is your 
best bet.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-09 Thread Duncan
boli posted on Wed, 08 Jun 2016 20:55:13 +0200 as excerpted:

> Recently I had the idea to replace the 6 TB HDDs with 8 TB ones ("WD
> Red"), because their price is now acceptable.

Are those the 8 TB SMR "archive" drives?

I haven't been following the issue very closely, but be aware that there 
were serious issues with those drives a few kernels back, and that while 
those issues are now fixed, the drives themselves operate rather 
differently than normal drives, and simply don't work well in normal 
usage.

The short version is that they really are designed for archiving and work 
well when used for that purpose -- a mostly write once and leave it there 
for archiving and retrieval but rarely if ever rewrite it, type usage.  
However, they work rather poorly in normal usage where data is rewritten, 
because they have to rewrite entire zones of data, and that takes much 
longer than simply rewriting individual sectors on normal drives does.

With the kernel patches to fix the initial problems they do work well 
enough, tho performance may not be what you expect, but the key to 
keeping them working well is being aware that they continue to do 
rewrites in the background for long after they are done with the initial 
write, and shutting them down while they are doing them can be an issue.

Due to btrfs' data checksumming feature, small variances to data that 
wouldn't normally be detected on non-checksumming filesystems were 
detected far sooner on btrfs, making it far more sensitive to these small 
errors.  However, if you use the drives for their intended nearly write-
only purpose, and/or very seldom power down the drives at all or do so 
only long after (give it half an hour, say) any writes have completed, as 
long as you're running a current kernel with the initial issues patched, 
you should be fine.  Just don't treat them like normal drives.

If OTOH you need more normal drive usage including lots of data rewrites, 
especially if you frequently poweroff the devices, strongly consider 
avoiding those 8 TB SMR drives, at least until the technology has a few 
more years to mature.

There's more information on other threads on the list and on other lists, 
if you need it and nobody posts more direct information (such as the 
specific patches in question and what specific kernel versions they hit) 
here.  I could find it but I'd have to do a search in my own list 
archives, and now that you are aware of the problem, you can of course do 
the search as well, if you need to. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: WARNING: at /home/kernel/COD/linux/fs/btrfs/inode.c:9261 btrfs_destroy_inode+0x247/0x2c0 [btrfs]

2016-06-09 Thread g6094199
Am 09.06.2016 um 16:52 schrieb Duncan:
> Fugou Nashi posted on Sun, 05 Jun 2016 10:12:31 +0900 as excerpted:
>
>> Hi,
>>
>> Do I need to worry about this?
>>
>> Thanks.
>>
>> Linux nakku 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59
>> UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> There's a patch for it that didn't quite make 4.6.0, but is in 4.7-rc1 
> and should appear in later 4.6.x stable releases (tho I don't track 
> stable myself so couldn't tell you if it's there yet or not).
>
> In general, some people were reporting many of these warnings during the 
> 4.6 rcs with no visible damage or further problems, so it doesn't tend to 
> be anything but the irritation of the warnings in practice, tho under the 
> right conditions, in theory, it could be a bigger issue.  But I know of 
> no one who actually experienced a problem beyond the warn noise.
>
> So in practice it's a get the patch as soon as you can thing, but not 
> something to be hugely worried about in the mean time.
>
FYI

i run into the same error on debian sid with 4.6.0and holger pointed
out:

This is 4.6.0-whatever, not 4.6.1. The problem is fixed in the real
4.6.1 (released last week), which you can find described in commit 4704fa54..
at https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.6.1

-h


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing drives with larger ones in a 4 drive raid1

2016-06-09 Thread bOli
On 09.06.2016, at 17:20, Duncan <1i5t5.dun...@cox.net> wrote:

> Are those the 8 TB SMR "archive" drives?

No, they are Western Digital Red drives.

Thanks for the detailed follow-up anyway. :)

Half a year ago, when I evaluated hard drives, in the 8 TB category there were 
only the Hitachi 8 TB Helium drives for 800 bucks, and the Seagate SMR for 250 
bucks.

I bought myself one of the Seagate SMR ones for testing, and figured out it 
wouldn't work for my use case (I now use it in a write-very-seldom context).

For my two NASes I went with 6 TB WD Red drives all around.

Nowadays there are more choices of 8 TB drives, such as the WD Reds I'm 
switching my backup NAS to.

> I haven't been following the issue very closely, but be aware that there 
> were serious issues with those drives a few kernels back, and that while 
> those issues are now fixed, the drives themselves operate rather 
> differently than normal drives, and simply don't work well in normal 
> usage.
> 
> The short version is that they really are designed for archiving and work 
> well when used for that purpose -- a mostly write once and leave it there 
> for archiving and retrieval but rarely if ever rewrite it, type usage.  
> However, they work rather poorly in normal usage where data is rewritten, 
> because they have to rewrite entire zones of data, and that takes much 
> longer than simply rewriting individual sectors on normal drives does.
> 
> With the kernel patches to fix the initial problems they do work well 
> enough, tho performance may not be what you expect, but the key to 
> keeping them working well is being aware that they continue to do 
> rewrites in the background for long after they are done with the initial 
> write, and shutting them down while they are doing them can be an issue.
> 
> Due to btrfs' data checksumming feature, small variances to data that 
> wouldn't normally be detected on non-checksumming filesystems were 
> detected far sooner on btrfs, making it far more sensitive to these small 
> errors.  However, if you use the drives for their intended nearly write-
> only purpose, and/or very seldom power down the drives at all or do so 
> only long after (give it half an hour, say) any writes have completed, as 
> long as you're running a current kernel with the initial issues patched, 
> you should be fine.  Just don't treat them like normal drives.
> 
> If OTOH you need more normal drive usage including lots of data rewrites, 
> especially if you frequently poweroff the devices, strongly consider 
> avoiding those 8 TB SMR drives, at least until the technology has a few 
> more years to mature.
> 
> There's more information on other threads on the list and on other lists, 
> if you need it and nobody posts more direct information (such as the 
> specific patches in question and what specific kernel versions they hit) 
> here.  I could find it but I'd have to do a search in my own list 
> archives, and now that you are aware of the problem, you can of course do 
> the search as well, if you need to. =:^)
> 
> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Btrfs: be more precise on errors when getting an inode from disk

2016-06-09 Thread fdmanana
From: Filipe Manana 

When we attempt to read an inode from disk, we end up always returning an
-ESTALE error to the caller regardless of the actual failure reason, which
can be an out of memory problem (when allocating a path), some error found
when reading from the fs/subvolume btree (like a genuine IO error) or the
inode does not exists. So lets start returning the real error code to the
callers so that they don't treat all -ESTALE errors as meaning that the
inode does not exists (such as during orphan cleanup). This will also be
needed for a subsequent patch in the same series dealing with a special
fsync case.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e5558d9..1ae0fc8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3419,10 +3419,10 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
found_key.offset = 0;
inode = btrfs_iget(root->fs_info->sb, _key, root, NULL);
ret = PTR_ERR_OR_ZERO(inode);
-   if (ret && ret != -ESTALE)
+   if (ret && ret != -ENOENT)
goto out;
 
-   if (ret == -ESTALE && root == root->fs_info->tree_root) {
+   if (ret == -ENOENT && root == root->fs_info->tree_root) {
struct btrfs_root *dead_root;
struct btrfs_fs_info *fs_info = root->fs_info;
int is_dead_root = 0;
@@ -3458,7 +3458,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root)
 * Inode is already gone but the orphan item is still there,
 * kill the orphan item.
 */
-   if (ret == -ESTALE) {
+   if (ret == -ENOENT) {
trans = btrfs_start_transaction(root, 1);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -3617,7 +3617,7 @@ static noinline int acls_after_inode_item(struct 
extent_buffer *leaf,
 /*
  * read an inode from the btree into the in-memory inode
  */
-static void btrfs_read_locked_inode(struct inode *inode)
+static int btrfs_read_locked_inode(struct inode *inode)
 {
struct btrfs_path *path;
struct extent_buffer *leaf;
@@ -3636,14 +3636,19 @@ static void btrfs_read_locked_inode(struct inode *inode)
filled = true;
 
path = btrfs_alloc_path();
-   if (!path)
+   if (!path) {
+   ret = -ENOMEM;
goto make_bad;
+   }
 
memcpy(, _I(inode)->location, sizeof(location));
 
ret = btrfs_lookup_inode(NULL, root, path, , 0);
-   if (ret)
+   if (ret) {
+   if (ret > 0)
+   ret = -ENOENT;
goto make_bad;
+   }
 
leaf = path->nodes[0];
 
@@ -3796,11 +3801,12 @@ cache_acl:
}
 
btrfs_update_iflags(inode);
-   return;
+   return 0;
 
 make_bad:
btrfs_free_path(path);
make_bad_inode(inode);
+   return ret;
 }
 
 /*
@@ -5585,7 +5591,9 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
return ERR_PTR(-ENOMEM);
 
if (inode->i_state & I_NEW) {
-   btrfs_read_locked_inode(inode);
+   int ret;
+
+   ret = btrfs_read_locked_inode(inode);
if (!is_bad_inode(inode)) {
inode_tree_add(inode);
unlock_new_inode(inode);
@@ -5594,7 +5602,8 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
} else {
unlock_new_inode(inode);
iput(inode);
-   inode = ERR_PTR(-ESTALE);
+   ASSERT(ret < 0);
+   inode = ERR_PTR(ret < 0 ? ret : -ESTALE);
}
}
 
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: chunk_width_limit mount option

2016-06-09 Thread Andrew Armenia
On Mon, Jun 6, 2016 at 11:00 AM, Hugo Mills  wrote:
> On Mon, Jun 06, 2016 at 09:43:19AM -0400, Andrew Armenia wrote:
>> On Mon, Jun 6, 2016 at 5:17 AM, David Sterba  wrote:
>> > On Thu, Jun 02, 2016 at 09:50:15PM -0400, Andrew Armenia wrote:
>> >> This patch adds mount option 'chunk_width_limit=X', which when set forces
>> >> the chunk allocator to use only up to X devices when allocating a chunk.
>> >> This may help reduce the seek penalties seen in filesystems with large
>> >> numbers of devices.
>> >
>> > There is a use for such limit but passing it as a mount is not the right
>> > way to do it. Changing the value requires remount, different raid levels
>> > might need different values, it's not persistent etc.
>>
>> Thanks for the feedback. I agree that it's less than an ideal way to
>> do it. I guess I was a bit quick to release proof-of-concept code.
>> I'll try to do some more work on the configuration side of things and
>> resubmit.
>
>Last time something like this came up in discussion, the consensus
> was to put the configuration in xattrs in the btrfs namespace. There
> should be one with a name to indicate "global" configuration, applied
> to the top of subvol 5. If the feature allows configuration on a
> per-subvol or per-object basis, then there should also be a name for
> the relevant xattr (also in the btrfs namespace) that can be created
> on each object as required.
>
>Hugo.
>
> --
> Hugo Mills | Klytus, I'm bored. What plaything can you offer me
> hugo@... carfax.org.uk | today?
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |  Ming the Merciless, Flash Gordon

OK, since I'm new here I'll throw my thoughts out before writing and
submitting another patch. I wouldn't want to have to read the xattrs
from disk every time a chunk needs to be allocated. That means the
allocator configuration will need to be cached in the fs_info
structure. My current line of thinking is to create a
btrfs_allocator_config structure containing all of the allocator
parameters - for now a user-set devs_max for each RAID level. I'd add
a couple ioctls to allow the userspace tools to get and set these
parameters. The "set" ioctl would also persist this configuration to
one or more xattrs on the root of subvol 5 per Hugo's suggestion.

I'm still a bit hung up on the proper way to read the config back at
mount time. Patching the mount routines doesn't seem like a good way
to do it. Perhaps just doing it at the first invocation of
__btrfs_alloc_chunk is best? Any other thoughts?

-Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-09 Thread Chris Murphy
On Thu, Jun 9, 2016 at 5:38 AM, Austin S. Hemmelgarn
 wrote:
> On 2016-06-09 02:16, Duncan wrote:
>>
>> Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as
>> excerpted:
>>
>>> As far as BTRFS raid10 mode in general, there are a few things that are
>>> important to remember about it:
>>> 1. It stores exactly two copies of everything, any extra disks just add
>>> to the stripe length on each copy.
>>
>>
>> I'll add one more, potentially very important, related to this one:
>>
>> Btrfs raid mode (any of them) works in relation to individual chunks,
>> *NOT* individual devices.
>>
>> What that means for btrfs raid10 in combination with the above exactly
>> two copies rule, is that it works rather differently than a standard
>> raid10, which can tolerate loss of two devices as long as they're from
>> the same mirror set, as the other mirror set will then still be whole.
>> Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of
>> a second device close to assures loss of data, because the very likely
>> true assumption is that both mirror sets will be affected for some
>> chunks, but not others.
>
> Actually, that's not _quite_ the case.  Assuming that you have an even
> number of devices, BTRFS raid10 will currently always span all the available
> devices with two striped copies of the data (if there's an odd number, it
> spans one less than the total, and rotates which one gets left out of each
> chunk).  This means that as long as all the devices are the same size and
> you have have stripes that are the full width of the array (you can end up
> with shorter ones if you have run in degraded mode or expanded the array),
> your probability of data loss per-chunk goes down as you add more devices
> (because the probability of a two device failure affecting both copies of a
> stripe in a given chunk decreases), but goes up as you add more chunks
> (because you then have to apply that probability for each individual chunk).
> Once you've lost one disk, the probability that losing another will
> compromise a specific chunk is:
> 1/(N - 1)
> Where N is the total number of devices.
> The probability that it will compromise _any_ chunk is:
> (1/(N - 1))/C
> Where C is the total number of chunks
> BTRFS raid1 mode actually has the exact same probabilities, but they apply
> even if you have an odd number of disks.

Yeah but somewhere there's a chunk that's likely affected by two
losses, with a probability much higher than for conventional raid10
where such a loss is very binary: if the loss is a mirrored pair, the
whole array and filesystem implodes; if the loss does not affect an
entire mirrored pair, the whole array survives.

The thing with Btrfs raid 10 is you can't really tell in advance to
what degree you have loss. It's not a binary condition, it has a gray
area where a lot of data can still be retrieved, but the instant you
hit missing data it's a loss, and if you hit missing metadata then the
fs will either go read only or crash, it just can't continue. So that
"walking on egg shells" behavior in a 2+ drive loss is really
different from a conventional raid10 where it's either gonna
completely work or completely fail.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recommended why to use btrfs for production?

2016-06-09 Thread Duncan
Chris Murphy posted on Thu, 09 Jun 2016 11:39:23 -0600 as excerpted:

> Yeah but somewhere there's a chunk that's likely affected by two losses,
> with a probability much higher than for conventional raid10 where such a
> loss is very binary: if the loss is a mirrored pair, the whole array and
> filesystem implodes; if the loss does not affect an entire mirrored
> pair, the whole array survives.
> 
> The thing with Btrfs raid 10 is you can't really tell in advance to what
> degree you have loss. It's not a binary condition, it has a gray area
> where a lot of data can still be retrieved, but the instant you hit
> missing data it's a loss, and if you hit missing metadata then the fs
> will either go read only or crash, it just can't continue. So that
> "walking on egg shells" behavior in a 2+ drive loss is really different
> from a conventional raid10 where it's either gonna completely work or
> completely fail.

Yes, thanks, CMurphy.  That's exactly what I was trying to explain. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2016-06-09 Thread Chris Murphy
On Wed, Jun 8, 2016 at 5:10 PM, Hans van Kranenburg
 wrote:
> Hi list,
>
>
> On 05/31/2016 03:36 AM, Qu Wenruo wrote:
>>
>>
>>
>> Hans van Kranenburg wrote on 2016/05/06 23:28 +0200:
>>>
>>> Hi,
>>>
>>> I've got a mostly inactive btrfs filesystem inside a virtual machine
>>> somewhere that shows interesting behaviour: while no interesting disk
>>> activity is going on, btrfs keeps allocating new chunks, a GiB at a time.
>>>
>>> A picture, telling more than 1000 words:
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png
>>> (when the amount of allocated/unused goes down, I did a btrfs balance)
>>
>>
>> Nice picture.
>> Really better than 1000 words.
>>
>> AFAIK, the problem may be caused by fragments.
>>
>> And even I saw some early prototypes inside the codes to allow btrfs do
>> allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)
>>
>> But it's still prototype and seems no one is really working on it now.
>>
>> So when btrfs is writing new data, for example, to write about 16M data,
>> it will need to allocate a 16M continuous extent, and if it can't find
>> large enough space to allocate, then create a new data chunk.
>>
>> Despite the already awesome chunk level usage pricutre, I hope there is
>> info about extent level allocation to confirm my assumption.
>>
>> You could dump it by calling "btrfs-debug-tree -t 2 ".
>> It's normally recommended to do it unmounted, but it's still possible to
>> call it mounted, although not 100% perfect though.
>> (Then I'd better find a good way to draw a picture of
>> allocate/unallocate space and how fragments the chunks are)
>
>
> So, I finally found some spare time to continue investigating. In the
> meantime, the filesystem has happily been allocating new chunks every few
> days, filling them up way below 10% with data before starting a new one.
>
> The chunk allocation primarily seems to happen during cron.daily. But,
> manually executing all the cronjobs that are in there, even multiple times,
> does not result in newly allocated chunks. Yay. :(
>
> After the previous post, I put a little script in between every two jobs in
> /etc/cron.daily that prints the output of btrfs fi df to syslog and sleeps
> for 10 minutes so I can easily find out afterwards during which one it
> happened.
>
> Bingo! The "apt" cron.daily, which refreshes package lists and triggers
> unattended-upgrades.
>
> Jun 7 04:01:46 ichiban root: Data, single: total=12.00GiB, used=5.65GiB
> [...]
> 2016-06-07 04:01:56,552 INFO Starting unattended upgrades script
> [...]
> Jun 7 04:12:10 ichiban root: Data, single: total=13.00GiB, used=5.64GiB
>
> And, this thing is clever enough to do things once a day, even if you would
> execute it multple times... (Hehehe...)
>
> Ok, let's try doing some apt-get update then.
>
> Today, the latest added chunks look like this:
>
> # ./show_usage.py /
> [...]
> chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 115499008 used_pct 10
> chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 36585472 used_pct 3
> chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length
> 1073741824 used 17510400 used_pct 1
> chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length
> 268435456 used 72663040 used_pct 27
> chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length
> 1073741824 used 86986752 used_pct 8
> chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length
> 1073741824 used 21188608 used_pct 1
> chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length
> 1073741824 used 64032768 used_pct 5
> chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length
> 1073741824 used 71712768 used_pct 6
>
> Now I apt-get update...
>
> before: Data, single: total=13.00GiB, used=5.64GiB
> during: Data, single: total=13.00GiB, used=5.59GiB
> after : Data, single: total=14.00GiB, used=5.64GiB
>
> # ./show_usage.py /
> [...]
> chunk vaddr 63495471104 type 1 stripe 0 devid 1 offset 9164554240 length
> 1073741824 used 119279616 used_pct 11
> chunk vaddr 64569212928 type 1 stripe 0 devid 1 offset 12079595520 length
> 1073741824 used 36585472 used_pct 3
> chunk vaddr 65642954752 type 1 stripe 0 devid 1 offset 14227079168 length
> 1073741824 used 17510400 used_pct 1
> chunk vaddr 66716696576 type 4 stripe 0 devid 1 offset 3275751424 length
> 268435456 used 73170944 used_pct 27
> chunk vaddr 66985132032 type 1 stripe 0 devid 1 offset 15300820992 length
> 1073741824 used 82251776 used_pct 7
> chunk vaddr 68058873856 type 1 stripe 0 devid 1 offset 16374562816 length
> 1073741824 used 21188608 used_pct 1
> chunk vaddr 69132615680 type 1 stripe 0 devid 1 offset 17448304640 length
> 1073741824 used 6041600 used_pct 0
> chunk vaddr 70206357504 type 1 stripe 0 devid 1 offset 18522046464 length
> 1073741824 used 46178304 

[PATCH 1/2] btrfs: plumb fs_info into btrfs_work

2016-06-09 Thread jeffm
From: Jeff Mahoney 

In order to provide an fsid for trace events, we'll need a btrfs_fs_info
pointer.  The most lightweight way to do that for btrfs_work structures
is to associate it with the __btrfs_workqueue structure.  Each queued
btrfs_work structure has a workqueue associated with it, so that's
a natural fit.  It's a privately defined structures, so we add accessors
to retrieve the fs_info pointer.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/async-thread.c | 31 +--
 fs/btrfs/async-thread.h |  6 +-
 fs/btrfs/disk-io.c  | 47 ---
 fs/btrfs/scrub.c| 10 +-
 4 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 5fb60ea..e0f071f 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -34,6 +34,10 @@
 
 struct __btrfs_workqueue {
struct workqueue_struct *normal_wq;
+
+   /* File system this workqueue services */
+   struct btrfs_fs_info *fs_info;
+
/* List head pointing to ordered work list */
struct list_head ordered_list;
 
@@ -70,6 +74,18 @@ void btrfs_##name(struct work_struct *arg)   
\
normal_work_helper(work);   \
 }
 
+struct btrfs_fs_info *
+btrfs_workqueue_owner(struct __btrfs_workqueue *wq)
+{
+   return wq->fs_info;
+}
+
+struct btrfs_fs_info *
+btrfs_work_owner(struct btrfs_work *work)
+{
+   return work->wq->fs_info;
+}
+
 BTRFS_WORK_HELPER(worker_helper);
 BTRFS_WORK_HELPER(delalloc_helper);
 BTRFS_WORK_HELPER(flush_delalloc_helper);
@@ -94,14 +110,15 @@ BTRFS_WORK_HELPER(scrubnc_helper);
 BTRFS_WORK_HELPER(scrubparity_helper);
 
 static struct __btrfs_workqueue *
-__btrfs_alloc_workqueue(const char *name, unsigned int flags, int limit_active,
-int thresh)
+__btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info, const char *name,
+   unsigned int flags, int limit_active, int thresh)
 {
struct __btrfs_workqueue *ret = kzalloc(sizeof(*ret), GFP_KERNEL);
 
if (!ret)
return NULL;
 
+   ret->fs_info = fs_info;
ret->limit_active = limit_active;
atomic_set(>pending, 0);
if (thresh == 0)
@@ -143,7 +160,8 @@ __btrfs_alloc_workqueue(const char *name, unsigned int 
flags, int limit_active,
 static inline void
 __btrfs_destroy_workqueue(struct __btrfs_workqueue *wq);
 
-struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name,
+struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
+ const char *name,
  unsigned int flags,
  int limit_active,
  int thresh)
@@ -153,7 +171,8 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(const char 
*name,
if (!ret)
return NULL;
 
-   ret->normal = __btrfs_alloc_workqueue(name, flags & ~WQ_HIGHPRI,
+   ret->normal = __btrfs_alloc_workqueue(fs_info, name,
+ flags & ~WQ_HIGHPRI,
  limit_active, thresh);
if (!ret->normal) {
kfree(ret);
@@ -161,8 +180,8 @@ struct btrfs_workqueue *btrfs_alloc_workqueue(const char 
*name,
}
 
if (flags & WQ_HIGHPRI) {
-   ret->high = __btrfs_alloc_workqueue(name, flags, limit_active,
-   thresh);
+   ret->high = __btrfs_alloc_workqueue(fs_info, name, flags,
+   limit_active, thresh);
if (!ret->high) {
__btrfs_destroy_workqueue(ret->normal);
kfree(ret);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index ad4d064..8e52484 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -21,6 +21,7 @@
 #define __BTRFS_ASYNC_THREAD_
 #include 
 
+struct btrfs_fs_info;
 struct btrfs_workqueue;
 /* Internal use only */
 struct __btrfs_workqueue;
@@ -67,7 +68,8 @@ BTRFS_WORK_HELPER_PROTO(scrubnc_helper);
 BTRFS_WORK_HELPER_PROTO(scrubparity_helper);
 
 
-struct btrfs_workqueue *btrfs_alloc_workqueue(const char *name,
+struct btrfs_workqueue *btrfs_alloc_workqueue(struct btrfs_fs_info *fs_info,
+ const char *name,
  unsigned int flags,
  int limit_active,
  int thresh);
@@ -80,4 +82,6 @@ void btrfs_queue_work(struct btrfs_workqueue *wq,
 void btrfs_destroy_workqueue(struct btrfs_workqueue *wq);
 void btrfs_workqueue_set_max(struct btrfs_workqueue *wq, int max);
 void btrfs_set_work_high_priority(struct 

[PATCH 2/2] btrfs: prefix fsid to all trace events

2016-06-09 Thread jeffm
From: Jeff Mahoney 

When using trace events to debug a problem, it's impossible to determine
which file system generated a particular event.  This patch adds a
macro to prefix standard information to the head of a trace event.

The extent_state alloc/free events are all that's left without an
fs_info available.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/delayed-ref.c   |   9 +-
 fs/btrfs/extent-tree.c   |  10 +-
 fs/btrfs/qgroup.c|  19 +--
 fs/btrfs/qgroup.h|   9 +-
 fs/btrfs/super.c |   2 +-
 include/trace/events/btrfs.h | 282 ---
 6 files changed, 182 insertions(+), 149 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 430b368..e7b1ec0 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -606,7 +606,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
qrecord->num_bytes = num_bytes;
qrecord->old_roots = NULL;
 
-   qexisting = btrfs_qgroup_insert_dirty_extent(delayed_refs,
+   qexisting = btrfs_qgroup_insert_dirty_extent(fs_info,
+delayed_refs,
 qrecord);
if (qexisting)
kfree(qrecord);
@@ -615,7 +616,7 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
spin_lock_init(_ref->lock);
mutex_init(_ref->mutex);
 
-   trace_add_delayed_ref_head(ref, head_ref, action);
+   trace_add_delayed_ref_head(fs_info, ref, head_ref, action);
 
existing = htree_insert(_refs->href_root,
_ref->href_node);
@@ -682,7 +683,7 @@ add_delayed_tree_ref(struct btrfs_fs_info *fs_info,
ref->type = BTRFS_TREE_BLOCK_REF_KEY;
full_ref->level = level;
 
-   trace_add_delayed_tree_ref(ref, full_ref, action);
+   trace_add_delayed_tree_ref(fs_info, ref, full_ref, action);
 
ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref);
 
@@ -739,7 +740,7 @@ add_delayed_data_ref(struct btrfs_fs_info *fs_info,
full_ref->objectid = owner;
full_ref->offset = offset;
 
-   trace_add_delayed_data_ref(ref, full_ref, action);
+   trace_add_delayed_data_ref(fs_info, ref, full_ref, action);
 
ret = add_delayed_ref_tail_merge(trans, delayed_refs, head_ref, ref);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 689d25a..ecb68bb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2194,7 +2194,7 @@ static int run_delayed_data_ref(struct btrfs_trans_handle 
*trans,
ins.type = BTRFS_EXTENT_ITEM_KEY;
 
ref = btrfs_delayed_node_to_data_ref(node);
-   trace_run_delayed_data_ref(node, ref, node->action);
+   trace_run_delayed_data_ref(root->fs_info, node, ref, node->action);
 
if (node->type == BTRFS_SHARED_DATA_REF_KEY)
parent = ref->parent;
@@ -2349,7 +2349,7 @@ static int run_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
 SKINNY_METADATA);
 
ref = btrfs_delayed_node_to_tree_ref(node);
-   trace_run_delayed_tree_ref(node, ref, node->action);
+   trace_run_delayed_tree_ref(root->fs_info, node, ref, node->action);
 
if (node->type == BTRFS_SHARED_BLOCK_REF_KEY)
parent = ref->parent;
@@ -2413,7 +2413,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
 */
BUG_ON(extent_op);
head = btrfs_delayed_node_to_head(node);
-   trace_run_delayed_ref_head(node, head, node->action);
+   trace_run_delayed_ref_head(root->fs_info, node, head,
+  node->action);
 
if (insert_reserved) {
btrfs_pin_extent(root, node->bytenr,
@@ -8316,7 +8317,8 @@ static int record_one_subtree_extent(struct 
btrfs_trans_handle *trans,
 
delayed_refs = >transaction->delayed_refs;
spin_lock(_refs->lock);
-   if (btrfs_qgroup_insert_dirty_extent(delayed_refs, qrecord))
+   if (btrfs_qgroup_insert_dirty_extent(trans->root->fs_info,
+delayed_refs, qrecord))
kfree(qrecord);
spin_unlock(_refs->lock);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9d4c05b..13e28d8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1453,9 +1453,10 @@ int btrfs_qgroup_prepare_account_extents(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
-struct btrfs_qgroup_extent_record
-*btrfs_qgroup_insert_dirty_extent(struct btrfs_delayed_ref_root *delayed_refs,
- struct btrfs_qgroup_extent_record *record)
+struct btrfs_qgroup_extent_record *
+btrfs_qgroup_insert_dirty_extent(struct btrfs_fs_info *fs_info,
+   

Re: [PATCH] btrfs-progs: add the error message when open fails

2016-06-09 Thread Tsutomu Itoh

Hi, David,

On 2016/06/09 22:13, David Sterba wrote:

On Thu, Jun 09, 2016 at 10:23:15AM +0900, Tsutomu Itoh wrote:

When open in btrfs_open_devices failed, only the following message is
displayed. Therefore the user doesn't understand the reason why open
failed.

  # btrfs check /dev/sdb8
  Couldn't open file system

This patch adds the error message when open fails.


I think the message should be printed by the caller, not by the helper.


However in this case, error device name is not understood in the caller.

Thanks,
Tsutomu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: add check-only option for balance

2016-06-09 Thread Ashish Samant
From: Liu Bo 

This aims to decide whether a balance can reduce the number of
data block groups and if it can, this shows the '-dvrange' block
group's objectid.

With this, you can run
'btrfs balance start -c mnt' or 'btrfs balance start --check-only mnt'

 --
$ btrfs balance start -c /mnt/btrfs
Checking data block groups...
block_group12582912 (len 8388608 used  786432)
block_group  1103101952 (len  1073741824 used   536870912)
block_group  2176843776 (len  1073741824 used  1073741824)
total bgs 3 total_free 544473088 min_used bg 12582912 has (min_used 786432 free 
7602176)
run 'btrfs balance start -dvrange=12582912..12582913 your_mnt'

$ btrfs balance start -dvrange=12582912..12582913 /mnt/btrfs
Done, had to relocate 1 out of 5 chunks

$ btrfs balance start -c /mnt/btrfs
Checking data block groups...
block_group  1103101952 (len  1073741824 used   537395200)
block_group  2176843776 (len  1073741824 used  1073741824)
total bgs 2 total_free 536346624 min_used bg 1103101952 has (min_used 537395200 
free 536346624)
 --

So you now know how to babysit your btrfs in a smart way.

Signed-off-by: Liu Bo 
Signed-off-by: Ashish Samant 
---
 cmds-balance.c |  127 +++-
 1 files changed, 126 insertions(+), 1 deletions(-)

diff --git a/cmds-balance.c b/cmds-balance.c
index 8f3bf5b..e2aab6c 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -493,6 +493,116 @@ out:
return ret;
 }
 
+/* return 0 if balance can remove a data block group, otherwise return 1 */
+static int search_data_bgs(const char *path)
+{
+   struct btrfs_ioctl_search_args args;
+   struct btrfs_ioctl_search_key *sk;
+   struct btrfs_ioctl_search_header *header;
+   struct btrfs_block_group_item *bg;
+   unsigned long off = 0;
+   DIR *dirstream = NULL;
+   int e;
+   int fd;
+   int i;
+   u64 total_free = 0;
+   u64 min_used = (u64)-1;
+   u64 free_of_min_used = 0;
+   u64 bg_of_min_used = 0;
+   u64 flags;
+   u64 used;
+   int ret = 0;
+   int nr_data_bgs = 0;
+
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
+   return 1;
+
+   memset(, 0, sizeof(args));
+   sk = 
+
+   sk->tree_id = BTRFS_EXTENT_TREE_OBJECTID;
+   sk->min_objectid = sk->min_offset = sk->min_transid = 0;
+   sk->max_objectid = sk->max_offset = sk->max_transid = (u64)-1;
+   sk->max_type = sk->min_type = BTRFS_BLOCK_GROUP_ITEM_KEY;
+   sk->nr_items = 65536;
+
+   while (1) {
+   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, );
+   e = errno;
+   if (ret < 0) {
+   fprintf(stderr, "ret %d error '%s'\n", ret,
+   strerror(e));
+   return ret;
+   }
+   /*
+* it should not happen.
+*/
+   if (sk->nr_items == 0)
+   break;
+
+   off = 0;
+   for (i = 0; i < sk->nr_items; i++) {
+   header = (struct btrfs_ioctl_search_header *)(args.buf
+ + off);
+
+   off += sizeof(*header);
+   if (header->type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+   bg = (struct btrfs_block_group_item *)
+   (args.buf + off);
+   flags = btrfs_block_group_flags(bg);
+   if (flags & BTRFS_BLOCK_GROUP_DATA) {
+   nr_data_bgs++;
+   used = btrfs_block_group_used(bg);
+   printf(
+   "block_group %15llu (len %11llu used 
%11llu)\n",
+   header->objectid,
+   header->offset, used);
+   total_free += header->offset - used;
+   if (min_used >= used) {
+   min_used = used;
+   free_of_min_used =
+   header->offset - used;
+   bg_of_min_used =
+   header->objectid;
+   }
+   }
+   }
+
+   off += header->len;
+   sk->min_objectid = header->objectid;
+   sk->min_type = header->type;
+   

Re: How to map extents to files

2016-06-09 Thread Nikolaus Rath
On Jun 01 2016, Nikolaus Rath  wrote:
> Hello,
>
> For one of my btrfs volumes, btrfsck reports a lot of the following
> warnings:
>
> [...]
> checking extents
> bad extent [138477568, 138510336), type mismatch with chunk
> bad extent [140091392, 140148736), type mismatch with chunk
> bad extent [140148736, 140201984), type mismatch with chunk
> bad extent [140836864, 140865536), type mismatch with chunk
> [...]
>
> Is there a way to discover which files are affected by this (in
> particular so that I can take a look at them before and after a btrfsck
> --repair)?

Hmm. Anyone? Is there really no way to figure out what might be affected
if I attempt a repair?


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-06-09 Thread Nikolaus Rath
On May 11 2016, Nikolaus Rath  wrote:
> Hello,
>
> I recently ran btrfsck on one of my file systems, and got the following
> messages:
>
> checking extents
> checking free space cache
> checking fs roots
> root 5 inode 3149867 errors 400, nbytes wrong
> root 5 inode 3150237 errors 400, nbytes wrong
> root 5 inode 3150238 errors 400, nbytes wrong
> root 5 inode 3150242 errors 400, nbytes wrong
> root 5 inode 3150260 errors 400, nbytes wrong
> [ lots of similar message with different inode numbers ]
> root 5 inode 15595011 errors 400, nbytes wrong
> root 5 inode 15595016 errors 400, nbytes wrong
> Checking filesystem on /dev/mapper/vg0-nikratio_crypt
> UUID: 8742472d-a9b0-4ab6-b67a-5d21f14f7a38
> found 263648960636 bytes used err is 1
> total csum bytes: 395314372
> total tree bytes: 908644352
> total fs tree bytes: 352735232
> total extent tree bytes: 95039488
> btree space waste bytes: 156301160
> file data blocks allocated: 675209801728
>  referenced 410351722496
> Btrfs v3.17
>
>
> Can someone explain to me the risk that I run by attempting a repair,
> and (conversely) what I put at stake when continuing to use this file
> system as-is?
 
To follow-up on this: after finding out which files were affected (using
btrfs inspect-internal), I was able to fix the problem without using
btrfsck by simply copying the data, deleting the file, and restoring it:

cat affected-files.txt | while read -r name; do
rsync -a "${name}" "/backup/location/${name}"
rm -f "${name}"
cp -a "/backup/location/${name}" "${name}"
done

(I used rsync to avoid cp making use of reflinks). After this procedure,
btrfschk reported no more problems.


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Allocator behaviour during device delete

2016-06-09 Thread Brendan Hide



On 06/09/2016 03:07 PM, Austin S. Hemmelgarn wrote:

On 2016-06-09 08:34, Brendan Hide wrote:

Hey, all

I noticed this odd behaviour while migrating from a 1TB spindle to SSD
(in this case on a LUKS-encrypted 200GB partition) - and am curious if
this behaviour I've noted below is expected or known. I figure it is a
bug. Depending on the situation, it *could* be severe. In my case it was
simply annoying.

---
Steps

After having added the new device (btrfs dev add), I deleted the old
device (btrfs dev del)

Then, whilst waiting for that to complete, I started a watch of "btrfs
fi show /". Note that the below is very close to the output at the time
- but is not actually copy/pasted from the output.


Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
Total devices 2 FS bytes used 115.03GiB
devid1 size 0.00GiB used 298.06GiB path /dev/sda2
devid2 size 200.88GiB used 0.00GiB path
/dev/mapper/cryptroot



devid1 is the old disk while devid2 is the new SSD

After a few minutes, I saw that the numbers have changed - but that the
SSD still had no data:


Label: 'tricky-root'  uuid: bcbe47a5-bd3f-497a-816b-decb4f822c42
Total devices 2 FS bytes used 115.03GiB
devid1 size 0.00GiB used 284.06GiB path /dev/sda2
devid2 size 200.88GiB used 0.00GiB path
/dev/mapper/cryptroot


The "FS bytes used" amount was changing a lot - but mostly stayed near
the original total, which is expected since there was very little
happening other than the "migration".

I'm not certain of the exact point where it started using the new disk's
space. I figure that may have been helpful to pinpoint. :-/

OK, I'm pretty sure I know what was going on in this case.  Your
assumption that device delete uses the balance code is correct, and that
is why you see what's happening happening.  There are two key bits that
are missing though:
1. Balance will never allocate chunks when it doesn't need to.
2. The space usage listed in fi show is how much space is allocated to
chunks, not how much is used in those chunks.

In this case, based on what you've said, you had a lot of empty or
mostly empty chunks.  As a result of this, the device delete was both
copying data, and consolidating free space.  If you have a lot of empty
or mostly empty chunks, it's not unusual for a device delete to look
like this until you start hitting chunks that have actual data in them.
The pri8mary point of this behavior is that it makes it possible to
directly switch to a smaller device without having to run a balance and
then a resize before replacing the device, and then resize again
afterwards.


Thanks, Austin. Your explanation is along the lines of my thinking though.

The new disk should have had *some* data written to it at that point, as 
it started out at over 600GiB in allocation (should have probably 
mentioned that already). Consolidating or not, I would consider data 
being written to the old disk to be a bug, even if it is considered minor.


I'll set up a reproducible test later today to prove/disprove the theory. :)

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Btrfs: improve performance on fsync against new inode after rename/unlink

2016-06-09 Thread fdmanana
From: Filipe Manana 

With commit 56f23fdbb600 ("Btrfs: fix file/data loss caused by fsync after
rename and new inode") we got simple fix for a functional issue when the
following sequence of actions is done:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  move/rename existing file A from directory D to directory E
  create a new file named A at directory D
  fsync the new file
  power fail

The solution was to simply detect such scenario and fallback to a full
transaction commit when we detect it. However this turned out to had a
significant impact on throughput (and a bit on latency too) for benchmarks
using the dbench tool, which simulates real workloads from smbd (Samba)
servers. For example on a test vm (with a debug kernel):

Unpatched:
Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms

Patched:
Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms

The patched results (this patch is applied) are similar to the results of
a kernel with the commit 56f23fdbb600 ("Btrfs: fix file/data loss caused
by fsync after rename and new inode") reverted.

This change avoids the fallback to a transaction commit and instead makes
sure all the names of the conflicting inode (the one that had a name in a
past transaction that matches the name of the new file in the same parent
directory) are logged so that at log replay time we don't lose neither the
new file nor the old file, and the old file gets the name it was renamed
to.

This also ends up avoiding a full transaction commit for a similar case
that involves an unlink instead of a rename of the old file:

  at transaction N
  create file A at directory D
  at transaction N + M (where M >= 1)
  remove file A
  create a new file named A at directory D
  fsync the new file
  power fail

Signed-off-by: Filipe Manana 
---
 fs/btrfs/inode.c| 19 +++-
 fs/btrfs/tree-log.c | 85 -
 2 files changed, 95 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 1ae0fc8..6dcd38b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4194,6 +4194,7 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
int err = 0;
struct btrfs_root *root = BTRFS_I(dir)->root;
struct btrfs_trans_handle *trans;
+   u64 last_unlink_trans;
 
if (inode->i_size > BTRFS_EMPTY_DIR_SIZE)
return -ENOTEMPTY;
@@ -4216,11 +4217,27 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
if (err)
goto out;
 
+   last_unlink_trans = BTRFS_I(inode)->last_unlink_trans;
+
/* now the directory is empty */
err = btrfs_unlink_inode(trans, root, dir, d_inode(dentry),
 dentry->d_name.name, dentry->d_name.len);
-   if (!err)
+   if (!err) {
btrfs_i_size_write(inode, 0);
+   /*
+* Propagate the last_unlink_trans value of the deleted dir to
+* its parent directory. This is to prevent an unrecoverable
+* log tree in the case we do something like this:
+* 1) create dir foo
+* 2) create snapshot under dir foo
+* 3) delete the snapshot
+* 4) rmdir foo
+* 5) mkdir foo
+* 6) fsync foo or some file inside foo
+*/
+   if (last_unlink_trans >= trans->transid)
+   BTRFS_I(dir)->last_unlink_trans = last_unlink_trans;
+   }
 out:
btrfs_end_transaction(trans, root);
btrfs_btree_balance_dirty(root);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index aa8fed1..0c39afb 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4469,7 +4469,8 @@ static int btrfs_log_trailing_hole(struct 
btrfs_trans_handle *trans,
 static int btrfs_check_ref_name_override(struct extent_buffer *eb,
 const int slot,
 const struct btrfs_key *key,
-struct inode *inode)
+struct inode *inode,
+u64 *other_ino)
 {
int ret;
struct btrfs_path *search_path;
@@ -4528,7 +4529,16 @@ static int btrfs_check_ref_name_override(struct 
extent_buffer *eb,
   search_path, parent,
   name, this_name_len, 0);
if (di && !IS_ERR(di)) {
-   ret = 1;
+   struct btrfs_key di_key;
+
+   btrfs_dir_item_key_to_cpu(search_path->nodes[0],
+ di, _key);
+   if (di_key.type == BTRFS_INODE_ITEM_KEY) {
+   ret = 

Re: Recommended why to use btrfs for production?

2016-06-09 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 03 Jun 2016 10:21:12 -0400 as
excerpted:

> As far as BTRFS raid10 mode in general, there are a few things that are 
> important to remember about it:
> 1. It stores exactly two copies of everything, any extra disks just add 
> to the stripe length on each copy.

I'll add one more, potentially very important, related to this one:

Btrfs raid mode (any of them) works in relation to individual chunks, 
*NOT* individual devices.

What that means for btrfs raid10 in combination with the above exactly 
two copies rule, is that it works rather differently than a standard 
raid10, which can tolerate loss of two devices as long as they're from 
the same mirror set, as the other mirror set will then still be whole.  
Because with btrfs raid10 the mirror sets are dynamic per-chunk, loss of 
a second device close to assures loss of data, because the very likely 
true assumption is that both mirror sets will be affected for some 
chunks, but not others.

By using a layered approach, btrfs raid1 on top (for its error correction 
from the other copy feature) of a pair of mdraid0s, you force one of the 
btrfs raid1 copies to each of the mdraid0s, thus making allocation more 
deterministic than btrfs raid10, and can thus again tolerate loss of two 
devices, as long as they're from the same underlying mdraid0.

(Traditionally, raid1 on top of raid0 is called raid01, and is 
discouraged compared to raid10, raid0 on top of raid1, because device 
failure and replacement with the latter triggers a much more localized 
rebuild than the former, across the pair of devices in the raid1 when 
it's closest to the physical devices, across the whole array, one raid0 
to the other, when the raid1 is on top.  However, btrfs raid1's data 
integrity and error repair from the good mirror feature is generally 
considered to be useful enough to be worth the rebuild-inefficiency of 
the raid01 design.)

So in regard to failure tolerance, btrfs raid10 is far closer to 
traditional raid5, loss of a single device is tolerated, loss of a second 
before a repair is complete generally means data loss -- there's not the 
chance of it being on the same mirror set to save you that traditional 
raid10 has.

Similarly, btrfs raid10 doesn't have the cleanly separate pair of mirrors 
on raid0 arrays that traditional raid10 does, thus doesn't have the fault 
tolerance of losing say the connection or power to one entire device 
bank, as long as it's all one mirror set, that traditional raid10 has.

And again, doing the layered thing with btrfs raid1 on top and mdraid0 
(or whatever else) underneath gets that back for you, if you set it up 
that way, of course.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html