Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Duncan
Chris Mason posted on Thu, 20 Mar 2014 01:01:35 + as excerpted:

> Message-ID: 
> Accept-Language: en-US
> Content-Type: text/plain; charset=euc-kr
> Content-Transfer-Encoding: base64
> Content-Language: en-US

>>> Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes.
>>> Honestly though, it©ös just a bug in the mvs driver.  Atomic 8K
>>> allocations are doomed to fail eventually.

> The process is a btrfs worker, and the IO was started by btrfs, but the
> allocation failure is all inside the mvs driver.  There¡¯s even the
> printk in there from mvs about the allocation failing.

> -chris
> 
> N‹§²æìrž›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±nÚß²)í…æèw*jg¬±š¶‰šŽŠÝ¢
j/êäz¹Þ–Šà2ŠÞ™šè­Ú&¢)ß¡«a¶Úþø®G«éh®æj:+v‰šŠwè†Ù¥


Chris, you might wish to take a look at this and/or have one of the FB 
techs familiar with your mail transport layers (and/or your mail client, 
and/or perhaps it's vger's list-serv bot) look at it.  That list-sig came 
thru as garbage at least here, and your "it's" and "there's" appear to 
have strange apostrophes as well.  As you can see I also included the 
headers I think might be relevant, plus the Message-ID.

The problem wasn't a big one here, but if it starts corrupting patches or 
something too...

I'd guess it has something to do with that content-type charset=euc-kr in 
the headers, which looks really strange combined with the accept-language 
and content-language both being en-US.

The other messages of yours (with an uncorrupted list sig) I checked had 
a content-type charset=iso-8859-1 header, which seems to be most common 
in English messages, anyway.  Why this one had euc-kr I don't know.

However, it's also worth noting that the third-level quoting above, also 
from you, has an "it's", and when I checked /that/ message, the 
apostrophe was a super-script-1, as it was when quoted in Marc's reply, 
but that got changed to the copyright symbol (plus something else) in 
your quote, which at least as I'm posting the quote, is showing the same 
way.

Meanwhile, the apostrophe in the "there's" in your message's new content 
(which is thus first-level quote in this message) is different, an i 
followed by an over-line.  I'd /guess/ that's what actually triggered the 
Korean (?) charset in your message, in ordered to handle that or perhaps 
some other character I missed, while your other messages are iso-8859-1, 
which thus changed the auto-inserted list-sig into garbage, since the 
list-bot presumably inserted it as the usual iso-8859-1 while your 
message claimed to be in eur-kr.

(It'll be interesting to see if my message, with both quotes, looks the 
same when I read it on the list via gmane, as it does when I send it, or 
if it's further garbled, and if my charset gets set to something exotic 
too.  FWIW, my client, which should be visible in my headers, is pan, via 
gmane.org's list2news service.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6 EARLY RFC] Btrfs: Get rid of whole page I/O.

2014-03-19 Thread Aneesh Kumar K.V
David Sterba  writes:

> On Tue, Mar 18, 2014 at 01:48:00PM +0630, chandan wrote:
>> The earlier patchset posted by Chandra Seethraman was to get 4k
>> blocksize to work with ppc64's 64k PAGE_SIZE.
>
> Are we talking about metadata block sizes or data block sizes?
>
>> The root node of "tree root" tree has 1957 bytes being written by
>> make_btrfs() (in btrfs-progs).  Hence I chose to do 2k blocksize for
>> the initial subpagesize-blocksize work. So with this patchset the
>> supported blocksizes would be in the range 2k-64k.
>
> So it's metadata blocks, and in this case 2k looks like the only
> allowed size that's smaller than 4k, and thus can demonstrage sub-page
> size allocations. I'm not sure if this is limiting for potential future
> extensions of metadata structures that could be larger.
>
> 2k is ok for testing purposes, but I think a 4k-page machine will hardly
> use a smaller page size. The more that 16k metadata blocks are now
> default.

The goal is to remove the assumption that supported blocks size is >= page
size. The primary reason to do that is to support migration of disk
devices across different architectures. If we have a btrfs disk created
on x86 box with data blocksize 4K and meta data block size 16K we should
make sure that, the disk can be read/written from ppc64 box (which have a page
size of 64K). To enable easy testing and community development we are
now focusing on achieving 2K data blocksize and 2K meata data block size
on x86. As you said this will never be used in production.

To achieve that we did the below
*) Add offset and len to btrfs_io_bio. These are file offsets and
len. This is later used to unlock extent io tree.

*) Now we also need to make sure that submit_extent_page only submit
 contiguous range in the file offset range. ie if we have holes in
 between we split them into two submit_extent_page.  This ensures that
 btrfs_io_bio offset and len represent a contiguous range.

Please let us know whether the above approach is acceptable.

 -aneesh
 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove unnecessary inode generation lookup in send

2014-03-19 Thread Liu Bo
On Tue, Mar 18, 2014 at 05:56:06PM +, Filipe David Borba Manana wrote:
> No need to search in the send tree for the generation number of the inode,
> we already have it in the recorded_ref structure passed to us.
> 

Reviewed-by: Liu Bo 

-liubo

> Signed-off-by: Filipe David Borba Manana 
> ---
>  fs/btrfs/send.c |9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 5d757ee..db4b10c 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -3179,7 +3179,7 @@ static int wait_for_parent_move(struct send_ctx *sctx,
>   int ret;
>   u64 ino = parent_ref->dir;
>   u64 parent_ino_before, parent_ino_after;
> - u64 new_gen, old_gen;
> + u64 old_gen;
>   struct fs_path *path_before = NULL;
>   struct fs_path *path_after = NULL;
>   int len1, len2;
> @@ -3198,12 +3198,7 @@ static int wait_for_parent_move(struct send_ctx *sctx,
>   else if (ret < 0)
>   return ret;
>  
> - ret = get_inode_info(sctx->send_root, ino, NULL, &new_gen,
> -  NULL, NULL, NULL, NULL);
> - if (ret < 0)
> - return ret;
> -
> - if (new_gen != old_gen)
> + if (parent_ref->dir_gen != old_gen)
>   return 0;
>  
>   path_before = fs_path_alloc();
> -- 
> 1.7.10.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Chris Mason


On 3/19/14, 8:20 PM, "Marc MERLIN"  wrote:

>On Thu, Mar 20, 2014 at 12:13:36AM +, Chris Mason wrote:
>> >Should I double it?
>> >
>> >For now, I have the copy running again, and it's been going for 8 hours
>> >without failure on the old kernel but of course that doesn't mean my
>>2TB
>> >copy will complete without hitting the bug again.
>> 
>> Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes.
>>Honestly
>> though, it¹s just a bug in the mvs driver.  Atomic 8K allocations are
>> doomed to fail eventually.
>
>Gotcha
>polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes
>45056
>polgara:/mnt/btrfs_backupcopy# echo 10 > /proc/sys/vm/min_free_kbytes
>polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes
>10
>polgara:/mnt/btrfs_backupcopy#
> 
>> The driver should either busy loop until the allocation completes
>>(really
>> not a great choice), gracefully deal with the failure (looks tricky), or
>> preallocate the space (like the rest of the block layer).
>
>Gotcha. I'll report this to the folks maintaining the marvel driver.
>
>So just to make sure I got you right, although the page allocation failure
>was shown in btrfs, it's really the underlying marvel driver at fault
>here,
>and there isn't really anything to change on the btrfs side, correct?

The process is a btrfs worker, and the IO was started by btrfs, but the
allocation failure is all inside the mvs driver.  There’s even the printk
in there from mvs about the allocation failing.

The only reason it’s btrfs instead of a regular process is because for
raid5/6 the rmw is farmed out to helper threads.

-chris

N떑꿩�r툤y鉉싕b쾊Ф푤v�^�)頻{.n�+돴쪐{켷雹�)�끾�w*jgП�텎쉸듶줷/곴�z받뻿�2듷솳鈺�&�)傘첺뛴췍쳺�h��j:+v돣둾�녪

Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds

2014-03-19 Thread Marc MERLIN
On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote:
> I tried the RAID6 implementation of btrfs and I looks like I had the
> same problem. Rebuild with "balance" worked but when a drive was
> removed when mounted and then readded, the chaos began. I tried it a
> few times. So when a drive fails (and this is just because of
> connection lost or similar non severe problems), then it is necessary
> to wipe the disc first before readding it, so btrfs will add it as a
> new disk and not try to readd the old one.

Good to know you got this too.

Just to confirm: did you get it to rebuild, or once a drive is lost/gets
behind, you're in degraded mode forever for those blocks?

Or were you able to balance?

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Marc MERLIN
On Thu, Mar 20, 2014 at 12:13:36AM +, Chris Mason wrote:
> >Should I double it?
> >
> >For now, I have the copy running again, and it's been going for 8 hours
> >without failure on the old kernel but of course that doesn't mean my 2TB
> >copy will complete without hitting the bug again.
> 
> Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes.  Honestly
> though, it¹s just a bug in the mvs driver.  Atomic 8K allocations are
> doomed to fail eventually.

Gotcha
polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes
45056
polgara:/mnt/btrfs_backupcopy# echo 10 > /proc/sys/vm/min_free_kbytes
polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes
10
polgara:/mnt/btrfs_backupcopy# 
 
> The driver should either busy loop until the allocation completes (really
> not a great choice), gracefully deal with the failure (looks tricky), or
> preallocate the space (like the rest of the block layer).

Gotcha. I'll report this to the folks maintaining the marvel driver.

So just to make sure I got you right, although the page allocation failure
was shown in btrfs, it's really the underlying marvel driver at fault here,
and there isn't really anything to change on the btrfs side, correct?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Chris Mason


On 3/19/14, 6:37 PM, "Marc MERLIN"  wrote:

>On Wed, Mar 19, 2014 at 12:20:08PM -0400, Chris Mason wrote:
>> On 03/19/2014 11:45 AM, Marc MERLIN wrote:
>> >My server died last night during a btrfs send/receive to a btrfs radi5
>> >array
>> >
>> >Here are the logs. Is this anything known or with a possible
>>workaround?
>> >
>> >Thanks,
>> >Marc
>> >
>> >btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
>> 
>> This is an order 1 atomic allocation from the mvs driver, we really
>> should not be depending on that to get IO done.  A quick search and it
>> looks like we're allocating MVS_SLOT_BUF_SZ (8192) bytes.
>> 
>> You could try bumping the lowmem reserves.
>
>Thanks for the info.
>
>So for now, I have
>CONFIG_X86_RESERVE_LOW=64
>
>This is the option we're talking about, right?
>
>Should I double it?
>
>For now, I have the copy running again, and it's been going for 8 hours
>without failure on the old kernel but of course that doesn't mean my 2TB
>copy will complete without hitting the bug again.

Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes.  Honestly
though, it¹s just a bug in the mvs driver.  Atomic 8K allocations are
doomed to fail eventually.

The driver should either busy loop until the allocation completes (really
not a great choice), gracefully deal with the failure (looks tricky), or
preallocate the space (like the rest of the block layer).

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds

2014-03-19 Thread Marc MERLIN
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote:
> > Yes, although it's limited, you apparently only lose new data that was added
> > after you went into degraded mode and only if you add another drive where
> > you write more data.
> > In real life this shouldn't be too common, even if it is indeed a bug.
> 
> It's entirely plausible a drive power/data cable becomes lose, runs for hours 
> degraded before the wayward device is reseated. It'll be common enough. It's 
> definitely not OK for all of that data in the interim to vanish just because 
> the volume has resumed from degraded to normal. Two states of data, normal vs 
> degraded, is scary. It sounds like totally silent data loss. So yeah if it's 
> reproducible it's worthy of a separate bug.

Actually what I did is more complex, I first added a drive to a degraded
array, and then re-added the drive that had been removed.
I don't know if re-adding the same drive that was removed would cause the
bug I saw.

For now, my array is back to actually trying to store the backup I had meant
for it, and the drives seems stable now that I fixed the power issue.

Does someone else want to try? :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Marc MERLIN
On Wed, Mar 19, 2014 at 12:20:08PM -0400, Chris Mason wrote:
> On 03/19/2014 11:45 AM, Marc MERLIN wrote:
> >My server died last night during a btrfs send/receive to a btrfs radi5 
> >array
> >
> >Here are the logs. Is this anything known or with a possible workaround?
> >
> >Thanks,
> >Marc
> >
> >btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
> 
> This is an order 1 atomic allocation from the mvs driver, we really 
> should not be depending on that to get IO done.  A quick search and it 
> looks like we're allocating MVS_SLOT_BUF_SZ (8192) bytes.
> 
> You could try bumping the lowmem reserves.

Thanks for the info.

So for now, I have
CONFIG_X86_RESERVE_LOW=64

This is the option we're talking about, right?

Should I double it?

For now, I have the copy running again, and it's been going for 8 hours
without failure on the old kernel but of course that doesn't mean my 2TB
copy will complete without hitting the bug again.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: take into account total references when doing backref lookup V2

2014-03-19 Thread Hugo Mills
On Wed, Mar 19, 2014 at 01:35:14PM -0400, Josef Bacik wrote:
> I added an optimization for large files where we would stop searching for
> backrefs once we had looked at the number of references we currently had for
> this extent.  This works great most of the time, but for snapshots that point 
> to
> this extent and has changes in the original root this assumption falls on it
> face.  So keep track of any delayed ref mods made and add in the actual ref
> count as reported by the extent item and use that to limit how far down an 
> inode
> we'll search for extents.  Thanks,
> 
> Reported-by: Hugo Mills 

Reported-by: Hugo Mills 

> Signed-off-by: Josef Bacik 

Tested-by: Hugo Mills 

   Looks like it's worked. (Modulo the above typo in the metadata ;) )
I'll do a more complete test overnight.

   Hugo.

> ---
> V1->V2: Just use the extent ref count and any delayed ref counts, this will 
> work
> out right, whereas the shared thing doesn't work out in some cases.
> 
>  fs/btrfs/backref.c | 29 ++---
>  1 file changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
> index 0be0e94..10db21f 100644
> --- a/fs/btrfs/backref.c
> +++ b/fs/btrfs/backref.c
> @@ -220,7 +220,8 @@ static int __add_prelim_ref(struct list_head *head, u64 
> root_id,
>  
>  static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
>  struct ulist *parents, struct __prelim_ref *ref,
> -int level, u64 time_seq, const u64 *extent_item_pos)
> +int level, u64 time_seq, const u64 *extent_item_pos,
> +u64 total_refs)
>  {
>   int ret = 0;
>   int slot;
> @@ -249,7 +250,7 @@ static int add_all_parents(struct btrfs_root *root, 
> struct btrfs_path *path,
>   if (path->slots[0] >= btrfs_header_nritems(path->nodes[0]))
>   ret = btrfs_next_old_leaf(root, path, time_seq);
>  
> - while (!ret && count < ref->count) {
> + while (!ret && count < total_refs) {
>   eb = path->nodes[0];
>   slot = path->slots[0];
>  
> @@ -306,7 +307,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
> *fs_info,
> struct btrfs_path *path, u64 time_seq,
> struct __prelim_ref *ref,
> struct ulist *parents,
> -   const u64 *extent_item_pos)
> +   const u64 *extent_item_pos, u64 total_refs)
>  {
>   struct btrfs_root *root;
>   struct btrfs_key root_key;
> @@ -364,7 +365,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
> *fs_info,
>   }
>  
>   ret = add_all_parents(root, path, parents, ref, level, time_seq,
> -   extent_item_pos);
> +   extent_item_pos, total_refs);
>  out:
>   path->lowest_level = 0;
>   btrfs_release_path(path);
> @@ -377,7 +378,7 @@ out:
>  static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info,
>  struct btrfs_path *path, u64 time_seq,
>  struct list_head *head,
> -const u64 *extent_item_pos)
> +const u64 *extent_item_pos, u64 total_refs)
>  {
>   int err;
>   int ret = 0;
> @@ -403,7 +404,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info 
> *fs_info,
>   if (ref->count == 0)
>   continue;
>   err = __resolve_indirect_ref(fs_info, path, time_seq, ref,
> -  parents, extent_item_pos);
> +  parents, extent_item_pos,
> +  total_refs);
>   /*
>* we can only tolerate ENOENT,otherwise,we should catch error
>* and return directly.
> @@ -560,7 +562,7 @@ static void __merge_refs(struct list_head *head, int mode)
>   * smaller or equal that seq to the list
>   */
>  static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq,
> -   struct list_head *prefs)
> +   struct list_head *prefs, u64 *total_refs)
>  {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
>   struct rb_node *n = &head->node.rb_node;
> @@ -596,6 +598,7 @@ static int __add_delayed_refs(struct 
> btrfs_delayed_ref_head *head, u64 seq,
>   default:
>   BUG_ON(1);
>   }
> + *total_refs += (node->ref_mod * sgn);
>   switch (node->type) {
>   case BTRFS_TREE_BLOCK_REF_KEY: {
>   struct btrfs_delayed_tree_ref *ref;
> @@ -656,7 +659,8 @@ static int __add_delayed_refs(struct 
> btrfs_delayed_ref_head *head, u64 seq,
>   */
>  static int __add_inline_refs(struct btrfs_fs_info *fs_info,
>   

Re: [PATCH] Btrfs: fix a crash of clone with inline extents's split

2014-03-19 Thread David Sterba
On Tue, Mar 18, 2014 at 06:55:13PM +0800, Liu Bo wrote:
> On Mon, Mar 17, 2014 at 03:41:31PM +0100, David Sterba wrote:
> > There are enough EINVAL's that verify correcntess of the input
> > parameters and it's not always clear which one fails. The EOPNOTSUPP
> > errocode is close to the true reason of the failure, but it could be
> > misinterpreted as if the whole clone operation is not supported, so it's
> > not all correct but IMO better than EINVAL.
> 
> Yep, I was hesitating on these two errors while making the patch, but I
> prefer EINVAL rather than EOPNOTSUPP because of the reason you've stated.
> 
> I think it'd be good to add one more btrfs_printk message to clarify what's
> happening here, agree?

I don't think a printk is the right thing here, this means that if an
error happens somebody has to look into the log what happened and act
accordingly.

The EOPNOTSUPP errorcode would allow an application to do a fallback
action, ie. copy the data instead of cloning. The same as if the clone
ioctl would not exist at all.

EINVAL says "you didn't give me valid arguments to work with".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6 EARLY RFC] Btrfs: Get rid of whole page I/O.

2014-03-19 Thread David Sterba
On Tue, Mar 18, 2014 at 01:48:00PM +0630, chandan wrote:
> The earlier patchset posted by Chandra Seethraman was to get 4k
> blocksize to work with ppc64's 64k PAGE_SIZE.

Are we talking about metadata block sizes or data block sizes?

> The root node of "tree root" tree has 1957 bytes being written by
> make_btrfs() (in btrfs-progs).  Hence I chose to do 2k blocksize for
> the initial subpagesize-blocksize work. So with this patchset the
> supported blocksizes would be in the range 2k-64k.

So it's metadata blocks, and in this case 2k looks like the only
allowed size that's smaller than 4k, and thus can demonstrage sub-page
size allocations. I'm not sure if this is limiting for potential future
extensions of metadata structures that could be larger.

2k is ok for testing purposes, but I think a 4k-page machine will hardly
use a smaller page size. The more that 16k metadata blocks are now
default.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Chris Mason

On 03/19/2014 11:45 AM, Marc MERLIN wrote:

My server died last night during a btrfs send/receive to a btrfs radi5 array

Here are the logs. Is this anything known or with a possible workaround?

Thanks,
Marc

btrfs-rmw-2: page allocation failure: order:1, mode:0x8020


This is an order 1 atomic allocation from the mvs driver, we really 
should not be depending on that to get IO done.  A quick search and it 
looks like we're allocating MVS_SLOT_BUF_SZ (8192) bytes.


You could try bumping the lowmem reserves.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: take into account total references when doing backref lookup V2

2014-03-19 Thread Josef Bacik
I added an optimization for large files where we would stop searching for
backrefs once we had looked at the number of references we currently had for
this extent.  This works great most of the time, but for snapshots that point to
this extent and has changes in the original root this assumption falls on it
face.  So keep track of any delayed ref mods made and add in the actual ref
count as reported by the extent item and use that to limit how far down an inode
we'll search for extents.  Thanks,

Reportedy-by: Hugo Mills 
Signed-off-by: Josef Bacik 
---
V1->V2: Just use the extent ref count and any delayed ref counts, this will work
out right, whereas the shared thing doesn't work out in some cases.

 fs/btrfs/backref.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 0be0e94..10db21f 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -220,7 +220,8 @@ static int __add_prelim_ref(struct list_head *head, u64 
root_id,
 
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
   struct ulist *parents, struct __prelim_ref *ref,
-  int level, u64 time_seq, const u64 *extent_item_pos)
+  int level, u64 time_seq, const u64 *extent_item_pos,
+  u64 total_refs)
 {
int ret = 0;
int slot;
@@ -249,7 +250,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
if (path->slots[0] >= btrfs_header_nritems(path->nodes[0]))
ret = btrfs_next_old_leaf(root, path, time_seq);
 
-   while (!ret && count < ref->count) {
+   while (!ret && count < total_refs) {
eb = path->nodes[0];
slot = path->slots[0];
 
@@ -306,7 +307,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
  struct btrfs_path *path, u64 time_seq,
  struct __prelim_ref *ref,
  struct ulist *parents,
- const u64 *extent_item_pos)
+ const u64 *extent_item_pos, u64 total_refs)
 {
struct btrfs_root *root;
struct btrfs_key root_key;
@@ -364,7 +365,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
}
 
ret = add_all_parents(root, path, parents, ref, level, time_seq,
- extent_item_pos);
+ extent_item_pos, total_refs);
 out:
path->lowest_level = 0;
btrfs_release_path(path);
@@ -377,7 +378,7 @@ out:
 static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info,
   struct btrfs_path *path, u64 time_seq,
   struct list_head *head,
-  const u64 *extent_item_pos)
+  const u64 *extent_item_pos, u64 total_refs)
 {
int err;
int ret = 0;
@@ -403,7 +404,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info 
*fs_info,
if (ref->count == 0)
continue;
err = __resolve_indirect_ref(fs_info, path, time_seq, ref,
-parents, extent_item_pos);
+parents, extent_item_pos,
+total_refs);
/*
 * we can only tolerate ENOENT,otherwise,we should catch error
 * and return directly.
@@ -560,7 +562,7 @@ static void __merge_refs(struct list_head *head, int mode)
  * smaller or equal that seq to the list
  */
 static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq,
- struct list_head *prefs)
+ struct list_head *prefs, u64 *total_refs)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
struct rb_node *n = &head->node.rb_node;
@@ -596,6 +598,7 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head 
*head, u64 seq,
default:
BUG_ON(1);
}
+   *total_refs += (node->ref_mod * sgn);
switch (node->type) {
case BTRFS_TREE_BLOCK_REF_KEY: {
struct btrfs_delayed_tree_ref *ref;
@@ -656,7 +659,8 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head 
*head, u64 seq,
  */
 static int __add_inline_refs(struct btrfs_fs_info *fs_info,
 struct btrfs_path *path, u64 bytenr,
-int *info_level, struct list_head *prefs)
+int *info_level, struct list_head *prefs,
+u64 *total_refs)
 {
int ret = 0;
int slot;
@@ -680,6 +684,7 @@ static int __add_inline_refs(struct b

Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds

2014-03-19 Thread Chris Murphy

On Mar 19, 2014, at 9:40 AM, Marc MERLIN  wrote:
> 
> After adding a drive, I couldn't quite tell if it was striping over 11
> drive2 or 10, but it felt that at least at times, it was striping over 11
> drives with write failures on the missing drive.
> I can't prove it, but I'm thinking the new data I was writing was being
> striped in degraded mode.

Well it does sound fragile after all to add a drive to a degraded array, 
especially when it's not expressly treating the faulty drive as faulty. I think 
iotop will show what block devices are being written to. And in a VM it's easy 
(albeit rudimentary) with sparse files, as you can see them grow.

> 
> Yes, although it's limited, you apparently only lose new data that was added
> after you went into degraded mode and only if you add another drive where
> you write more data.
> In real life this shouldn't be too common, even if it is indeed a bug.

It's entirely plausible a drive power/data cable becomes lose, runs for hours 
degraded before the wayward device is reseated. It'll be common enough. It's 
definitely not OK for all of that data in the interim to vanish just because 
the volume has resumed from degraded to normal. Two states of data, normal vs 
degraded, is scary. It sounds like totally silent data loss. So yeah if it's 
reproducible it's worthy of a separate bug.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: subpagesize-blocksize: Get rid of whole page reads.

2014-03-19 Thread chandan Rajendra
> These should be put in front of "struct bio bio",
> otherwise, it might lead to errors, according to bioset_create()'s comments,
> 
> --
> "Note that the bio must be embedded at the END of that structure always,
> or things will break badly."
> --
> 

Thank you for pointing that out. I will fix it.

Thanks,
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-rmw-2: page allocation failure: order:1, mode:0x8020

2014-03-19 Thread Marc MERLIN
My server died last night during a btrfs send/receive to a btrfs radi5 array

Here are the logs. Is this anything known or with a possible workaround?

Thanks,
Marc

btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
CPU: 1 PID: 12499 Comm: btrfs-rmw-2 Not tainted 
3.14.0-rc5-amd64-i915-preempt-20140216c #1
Hardware name: System manufacturer P5KC/P5KC, BIOS 050205/24/2007
  88000549d780 816090b3 
 88000549d808 811037b0 0001fffe 88007ff7ce00
  0002 0030 88007ff7ce00
Call Trace:
 [] dump_stack+0x4e/0x7a
 [] warn_alloc_failed+0x111/0x125
 [] __alloc_pages_nodemask+0x707/0x854
 [] ? dma_generic_alloc_coherent+0xa7/0x11c
 [] dma_generic_alloc_coherent+0xa7/0x11c
 [] dma_pool_alloc+0x10a/0x1cb
 [] mvs_task_prep+0x192/0xa42 [mvsas]
 [] ? blkg_path.isra.80.constprop.90+0x17/0x38
 [] ? cache_alloc+0x1c/0x29b
 [] mvs_task_exec.isra.9+0x5d/0xc9 [mvsas]
 [] mvs_queue_command+0x3d/0x29b [mvsas]
 [] ? kmem_cache_alloc+0xe3/0x161
 [] sas_ata_qc_issue+0x1cd/0x235 [libsas]
 [] ata_qc_issue+0x291/0x2f1
 [] ? ata_scsiop_mode_sense+0x29c/0x29c
 [] __ata_scsi_queuecmd+0x184/0x1e0
 [] ata_sas_queuecmd+0x31/0x4d
 [] sas_queuecommand+0x98/0x1fe [libsas]
 [] scsi_dispatch_cmd+0x14f/0x22e
 [] scsi_request_fn+0x4da/0x507
 [] ? blk_recount_segments+0x1e/0x2e
 [] __blk_run_queue_uncond+0x22/0x2b
 [] __blk_run_queue+0x19/0x1b
 [] blk_queue_bio+0x23f/0x256
 [] generic_make_request+0x9c/0xdb
 [] submit_bio+0x112/0x131
 [] rmw_work+0x112/0x162
 [] worker_loop+0x168/0x4d8
 [] ? btrfs_queue_worker+0x283/0x283
 [] kthread+0xae/0xb6
 [] ? __kthread_parkme+0x61/0x61
 [] ret_from_fork+0x7c/0xb0
 [] ? __kthread_parkme+0x61/0x61
Mem-Info:
Node 0 DMA per-cpu:
CPU0: hi:0, btch:   1 usd:   0
CPU1: hi:0, btch:   1 usd:   0
Node 0 DMA32 per-cpu:
CPU0: hi:  186, btch:  31 usd: 171
CPU1: hi:  186, btch:  31 usd: 190
active_anon:17298 inactive_anon:21061 isolated_anon:0
 active_file:67491 inactive_file:94189 isolated_file:32
 unevictable:1260 dirty:38914 writeback:49596 unstable:0
 free:15999 slab_reclaimable:8198 slab_unreclaimable:9741
 mapped:12981 shmem:1661 pagetables:2711 bounce:0
 free_cma:0
Node 0 DMA free:8084kB min:348kB low:432kB high:520kB active_anon:360kB 
inactive_anon:764kB active_file:288kB inactive_file:2040kB unevictable:100kB 
isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB 
mlocked:100kB dirty:0kB writeback:1272kB mapped:252kB shmem:8kB 
slab_reclaimable:168kB slab_unreclaimable:336kB kernel_stack:88kB 
pagetables:128kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 1987 1987 1987
Node 0 DMA32 free:56080kB min:44704kB low:55880kB high:67056kB 
active_anon:68832kB inactive_anon:83480kB active_file:269676kB 
inactive_file:374588kB unevictable:4940kB isolated(anon):0kB 
isolated(file):128kB present:2080256kB managed:2039064kB mlocked:4940kB 
dirty:155668kB writeback:197112kB mapped:51672kB shmem:6636kB 
slab_reclaimable:32624kB slab_unreclaimable:38628kB kernel_stack:2912kB 
pagetables:10716kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB 
pages_scanned:32 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 85*4kB (UEM) 22*8kB (UEM) 62*16kB (UEM) 6*32kB (UM) 2*64kB (UE) 
5*128kB (UEM) 6*256kB (UEM) 4*512kB (EM) 0*1024kB 1*2048kB (R) 0*4096kB = 8100kB
Node 0 DMA32: 13004*4kB (M) 16*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 
0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 56240kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
164139 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap  = 9255932kB
Total swap = 9255932kB
524058 pages RAM
0 pages HighMem/MovableOnly
10298 pages reserved
0 pages hwpoisoned
mvsas :01:00.0: mvsas prep failed[0]!
btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
CPU: 1 PID: 12499 Comm: btrfs-rmw-2 Not tainted 
3.14.0-rc5-amd64-i915-preempt-20140216c #1
Hardware name: System manufacturer P5KC/P5KC, BIOS 050205/24/2007
  88000549d690 816090b3 
 88000549d718 811037b0 0001fffe 88000549d6c8
 8160e9c4 0002 0030 88007ff7ce00
Call Trace:
 [] dump_stack+0x4e/0x7a
 [] warn_alloc_failed+0x111/0x125
 [] ? _raw_spin_trylock+0x20/0x50
 [] __alloc_pages_nodemask+0x707/0x854
 [] ? console_unlock+0x2f6/0x302
 [] dma_generic_alloc_coherent+0xa7/0x11c
 [] dma_pool_alloc+0x10a/0x1cb
 [] mvs_task_prep+0x192/0xa42 [mvsas]
 [] ? get_page_from_freelist+0x549/0x71d
 [] ? cache_alloc+0x1c/0x29b
 [] mvs_task_exec.isra.9+0x5d/0xc9 [mvsas]
 [] mvs_queue_command+0x3d/0x29b [mvsas]
 [] ? kmem_cache_alloc+0xe3/0x161
 [] sas_ata_qc_issue+0x1cd/0x235 [libsas]
 [] ata_qc_issue+0x291/0x2f1
 [] ? ata_scsiop_mode_sense+0x29c/0x29c
 [] __ata_scsi_queuecmd+0x184/0x1e0
 [] ata_sas_queuecmd+0x31/0x4d
 [

Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds

2014-03-19 Thread Marc MERLIN
On Wed, Mar 19, 2014 at 12:32:55AM -0600, Chris Murphy wrote:
> 
> On Mar 19, 2014, at 12:09 AM, Marc MERLIN  wrote:
> > 
> > 7) you can remove a drive from an array, add files, and then if you plug
> >   the drive in, it apparently gets auto sucked in back in the array.
> > There is no rebuild that happens, you now have an inconsistent array where
> > one drive is not at the same level than the other ones (I lost all files I 
> > added 
> > after the drive was removed when I added the drive back).
> 
> Seems worthy of a dedicated bug report and keeping an eye on in the future, 
> not good.
 
Since it's not supposed to be working, I didn't file a bug, but I figured
it'd be good for people to know about it in the meantime.

> >> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 
> >> /mnt/btrfs_backupcopy/
> >> polgara:/mnt/btrfs_backupcopy# df -h .
> >> Filesystem  Size  Used Avail Use% Mounted on
> >> /dev/mapper/crypt_sdb1  4.6T  3.0M  4.6T   1% /mnt/btrfs_backupcopy
> > 
> > Oh look it's bigger now. We need to manual rebalance to use the new drive:
> 
> You don't have to. As soon as you add the additional drive, newly allocated 
> chunks will stripe across all available drives. e.g. 1 GB allocations striped 
> across 3x drives, if I add a 4th drive, initially any additional writes are 
> only to the first three drives but once a new data chunk is allocated it gets 
> striped across 4 drives.
 
That's the thing though. If the bad device hadn't been forcibly removed, and
apparently the only way to do this was to unmount, make the device node
disappear, and remount in degraded mode, it looked to me like btrfs was
still consideing that the drive was part of the array and trying to write to
it.
After adding a drive, I couldn't quite tell if it was striping over 11
drive2 or 10, but it felt that at least at times, it was striping over 11
drives with write failures on the missing drive.
I can't prove it, but I'm thinking the new data I was writing was being
striped in degraded mode.

> Sure the whole thing isn't corrupt. But if anything written while degraded 
> vanishes once the missing device is reattached, and you remount normally 
> (non-degraded), that's data loss. Yikes!

Yes, although it's limited, you apparently only lose new data that was added
after you went into degraded mode and only if you add another drive where
you write more data.
In real life this shouldn't be too common, even if it is indeed a bug.

Cheers,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] Btrfs: part 2, fix incremental send's decision to delay a dir move/rename

2014-03-19 Thread Filipe David Borba Manana
For an incremental send, fix the process of determining whether the directory
inode we're currently processing needs to have its move/rename operation 
delayed.

We were ignoring the fact that if the inode's new immediate ancestor has a 
higher
inode number than ours but wasn't renamed/moved, we might still need to delay 
our
move/rename, because some other ancestor directory higher in the hierarchy might
have an inode number higher than ours *and* was renamed/moved too - in this case
we have to wait for rename/move of that ancestor to happen before our current
directory's rename/move operation.

Simple steps to reproduce this issue:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt

  $ mkdir -p /mnt/a/x1/x2
  $ mkdir /mnt/a/Z
  $ mkdir -p /mnt/a/x1/x2/x3/x4/x5

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1
  $ btrfs send /mnt/snap1 -f /tmp/base.send

  $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33
  $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

The incremental send caused the kernel code to enter an infinite loop when
building the path string for directory Z after its references are processed.

A more complex scenario:

  $ mkfs.btrfs -f /dev/sdd
  $ mount /dev/sdd /mnt

  $ mkdir -p /mnt/a/b/c/d
  $ mkdir /mnt/a/b/c/d/e
  $ mkdir /mnt/a/b/c/d/f
  $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2
  $ mkdir /mmt/a/b/c/g
  $ mv /mnt/a/b/c/d /mnt/a/b/D2

  $ btrfs subvolume snapshot -r /mnt /mnt/snap1
  $ btrfs send /mnt/snap1 -f /tmp/base.send

  $ mkdir /mnt/a/o
  $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2
  $ mv /mnt/a/b/D2 /mnt/a/b/dd
  $ mv /mnt/a/b/c /mnt/a/C2
  $ mv /mnt/a/b/dd/f /mnt/a/o/FF
  $ mv /mnt/a/b /mnt/a/o/FF/E2/BB

  $ btrfs subvolume snapshot -r /mnt /mnt/snap2
  $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send

A test case for xfstests follows.

Signed-off-by: Filipe David Borba Manana 
---

V2: Added missing error handling and fixed typo in commit message.
V3: Updated the algorithm to deal with more complex cases, hopefully all
cases are nailed down now.
V4: Pass the right generation number to add_pending_dir_move.

 fs/btrfs/send.c |   71 +++
 1 file changed, 66 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index d869079..b27b3e1 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -2916,7 +2916,10 @@ static void free_waiting_dir_move(struct send_ctx *sctx,
kfree(dm);
 }
 
-static int add_pending_dir_move(struct send_ctx *sctx, u64 parent_ino)
+static int add_pending_dir_move(struct send_ctx *sctx,
+   u64 ino,
+   u64 ino_gen,
+   u64 parent_ino)
 {
struct rb_node **p = &sctx->pending_dir_moves.rb_node;
struct rb_node *parent = NULL;
@@ -2929,8 +2932,8 @@ static int add_pending_dir_move(struct send_ctx *sctx, 
u64 parent_ino)
if (!pm)
return -ENOMEM;
pm->parent_ino = parent_ino;
-   pm->ino = sctx->cur_ino;
-   pm->gen = sctx->cur_inode_gen;
+   pm->ino = ino;
+   pm->gen = ino_gen;
INIT_LIST_HEAD(&pm->list);
INIT_LIST_HEAD(&pm->update_refs);
RB_CLEAR_NODE(&pm->node);
@@ -3183,6 +3186,8 @@ static int wait_for_parent_move(struct send_ctx *sctx,
struct fs_path *path_before = NULL;
struct fs_path *path_after = NULL;
int len1, len2;
+   int register_upper_dirs;
+   u64 gen;
 
if (is_waiting_for_move(sctx, ino))
return 1;
@@ -3225,7 +3230,7 @@ static int wait_for_parent_move(struct send_ctx *sctx,
}
 
ret = get_first_ref(sctx->send_root, ino, &parent_ino_after,
-   NULL, path_after);
+   &gen, path_after);
if (ret == -ENOENT) {
ret = 0;
goto out;
@@ -3242,6 +3247,60 @@ static int wait_for_parent_move(struct send_ctx *sctx,
}
ret = 0;
 
+   /*
+* Ok, our new most direct ancestor has a higher inode number but
+* wasn't moved/renamed. So maybe some of the new ancestors higher in
+* the hierarchy have an higher inode number too *and* were renamed
+* or moved - in this case we need to wait for the ancestor's rename
+* or move operation before we can do the move/rename for the current
+* inode.
+*/
+   register_upper_dirs = 0;
+   ino = parent_ino_after;
+again:
+   while ((ret == 0 || register_upper_dirs) && ino > sctx->cur_ino) {
+   u64 parent_gen;
+
+   fs_path_reset(path_before);
+   fs_path_reset(path_after);
+
+   ret = get_first_ref(sctx->send_root, ino, &parent_ino_after,
+   &parent_gen, path

Re: Please help me to contribute to btrfs project

2014-03-19 Thread Ajesh js
Thank you very much Ben :)

I did go though the links send by you & got the complete details for
sending the kernel component.

Also my change has a patch in btrfs-tools. It will be nice if you can
share the process for submitting that patch also.

Regards,
Ajesh

On Tue, Mar 18, 2014 at 7:17 PM, Ben Gamari  wrote:
> Ajesh js  writes:
>
>> Hi,
>>
>> I have used the btrfs filesystem in one of my projects and I have
>> added a small feature to it. I feel that the same feature will be
>> useful for others too. Hence I would like to contribute the same to
>> open source.
>>
> Excellent!
>
>> If everything works fine and this feature is not already added by
>> somebody else, this will be my first contribution to the opensource &
>> I am excited to join the huge family of opensource :)
>>
>> Please help me with a precise steps to do the same.
>>
> In general the way to contribute is to send a patch for review. You
> should have a look at the code style guidelines[1] and patch submission
> guidelines[2] in the kernel tree. For nontrivial changes the patch
> should be accompanied by a cover letter describing the change and the
> motivations for any non-obvious design decisions.
>
> It is possible that your change is acceptable as-is. More likely,
> however, is that there will be some discussion and requests for
> changes. Eventually the review process will produce a merge-worthy
> patch. The first step, however, is sending something concrete for
> community review.
>
> Cheers,
>
> - Ben
>
>
> [1] 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/CodingStyle
> [2] 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Another crash when deleting a large number of snapshots

2014-03-19 Thread Liu Bo
On Wed, Mar 19, 2014 at 08:52:37AM +0100, Juan Orti Alcaine wrote:
> This kind of crashes happens me very often when I delete a large
> number (+200) of snapshots at once.
> There is a very high IO for a while, and after that, the system
> freezed at intervals. I had to reboot the system to get it responsive
> again.
> 
> Versions used:
> kernel-3.13.6-200.fc20.x86_64
> btrfs-progs-3.12-1.fc20.x86_64
> 
> The log:
> http://ur1.ca/gvr3j

Not sure if this has been fixed...

Can you try btrfs-next or compile the kernel with CONFIG_DEBUG_SPINLOCK=y?

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Another crash when deleting a large number of snapshots

2014-03-19 Thread Juan Orti Alcaine
This kind of crashes happens me very often when I delete a large
number (+200) of snapshots at once.
There is a very high IO for a while, and after that, the system
freezed at intervals. I had to reboot the system to get it responsive
again.

Versions used:
kernel-3.13.6-200.fc20.x86_64
btrfs-progs-3.12-1.fc20.x86_64

The log:
http://ur1.ca/gvr3j
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/6] Btrfs: subpagesize-blocksize: Get rid of whole page reads.

2014-03-19 Thread Liu Bo
On Wed, Mar 12, 2014 at 07:50:28PM +0530, Chandan Rajendra wrote:
> bio_vec->{bv_offset, bv_len} cannot be relied upon by the end bio functions
> to track the file offset range operated on by the bio. Hence this patch adds
> two new members to 'struct btrfs_io_bio' to track the file offset range.
> 
> This patch also brings back check_page_locked() to reliably unlock pages in
> readpage's end bio function.
> 
> Signed-off-by: Chandan Rajendra 
> ---
>  fs/btrfs/extent_io.c | 122 
> +--
>  fs/btrfs/volumes.h   |   3 ++
>  2 files changed, 82 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index fbe501d..5a65aee 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -1943,15 +1943,31 @@ int test_range_bit(struct extent_io_tree *tree, u64 
> start, u64 end,
>   * helper function to set a given page up to date if all the
>   * extents in the tree for that page are up to date
>   */
> -static void check_page_uptodate(struct extent_io_tree *tree, struct page 
> *page)
> +static void check_page_uptodate(struct extent_io_tree *tree, struct page 
> *page,
> + struct extent_state *cached)
>  {
>   u64 start = page_offset(page);
>   u64 end = start + PAGE_CACHE_SIZE - 1;
> - if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
> + if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, cached))
>   SetPageUptodate(page);
>  }
>  
>  /*
> + * helper function to unlock a page if all the extents in the tree
> + * for that page are unlocked
> + */
> +static void check_page_locked(struct extent_io_tree *tree, struct page *page)
> +{
> + u64 start = page_offset(page);
> + u64 end = start + PAGE_CACHE_SIZE - 1;
> +
> + if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) {
> + unlock_page(page);
> + }
> +}
> +
> +
> +/*
>   * When IO fails, either with EIO or csum verification fails, we
>   * try other mirrors that might have a good copy of the data.  This
>   * io_failure_record is used to record state as we go through all the
> @@ -2414,16 +2430,33 @@ static void end_bio_extent_writepage(struct bio *bio, 
> int err)
>   bio_put(bio);
>  }
>  
> -static void
> -endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 
> len,
> -   int uptodate)
> +static void unlock_extent_and_page(struct address_space *mapping,
> +struct extent_io_tree *tree,
> +struct btrfs_io_bio *io_bio)
>  {
> - struct extent_state *cached = NULL;
> - u64 end = start + len - 1;
> + pgoff_t index;
> + u64 offset, len;
> + /*
> +  * This btrfs_io_bio may span multiple pages.
> +  * We need to unlock the pages convered by them
> +  * if we got endio callback for all the blocks in the page.
> +  * btrfs_io_bio also contain "contigous blocks of the file"
> +  * look at submit_extent_page for more details.
> +  */
>  
> - if (uptodate && tree->track_uptodate)
> - set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC);
> - unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC);
> + offset = io_bio->start_offset;
> + len= io_bio->len;
> + unlock_extent(tree, offset, offset + len - 1);
> +
> + index = offset >> PAGE_CACHE_SHIFT;
> + while (offset < io_bio->start_offset + len) {
> + struct page *page;
> + page = find_get_page(mapping, index);
> + check_page_locked(tree, page);
> + page_cache_release(page);
> + index++;
> + offset += PAGE_CACHE_SIZE;
> + }
>  }
>  
>  /*
> @@ -2443,13 +2476,13 @@ static void end_bio_extent_readpage(struct bio *bio, 
> int err)
>   struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1;
>   struct bio_vec *bvec = bio->bi_io_vec;
>   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> + struct address_space *mapping = bio->bi_io_vec->bv_page->mapping;
>   struct extent_io_tree *tree;
> + struct extent_state *cached = NULL;
>   u64 offset = 0;
>   u64 start;
>   u64 end;
>   u64 len;
> - u64 extent_start = 0;
> - u64 extent_len = 0;
>   int mirror;
>   int ret;
>  
> @@ -2482,8 +2515,8 @@ static void end_bio_extent_readpage(struct bio *bio, 
> int err)
>   bvec->bv_offset, bvec->bv_len);
>   }
>  
> - start = page_offset(page);
> - end = start + bvec->bv_offset + bvec->bv_len - 1;
> + start = page_offset(page) + bvec->bv_offset;
> + end = start + bvec->bv_len - 1;
>   len = bvec->bv_len;
>  
>   if (++bvec <= bvec_end)
> @@ -2540,40 +2573,24 @@ readpage_ok:
>   offset = i_size & (PAGE_CACHE_SIZE-1);
>   if (page->index == end_