Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
Chris Mason posted on Thu, 20 Mar 2014 01:01:35 + as excerpted: > Message-ID: > Accept-Language: en-US > Content-Type: text/plain; charset=euc-kr > Content-Transfer-Encoding: base64 > Content-Language: en-US >>> Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes. >>> Honestly though, it©ös just a bug in the mvs driver. Atomic 8K >>> allocations are doomed to fail eventually. > The process is a btrfs worker, and the IO was started by btrfs, but the > allocation failure is all inside the mvs driver. There¡¯s even the > printk in there from mvs about the allocation failing. > -chris > > N§²æìržyúèØb²X¬¶Ç§vØ^)Þº{.nÇ+·¥{±nÚß²)í æèw*jg¬±š¶Ý¢ j/êäz¹Þà2ÞšèÚ&¢)ß¡«a¶Úþø®G«éh®æj:+všwèÙ¥ Chris, you might wish to take a look at this and/or have one of the FB techs familiar with your mail transport layers (and/or your mail client, and/or perhaps it's vger's list-serv bot) look at it. That list-sig came thru as garbage at least here, and your "it's" and "there's" appear to have strange apostrophes as well. As you can see I also included the headers I think might be relevant, plus the Message-ID. The problem wasn't a big one here, but if it starts corrupting patches or something too... I'd guess it has something to do with that content-type charset=euc-kr in the headers, which looks really strange combined with the accept-language and content-language both being en-US. The other messages of yours (with an uncorrupted list sig) I checked had a content-type charset=iso-8859-1 header, which seems to be most common in English messages, anyway. Why this one had euc-kr I don't know. However, it's also worth noting that the third-level quoting above, also from you, has an "it's", and when I checked /that/ message, the apostrophe was a super-script-1, as it was when quoted in Marc's reply, but that got changed to the copyright symbol (plus something else) in your quote, which at least as I'm posting the quote, is showing the same way. Meanwhile, the apostrophe in the "there's" in your message's new content (which is thus first-level quote in this message) is different, an i followed by an over-line. I'd /guess/ that's what actually triggered the Korean (?) charset in your message, in ordered to handle that or perhaps some other character I missed, while your other messages are iso-8859-1, which thus changed the auto-inserted list-sig into garbage, since the list-bot presumably inserted it as the usual iso-8859-1 while your message claimed to be in eur-kr. (It'll be interesting to see if my message, with both quotes, looks the same when I read it on the list via gmane, as it does when I send it, or if it's further garbled, and if my charset gets set to something exotic too. FWIW, my client, which should be visible in my headers, is pan, via gmane.org's list2news service.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6 EARLY RFC] Btrfs: Get rid of whole page I/O.
David Sterba writes: > On Tue, Mar 18, 2014 at 01:48:00PM +0630, chandan wrote: >> The earlier patchset posted by Chandra Seethraman was to get 4k >> blocksize to work with ppc64's 64k PAGE_SIZE. > > Are we talking about metadata block sizes or data block sizes? > >> The root node of "tree root" tree has 1957 bytes being written by >> make_btrfs() (in btrfs-progs). Hence I chose to do 2k blocksize for >> the initial subpagesize-blocksize work. So with this patchset the >> supported blocksizes would be in the range 2k-64k. > > So it's metadata blocks, and in this case 2k looks like the only > allowed size that's smaller than 4k, and thus can demonstrage sub-page > size allocations. I'm not sure if this is limiting for potential future > extensions of metadata structures that could be larger. > > 2k is ok for testing purposes, but I think a 4k-page machine will hardly > use a smaller page size. The more that 16k metadata blocks are now > default. The goal is to remove the assumption that supported blocks size is >= page size. The primary reason to do that is to support migration of disk devices across different architectures. If we have a btrfs disk created on x86 box with data blocksize 4K and meta data block size 16K we should make sure that, the disk can be read/written from ppc64 box (which have a page size of 64K). To enable easy testing and community development we are now focusing on achieving 2K data blocksize and 2K meata data block size on x86. As you said this will never be used in production. To achieve that we did the below *) Add offset and len to btrfs_io_bio. These are file offsets and len. This is later used to unlock extent io tree. *) Now we also need to make sure that submit_extent_page only submit contiguous range in the file offset range. ie if we have holes in between we split them into two submit_extent_page. This ensures that btrfs_io_bio offset and len represent a contiguous range. Please let us know whether the above approach is acceptable. -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: remove unnecessary inode generation lookup in send
On Tue, Mar 18, 2014 at 05:56:06PM +, Filipe David Borba Manana wrote: > No need to search in the send tree for the generation number of the inode, > we already have it in the recorded_ref structure passed to us. > Reviewed-by: Liu Bo -liubo > Signed-off-by: Filipe David Borba Manana > --- > fs/btrfs/send.c |9 ++--- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c > index 5d757ee..db4b10c 100644 > --- a/fs/btrfs/send.c > +++ b/fs/btrfs/send.c > @@ -3179,7 +3179,7 @@ static int wait_for_parent_move(struct send_ctx *sctx, > int ret; > u64 ino = parent_ref->dir; > u64 parent_ino_before, parent_ino_after; > - u64 new_gen, old_gen; > + u64 old_gen; > struct fs_path *path_before = NULL; > struct fs_path *path_after = NULL; > int len1, len2; > @@ -3198,12 +3198,7 @@ static int wait_for_parent_move(struct send_ctx *sctx, > else if (ret < 0) > return ret; > > - ret = get_inode_info(sctx->send_root, ino, NULL, &new_gen, > - NULL, NULL, NULL, NULL); > - if (ret < 0) > - return ret; > - > - if (new_gen != old_gen) > + if (parent_ref->dir_gen != old_gen) > return 0; > > path_before = fs_path_alloc(); > -- > 1.7.10.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
On 3/19/14, 8:20 PM, "Marc MERLIN" wrote: >On Thu, Mar 20, 2014 at 12:13:36AM +, Chris Mason wrote: >> >Should I double it? >> > >> >For now, I have the copy running again, and it's been going for 8 hours >> >without failure on the old kernel but of course that doesn't mean my >>2TB >> >copy will complete without hitting the bug again. >> >> Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes. >>Honestly >> though, it¹s just a bug in the mvs driver. Atomic 8K allocations are >> doomed to fail eventually. > >Gotcha >polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes >45056 >polgara:/mnt/btrfs_backupcopy# echo 10 > /proc/sys/vm/min_free_kbytes >polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes >10 >polgara:/mnt/btrfs_backupcopy# > >> The driver should either busy loop until the allocation completes >>(really >> not a great choice), gracefully deal with the failure (looks tricky), or >> preallocate the space (like the rest of the block layer). > >Gotcha. I'll report this to the folks maintaining the marvel driver. > >So just to make sure I got you right, although the page allocation failure >was shown in btrfs, it's really the underlying marvel driver at fault >here, >and there isn't really anything to change on the btrfs side, correct? The process is a btrfs worker, and the IO was started by btrfs, but the allocation failure is all inside the mvs driver. There’s even the printk in there from mvs about the allocation failing. The only reason it’s btrfs instead of a regular process is because for raid5/6 the rmw is farmed out to helper threads. -chris N떑꿩�r툤y鉉싕b쾊Ф푤v�^�)頻{.n�+돴쪐{켷雹�)�끾�w*jgП�텎쉸듶줷/곴�z받뻿�2듷솳鈺�&�)傘첺뛴췍쳺�h��j:+v돣둾�녪
Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds
On Thu, Mar 20, 2014 at 01:44:20AM +0100, Tobias Holst wrote: > I tried the RAID6 implementation of btrfs and I looks like I had the > same problem. Rebuild with "balance" worked but when a drive was > removed when mounted and then readded, the chaos began. I tried it a > few times. So when a drive fails (and this is just because of > connection lost or similar non severe problems), then it is necessary > to wipe the disc first before readding it, so btrfs will add it as a > new disk and not try to readd the old one. Good to know you got this too. Just to confirm: did you get it to rebuild, or once a drive is lost/gets behind, you're in degraded mode forever for those blocks? Or were you able to balance? Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
On Thu, Mar 20, 2014 at 12:13:36AM +, Chris Mason wrote: > >Should I double it? > > > >For now, I have the copy running again, and it's been going for 8 hours > >without failure on the old kernel but of course that doesn't mean my 2TB > >copy will complete without hitting the bug again. > > Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes. Honestly > though, it¹s just a bug in the mvs driver. Atomic 8K allocations are > doomed to fail eventually. Gotcha polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes 45056 polgara:/mnt/btrfs_backupcopy# echo 10 > /proc/sys/vm/min_free_kbytes polgara:/mnt/btrfs_backupcopy# cat /proc/sys/vm/min_free_kbytes 10 polgara:/mnt/btrfs_backupcopy# > The driver should either busy loop until the allocation completes (really > not a great choice), gracefully deal with the failure (looks tricky), or > preallocate the space (like the rest of the block layer). Gotcha. I'll report this to the folks maintaining the marvel driver. So just to make sure I got you right, although the page allocation failure was shown in btrfs, it's really the underlying marvel driver at fault here, and there isn't really anything to change on the btrfs side, correct? Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
On 3/19/14, 6:37 PM, "Marc MERLIN" wrote: >On Wed, Mar 19, 2014 at 12:20:08PM -0400, Chris Mason wrote: >> On 03/19/2014 11:45 AM, Marc MERLIN wrote: >> >My server died last night during a btrfs send/receive to a btrfs radi5 >> >array >> > >> >Here are the logs. Is this anything known or with a possible >>workaround? >> > >> >Thanks, >> >Marc >> > >> >btrfs-rmw-2: page allocation failure: order:1, mode:0x8020 >> >> This is an order 1 atomic allocation from the mvs driver, we really >> should not be depending on that to get IO done. A quick search and it >> looks like we're allocating MVS_SLOT_BUF_SZ (8192) bytes. >> >> You could try bumping the lowmem reserves. > >Thanks for the info. > >So for now, I have >CONFIG_X86_RESERVE_LOW=64 > >This is the option we're talking about, right? > >Should I double it? > >For now, I have the copy running again, and it's been going for 8 hours >without failure on the old kernel but of course that doesn't mean my 2TB >copy will complete without hitting the bug again. Sorry, I misspoke, you should bump /proc/sys/vm/min_free_kbytes. Honestly though, it¹s just a bug in the mvs driver. Atomic 8K allocations are doomed to fail eventually. The driver should either busy loop until the allocation completes (really not a great choice), gracefully deal with the failure (looks tricky), or preallocate the space (like the rest of the block layer). -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds
On Wed, Mar 19, 2014 at 10:53:33AM -0600, Chris Murphy wrote: > > Yes, although it's limited, you apparently only lose new data that was added > > after you went into degraded mode and only if you add another drive where > > you write more data. > > In real life this shouldn't be too common, even if it is indeed a bug. > > It's entirely plausible a drive power/data cable becomes lose, runs for hours > degraded before the wayward device is reseated. It'll be common enough. It's > definitely not OK for all of that data in the interim to vanish just because > the volume has resumed from degraded to normal. Two states of data, normal vs > degraded, is scary. It sounds like totally silent data loss. So yeah if it's > reproducible it's worthy of a separate bug. Actually what I did is more complex, I first added a drive to a degraded array, and then re-added the drive that had been removed. I don't know if re-adding the same drive that was removed would cause the bug I saw. For now, my array is back to actually trying to store the backup I had meant for it, and the drives seems stable now that I fixed the power issue. Does someone else want to try? :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
On Wed, Mar 19, 2014 at 12:20:08PM -0400, Chris Mason wrote: > On 03/19/2014 11:45 AM, Marc MERLIN wrote: > >My server died last night during a btrfs send/receive to a btrfs radi5 > >array > > > >Here are the logs. Is this anything known or with a possible workaround? > > > >Thanks, > >Marc > > > >btrfs-rmw-2: page allocation failure: order:1, mode:0x8020 > > This is an order 1 atomic allocation from the mvs driver, we really > should not be depending on that to get IO done. A quick search and it > looks like we're allocating MVS_SLOT_BUF_SZ (8192) bytes. > > You could try bumping the lowmem reserves. Thanks for the info. So for now, I have CONFIG_X86_RESERVE_LOW=64 This is the option we're talking about, right? Should I double it? For now, I have the copy running again, and it's been going for 8 hours without failure on the old kernel but of course that doesn't mean my 2TB copy will complete without hitting the bug again. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: take into account total references when doing backref lookup V2
On Wed, Mar 19, 2014 at 01:35:14PM -0400, Josef Bacik wrote: > I added an optimization for large files where we would stop searching for > backrefs once we had looked at the number of references we currently had for > this extent. This works great most of the time, but for snapshots that point > to > this extent and has changes in the original root this assumption falls on it > face. So keep track of any delayed ref mods made and add in the actual ref > count as reported by the extent item and use that to limit how far down an > inode > we'll search for extents. Thanks, > > Reported-by: Hugo Mills Reported-by: Hugo Mills > Signed-off-by: Josef Bacik Tested-by: Hugo Mills Looks like it's worked. (Modulo the above typo in the metadata ;) ) I'll do a more complete test overnight. Hugo. > --- > V1->V2: Just use the extent ref count and any delayed ref counts, this will > work > out right, whereas the shared thing doesn't work out in some cases. > > fs/btrfs/backref.c | 29 ++--- > 1 file changed, 18 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c > index 0be0e94..10db21f 100644 > --- a/fs/btrfs/backref.c > +++ b/fs/btrfs/backref.c > @@ -220,7 +220,8 @@ static int __add_prelim_ref(struct list_head *head, u64 > root_id, > > static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, > struct ulist *parents, struct __prelim_ref *ref, > -int level, u64 time_seq, const u64 *extent_item_pos) > +int level, u64 time_seq, const u64 *extent_item_pos, > +u64 total_refs) > { > int ret = 0; > int slot; > @@ -249,7 +250,7 @@ static int add_all_parents(struct btrfs_root *root, > struct btrfs_path *path, > if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) > ret = btrfs_next_old_leaf(root, path, time_seq); > > - while (!ret && count < ref->count) { > + while (!ret && count < total_refs) { > eb = path->nodes[0]; > slot = path->slots[0]; > > @@ -306,7 +307,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info > *fs_info, > struct btrfs_path *path, u64 time_seq, > struct __prelim_ref *ref, > struct ulist *parents, > - const u64 *extent_item_pos) > + const u64 *extent_item_pos, u64 total_refs) > { > struct btrfs_root *root; > struct btrfs_key root_key; > @@ -364,7 +365,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info > *fs_info, > } > > ret = add_all_parents(root, path, parents, ref, level, time_seq, > - extent_item_pos); > + extent_item_pos, total_refs); > out: > path->lowest_level = 0; > btrfs_release_path(path); > @@ -377,7 +378,7 @@ out: > static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, > struct btrfs_path *path, u64 time_seq, > struct list_head *head, > -const u64 *extent_item_pos) > +const u64 *extent_item_pos, u64 total_refs) > { > int err; > int ret = 0; > @@ -403,7 +404,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info > *fs_info, > if (ref->count == 0) > continue; > err = __resolve_indirect_ref(fs_info, path, time_seq, ref, > - parents, extent_item_pos); > + parents, extent_item_pos, > + total_refs); > /* >* we can only tolerate ENOENT,otherwise,we should catch error >* and return directly. > @@ -560,7 +562,7 @@ static void __merge_refs(struct list_head *head, int mode) > * smaller or equal that seq to the list > */ > static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, > - struct list_head *prefs) > + struct list_head *prefs, u64 *total_refs) > { > struct btrfs_delayed_extent_op *extent_op = head->extent_op; > struct rb_node *n = &head->node.rb_node; > @@ -596,6 +598,7 @@ static int __add_delayed_refs(struct > btrfs_delayed_ref_head *head, u64 seq, > default: > BUG_ON(1); > } > + *total_refs += (node->ref_mod * sgn); > switch (node->type) { > case BTRFS_TREE_BLOCK_REF_KEY: { > struct btrfs_delayed_tree_ref *ref; > @@ -656,7 +659,8 @@ static int __add_delayed_refs(struct > btrfs_delayed_ref_head *head, u64 seq, > */ > static int __add_inline_refs(struct btrfs_fs_info *fs_info, >
Re: [PATCH] Btrfs: fix a crash of clone with inline extents's split
On Tue, Mar 18, 2014 at 06:55:13PM +0800, Liu Bo wrote: > On Mon, Mar 17, 2014 at 03:41:31PM +0100, David Sterba wrote: > > There are enough EINVAL's that verify correcntess of the input > > parameters and it's not always clear which one fails. The EOPNOTSUPP > > errocode is close to the true reason of the failure, but it could be > > misinterpreted as if the whole clone operation is not supported, so it's > > not all correct but IMO better than EINVAL. > > Yep, I was hesitating on these two errors while making the patch, but I > prefer EINVAL rather than EOPNOTSUPP because of the reason you've stated. > > I think it'd be good to add one more btrfs_printk message to clarify what's > happening here, agree? I don't think a printk is the right thing here, this means that if an error happens somebody has to look into the log what happened and act accordingly. The EOPNOTSUPP errorcode would allow an application to do a fallback action, ie. copy the data instead of cloning. The same as if the clone ioctl would not exist at all. EINVAL says "you didn't give me valid arguments to work with". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6 EARLY RFC] Btrfs: Get rid of whole page I/O.
On Tue, Mar 18, 2014 at 01:48:00PM +0630, chandan wrote: > The earlier patchset posted by Chandra Seethraman was to get 4k > blocksize to work with ppc64's 64k PAGE_SIZE. Are we talking about metadata block sizes or data block sizes? > The root node of "tree root" tree has 1957 bytes being written by > make_btrfs() (in btrfs-progs). Hence I chose to do 2k blocksize for > the initial subpagesize-blocksize work. So with this patchset the > supported blocksizes would be in the range 2k-64k. So it's metadata blocks, and in this case 2k looks like the only allowed size that's smaller than 4k, and thus can demonstrage sub-page size allocations. I'm not sure if this is limiting for potential future extensions of metadata structures that could be larger. 2k is ok for testing purposes, but I think a 4k-page machine will hardly use a smaller page size. The more that 16k metadata blocks are now default. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
On 03/19/2014 11:45 AM, Marc MERLIN wrote: My server died last night during a btrfs send/receive to a btrfs radi5 array Here are the logs. Is this anything known or with a possible workaround? Thanks, Marc btrfs-rmw-2: page allocation failure: order:1, mode:0x8020 This is an order 1 atomic allocation from the mvs driver, we really should not be depending on that to get IO done. A quick search and it looks like we're allocating MVS_SLOT_BUF_SZ (8192) bytes. You could try bumping the lowmem reserves. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: take into account total references when doing backref lookup V2
I added an optimization for large files where we would stop searching for backrefs once we had looked at the number of references we currently had for this extent. This works great most of the time, but for snapshots that point to this extent and has changes in the original root this assumption falls on it face. So keep track of any delayed ref mods made and add in the actual ref count as reported by the extent item and use that to limit how far down an inode we'll search for extents. Thanks, Reportedy-by: Hugo Mills Signed-off-by: Josef Bacik --- V1->V2: Just use the extent ref count and any delayed ref counts, this will work out right, whereas the shared thing doesn't work out in some cases. fs/btrfs/backref.c | 29 ++--- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 0be0e94..10db21f 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -220,7 +220,8 @@ static int __add_prelim_ref(struct list_head *head, u64 root_id, static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, struct ulist *parents, struct __prelim_ref *ref, - int level, u64 time_seq, const u64 *extent_item_pos) + int level, u64 time_seq, const u64 *extent_item_pos, + u64 total_refs) { int ret = 0; int slot; @@ -249,7 +250,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, if (path->slots[0] >= btrfs_header_nritems(path->nodes[0])) ret = btrfs_next_old_leaf(root, path, time_seq); - while (!ret && count < ref->count) { + while (!ret && count < total_refs) { eb = path->nodes[0]; slot = path->slots[0]; @@ -306,7 +307,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, struct btrfs_path *path, u64 time_seq, struct __prelim_ref *ref, struct ulist *parents, - const u64 *extent_item_pos) + const u64 *extent_item_pos, u64 total_refs) { struct btrfs_root *root; struct btrfs_key root_key; @@ -364,7 +365,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, } ret = add_all_parents(root, path, parents, ref, level, time_seq, - extent_item_pos); + extent_item_pos, total_refs); out: path->lowest_level = 0; btrfs_release_path(path); @@ -377,7 +378,7 @@ out: static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, struct btrfs_path *path, u64 time_seq, struct list_head *head, - const u64 *extent_item_pos) + const u64 *extent_item_pos, u64 total_refs) { int err; int ret = 0; @@ -403,7 +404,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, if (ref->count == 0) continue; err = __resolve_indirect_ref(fs_info, path, time_seq, ref, -parents, extent_item_pos); +parents, extent_item_pos, +total_refs); /* * we can only tolerate ENOENT,otherwise,we should catch error * and return directly. @@ -560,7 +562,7 @@ static void __merge_refs(struct list_head *head, int mode) * smaller or equal that seq to the list */ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, - struct list_head *prefs) + struct list_head *prefs, u64 *total_refs) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; struct rb_node *n = &head->node.rb_node; @@ -596,6 +598,7 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, default: BUG_ON(1); } + *total_refs += (node->ref_mod * sgn); switch (node->type) { case BTRFS_TREE_BLOCK_REF_KEY: { struct btrfs_delayed_tree_ref *ref; @@ -656,7 +659,8 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, */ static int __add_inline_refs(struct btrfs_fs_info *fs_info, struct btrfs_path *path, u64 bytenr, -int *info_level, struct list_head *prefs) +int *info_level, struct list_head *prefs, +u64 *total_refs) { int ret = 0; int slot; @@ -680,6 +684,7 @@ static int __add_inline_refs(struct b
Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds
On Mar 19, 2014, at 9:40 AM, Marc MERLIN wrote: > > After adding a drive, I couldn't quite tell if it was striping over 11 > drive2 or 10, but it felt that at least at times, it was striping over 11 > drives with write failures on the missing drive. > I can't prove it, but I'm thinking the new data I was writing was being > striped in degraded mode. Well it does sound fragile after all to add a drive to a degraded array, especially when it's not expressly treating the faulty drive as faulty. I think iotop will show what block devices are being written to. And in a VM it's easy (albeit rudimentary) with sparse files, as you can see them grow. > > Yes, although it's limited, you apparently only lose new data that was added > after you went into degraded mode and only if you add another drive where > you write more data. > In real life this shouldn't be too common, even if it is indeed a bug. It's entirely plausible a drive power/data cable becomes lose, runs for hours degraded before the wayward device is reseated. It'll be common enough. It's definitely not OK for all of that data in the interim to vanish just because the volume has resumed from degraded to normal. Two states of data, normal vs degraded, is scary. It sounds like totally silent data loss. So yeah if it's reproducible it's worthy of a separate bug. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: subpagesize-blocksize: Get rid of whole page reads.
> These should be put in front of "struct bio bio", > otherwise, it might lead to errors, according to bioset_create()'s comments, > > -- > "Note that the bio must be embedded at the END of that structure always, > or things will break badly." > -- > Thank you for pointing that out. I will fix it. Thanks, chandan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs-rmw-2: page allocation failure: order:1, mode:0x8020
My server died last night during a btrfs send/receive to a btrfs radi5 array Here are the logs. Is this anything known or with a possible workaround? Thanks, Marc btrfs-rmw-2: page allocation failure: order:1, mode:0x8020 CPU: 1 PID: 12499 Comm: btrfs-rmw-2 Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1 Hardware name: System manufacturer P5KC/P5KC, BIOS 050205/24/2007 88000549d780 816090b3 88000549d808 811037b0 0001fffe 88007ff7ce00 0002 0030 88007ff7ce00 Call Trace: [] dump_stack+0x4e/0x7a [] warn_alloc_failed+0x111/0x125 [] __alloc_pages_nodemask+0x707/0x854 [] ? dma_generic_alloc_coherent+0xa7/0x11c [] dma_generic_alloc_coherent+0xa7/0x11c [] dma_pool_alloc+0x10a/0x1cb [] mvs_task_prep+0x192/0xa42 [mvsas] [] ? blkg_path.isra.80.constprop.90+0x17/0x38 [] ? cache_alloc+0x1c/0x29b [] mvs_task_exec.isra.9+0x5d/0xc9 [mvsas] [] mvs_queue_command+0x3d/0x29b [mvsas] [] ? kmem_cache_alloc+0xe3/0x161 [] sas_ata_qc_issue+0x1cd/0x235 [libsas] [] ata_qc_issue+0x291/0x2f1 [] ? ata_scsiop_mode_sense+0x29c/0x29c [] __ata_scsi_queuecmd+0x184/0x1e0 [] ata_sas_queuecmd+0x31/0x4d [] sas_queuecommand+0x98/0x1fe [libsas] [] scsi_dispatch_cmd+0x14f/0x22e [] scsi_request_fn+0x4da/0x507 [] ? blk_recount_segments+0x1e/0x2e [] __blk_run_queue_uncond+0x22/0x2b [] __blk_run_queue+0x19/0x1b [] blk_queue_bio+0x23f/0x256 [] generic_make_request+0x9c/0xdb [] submit_bio+0x112/0x131 [] rmw_work+0x112/0x162 [] worker_loop+0x168/0x4d8 [] ? btrfs_queue_worker+0x283/0x283 [] kthread+0xae/0xb6 [] ? __kthread_parkme+0x61/0x61 [] ret_from_fork+0x7c/0xb0 [] ? __kthread_parkme+0x61/0x61 Mem-Info: Node 0 DMA per-cpu: CPU0: hi:0, btch: 1 usd: 0 CPU1: hi:0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU0: hi: 186, btch: 31 usd: 171 CPU1: hi: 186, btch: 31 usd: 190 active_anon:17298 inactive_anon:21061 isolated_anon:0 active_file:67491 inactive_file:94189 isolated_file:32 unevictable:1260 dirty:38914 writeback:49596 unstable:0 free:15999 slab_reclaimable:8198 slab_unreclaimable:9741 mapped:12981 shmem:1661 pagetables:2711 bounce:0 free_cma:0 Node 0 DMA free:8084kB min:348kB low:432kB high:520kB active_anon:360kB inactive_anon:764kB active_file:288kB inactive_file:2040kB unevictable:100kB isolated(anon):0kB isolated(file):0kB present:15976kB managed:15892kB mlocked:100kB dirty:0kB writeback:1272kB mapped:252kB shmem:8kB slab_reclaimable:168kB slab_unreclaimable:336kB kernel_stack:88kB pagetables:128kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 1987 1987 1987 Node 0 DMA32 free:56080kB min:44704kB low:55880kB high:67056kB active_anon:68832kB inactive_anon:83480kB active_file:269676kB inactive_file:374588kB unevictable:4940kB isolated(anon):0kB isolated(file):128kB present:2080256kB managed:2039064kB mlocked:4940kB dirty:155668kB writeback:197112kB mapped:51672kB shmem:6636kB slab_reclaimable:32624kB slab_unreclaimable:38628kB kernel_stack:2912kB pagetables:10716kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 85*4kB (UEM) 22*8kB (UEM) 62*16kB (UEM) 6*32kB (UM) 2*64kB (UE) 5*128kB (UEM) 6*256kB (UEM) 4*512kB (EM) 0*1024kB 1*2048kB (R) 0*4096kB = 8100kB Node 0 DMA32: 13004*4kB (M) 16*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 56240kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 164139 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 9255932kB Total swap = 9255932kB 524058 pages RAM 0 pages HighMem/MovableOnly 10298 pages reserved 0 pages hwpoisoned mvsas :01:00.0: mvsas prep failed[0]! btrfs-rmw-2: page allocation failure: order:1, mode:0x8020 CPU: 1 PID: 12499 Comm: btrfs-rmw-2 Not tainted 3.14.0-rc5-amd64-i915-preempt-20140216c #1 Hardware name: System manufacturer P5KC/P5KC, BIOS 050205/24/2007 88000549d690 816090b3 88000549d718 811037b0 0001fffe 88000549d6c8 8160e9c4 0002 0030 88007ff7ce00 Call Trace: [] dump_stack+0x4e/0x7a [] warn_alloc_failed+0x111/0x125 [] ? _raw_spin_trylock+0x20/0x50 [] __alloc_pages_nodemask+0x707/0x854 [] ? console_unlock+0x2f6/0x302 [] dma_generic_alloc_coherent+0xa7/0x11c [] dma_pool_alloc+0x10a/0x1cb [] mvs_task_prep+0x192/0xa42 [mvsas] [] ? get_page_from_freelist+0x549/0x71d [] ? cache_alloc+0x1c/0x29b [] mvs_task_exec.isra.9+0x5d/0xc9 [mvsas] [] mvs_queue_command+0x3d/0x29b [mvsas] [] ? kmem_cache_alloc+0xe3/0x161 [] sas_ata_qc_issue+0x1cd/0x235 [libsas] [] ata_qc_issue+0x291/0x2f1 [] ? ata_scsiop_mode_sense+0x29c/0x29c [] __ata_scsi_queuecmd+0x184/0x1e0 [] ata_sas_queuecmd+0x31/0x4d [
Re: How to handle a RAID5 arrawy with a failing drive? -> raid5 mostly works, just no rebuilds
On Wed, Mar 19, 2014 at 12:32:55AM -0600, Chris Murphy wrote: > > On Mar 19, 2014, at 12:09 AM, Marc MERLIN wrote: > > > > 7) you can remove a drive from an array, add files, and then if you plug > > the drive in, it apparently gets auto sucked in back in the array. > > There is no rebuild that happens, you now have an inconsistent array where > > one drive is not at the same level than the other ones (I lost all files I > > added > > after the drive was removed when I added the drive back). > > Seems worthy of a dedicated bug report and keeping an eye on in the future, > not good. Since it's not supposed to be working, I didn't file a bug, but I figured it'd be good for people to know about it in the meantime. > >> polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 > >> /mnt/btrfs_backupcopy/ > >> polgara:/mnt/btrfs_backupcopy# df -h . > >> Filesystem Size Used Avail Use% Mounted on > >> /dev/mapper/crypt_sdb1 4.6T 3.0M 4.6T 1% /mnt/btrfs_backupcopy > > > > Oh look it's bigger now. We need to manual rebalance to use the new drive: > > You don't have to. As soon as you add the additional drive, newly allocated > chunks will stripe across all available drives. e.g. 1 GB allocations striped > across 3x drives, if I add a 4th drive, initially any additional writes are > only to the first three drives but once a new data chunk is allocated it gets > striped across 4 drives. That's the thing though. If the bad device hadn't been forcibly removed, and apparently the only way to do this was to unmount, make the device node disappear, and remount in degraded mode, it looked to me like btrfs was still consideing that the drive was part of the array and trying to write to it. After adding a drive, I couldn't quite tell if it was striping over 11 drive2 or 10, but it felt that at least at times, it was striping over 11 drives with write failures on the missing drive. I can't prove it, but I'm thinking the new data I was writing was being striped in degraded mode. > Sure the whole thing isn't corrupt. But if anything written while degraded > vanishes once the missing device is reattached, and you remount normally > (non-degraded), that's data loss. Yikes! Yes, although it's limited, you apparently only lose new data that was added after you went into degraded mode and only if you add another drive where you write more data. In real life this shouldn't be too common, even if it is indeed a bug. Cheers, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] Btrfs: part 2, fix incremental send's decision to delay a dir move/rename
For an incremental send, fix the process of determining whether the directory inode we're currently processing needs to have its move/rename operation delayed. We were ignoring the fact that if the inode's new immediate ancestor has a higher inode number than ours but wasn't renamed/moved, we might still need to delay our move/rename, because some other ancestor directory higher in the hierarchy might have an inode number higher than ours *and* was renamed/moved too - in this case we have to wait for rename/move of that ancestor to happen before our current directory's rename/move operation. Simple steps to reproduce this issue: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/x1/x2 $ mkdir /mnt/a/Z $ mkdir -p /mnt/a/x1/x2/x3/x4/x5 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33 $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The incremental send caused the kernel code to enter an infinite loop when building the path string for directory Z after its references are processed. A more complex scenario: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b/c/d $ mkdir /mnt/a/b/c/d/e $ mkdir /mnt/a/b/c/d/f $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2 $ mkdir /mmt/a/b/c/g $ mv /mnt/a/b/c/d /mnt/a/b/D2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/o $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2 $ mv /mnt/a/b/D2 /mnt/a/b/dd $ mv /mnt/a/b/c /mnt/a/C2 $ mv /mnt/a/b/dd/f /mnt/a/o/FF $ mv /mnt/a/b /mnt/a/o/FF/E2/BB $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana --- V2: Added missing error handling and fixed typo in commit message. V3: Updated the algorithm to deal with more complex cases, hopefully all cases are nailed down now. V4: Pass the right generation number to add_pending_dir_move. fs/btrfs/send.c | 71 +++ 1 file changed, 66 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index d869079..b27b3e1 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -2916,7 +2916,10 @@ static void free_waiting_dir_move(struct send_ctx *sctx, kfree(dm); } -static int add_pending_dir_move(struct send_ctx *sctx, u64 parent_ino) +static int add_pending_dir_move(struct send_ctx *sctx, + u64 ino, + u64 ino_gen, + u64 parent_ino) { struct rb_node **p = &sctx->pending_dir_moves.rb_node; struct rb_node *parent = NULL; @@ -2929,8 +2932,8 @@ static int add_pending_dir_move(struct send_ctx *sctx, u64 parent_ino) if (!pm) return -ENOMEM; pm->parent_ino = parent_ino; - pm->ino = sctx->cur_ino; - pm->gen = sctx->cur_inode_gen; + pm->ino = ino; + pm->gen = ino_gen; INIT_LIST_HEAD(&pm->list); INIT_LIST_HEAD(&pm->update_refs); RB_CLEAR_NODE(&pm->node); @@ -3183,6 +3186,8 @@ static int wait_for_parent_move(struct send_ctx *sctx, struct fs_path *path_before = NULL; struct fs_path *path_after = NULL; int len1, len2; + int register_upper_dirs; + u64 gen; if (is_waiting_for_move(sctx, ino)) return 1; @@ -3225,7 +3230,7 @@ static int wait_for_parent_move(struct send_ctx *sctx, } ret = get_first_ref(sctx->send_root, ino, &parent_ino_after, - NULL, path_after); + &gen, path_after); if (ret == -ENOENT) { ret = 0; goto out; @@ -3242,6 +3247,60 @@ static int wait_for_parent_move(struct send_ctx *sctx, } ret = 0; + /* +* Ok, our new most direct ancestor has a higher inode number but +* wasn't moved/renamed. So maybe some of the new ancestors higher in +* the hierarchy have an higher inode number too *and* were renamed +* or moved - in this case we need to wait for the ancestor's rename +* or move operation before we can do the move/rename for the current +* inode. +*/ + register_upper_dirs = 0; + ino = parent_ino_after; +again: + while ((ret == 0 || register_upper_dirs) && ino > sctx->cur_ino) { + u64 parent_gen; + + fs_path_reset(path_before); + fs_path_reset(path_after); + + ret = get_first_ref(sctx->send_root, ino, &parent_ino_after, + &parent_gen, path
Re: Please help me to contribute to btrfs project
Thank you very much Ben :) I did go though the links send by you & got the complete details for sending the kernel component. Also my change has a patch in btrfs-tools. It will be nice if you can share the process for submitting that patch also. Regards, Ajesh On Tue, Mar 18, 2014 at 7:17 PM, Ben Gamari wrote: > Ajesh js writes: > >> Hi, >> >> I have used the btrfs filesystem in one of my projects and I have >> added a small feature to it. I feel that the same feature will be >> useful for others too. Hence I would like to contribute the same to >> open source. >> > Excellent! > >> If everything works fine and this feature is not already added by >> somebody else, this will be my first contribution to the opensource & >> I am excited to join the huge family of opensource :) >> >> Please help me with a precise steps to do the same. >> > In general the way to contribute is to send a patch for review. You > should have a look at the code style guidelines[1] and patch submission > guidelines[2] in the kernel tree. For nontrivial changes the patch > should be accompanied by a cover letter describing the change and the > motivations for any non-obvious design decisions. > > It is possible that your change is acceptable as-is. More likely, > however, is that there will be some discussion and requests for > changes. Eventually the review process will produce a merge-worthy > patch. The first step, however, is sending something concrete for > community review. > > Cheers, > > - Ben > > > [1] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/CodingStyle > [2] > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Another crash when deleting a large number of snapshots
On Wed, Mar 19, 2014 at 08:52:37AM +0100, Juan Orti Alcaine wrote: > This kind of crashes happens me very often when I delete a large > number (+200) of snapshots at once. > There is a very high IO for a while, and after that, the system > freezed at intervals. I had to reboot the system to get it responsive > again. > > Versions used: > kernel-3.13.6-200.fc20.x86_64 > btrfs-progs-3.12-1.fc20.x86_64 > > The log: > http://ur1.ca/gvr3j Not sure if this has been fixed... Can you try btrfs-next or compile the kernel with CONFIG_DEBUG_SPINLOCK=y? thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Another crash when deleting a large number of snapshots
This kind of crashes happens me very often when I delete a large number (+200) of snapshots at once. There is a very high IO for a while, and after that, the system freezed at intervals. I had to reboot the system to get it responsive again. Versions used: kernel-3.13.6-200.fc20.x86_64 btrfs-progs-3.12-1.fc20.x86_64 The log: http://ur1.ca/gvr3j -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] Btrfs: subpagesize-blocksize: Get rid of whole page reads.
On Wed, Mar 12, 2014 at 07:50:28PM +0530, Chandan Rajendra wrote: > bio_vec->{bv_offset, bv_len} cannot be relied upon by the end bio functions > to track the file offset range operated on by the bio. Hence this patch adds > two new members to 'struct btrfs_io_bio' to track the file offset range. > > This patch also brings back check_page_locked() to reliably unlock pages in > readpage's end bio function. > > Signed-off-by: Chandan Rajendra > --- > fs/btrfs/extent_io.c | 122 > +-- > fs/btrfs/volumes.h | 3 ++ > 2 files changed, 82 insertions(+), 43 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index fbe501d..5a65aee 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -1943,15 +1943,31 @@ int test_range_bit(struct extent_io_tree *tree, u64 > start, u64 end, > * helper function to set a given page up to date if all the > * extents in the tree for that page are up to date > */ > -static void check_page_uptodate(struct extent_io_tree *tree, struct page > *page) > +static void check_page_uptodate(struct extent_io_tree *tree, struct page > *page, > + struct extent_state *cached) > { > u64 start = page_offset(page); > u64 end = start + PAGE_CACHE_SIZE - 1; > - if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL)) > + if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, cached)) > SetPageUptodate(page); > } > > /* > + * helper function to unlock a page if all the extents in the tree > + * for that page are unlocked > + */ > +static void check_page_locked(struct extent_io_tree *tree, struct page *page) > +{ > + u64 start = page_offset(page); > + u64 end = start + PAGE_CACHE_SIZE - 1; > + > + if (!test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) { > + unlock_page(page); > + } > +} > + > + > +/* > * When IO fails, either with EIO or csum verification fails, we > * try other mirrors that might have a good copy of the data. This > * io_failure_record is used to record state as we go through all the > @@ -2414,16 +2430,33 @@ static void end_bio_extent_writepage(struct bio *bio, > int err) > bio_put(bio); > } > > -static void > -endio_readpage_release_extent(struct extent_io_tree *tree, u64 start, u64 > len, > - int uptodate) > +static void unlock_extent_and_page(struct address_space *mapping, > +struct extent_io_tree *tree, > +struct btrfs_io_bio *io_bio) > { > - struct extent_state *cached = NULL; > - u64 end = start + len - 1; > + pgoff_t index; > + u64 offset, len; > + /* > + * This btrfs_io_bio may span multiple pages. > + * We need to unlock the pages convered by them > + * if we got endio callback for all the blocks in the page. > + * btrfs_io_bio also contain "contigous blocks of the file" > + * look at submit_extent_page for more details. > + */ > > - if (uptodate && tree->track_uptodate) > - set_extent_uptodate(tree, start, end, &cached, GFP_ATOMIC); > - unlock_extent_cached(tree, start, end, &cached, GFP_ATOMIC); > + offset = io_bio->start_offset; > + len= io_bio->len; > + unlock_extent(tree, offset, offset + len - 1); > + > + index = offset >> PAGE_CACHE_SHIFT; > + while (offset < io_bio->start_offset + len) { > + struct page *page; > + page = find_get_page(mapping, index); > + check_page_locked(tree, page); > + page_cache_release(page); > + index++; > + offset += PAGE_CACHE_SIZE; > + } > } > > /* > @@ -2443,13 +2476,13 @@ static void end_bio_extent_readpage(struct bio *bio, > int err) > struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; > struct bio_vec *bvec = bio->bi_io_vec; > struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); > + struct address_space *mapping = bio->bi_io_vec->bv_page->mapping; > struct extent_io_tree *tree; > + struct extent_state *cached = NULL; > u64 offset = 0; > u64 start; > u64 end; > u64 len; > - u64 extent_start = 0; > - u64 extent_len = 0; > int mirror; > int ret; > > @@ -2482,8 +2515,8 @@ static void end_bio_extent_readpage(struct bio *bio, > int err) > bvec->bv_offset, bvec->bv_len); > } > > - start = page_offset(page); > - end = start + bvec->bv_offset + bvec->bv_len - 1; > + start = page_offset(page) + bvec->bv_offset; > + end = start + bvec->bv_len - 1; > len = bvec->bv_len; > > if (++bvec <= bvec_end) > @@ -2540,40 +2573,24 @@ readpage_ok: > offset = i_size & (PAGE_CACHE_SIZE-1); > if (page->index == end_