date:20160519




David Sterba wrote on 2016/05/19 16:51 +0200:

On Wed, May 18, 2016 at 08:58:57AM +0800, Qu Wenruo wrote:

Josef Bacik wrote on 2016/05/17 11:29 -0400:

On 05/17/2016 05:38 AM, Qu Wenruo wrote:


Thanks for the review.

I'll add it to github branch to avoid unneeded mail bombing.


Thanks.  I did one more pass and fixed the error messages and some minor
formatting. Branch merged in devel, any fixups, please send as separate
patches. I'd like to give enough time for testing, so ETA for 4.6 will
be the end of the next week.


Thank you a lot, for all the work.

We'll enrich the test cases for current low memory mode.

And only until current extent tree part is completely OK, then we will 
try to implement later fs tree one.
(Currently fs tree check still eats a lot of memory, making the memory 
save in extent tree check a little meaningless)


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: fsck: Add more explanation on low memory mode

It seems the initial support for low memory mode will come in recent
release, it's better to add more explanation on the unsupported
functions, to avoid user reporting known bugs.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-check.asciidoc | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-check.asciidoc 
b/Documentation/btrfs-check.asciidoc
index 4e27863..6b09662 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -97,7 +97,14 @@ NOTE: Do not use unless you know what you're doing.
 check fs in low memory usage mode(experimental)
 May takes longer time than normal check.
 +
-NOTE: Doesn't work with '--repair' option yet.
+[NOTE]
+==
+As this feature is still under development, some functions are not implemented
+yet
+
+- No support for '--repair' yet.
+- Only reduce memory usage of extent tree check. No effect for fs tree yet.
+==
 
 EXIT STATUS
 ---
-- 
2.8.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1 v2] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

Hi David,

Sorry for the noise.  Please disregard my v1 patch and subsequent
emails.  This patch is for upstream linux-next.  From now on I think
that's what I'm going to work from, to keep things simple, because it
seems I'm still inept with git.

Regards,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/1 v2] for linux-next! String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

Hi David,

There.  I decided to just check out a fresh copy of linux-next.git from a URL 
provided in a previous email, because I'm clearly still inept with git.  Please 
disregard my last thread.

I took a look at what you merged and what you didn't for my review of 
btrfs-progs, and revised my approach accordingly.  There are six edits that 
require your review.  Please search/grep for "Steeves" to find them.

Cheers!
Nicholas

Nicholas D Steeves (1):
  String and comment review: Fix typos; fix a couple of mandatory
grammatical issues for clarity.

 fs/btrfs/backref.c|  2 +-
 fs/btrfs/btrfs_inode.h|  2 +-
 fs/btrfs/check-integrity.c|  2 +-
 fs/btrfs/ctree.c  | 14 +++---
 fs/btrfs/ctree.h  |  6 +++---
 fs/btrfs/delayed-ref.h|  2 +-
 fs/btrfs/dev-replace.c|  2 +-
 fs/btrfs/disk-io.c| 10 +-
 fs/btrfs/extent-tree.c| 32 
 fs/btrfs/extent_io.c  |  4 ++--
 fs/btrfs/extent_map.c |  2 +-
 fs/btrfs/file.c   |  4 ++--
 fs/btrfs/free-space-cache.c   |  2 +-
 fs/btrfs/free-space-cache.h   |  2 +-
 fs/btrfs/inode.c  | 22 +++---
 fs/btrfs/ioctl.c  | 10 +-
 fs/btrfs/ordered-data.h   |  2 +-
 fs/btrfs/qgroup.c | 16 
 fs/btrfs/raid56.c |  6 +++---
 fs/btrfs/relocation.c | 12 ++--
 fs/btrfs/root-tree.c  |  4 ++--
 fs/btrfs/scrub.c  |  4 ++--
 fs/btrfs/send.c   |  6 +++---
 fs/btrfs/struct-funcs.c   |  2 +-
 fs/btrfs/super.c  |  8 
 fs/btrfs/sysfs.c  |  2 +-
 fs/btrfs/tests/extent-io-tests.c  |  2 +-
 fs/btrfs/tests/free-space-tests.c |  7 ---
 fs/btrfs/tests/inode-tests.c  |  2 +-
 fs/btrfs/tests/qgroup-tests.c |  2 +-
 fs/btrfs/transaction.h|  2 +-
 fs/btrfs/tree-log.c   |  8 
 fs/btrfs/ulist.c  |  2 +-
 fs/btrfs/volumes.c|  8 
 34 files changed, 107 insertions(+), 106 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/1 v2] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

Signed-off-by: Nicholas D Steeves 
---
 fs/btrfs/backref.c|  2 +-
 fs/btrfs/btrfs_inode.h|  2 +-
 fs/btrfs/check-integrity.c|  2 +-
 fs/btrfs/ctree.c  | 14 +++---
 fs/btrfs/ctree.h  |  6 +++---
 fs/btrfs/delayed-ref.h|  2 +-
 fs/btrfs/dev-replace.c|  2 +-
 fs/btrfs/disk-io.c| 10 +-
 fs/btrfs/extent-tree.c| 32 
 fs/btrfs/extent_io.c  |  4 ++--
 fs/btrfs/extent_map.c |  2 +-
 fs/btrfs/file.c   |  4 ++--
 fs/btrfs/free-space-cache.c   |  2 +-
 fs/btrfs/free-space-cache.h   |  2 +-
 fs/btrfs/inode.c  | 22 +++---
 fs/btrfs/ioctl.c  | 10 +-
 fs/btrfs/ordered-data.h   |  2 +-
 fs/btrfs/qgroup.c | 16 
 fs/btrfs/raid56.c |  6 +++---
 fs/btrfs/relocation.c | 12 ++--
 fs/btrfs/root-tree.c  |  4 ++--
 fs/btrfs/scrub.c  |  4 ++--
 fs/btrfs/send.c   |  6 +++---
 fs/btrfs/struct-funcs.c   |  2 +-
 fs/btrfs/super.c  |  8 
 fs/btrfs/sysfs.c  |  2 +-
 fs/btrfs/tests/extent-io-tests.c  |  2 +-
 fs/btrfs/tests/free-space-tests.c |  7 ---
 fs/btrfs/tests/inode-tests.c  |  2 +-
 fs/btrfs/tests/qgroup-tests.c |  2 +-
 fs/btrfs/transaction.h|  2 +-
 fs/btrfs/tree-log.c   |  8 
 fs/btrfs/ulist.c  |  2 +-
 fs/btrfs/volumes.c|  8 
 34 files changed, 107 insertions(+), 106 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index d309018..8bb3509 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1939,7 +1939,7 @@ static int inode_to_path(u64 inum, u32 name_len, unsigned 
long name_off,
  * from ipath->fspath->val[i].
  * when it returns, there are ipath->fspath->elem_cnt number of paths available
  * in ipath->fspath->val[]. when the allocated space wasn't sufficient, the
- * number of missed paths in recored in ipath->fspath->elem_missed, otherwise,
+ * number of missed paths is recorded in ipath->fspath->elem_missed, otherwise,
  * it's zero. ipath->fspath->bytes_missing holds the number of bytes that would
  * have been needed to return all paths.
  */
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 1da5753..4919aed 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -313,7 +313,7 @@ struct btrfs_dio_private {
struct bio *dio_bio;
 
/*
-* The original bio may be splited to several sub-bios, this is
+* The original bio may be split to several sub-bios, this is
 * done during endio of sub-bios
 */
int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 516e19d..b677a6e 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1939,7 +1939,7 @@ again:
/*
 * Clear all references of this block. Do not free
 * the block itself even if is not referenced anymore
-* because it still carries valueable information
+* because it still carries valuable information
 * like whether it was ever written and IO completed.
 */
list_for_each_entry_safe(l, tmp, >ref_to_list,
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index decd0a3..ba4dc5c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -156,7 +156,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
*root)
 
/*
 * RCU really hurts here, we could free up the root node because
-* it was cow'ed but we may not get the new root node yet so do
+* it was COWed but we may not get the new root node yet so do
 * the inc_not_zero dance and if it doesn't work then
 * synchronize_rcu and try again.
 */
@@ -955,7 +955,7 @@ int btrfs_block_can_be_shared(struct btrfs_root *root,
  struct extent_buffer *buf)
 {
/*
-* Tree blocks not in refernece counted trees and tree roots
+* Tree blocks not in reference counted trees and tree roots
 * are never shared. If a block was allocated after the last
 * snapshot and the block was not allocated by tree relocation,
 * we know the block is not shared.
@@ -1270,7 +1270,7 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
 
 /*
  * tm is a pointer to the first operation to rewind within eb. then, all
- * previous operations will be rewinded (until we reach something older than
+ * previous operations will be rewound (until we reach something older than
  * time_seq).
  */
 static void
@@

Re: [PATCH 0/4] Add support to clear v1 free space cache for btrfs check

Ivan P wrote on 2016/05/19 18:54 +0200:

Not sure why I was so fixated on installing it - thanks for reminding.

I was able to run it, but it fails:
===
# ./btrfsck --repair --clear-space-cache -p /dev/sda
enabling repair mode
Clearing all free space cache
btrfs unable to find ref byte nr 812846673920 parent 0 root 1 owner

Error output here means btrfsck find the free space cache inode and its
data extent, but failed to delete the backref.

In that case, the only successfully cleaned space cache will be commited
to disk, and the deletion failure won't be commited to disk.

Would please run "btrfs check --readonly" on the image after clear space
cache and paste the output?

I'm a little interested in what the root problem is.
If it's just a normal backref mismatch problem, it's my patch to blame
as it's too restrict to handle such error. (but for a good safety reason)
If that's the case, I can add some option to allow clear cache to be a
little more aggressive.

And it's better to paste "btrfs-debug-tree -t 1" and "btrfs-debug-tree
-t 2" on your btrfs image for better debugging.

Such output won't leak much personal info, except the subvolume name.
And feel free to fuzz the subvolume name.

1050 offset 0
ERROR: failed to remove backref for disk bytenr 812846673920
ERROR: failed to clear free space cache
transaction.h:41: btrfs_start_transaction: Assertion `root->commit_root` failed.
btrfs check[0x4437ee]
btrfs check(close_ctree_fs_info+0x213)[0x445ab3]
btrfs check(cmd_check+0x638)[0x42b718]
btrfs check(main+0x7b)[0x40fbdb]
/usr/lib/libc.so.6(__libc_start_main+0xf1)[0x7fe4ebe1a741]
btrfs check(_start+0x29)[0x40fcd9]
===

I also tried running it WITHOUT --repair first, as well as ONLY with
--repair, followed by --clearcache and same thing happened.

--clear-space-cache implies --repair, but override dangerous repair
behavior and only do space cache clean things.

So without or without --repair, it's the same.

I must add that I feel cheated by the "-p" option, it's more of a
"print heartbeat" than "print progress" IMHO :p

Thanks for the -p option, clear space cache doesn't support progress
report yet.

I'll add them in next version.

Thanks,
Qu

@David Sterba: you're welcome.

Regards,
Ivan.

On Wed, May 18, 2016 at 3:00 AM, Qu Wenruo wrote:

Not familiar about the problem, but you don't really need to install the
patch.

You can just execute '/btrfsck' directly to
call the new fsck.

And there is still some time before the patch is merged, there is really no
need to install.

Thanks,
Qu

Ivan P wrote on 2016/05/17 20:12 +0200:

Thank you, however I can't seem to be able to compile that snapshot, I'm
getting

===
/usr/bin/install -c -m644 -d 64-btrfs-dm.rules
/home/myuser/aur/btrfs-progs-git/pkg/btrfs-progs-git/usr/lib/udev/rules.d
/usr/bin/install: cannot create directory ‘64-btrfs-dm.rules’: File exists
Makefile:400: recipe for target 'install' failed
make: *** [install] Error 1
==> ERROR: A failure occurred in package()
===

Just to make sure I wasn't screwing up somewhere, I tried the
btrfs-progs-git AUR package and I'm getting the same thing.
It's not only me, however: according to [1] it could be that this
commit has introduced it: [2]

Regards,
Ivan.

[1] https://aur.archlinux.org/packages/btrfs-progs-git/
[2]
http://repo.or.cz/btrfs-progs-unstable/devel.git?a=commit;h=ebe5b1cc7885027521db3c1d16d84bd54cc1321b

On Tue, May 17, 2016 at 5:56 AM, Qu Wenruo
wrote:

Also to Ivan P, the initial reporter of the problem.

Now "btrfsck --clear-cache" should help you to prevent kernel warning on
free space cache problem.

Thanks,
Qu

Qu Wenruo wrote on 2016/05/17 11:47 +0800:

The patchset can be also fetched from github, just in case mail list
blocks the last patch, which contains binary file.
https://github.com/adam900710/btrfs-progs.git fsck_clear_cache

Just as describe in the first patch, btrfs kernel "clear_cache" mount
option can't rebuild all free space cache, due to the design of free
space cache.
(Although I pretty like the idea to delay the load of free space cache,
as it hugely reduce the mount time of large fs)

So this makes users to report error in mail list, complaining
"clear_cache" doesn't wipe the annoying kernel warning on corrupted free
space cache.

Since kernel can't handle it, and user consider it more like an error,
we'd better handle it like an error, to fix it in btrfs check.

So this patchset adds the ability to clear free space cache, and add
test case for it.

The clear procedure is different from kernel, it will remove all free
space cache inodes and its contents(with backrefs), and set
cache_generation of super block to -1.
So there will be no free space cache at all and kernel will be happy
with that.

This patch also enhances btrfs_previous_item() to use min_objectid to
return as early as possible.

Re: [PATCH 0/1] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

On 19 May 2016 at 19:47, Nicholas D Steeves  wrote:
> On 19 May 2016 at 19:13, Nicholas D Steeves  wrote:
>> Nicholas D Steeves (1):
>>   String and comment review: Fix typos; fix a couple of mandatory
>> grammatical issues for clarity.
>>
>>  fs/btrfs/backref.c|  2 +-
>>  fs/btrfs/btrfs_inode.h|  2 +-
>>  fs/btrfs/check-integrity.c|  2 +-
>>  fs/btrfs/ctree.c  | 14 +++---
>>  fs/btrfs/ctree.h  | 10 +-
>>  fs/btrfs/delayed-ref.h|  2 +-
>>  fs/btrfs/dev-replace.c|  2 +-
>>  fs/btrfs/disk-io.c| 10 +-
>>  fs/btrfs/extent-tree.c| 34 +-
>>  fs/btrfs/extent_io.c  |  4 ++--
>>  fs/btrfs/extent_map.c |  2 +-
>>  fs/btrfs/file.c   |  4 ++--
>>  fs/btrfs/free-space-cache.c   |  2 +-
>>  fs/btrfs/free-space-cache.h   |  2 +-
>>  fs/btrfs/inode.c  | 22 +++---
>>  fs/btrfs/ioctl.c  | 10 +-
>>  fs/btrfs/ordered-data.h   |  2 +-
>>  fs/btrfs/qgroup.c | 16 
>>  fs/btrfs/raid56.c |  6 +++---
>>  fs/btrfs/relocation.c | 12 ++--
>>  fs/btrfs/root-tree.c  |  4 ++--
>>  fs/btrfs/scrub.c  |  4 ++--
>>  fs/btrfs/send.c   |  6 +++---
>>  fs/btrfs/struct-funcs.c   |  2 +-
>>  fs/btrfs/super.c  | 10 +-
>>  fs/btrfs/sysfs.c  |  2 +-
>>  fs/btrfs/tests/extent-io-tests.c  |  2 +-
>>  fs/btrfs/tests/free-space-tests.c |  7 ---
>>  fs/btrfs/tests/inode-tests.c  |  2 +-
>>  fs/btrfs/tests/qgroup-tests.c |  2 +-
>>  fs/btrfs/transaction.h|  2 +-
>>  fs/btrfs/tree-log.c   |  8 
>>  fs/btrfs/ulist.c  |  2 +-
>>  fs/btrfs/volumes.c|  8 
>>  34 files changed, 111 insertions(+), 110 deletions(-)
>>
>> --
>> 2.1.4
>>
>
> Argh.  I'm sorry, I wasn't quite careful enough, and was working off
> of master rather than for-next.  Please disregard this patch.
>
> Sincerely,
> Nicholas

Oh my, rebasing off of for-next includes everyone else's patches too!
Maybe I did it correctly by using master after all?  There were only a
couple of minor rejects affecting the following: ctree.h,
extent-tree.c, and super.c.

Or is it preferred that I delete all patches excepting my cover letter
and 0119-For-linux-next-String-and-comment-review-Fix-typos-f.patch,
and then git send-email?

Best regards,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sharing page cache pages between multiple mappings

2016-05-19 Thread Dave Chinner

On Thu, May 19, 2016 at 12:17:14PM +0200, Miklos Szeredi wrote:
> On Thu, May 19, 2016 at 11:05 AM, Michal Hocko  wrote:
> > On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
> >> Has anyone thought about sharing pages between multiple files?
> >>
> >> The obvious application is for COW filesytems where there are
> >> logically distinct files that physically share data and could easily
> >> share the cache as well if there was infrastructure for it.
> >
> > FYI this has been discussed at LSFMM this year[1]. I wasn't at the
> > session so cannot tell you any details but the LWN article covers it at
> > least briefly.
> 
> Cool, so it's not such a crazy idea.

Oh, it most certainly is crazy. :P

> Darrick, would you mind briefly sharing your ideas regarding this?

The current line of though is that we'll only attempt this in XFS on
inodes that are known to share underlying physical extents. i.e.
files that have blocks that have been reflinked or deduped.  That
way we can overload the breaking of reflink blocks (via copy on
write) with unsharing the pages in the page cache for that inode.
i.e. shared pages can propagate upwards in overlay if it uses
reflink for copy-up and writes will then break the sharing with the
underlying source without overlay having to do anything special.

Right now I'm not sure what mechanism we will use - we want to
support files that have a mix of private and shared pages, so that
implies we are not going to be sharing mappings but sharing pages
instead.  However, we've been looking at this as being completely
encapsulated within the filesystem because it's tightly linked to
changes in the physical layout of the filesystem, not as general
"share this mapping between two unrelated inodes" infrastructure.
That may change as we dig deeper into it...

> The use case I have is fixing overlayfs weird behavior. The following
> may result in "buf" not matching "data":
> 
> int fr = open("foo", O_RDONLY);
> int fw = open("foo", O_RDWR);
> write(fw, data, sizeof(data));
> read(fr, buf, sizeof(data));
> 
> The reason is that "foo" is on a read-only layer, and opening it for
> read-write triggers copy-up into a read-write layer.  However the old,
> read-only open still refers to the unmodified file.
>
> Fixing this properly requires that when opening a file, we don't
> delegate operations fully to the underlying file, but rather allow
> sharing of pages from underlying file until the file is copied up.  At
> that point we switch to sharing pages with the read-write copy.

Unless I'm missing something here (quite possible!), I'm not sure
we can fix that problem with page cache sharing or reflink. It
implies we are sharing pages in a downwards direction - private
overlay pages/mappings from multiple inodes would need to be shared
with a single underlying shared read-only inode, and I lack the
imagination to see how that works...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/1] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

On 19 May 2016 at 19:13, Nicholas D Steeves  wrote:
> Nicholas D Steeves (1):
>   String and comment review: Fix typos; fix a couple of mandatory
> grammatical issues for clarity.
>
>  fs/btrfs/backref.c|  2 +-
>  fs/btrfs/btrfs_inode.h|  2 +-
>  fs/btrfs/check-integrity.c|  2 +-
>  fs/btrfs/ctree.c  | 14 +++---
>  fs/btrfs/ctree.h  | 10 +-
>  fs/btrfs/delayed-ref.h|  2 +-
>  fs/btrfs/dev-replace.c|  2 +-
>  fs/btrfs/disk-io.c| 10 +-
>  fs/btrfs/extent-tree.c| 34 +-
>  fs/btrfs/extent_io.c  |  4 ++--
>  fs/btrfs/extent_map.c |  2 +-
>  fs/btrfs/file.c   |  4 ++--
>  fs/btrfs/free-space-cache.c   |  2 +-
>  fs/btrfs/free-space-cache.h   |  2 +-
>  fs/btrfs/inode.c  | 22 +++---
>  fs/btrfs/ioctl.c  | 10 +-
>  fs/btrfs/ordered-data.h   |  2 +-
>  fs/btrfs/qgroup.c | 16 
>  fs/btrfs/raid56.c |  6 +++---
>  fs/btrfs/relocation.c | 12 ++--
>  fs/btrfs/root-tree.c  |  4 ++--
>  fs/btrfs/scrub.c  |  4 ++--
>  fs/btrfs/send.c   |  6 +++---
>  fs/btrfs/struct-funcs.c   |  2 +-
>  fs/btrfs/super.c  | 10 +-
>  fs/btrfs/sysfs.c  |  2 +-
>  fs/btrfs/tests/extent-io-tests.c  |  2 +-
>  fs/btrfs/tests/free-space-tests.c |  7 ---
>  fs/btrfs/tests/inode-tests.c  |  2 +-
>  fs/btrfs/tests/qgroup-tests.c |  2 +-
>  fs/btrfs/transaction.h|  2 +-
>  fs/btrfs/tree-log.c   |  8 
>  fs/btrfs/ulist.c  |  2 +-
>  fs/btrfs/volumes.c|  8 
>  34 files changed, 111 insertions(+), 110 deletions(-)
>
> --
> 2.1.4
>

Argh.  I'm sorry, I wasn't quite careful enough, and was working off
of master rather than for-next.  Please disregard this patch.

Sincerely,
Nicholas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hot data tracking / hybrid storage

2016-05-19 Thread Henk Slager

On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn
 wrote:
> On 2016-05-19 14:09, Kai Krakow wrote:
>>
>> Am Wed, 18 May 2016 22:44:55 + (UTC)
>> schrieb Ferry Toth :
>>
>>> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
>>>
 Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
 :

> On 2016-05-17 02:27, Ferry Toth wrote:
>>>
>>>  [...]
>>>  [...]
>
>  [...]
>>>
>>>  [...]
>>>  [...]
>>>  [...]
>
> On the other hand, it's actually possible to do this all online
> with BTRFS because of the reshaping and device replacement tools.
>
> In fact, I've done even more complex reprovisioning online before
> (for example, my home server system has 2 SSD's and 4 HDD's,
> running BTRFS on top of LVM, I've at least twice completely
> recreated the LVM layer online without any data loss and minimal
> performance degradation).
>>>
>>>  [...]
>
> I have absolutely no idea how bcache handles this, but I doubt
> it's any better than BTRFS.


 Bcache should in theory fall back to write-through as soon as an
 error counter exceeds a threshold. This is adjustable with sysfs
 io_error_halftime and io_error_limit. Tho I never tried what
 actually happens when either the HDD (in bcache writeback-mode) or
 the SSD fails. Actually, btrfs should be able to handle this (tho,
 according to list reports, it doesn't handle errors very well at
 this point).

 BTW: Unnecessary copying from SSD to HDD doesn't take place in
 bcache default mode: It only copies from HDD to SSD in writeback
 mode (data is written to the cache first, then persisted to HDD in
 the background). You can also use "write through" (data is written
 to SSD and persisted to HDD at the same time, reporting persistence
 to the application only when both copies were written) and "write
 around" mode (data is written to HDD only, and only reads are
 written to the SSD cache device).

 If you want bcache behave as a huge IO scheduler for writes, use
 writeback mode. If you have write-intensive applications, you may
 want to choose write-around to not wear out the SSDs early. If you
 want writes to be cached for later reads, you can choose
 write-through mode. The latter two modes will ensure written data
 is always persisted to HDD with the same guaranties you had without
 bcache. The last mode is default and should not change behavior of
 btrfs if the HDD fails, and if the SSD fails bcache would simply
 turn off and fall back to HDD.
>>>
>>>
>>> Hello Kai,
>>>
>>> Yeah, lots of modes. So that means, none works well for all cases?
>>
>>
>> Just three, and they all work well. It's just a decision wearing vs.
>> performance/safety. Depending on your workload you might benefit more or
>> less from write-behind caching - that's when you want to turn the knob.
>> Everything else works out of the box. In case of an SSD failure,
>> write-back is just less safe while the other two modes should keep your
>> FS intact in that case.
>>
>>> Our server has lots of old files, on smb (various size), imap
>>> (1's small, 1000's large), postgresql server, virtualbox images
>>> (large), 50 or so snapshots and running synaptics for system upgrades
>>> is painfully slow.
>>
>>
>> I don't think that bcache even cares to cache imap accesses to mail
>> bodies - it won't help performance. Network is usually much slower than
>> SSD access. But it will cache fs meta data which will improve imap
>> performance a lot.
>
> Bcache caches anything that falls within it's heuristics as candidates for
> caching.  It pays no attention to what type of data you're accessing, just
> the access patterns.  This is also the case for dm-cache, and for Windows
> ReadyBoost (or whatever the hell they're calling it these days).  Unless
> you're shifting very big e-mails, it's pretty likely that ones that get
> accessed more than once in a short period of time will end up being cached.
>>
>>
>>> We are expecting slowness to be caused by fsyncs which appear to be
>>> much worse on a raid10 with snapshots. Presumably the whole thing
>>> would be fast enough with ssd's but that would be not very cost
>>> efficient.
>>>
>>> All the overhead of the cache layer could be avoided if btrfs would
>>> just prefer to write small, hot, files to the ssd in the first place
>>> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
>>> would be very nice (the mobo has 6 x sata, which is pretty common)
>>
>>
>> Well, I don't want to advertise bcache. But there's nothing you
>> couldn't do with it in your particular case:
>>
>> Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
>> here, you can use 1:n where n is the backing devices. There's no need
>> to clean up using balancing because bcache will track hot data by
>> default.

[PATCH 1/1] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

Signed-off-by: Nicholas D Steeves 
---
 fs/btrfs/backref.c|  2 +-
 fs/btrfs/btrfs_inode.h|  2 +-
 fs/btrfs/check-integrity.c|  2 +-
 fs/btrfs/ctree.c  | 14 +++---
 fs/btrfs/ctree.h  | 10 +-
 fs/btrfs/delayed-ref.h|  2 +-
 fs/btrfs/dev-replace.c|  2 +-
 fs/btrfs/disk-io.c| 10 +-
 fs/btrfs/extent-tree.c| 34 +-
 fs/btrfs/extent_io.c  |  4 ++--
 fs/btrfs/extent_map.c |  2 +-
 fs/btrfs/file.c   |  4 ++--
 fs/btrfs/free-space-cache.c   |  2 +-
 fs/btrfs/free-space-cache.h   |  2 +-
 fs/btrfs/inode.c  | 22 +++---
 fs/btrfs/ioctl.c  | 10 +-
 fs/btrfs/ordered-data.h   |  2 +-
 fs/btrfs/qgroup.c | 16 
 fs/btrfs/raid56.c |  6 +++---
 fs/btrfs/relocation.c | 12 ++--
 fs/btrfs/root-tree.c  |  4 ++--
 fs/btrfs/scrub.c  |  4 ++--
 fs/btrfs/send.c   |  6 +++---
 fs/btrfs/struct-funcs.c   |  2 +-
 fs/btrfs/super.c  | 10 +-
 fs/btrfs/sysfs.c  |  2 +-
 fs/btrfs/tests/extent-io-tests.c  |  2 +-
 fs/btrfs/tests/free-space-tests.c |  7 ---
 fs/btrfs/tests/inode-tests.c  |  2 +-
 fs/btrfs/tests/qgroup-tests.c |  2 +-
 fs/btrfs/transaction.h|  2 +-
 fs/btrfs/tree-log.c   |  8 
 fs/btrfs/ulist.c  |  2 +-
 fs/btrfs/volumes.c|  8 
 34 files changed, 111 insertions(+), 110 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 80e8472..b8b5987 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -1939,7 +1939,7 @@ static int inode_to_path(u64 inum, u32 name_len, unsigned 
long name_off,
  * from ipath->fspath->val[i].
  * when it returns, there are ipath->fspath->elem_cnt number of paths available
  * in ipath->fspath->val[]. when the allocated space wasn't sufficient, the
- * number of missed paths in recored in ipath->fspath->elem_missed, otherwise,
+ * number of missed paths is recorded in ipath->fspath->elem_missed, otherwise,
  * it's zero. ipath->fspath->bytes_missing holds the number of bytes that would
  * have been needed to return all paths.
  */
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 61205e3..c0a2018 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -303,7 +303,7 @@ struct btrfs_dio_private {
struct bio *dio_bio;
 
/*
-* The original bio may be splited to several sub-bios, this is
+* The original bio may be split to several sub-bios, this is
 * done during endio of sub-bios
 */
int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 516e19d..b677a6e 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1939,7 +1939,7 @@ again:
/*
 * Clear all references of this block. Do not free
 * the block itself even if is not referenced anymore
-* because it still carries valueable information
+* because it still carries valuable information
 * like whether it was ever written and IO completed.
 */
list_for_each_entry_safe(l, tmp, >ref_to_list,
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index ec7928a..6487c9c 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -156,7 +156,7 @@ struct extent_buffer *btrfs_root_node(struct btrfs_root 
*root)
 
/*
 * RCU really hurts here, we could free up the root node because
-* it was cow'ed but we may not get the new root node yet so do
+* it was COWed but we may not get the new root node yet so do
 * the inc_not_zero dance and if it doesn't work then
 * synchronize_rcu and try again.
 */
@@ -955,7 +955,7 @@ int btrfs_block_can_be_shared(struct btrfs_root *root,
  struct extent_buffer *buf)
 {
/*
-* Tree blocks not in refernece counted trees and tree roots
+* Tree blocks not in reference counted trees and tree roots
 * are never shared. If a block was allocated after the last
 * snapshot and the block was not allocated by tree relocation,
 * we know the block is not shared.
@@ -1270,7 +1270,7 @@ __tree_mod_log_oldest_root(struct btrfs_fs_info *fs_info,
 
 /*
  * tm is a pointer to the first operation to rewind within eb. then, all
- * previous operations will be rewinded (until we reach something older than
+ * previous operations will be rewound (until we reach something older than
  * time_seq).
  */
 static

[PATCH 0/1] String and comment review: Fix typos; fix a couple of mandatory grammatical issues for clarity.

Hi David,

I took a look at what you merged and what you didn't for my review of 
btrfs-progs, and revised my approach accordingly.  There are six edits that 
require your review.  Please search/grep for "Steeves" to find them.

Cheers!
Nicholas

Nicholas D Steeves (1):
  String and comment review: Fix typos; fix a couple of mandatory
grammatical issues for clarity.

 fs/btrfs/backref.c|  2 +-
 fs/btrfs/btrfs_inode.h|  2 +-
 fs/btrfs/check-integrity.c|  2 +-
 fs/btrfs/ctree.c  | 14 +++---
 fs/btrfs/ctree.h  | 10 +-
 fs/btrfs/delayed-ref.h|  2 +-
 fs/btrfs/dev-replace.c|  2 +-
 fs/btrfs/disk-io.c| 10 +-
 fs/btrfs/extent-tree.c| 34 +-
 fs/btrfs/extent_io.c  |  4 ++--
 fs/btrfs/extent_map.c |  2 +-
 fs/btrfs/file.c   |  4 ++--
 fs/btrfs/free-space-cache.c   |  2 +-
 fs/btrfs/free-space-cache.h   |  2 +-
 fs/btrfs/inode.c  | 22 +++---
 fs/btrfs/ioctl.c  | 10 +-
 fs/btrfs/ordered-data.h   |  2 +-
 fs/btrfs/qgroup.c | 16 
 fs/btrfs/raid56.c |  6 +++---
 fs/btrfs/relocation.c | 12 ++--
 fs/btrfs/root-tree.c  |  4 ++--
 fs/btrfs/scrub.c  |  4 ++--
 fs/btrfs/send.c   |  6 +++---
 fs/btrfs/struct-funcs.c   |  2 +-
 fs/btrfs/super.c  | 10 +-
 fs/btrfs/sysfs.c  |  2 +-
 fs/btrfs/tests/extent-io-tests.c  |  2 +-
 fs/btrfs/tests/free-space-tests.c |  7 ---
 fs/btrfs/tests/inode-tests.c  |  2 +-
 fs/btrfs/tests/qgroup-tests.c |  2 +-
 fs/btrfs/transaction.h|  2 +-
 fs/btrfs/tree-log.c   |  8 
 fs/btrfs/ulist.c  |  2 +-
 fs/btrfs/volumes.c|  8 
 34 files changed, 111 insertions(+), 110 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hot data tracking / hybrid storage

2016-05-19 Thread Kai Krakow

Am Thu, 19 May 2016 14:51:01 -0400
schrieb "Austin S. Hemmelgarn" :

> For a point of reference, I've 
> got a pair of 250GB Crucial MX100's (they cost less than 0.50 USD per
> GB when I got them and provide essentially the same power-loss
> protections that the high end Intel SSD's do) which have seen more
> than 2.5TB of data writes over their lifetime, combined from at least
> three different filesystem formats (BTRFS, FAT32, and ext4), swap
> space, and LVM management, and the wear-leveling indicator on each
> still says they have 100% life remaining, and the similar 500GB one I
> just recently upgraded in my laptop had seen over 50TB of writes and
> was still saying 95% life remaining (and had been for months).

The smaller Crucials are much worse at that: The MX100 128GB version I
had was specified for 85TB writes which I hit after about 12 months (97%
lifetime used according to smartctl) due to excessive write patterns.
I'm not sure how long it would have lasted but I decided to swap it for
a Samsung 500GB drive, and reconfigure my system for much less write
patterns.

What should I say: I liked the Crucial more, first: It has an easy
lifetime counter in smartctl, Samsung doesn't. And it had powerloss
protection which Samsung doesn't explicitly mention (tho I think it has
it).

At least, according to endurance tests, my Samsung SSD should take
about 1 PB of writes. I've already written 7 TB if I can trust the
smartctl raw value.

But I think you cannot compare specification values to a real endurance
test... I think it says 150TBW for 500GB 850 EVO.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix handling of faults from btrfs_copy_from_user

2016-05-19 Thread Adam Borowski

On Thu, May 19, 2016 at 04:10:20PM -0400, Chris Mason wrote:
> On Tue, May 17, 2016 at 10:47:27PM +0200, Adam Borowski wrote:
> > On Tue, May 17, 2016 at 12:23:44PM -0400, Chris Mason wrote:
> > > On Tue, May 17, 2016 at 05:14:51PM +0200, Adam Borowski wrote:
> > > > On Mon, May 16, 2016 at 05:06:55PM -0400, Chris Mason wrote:
> > > > > And now for the patch:
> > > > [...]
> > > > 
> > > > I then tried your test case, and alas:
> > > > 
> > > > Here's another run on an untainted kernel built with frame pointers etc:
> > > > 
> > > > ./dammitdave foo
> > > > [  236.500280] WARNING: CPU: 3 PID: 2940 at fs/btrfs/extent-tree.c:4233 
> > > > btrfs_free_reserved_data_space_noquota+0xdd/0x160
> > > > rm foo
> > > > [  323.851851] BTRFS info (device sda1): btrfs_destroy_inode: leftover 
> > > > csum_bytes
> > > 
> > > Hmmm, some of your traces mentioned compression, do you have compression
> > > enabled?
> > 
> > Yeah, I mount with noatime,compress=lzo.
> > 
> > > I'll try to reproduce here, but could you try the same test on v4.5?
> > 
> > I've ran it for half an hour on vanilla 4.5.4 without any patches, no
> > failures of any kind.
>  >
> > > gdb> list *__btrfs_buffered_write+0x748
> > 
> > 0x8152eb78 is in __btrfs_buffered_write (fs/btrfs/file.c:1564).
> 
> Hmpf, even with forcing btrfs_delalloc_reserve_metadata to randomly
> fail, I can't trigger this warning.  Something else is going on.
> 
> For now I'm going to send in the fault patch, I'm confident this new
> warning is something unrelated.

After having balanced my sda1, I can't reproduce this anymore, even after
having re-allocated all chunks.

As the original bug produced different results (like, no warnings other than
btrfs_destroy_inode:csum_bytes), it indeed looks like an unrelated
regression of some kind.

I re-checked both real loads and dammitdave yesterday:
* 4.5.4 works ok
* 4.6.0 without your patch warns btrfs_destroy_inode
* 4.6.0 with the patch works ok
thus, it at least fixes that one.

Tested-by: Adam Borowski 

-- 
An imaginary friend squared is a real enemy.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix handling of faults from btrfs_copy_from_user

2016-05-19 Thread Chris Mason

On Tue, May 17, 2016 at 10:47:27PM +0200, Adam Borowski wrote:
> On Tue, May 17, 2016 at 12:23:44PM -0400, Chris Mason wrote:
> > On Tue, May 17, 2016 at 05:14:51PM +0200, Adam Borowski wrote:
> > > On Mon, May 16, 2016 at 05:06:55PM -0400, Chris Mason wrote:
> > > > And now for the patch:
> > > [...]
> > > 
> > > I then tried your test case, and alas:
> > > 
> > > Here's another run on an untainted kernel built with frame pointers etc:
> > > 
> > > ./dammitdave foo
> > > [  236.500257] [ cut here ]
> > > [  236.500280] WARNING: CPU: 3 PID: 2940 at fs/btrfs/extent-tree.c:4233 
> > > btrfs_free_reserved_data_space_noquota+0xdd/0x160
> > > [  236.500285] Modules linked in:
> > > [  236.500295] CPU: 3 PID: 2940 Comm: dammit Not tainted 4.6.0debug+ #1
> > > [  236.500301] Hardware name: System manufacturer System Product 
> > > Name/M4A77T, BIOS 240105/18/2011
> > > [  236.500306]   42ec2fb0 88022d5ffbd8 
> > > 816b1920
> > > [  236.500315]  81f86e3f 42ec2fb0  
> > > 
> > > [  236.500322]  88022d5ffc20 81118c2c 88022dfd4000 
> > > 1089
> > > [  236.500330] Call Trace:
> > > [  236.500342]  [] dump_stack+0x60/0xa0
> > > [  236.500352]  [] __warn+0x10c/0x150
> > > [  236.500360]  [] warn_slowpath_null+0x18/0x20
> > > [  236.500368]  [] 
> > > btrfs_free_reserved_data_space_noquota+0xdd/0x160
> > > [  236.500376]  [] 
> > > btrfs_free_reserved_data_space+0x22/0x40
> > > [  236.500385]  [] __btrfs_buffered_write+0x748/0xa20
> > > [  236.500394]  [] ? 
> > > security_inode_need_killpriv+0x3c/0x60
> > > [  236.500401]  [] btrfs_file_write_iter+0x4ff/0xb90
> > > [  236.500410]  [] __vfs_write+0x117/0x1d0
> > > [  236.500417]  [] vfs_write+0xdd/0x290
> > > [  236.500425]  [] ? __fget_light+0x4d/0x120
> > > [  236.500432]  [] SyS_pwrite64+0x9e/0xc0
> > > [  236.500441]  [] entry_SYSCALL_64_fastpath+0x17/0x93
> > > [  236.500446] ---[ end trace 7df747a6a0962ae6 ]---
> > > rm foo
> > > [  323.851851] BTRFS info (device sda1): btrfs_destroy_inode: leftover 
> > > csum_bytes
> > 
> > Hmmm, some of your traces mentioned compression, do you have compression
> > enabled?
> 
> Yeah, I mount with noatime,compress=lzo.
> 
> > I'll try to reproduce here, but could you try the same test on v4.5?
> 
> I've ran it for half an hour on vanilla 4.5.4 without any patches, no
> failures of any kind.
> 
> > Also, if you can gdb your vmlinux (or btrfs.ko) and find this line:
> > 
> > gdb> list *__btrfs_buffered_write+0x748
> 
> 0x8152eb78 is in __btrfs_buffered_write (fs/btrfs/file.c:1564).
> 1559  
> 1560  reserve_metadata:
> 1561  ret = btrfs_delalloc_reserve_metadata(inode, 
> reserve_bytes);
> 1562  if (ret) {
> 1563  if (!only_release_metadata)
> 1564  btrfs_free_reserved_data_space(inode, 
> pos,
> 1565 
> write_bytes);
> 1566  else
> 1567  btrfs_end_write_no_snapshoting(root);
> 1568  break;
> 

Hmpf, even with forcing btrfs_delalloc_reserve_metadata to randomly
fail, I can't trigger this warning.  Something else is going on.

For now I'm going to send in the fault patch, I'm confident this new
warning is something unrelated.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hot data tracking / hybrid storage

2016-05-19 Thread Austin S. Hemmelgarn


On 2016-05-19 14:09, Kai Krakow wrote:

Am Wed, 18 May 2016 22:44:55 + (UTC)
schrieb Ferry Toth :


Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:


Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
:


On 2016-05-17 02:27, Ferry Toth wrote:

 [...]
 [...]

 [...]

 [...]
 [...]
 [...]

On the other hand, it's actually possible to do this all online
with BTRFS because of the reshaping and device replacement tools.

In fact, I've done even more complex reprovisioning online before
(for example, my home server system has 2 SSD's and 4 HDD's,
running BTRFS on top of LVM, I've at least twice completely
recreated the LVM layer online without any data loss and minimal
performance degradation).

 [...]

I have absolutely no idea how bcache handles this, but I doubt
it's any better than BTRFS.


Bcache should in theory fall back to write-through as soon as an
error counter exceeds a threshold. This is adjustable with sysfs
io_error_halftime and io_error_limit. Tho I never tried what
actually happens when either the HDD (in bcache writeback-mode) or
the SSD fails. Actually, btrfs should be able to handle this (tho,
according to list reports, it doesn't handle errors very well at
this point).

BTW: Unnecessary copying from SSD to HDD doesn't take place in
bcache default mode: It only copies from HDD to SSD in writeback
mode (data is written to the cache first, then persisted to HDD in
the background). You can also use "write through" (data is written
to SSD and persisted to HDD at the same time, reporting persistence
to the application only when both copies were written) and "write
around" mode (data is written to HDD only, and only reads are
written to the SSD cache device).

If you want bcache behave as a huge IO scheduler for writes, use
writeback mode. If you have write-intensive applications, you may
want to choose write-around to not wear out the SSDs early. If you
want writes to be cached for later reads, you can choose
write-through mode. The latter two modes will ensure written data
is always persisted to HDD with the same guaranties you had without
bcache. The last mode is default and should not change behavior of
btrfs if the HDD fails, and if the SSD fails bcache would simply
turn off and fall back to HDD.


Hello Kai,

Yeah, lots of modes. So that means, none works well for all cases?


Just three, and they all work well. It's just a decision wearing vs.
performance/safety. Depending on your workload you might benefit more or
less from write-behind caching - that's when you want to turn the knob.
Everything else works out of the box. In case of an SSD failure,
write-back is just less safe while the other two modes should keep your
FS intact in that case.


Our server has lots of old files, on smb (various size), imap
(1's small, 1000's large), postgresql server, virtualbox images
(large), 50 or so snapshots and running synaptics for system upgrades
is painfully slow.


I don't think that bcache even cares to cache imap accesses to mail
bodies - it won't help performance. Network is usually much slower than
SSD access. But it will cache fs meta data which will improve imap
performance a lot.
Bcache caches anything that falls within it's heuristics as candidates 
for caching.  It pays no attention to what type of data you're 
accessing, just the access patterns.  This is also the case for 
dm-cache, and for Windows ReadyBoost (or whatever the hell they're 
calling it these days).  Unless you're shifting very big e-mails, it's 
pretty likely that ones that get accessed more than once in a short 
period of time will end up being cached.



We are expecting slowness to be caused by fsyncs which appear to be
much worse on a raid10 with snapshots. Presumably the whole thing
would be fast enough with ssd's but that would be not very cost
efficient.

All the overhead of the cache layer could be avoided if btrfs would
just prefer to write small, hot, files to the ssd in the first place
and clean up while balancing. A combination of 2 ssd's and 4 hdd's
would be very nice (the mobo has 6 x sata, which is pretty common)


Well, I don't want to advertise bcache. But there's nothing you
couldn't do with it in your particular case:

Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
here, you can use 1:n where n is the backing devices. There's no need
to clean up using balancing because bcache will track hot data by
default. You just have to decide which balance between wearing the SSD
vs. performance you prefer. If slow fsyncs are you primary concern, I'd
go with write-back caching. The small file contents are propably not
your performance problem anyways but the meta data management btrfs has
to do in the background. Bcache will help a lot here, especially in
write-back mode. I'd recommend against using balance too often and too
intensive (don't use too big usage% filters), it will invalidate your
block cache and probably

Re: Hot data tracking / hybrid storage

2016-05-19 Thread Kai Krakow

Am Wed, 18 May 2016 22:44:55 + (UTC)
schrieb Ferry Toth :

> Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow:
> 
> > Am Tue, 17 May 2016 07:32:11 -0400 schrieb "Austin S. Hemmelgarn"
> > :
> >   
> >> On 2016-05-17 02:27, Ferry Toth wrote:  
>  [...]  
>  [...]  
> >>  [...]  
>  [...]  
>  [...]  
>  [...]  
> >> On the other hand, it's actually possible to do this all online
> >> with BTRFS because of the reshaping and device replacement tools.
> >> 
> >> In fact, I've done even more complex reprovisioning online before
> >> (for example, my home server system has 2 SSD's and 4 HDD's,
> >> running BTRFS on top of LVM, I've at least twice completely
> >> recreated the LVM layer online without any data loss and minimal
> >> performance degradation).  
>  [...]  
> >> I have absolutely no idea how bcache handles this, but I doubt
> >> it's any better than BTRFS.  
> > 
> > Bcache should in theory fall back to write-through as soon as an
> > error counter exceeds a threshold. This is adjustable with sysfs
> > io_error_halftime and io_error_limit. Tho I never tried what
> > actually happens when either the HDD (in bcache writeback-mode) or
> > the SSD fails. Actually, btrfs should be able to handle this (tho,
> > according to list reports, it doesn't handle errors very well at
> > this point).
> > 
> > BTW: Unnecessary copying from SSD to HDD doesn't take place in
> > bcache default mode: It only copies from HDD to SSD in writeback
> > mode (data is written to the cache first, then persisted to HDD in
> > the background). You can also use "write through" (data is written
> > to SSD and persisted to HDD at the same time, reporting persistence
> > to the application only when both copies were written) and "write
> > around" mode (data is written to HDD only, and only reads are
> > written to the SSD cache device).
> > 
> > If you want bcache behave as a huge IO scheduler for writes, use
> > writeback mode. If you have write-intensive applications, you may
> > want to choose write-around to not wear out the SSDs early. If you
> > want writes to be cached for later reads, you can choose
> > write-through mode. The latter two modes will ensure written data
> > is always persisted to HDD with the same guaranties you had without
> > bcache. The last mode is default and should not change behavior of
> > btrfs if the HDD fails, and if the SSD fails bcache would simply
> > turn off and fall back to HDD. 
> 
> Hello Kai,
> 
> Yeah, lots of modes. So that means, none works well for all cases?

Just three, and they all work well. It's just a decision wearing vs.
performance/safety. Depending on your workload you might benefit more or
less from write-behind caching - that's when you want to turn the knob.
Everything else works out of the box. In case of an SSD failure,
write-back is just less safe while the other two modes should keep your
FS intact in that case.

> Our server has lots of old files, on smb (various size), imap
> (1's small, 1000's large), postgresql server, virtualbox images
> (large), 50 or so snapshots and running synaptics for system upgrades
> is painfully slow. 

I don't think that bcache even cares to cache imap accesses to mail
bodies - it won't help performance. Network is usually much slower than
SSD access. But it will cache fs meta data which will improve imap
performance a lot.

> We are expecting slowness to be caused by fsyncs which appear to be
> much worse on a raid10 with snapshots. Presumably the whole thing
> would be fast enough with ssd's but that would be not very cost
> efficient.
> 
> All the overhead of the cache layer could be avoided if btrfs would
> just prefer to write small, hot, files to the ssd in the first place
> and clean up while balancing. A combination of 2 ssd's and 4 hdd's
> would be very nice (the mobo has 6 x sata, which is pretty common)

Well, I don't want to advertise bcache. But there's nothing you
couldn't do with it in your particular case:

Just attach two HDDs to one SSD. Bcache doesn't use a 1:1 relation
here, you can use 1:n where n is the backing devices. There's no need
to clean up using balancing because bcache will track hot data by
default. You just have to decide which balance between wearing the SSD
vs. performance you prefer. If slow fsyncs are you primary concern, I'd
go with write-back caching. The small file contents are propably not
your performance problem anyways but the meta data management btrfs has
to do in the background. Bcache will help a lot here, especially in
write-back mode. I'd recommend against using balance too often and too
intensive (don't use too big usage% filters), it will invalidate your
block cache and probably also invalidate bcache if bcache is too small.
It will hurt performance more than you gain. You may want to increase
nr_requests in the IO scheduler for your situation.

> Moreover increasing the ssd's size in the future would then be just
> as simple as

Re: [PATCH 0/4] Add support to clear v1 free space cache for btrfs check

2016-05-19 Thread Ivan P

Not sure why I was so fixated on installing it - thanks for reminding.

I was able to run it, but it fails:
===
# ./btrfsck --repair --clear-space-cache -p /dev/sda
enabling repair mode
Clearing all free space cache
btrfs unable to find ref byte nr 812846673920 parent 0 root 1  owner
1050 offset 0
ERROR: failed to remove backref for disk bytenr 812846673920
ERROR: failed to clear free space cache
transaction.h:41: btrfs_start_transaction: Assertion `root->commit_root` failed.
btrfs check[0x4437ee]
btrfs check(close_ctree_fs_info+0x213)[0x445ab3]
btrfs check(cmd_check+0x638)[0x42b718]
btrfs check(main+0x7b)[0x40fbdb]
/usr/lib/libc.so.6(__libc_start_main+0xf1)[0x7fe4ebe1a741]
btrfs check(_start+0x29)[0x40fcd9]
===

I also tried running it WITHOUT --repair first, as well as ONLY with
--repair, followed by --clearcache and same thing happened.
I must add that I feel cheated by the "-p" option, it's more of a
"print heartbeat" than "print progress" IMHO :p

@David Sterba: you're welcome.

Regards,
Ivan.

On Wed, May 18, 2016 at 3:00 AM, Qu Wenruo  wrote:
> Not familiar about the problem, but you don't really need to install the
> patch.
>
> You can just execute '/btrfsck' directly to
> call the new fsck.
>
> And there is still some time before the patch is merged, there is really no
> need to install.
>
> Thanks,
> Qu
>
>
> Ivan P wrote on 2016/05/17 20:12 +0200:
>>
>> Thank you, however I can't seem to be able to compile that snapshot, I'm
>> getting
>>
>> ===
>> /usr/bin/install -c -m644 -d 64-btrfs-dm.rules
>> /home/myuser/aur/btrfs-progs-git/pkg/btrfs-progs-git/usr/lib/udev/rules.d
>> /usr/bin/install: cannot create directory ‘64-btrfs-dm.rules’: File exists
>> Makefile:400: recipe for target 'install' failed
>> make: *** [install] Error 1
>> ==> ERROR: A failure occurred in package()
>> ===
>>
>> Just to make sure I wasn't screwing up somewhere, I tried the
>> btrfs-progs-git AUR package and I'm getting the same thing.
>> It's not only me, however: according to [1] it could be that this
>> commit has introduced it: [2]
>>
>> Regards,
>> Ivan.
>>
>> [1] https://aur.archlinux.org/packages/btrfs-progs-git/
>> [2]
>> http://repo.or.cz/btrfs-progs-unstable/devel.git?a=commit;h=ebe5b1cc7885027521db3c1d16d84bd54cc1321b
>>
>> On Tue, May 17, 2016 at 5:56 AM, Qu Wenruo 
>> wrote:
>>>
>>> Also to Ivan P, the initial reporter of the problem.
>>>
>>> Now "btrfsck --clear-cache" should help you to prevent kernel warning on
>>> free space cache problem.
>>>
>>> Thanks,
>>> Qu
>>>
>>> Qu Wenruo wrote on 2016/05/17 11:47 +0800:

 The patchset can be also fetched from github, just in case mail list
 blocks the last patch, which contains binary file.
 https://github.com/adam900710/btrfs-progs.git fsck_clear_cache

 Just as describe in the first patch, btrfs kernel "clear_cache" mount
 option can't rebuild all free space cache, due to the design of free
 space cache.
 (Although I pretty like the idea to delay the load of free space cache,
 as it hugely reduce the mount time of large fs)

 So this makes users to report error in mail list, complaining
 "clear_cache" doesn't wipe the annoying kernel warning on corrupted free
 space cache.

 Since kernel can't handle it, and user consider it more like an error,
 we'd better handle it like an error, to fix it in btrfs check.

 So this patchset adds the ability to clear free space cache, and add
 test case for it.

 The clear procedure is different from kernel, it will remove all free
 space cache inodes and its contents(with backrefs), and set
 cache_generation of super block to -1.
 So there will be no free space cache at all and kernel will be happy
 with that.

 This patch also enhances btrfs_previous_item() to use min_objectid to
 return as early as possible.

 Lu Fengqi (1):
   btrfs-progs: tests: add 020-bad-free-space-cache

 Qu Wenruo (3):
   btrfs-progs: corrupt-block: Add ability to corrupt free space cache
 file
   btrfs-progs: ctree: return earlier for btrfs_previous_item
   btrfs-progs: fsck: Add support to clear free space cache

  Documentation/btrfs-check.asciidoc |   8 ++
  btrfs-corrupt-block.c  | 124
 -
  cmds-check.c   |  58 +-
  ctree.c|   2 +
  free-space-cache.c | 124
 +
  free-space-cache.h |   4 +
  .../020-bad-free-space-cache/default_case.raw.xz   | Bin 0 -> 164068
 bytes
  tests/fsck-tests/020-bad-free-space-cache/test.sh  |  16 +++
  8 files changed, 334

Re: Crash after WARN_ON(BTRFS_I(inode)->csum_bytes)

2016-05-19 Thread Chris Mason

On Thu, May 19, 2016 at 12:25:53PM +0200, M G Berberich wrote:
> Hello,
> 
> a near total system crash (reboot via magic-sysreq-keys was possible)
> which seems to be associated to btrfs.
> 
> btrfs check (v 4.5.2) does not find any problem with any filesystems
> on the system.
> 
> Kernel is stock 4.6, booting without initramfs, system is Debian
> stretch (testing).

This patch should fix things:

https://patchwork.kernel.org/patch/9106001/

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v2.1 16/16] btrfs-progs: fsck: Introduce low memory mode

On Wed, May 18, 2016 at 08:58:57AM +0800, Qu Wenruo wrote:
> Josef Bacik wrote on 2016/05/17 11:29 -0400:
> > On 05/17/2016 05:38 AM, Qu Wenruo wrote:
> >
> Thanks for the review.
> 
> I'll add it to github branch to avoid unneeded mail bombing.

Thanks.  I did one more pass and fixed the error messages and some minor
formatting. Branch merged in devel, any fixups, please send as separate
patches. I'd like to give enough time for testing, so ETA for 4.6 will
be the end of the next week.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V2] Btrfs: introduce ticketed enospc infrastructure

2016-05-19 Thread Austin S. Hemmelgarn


On 2016-05-18 07:24, Austin S. Hemmelgarn wrote:

On 2016-05-17 13:30, Josef Bacik wrote:

Our enospc flushing sucks.  It is born from a time where we were early
enospc'ing constantly because multiple threads would race in for the same
reservation and randomly starve other ones out.  So I came up with
this solution
to block any other reservations from happening while one guy tried to
flush
stuff to satisfy his reservation.  This gives us pretty good
correctness, but
completely crap latency.

The solution I've come up with is ticketed reservations.  Basically we
try to
make our reservation, and if we can't we put a ticket on a list in
order and
kick off an async flusher thread.  This async flusher thread does the
same old
flushing we always did, just asynchronously.  As space is freed and
added back
to the space_info it checks and sees if we have any tickets that need
satisfying, and adds space to the tickets and wakes up anything we've
satisfied.

Once the flusher thread stops making progress it wakes up all the current
tickets and tells them to take a hike.

There is a priority list for things that can't flush, since the async
flusher
could do anything we need to avoid deadlocks.  These guys get priority
for
having their reservation made, and will still do manual flushing
themselves in
case the async flusher isn't running.

This patch gives us significantly better latencies.  Thanks,

Signed-off-by: Josef Bacik 

I've had this running on my test system (which is _finally_ working
again) for about 16 hours now, nothing is breaking, and a number of the
tests are actually completing marginally faster, so you can add:

Tested-by: Austin S. Hemmelgarn 
Hmm, this is troubling, I just did some manual testing, and hit the 
exact same warning that David did, so it looks like I need to go do some 
more thorough testing of my test system...


FWIW though, everything else does appear to be working correctly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Return earlier for previous item

On Thu, May 19, 2016 at 10:54:39AM +0800, Qu Wenruo wrote:
> Follow kernel code to return earlier for btrfs_previous_item() function.
> 
> Before this patch, btrfs_previous_item() doesn't use its min_objectid to
> exit, this makes caller to check key to exit, and if caller doesn't
> check, it will iterate all previous item.
> 
> This patch will check min_objectid and type, to early return and save
> some time.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Enhance tree block check by checking empty leaf or node

On Thu, May 19, 2016 at 04:44:35PM +0800, Qu Wenruo wrote:
> For btrfs, it's possible to have empty leaf, but empty node is not
> possible.
> 
> Add check for empty node for tree blocks.
> 
> Suggested-by: Josef Bacik 
> Signed-off-by: Qu Wenruo 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/3] btrfs-progs: fixes on top of latest integration branch

On Thu, May 19, 2016 at 01:13:11PM +0800, Anand Jain wrote:
>   Your latest branch integration-20160517 is failing with
>   make install, 1/3 will fix it. While here can you also
>   apply 2/3 and 3/3 as below, however they aren't related
>   though.

>   btrfs-progs: fix make install failure
>   btrfs-progs: add clean-all to the usage
>   btrfs-progs: clean up commands.h

2 and 3 applied, thanks. It's absolutelly ok to group such small misc
fixups.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] btrfs-progs: fix make install failure

On Thu, May 19, 2016 at 01:13:12PM +0800, Anand Jain wrote:
> /usr/bin/install -c -m644 -d 64-btrfs-dm.rules /usr/lib/udev/rules.d
> /usr/bin/install: cannot create directory ‘64-btrfs-dm.rules’: File exists
> Makefile:400: recipe for target 'install' failed
> make: *** [install] Error 1
> 
> fixes: btrfs-progs: udev: add rules for dm devices

>  ifneq ($(udevdir),)
> - $(INSTALL) -m644 -d $(udev_rules) $(DESTDIR)$(udevruledir)
> + $(INSTALL) -m644 $(udev_rules) $(DESTDIR)$(udevruledir)
>  endif

We have to conditionally create the directory, so extra -d on the
udevrule or -D is needed. Looking at the patch we have in openSUSE, it
does the former. I'll update the patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] btrfs: correct inode's outstanding_extents computation

2016-05-19 Thread Filipe Manana

On Thu, May 19, 2016 at 11:49 AM, Wang Xiaoguang
 wrote:
> This issue was revealed by modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
> When modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often gets
> these warnings from btrfs_destroy_inode():
> WARN_ON(BTRFS_I(inode)->outstanding_extents);
> WARN_ON(BTRFS_I(inode)->reserved_extents);
>
> Simple test program below can reproduce this issue steadily.
> #include 
> #include 
> #include 
> #include 
> #include 
>
> int main(void)
> {
> int fd;
> char buf[1024*1024];
>
> memset(buf, 0, 1024 * 1024);
> fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
> pwrite(fd, buf, 69954, 693581);
> return;
> }
>
> Assume the BTRFS_MAX_EXTENT_SIZE is 64KB, and data range is:
> 692224
>  765951
> |--|
>  len(73728)
> 1) for the above data range, btrfs_delalloc_reserve_metadata() will reserve
> metadata and BTRFS_I(inode)->outstanding_extents will be 2.
> (73728 + 65535) / 65536 == 2
>
> 2) then btrfs_dirty_page() will be called to dirty pages and set 
> EXTENT_DELALLOC
> flag. In this case, btrfs_set_bit_hook will be called 3 times. For first call,
> there will be such extent io map.
> 692224 696319 696320  
>   765951
> |--|  
> |-|
>len(4096)len(69632)
> have EXTENT_DELALLOC
> and because of having EXTENT_FIRST_DELALLOC, btrfs_set_bit_hook() won't change
> BTRFS_I(inode)->outstanding_extents, still be 2. see code logic in 
> btrfs_set_bit_hook();
>
> 3) second btrfs_set_bit_hook() call.
> Because of EXTENT_FIRST_DELALLOC have been unset by previous 
> btrfs_set_bit_hook(),
> btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by one, 
> so now
> BTRFS_I(inode)->outstanding_extents, sitll is 3. There will be such extent_io 
> map:
> 692224   696319 696320761855 761856   
>   765951
> ||  |-|  
> |--|
> len(4096) len(65536) len(4096)
> have EXTENT_DELALLOC  have EXTENT_DELALLOC
>
> And because (692224, 696319) and (696320, 761855) is adjacent, 
> btrfs_merge_extent_hook()
> will merge them into one delalloc extent, but according to the compulation 
> logic in
> btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will still be 
> 3.
> After merge, tehre will bu such extent_io map:
> 692224761855 761856   
>   765951
> |-|  
> |--|
>len(69632) len(4096)
>   have EXTENT_DELALLOC
>
> 4) third btrfs_set_bit_hook() call.
> Also because of EXTENT_FIRST_DELALLOC have not been set, btrfs_set_bit_hook 
> will increase
> BTRFS_I(inode)->outstanding_extents by one, so now 
> BTRFS_I(inode)->outstanding_extents is 4.
> The extent io map is:
> 692224761855 761856   
>   765951
> |-|  
> |--|
>len(69632) len(4096)
>   have EXTENT_DELALLOChave 
> EXTENT_DELALLOC
>
> Also because (692224, 761855) and (761856, 765951) is adjacent, 
> btrfs_merge_extent_hook()
> will merge them into one delalloc extent, according to the compulation logic 
> in
> btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will decrease 
> by one, be 3.
> so after merge, tehre will bu such extent_io map:
> 692224
>   765951
> |---|
>  len(73728)
>have EXTENT_DELALLOC
>
> But indeed for original data range(start:692224 end:765951 len:73728), we 
> just should
> have 2 outstanding extents, so it will trigger the above WARNINGs.
>
> The root casue is that btrfs_delalloc_reserve_metadata() will always add 
> needed outstanding
> extents first, and if later btrfs_set_extent_delalloc call multiple 
> btrfs_set_bit_hook(),
> it may wrongly update BTRFS_I(inode)->outstanding_extents, This patch choose 
> to also add
> BTRFS_I(inode)->outstanding_extents in btrfs_set_bit_hook() according to the 
> data range length,

Re: sharing page cache pages between multiple mappings

2016-05-19 Thread Michal Hocko

On Thu 19-05-16 12:17:14, Miklos Szeredi wrote:
> On Thu, May 19, 2016 at 11:05 AM, Michal Hocko  wrote:
> > On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
> >> Has anyone thought about sharing pages between multiple files?
> >>
> >> The obvious application is for COW filesytems where there are
> >> logically distinct files that physically share data and could easily
> >> share the cache as well if there was infrastructure for it.
> >
> > FYI this has been discussed at LSFMM this year[1]. I wasn't at the
> > session so cannot tell you any details but the LWN article covers it at
> > least briefly.
> 
> Cool, so it's not such a crazy idea.

FWIW it is ;)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] btrfs: correct inode's outstanding_extents computation

2016-05-19 Thread Wang Xiaoguang

This issue was revealed by modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB,
When modifing BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often gets
these warnings from btrfs_destroy_inode():
WARN_ON(BTRFS_I(inode)->outstanding_extents);
WARN_ON(BTRFS_I(inode)->reserved_extents);

Simple test program below can reproduce this issue steadily.
#include 
#include 
#include 
#include 
#include 

int main(void)
{
int fd;
char buf[1024*1024];

memset(buf, 0, 1024 * 1024);
fd = open("testfile", O_CREAT | O_EXCL | O_RDWR);
pwrite(fd, buf, 69954, 693581);
return;
}

Assume the BTRFS_MAX_EXTENT_SIZE is 64KB, and data range is:
692224  
   765951
|--|
 len(73728)
1) for the above data range, btrfs_delalloc_reserve_metadata() will reserve
metadata and BTRFS_I(inode)->outstanding_extents will be 2.
(73728 + 65535) / 65536 == 2

2) then btrfs_dirty_page() will be called to dirty pages and set EXTENT_DELALLOC
flag. In this case, btrfs_set_bit_hook will be called 3 times. For first call,
there will be such extent io map.
692224 696319 696320
765951
|--|  
|-|
   len(4096)len(69632)
have EXTENT_DELALLOC
and because of having EXTENT_FIRST_DELALLOC, btrfs_set_bit_hook() won't change
BTRFS_I(inode)->outstanding_extents, still be 2. see code logic in 
btrfs_set_bit_hook();

3) second btrfs_set_bit_hook() call.
Because of EXTENT_FIRST_DELALLOC have been unset by previous 
btrfs_set_bit_hook(),
btrfs_set_bit_hook will increase BTRFS_I(inode)->outstanding_extents by one, so 
now
BTRFS_I(inode)->outstanding_extents, sitll is 3. There will be such extent_io 
map:
692224   696319 696320761855 761856 
765951
||  |-|  
|--|
len(4096) len(65536) len(4096)
have EXTENT_DELALLOC  have EXTENT_DELALLOC

And because (692224, 696319) and (696320, 761855) is adjacent, 
btrfs_merge_extent_hook()
will merge them into one delalloc extent, but according to the compulation 
logic in
btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will still be 3.
After merge, tehre will bu such extent_io map:
692224761855 761856 
765951
|-|  
|--|
   len(69632) len(4096)
  have EXTENT_DELALLOC

4) third btrfs_set_bit_hook() call.
Also because of EXTENT_FIRST_DELALLOC have not been set, btrfs_set_bit_hook 
will increase
BTRFS_I(inode)->outstanding_extents by one, so now 
BTRFS_I(inode)->outstanding_extents is 4.
The extent io map is:
692224761855 761856 
765951
|-|  
|--|
   len(69632) len(4096)
  have EXTENT_DELALLOChave 
EXTENT_DELALLOC

Also because (692224, 761855) and (761856, 765951) is adjacent, 
btrfs_merge_extent_hook()
will merge them into one delalloc extent, according to the compulation logic in
btrfs_merge_extent_hook(), BTRFS_I(inode)->outstanding_extents will decrease by 
one, be 3.
so after merge, tehre will bu such extent_io map:
692224  
765951
|---|
 len(73728)
   have EXTENT_DELALLOC

But indeed for original data range(start:692224 end:765951 len:73728), we just 
should
have 2 outstanding extents, so it will trigger the above WARNINGs.

The root casue is that btrfs_delalloc_reserve_metadata() will always add needed 
outstanding
extents first, and if later btrfs_set_extent_delalloc call multiple 
btrfs_set_bit_hook(),
it may wrongly update BTRFS_I(inode)->outstanding_extents, This patch choose to 
also add
BTRFS_I(inode)->outstanding_extents in btrfs_set_bit_hook() according to the 
data range length,
and the added value is the correct number of outstanding_extents for this data 
range, then
decrease the value which was added in btrfs_delalloc_reserve_metadata().

As for why BTRFS_MAX_EXTENT_SIZE(128M) does not trigger above WARNINGs, this is 
because
__btrfs_buffered_write() internally have write

Re: [PATCH] btrfs: remove unnecessary btrfs_delalloc_release_metadata() call

2016-05-19 Thread Wang Xiaoguang


Hi,

On 05/19/2016 06:29 PM, Filipe Manana wrote:

On Thu, May 19, 2016 at 10:44 AM, Wang Xiaoguang
 wrote:

In __btrfs_write_out_cache(), we don't call btrfs_delalloc_reserve_metadata()
or btrfs_delalloc_reserve_space() to reserve metadata space, so we also should
not call btrfs_delalloc_release_metadata(), in which 
BTRFS_I(inode)->outstanding_extents
will be decreased, then WARN_ON(BTRFS_I(inode)->outstanding_extents) in
btrfs_destroy_inode() will be triggered.

To be honest, I do not see this WARNING in real, but I think we should fix it.

No, this is wrong.

We do this call to release reserved metadata space at
btrfs_save_ino_cache() through btrfs_delalloc_reserve_space().
And it's btrfs_save_ino_cache() that calls btrfs_write_out_ino_cache().

Oh, you're right, sorry for the noise.



If what you said were true, then it would be trivial to write a test
case (using the inode cache feature) to trigger the problem.

OK.

Regards,
Xiaoguang Wang





Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/free-space-cache.c | 17 +++--
  1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 8f835bf..ababf56 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3512,7 +3512,6 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 struct btrfs_free_space_ctl *ctl = root->free_ino_ctl;
 int ret;
 struct btrfs_io_ctl io_ctl;
-   bool release_metadata = true;

 if (!btrfs_test_opt(root, INODE_MAP_CACHE))
 return 0;
@@ -3520,26 +3519,16 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
 memset(_ctl, 0, sizeof(io_ctl));
 ret = __btrfs_write_out_cache(root, inode, ctl, NULL, _ctl,
   trans, path, 0);
-   if (!ret) {
-   /*
-* At this point writepages() didn't error out, so our metadata
-* reservation is released when the writeback finishes, at
-* inode.c:btrfs_finish_ordered_io(), regardless of it finishing
-* with or without an error.
-*/
-   release_metadata = false;
+   if (!ret)
 ret = btrfs_wait_cache_io(root, trans, NULL, _ctl, path, 0);
-   }

-   if (ret) {
-   if (release_metadata)
-   btrfs_delalloc_release_metadata(inode, inode->i_size);
  #ifdef DEBUG
+   if (ret) {
 btrfs_err(root->fs_info,
 "failed to write free ino cache for root %llu",
 root->root_key.objectid);
-#endif
 }
+#endif

 return ret;
  }
--
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: remove unnecessary btrfs_delalloc_release_metadata() call

2016-05-19 Thread Filipe Manana

On Thu, May 19, 2016 at 10:44 AM, Wang Xiaoguang
 wrote:
> In __btrfs_write_out_cache(), we don't call btrfs_delalloc_reserve_metadata()
> or btrfs_delalloc_reserve_space() to reserve metadata space, so we also should
> not call btrfs_delalloc_release_metadata(), in which 
> BTRFS_I(inode)->outstanding_extents
> will be decreased, then WARN_ON(BTRFS_I(inode)->outstanding_extents) in
> btrfs_destroy_inode() will be triggered.
>
> To be honest, I do not see this WARNING in real, but I think we should fix it.

No, this is wrong.

We do this call to release reserved metadata space at
btrfs_save_ino_cache() through btrfs_delalloc_reserve_space().
And it's btrfs_save_ino_cache() that calls btrfs_write_out_ino_cache().

If what you said were true, then it would be trivial to write a test
case (using the inode cache feature) to trigger the problem.



>
> Signed-off-by: Wang Xiaoguang 
> ---
>  fs/btrfs/free-space-cache.c | 17 +++--
>  1 file changed, 3 insertions(+), 14 deletions(-)
>
> diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
> index 8f835bf..ababf56 100644
> --- a/fs/btrfs/free-space-cache.c
> +++ b/fs/btrfs/free-space-cache.c
> @@ -3512,7 +3512,6 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
> struct btrfs_free_space_ctl *ctl = root->free_ino_ctl;
> int ret;
> struct btrfs_io_ctl io_ctl;
> -   bool release_metadata = true;
>
> if (!btrfs_test_opt(root, INODE_MAP_CACHE))
> return 0;
> @@ -3520,26 +3519,16 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
> memset(_ctl, 0, sizeof(io_ctl));
> ret = __btrfs_write_out_cache(root, inode, ctl, NULL, _ctl,
>   trans, path, 0);
> -   if (!ret) {
> -   /*
> -* At this point writepages() didn't error out, so our 
> metadata
> -* reservation is released when the writeback finishes, at
> -* inode.c:btrfs_finish_ordered_io(), regardless of it 
> finishing
> -* with or without an error.
> -*/
> -   release_metadata = false;
> +   if (!ret)
> ret = btrfs_wait_cache_io(root, trans, NULL, _ctl, path, 
> 0);
> -   }
>
> -   if (ret) {
> -   if (release_metadata)
> -   btrfs_delalloc_release_metadata(inode, inode->i_size);
>  #ifdef DEBUG
> +   if (ret) {
> btrfs_err(root->fs_info,
> "failed to write free ino cache for root %llu",
> root->root_key.objectid);
> -#endif
> }
> +#endif
>
> return ret;
>  }
> --
> 1.8.3.1
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Crash after WARN_ON(BTRFS_I(inode)->csum_bytes)

2016-05-19 Thread M G Berberich

Hello,

a near total system crash (reboot via magic-sysreq-keys was possible)
which seems to be associated to btrfs.

btrfs check (v 4.5.2) does not find any problem with any filesystems
on the system.

Kernel is stock 4.6, booting without initramfs, system is Debian
stretch (testing).

[154452.371396] [ cut here ]
[154452.376261] WARNING: CPU: 4 PID: 22333 at fs/btrfs/inode.c:9261 
btrfs_destroy_inode+0x23e/0x2b0
[154452.385319] Modules linked in: cp210x usbserial rc_digittrade mt2060 af9013 
uas intel_rapl kvm_intel kvm irqbypass ghash_clmulni_intel input_leds serio_raw 
dvb_usb_af9015 dvb_usb_v2 dvb_core rc_core usb_storage it87 hwmon_vid 
firewire_sbp2 firewire_core ppdev lp autofs4
[154452.410449] CPU: 4 PID: 22333 Comm: debtags Tainted: GW   4.6.0 
#1
[154452.417983] Hardware name: Gigabyte Technology Co., Ltd. H87-HD3/H87-HD3, 
BIOS F10 08/18/2015
[154452.426856]   8801078d3d48 813f634d 

[154452.434616]   8801078d3d88 8108daf6 
242d819df8b1
[154452.442376]  88028cfdf0e8 88028cfdf0e8 8800ca3e3800 
88028cfdf170
[154452.450137] Call Trace:
[154452.452753]  [] dump_stack+0x4f/0x72
[154452.458140]  [] __warn+0xc6/0xe0
[154452.463167]  [] warn_slowpath_null+0x18/0x20
[154452.469270]  [] btrfs_destroy_inode+0x23e/0x2b0
[154452.475649]  [] destroy_inode+0x36/0x60
[154452.481304]  [] evict+0x12d/0x190
[154452.486421]  [] iput+0x1d1/0x240
[154452.491448]  [] __dentry_kill+0x16d/0x1d0
[154452.497280]  [] dput+0x12b/0x220
[154452.502306]  [] SyS_rename+0x2c7/0x360
[154452.507872]  [] entry_SYSCALL_64_fastpath+0x13/0x8f
[154452.514632] ---[ end trace 108fc543c76a3117 ]---
[154460.456586] general protection fault:  [#1] PREEMPT SMP
[154460.462559] Modules linked in: cp210x usbserial rc_digittrade mt2060 af9013 
uas intel_rapl kvm_intel kvm irqbypass ghash_clmulni_intel input_leds serio_raw 
dvb_usb_af9015 dvb_usb_v2 dvb_core rc_core usb_storage it87 hwmon_vid 
firewire_sbp2 firewire_core ppdev lp autofs4
[154460.487670] CPU: 6 PID: 22513 Comm: find Tainted: GW   4.6.0 #1
[154460.494948] Hardware name: Gigabyte Technology Co., Ltd. H87-HD3/H87-HD3, 
BIOS F10 08/18/2015
[154460.503840] task: 880107208000 ti: 8807cdd24000 task.ti: 
8807cdd24000
[154460.511655] RIP: 0010:[]  [] 
find_inode+0x3f/0xc0
[154460.519863] RSP: 0018:8807cdd27ad8  EFLAGS: 00010206
[154460.525434] RAX:  RBX: 582677919991aa21 RCX: 
8807cdd27b90
[154460.532899] RDX: 812f7fb0 RSI: 88080b1ee930 RDI: 
582677919991ab01
[154460.540367] RBP: 8807cdd27b08 R08: 8807cdd27b90 R09: 
812cca61
[154460.547837] R10: 88080a2e1100 R11: 0003 R12: 
8807cdd27b90
[154460.555302] R13: 812f7fb0 R14: 88080b1ee930 R15: 
88080838b800
[154460.562760] FS:  7f9459528800() GS:88082e38() 
knlGS:
[154460.571213] CS:  0010 DS:  ES:  CR0: 80050033
[154460.577235] CR2: 7f945954d000 CR3: 0003e5342000 CR4: 
001406e0
[154460.584692] DR0:  DR1:  DR2: 

[154460.592147] DR3:  DR6: fffe0ff0 DR7: 
0400
[154460.599605] Stack:
[154460.601770]  8807cdd27af0 88080838b800 88080b1ee930 
8807cdd27b90
[154460.609576]  812f7fb0 880231a2e1c0 8807cdd27b50 
811d7e7c
[154460.617356]  8801c3b0ed20 812f8280  
88010eaef780
[154460.625134] Call Trace:
[154460.627752]  [] ? do_async_commit+0x40/0x40
[154460.633775]  [] iget5_locked+0x7c/0x1c0
[154460.639440]  [] ? create_pinned_em+0x130/0x130
[154460.645733]  [] btrfs_iget+0x4b/0x6a0
[154460.651218]  [] ? release_extent_buffer+0x26/0xc0
[154460.657781]  [] ? free_extent_buffer+0x46/0x90
[154460.664072]  [] btrfs_lookup_dentry+0x3ee/0x520
[154460.670454]  [] btrfs_lookup+0xd/0x30
[154460.675938]  [] lookup_real+0x18/0x60
[154460.681422]  [] lookup_slow+0x58/0xe0
[154460.686908]  [] walk_component+0x1c5/0x430
[154460.692860]  [] ? path_init+0x266/0x370
[154460.698525]  [] path_lookupat+0x58/0x100
[154460.704278]  [] filename_lookup+0x99/0x150
[154460.710211]  [] ? tlb_flush_mmu_free+0x31/0x50
[154460.716502]  [] ? tlb_finish_mmu+0x17/0x50
[154460.722433]  [] ? getname_flags+0x6d/0x1f0
[154460.728363]  [] user_path_at_empty+0x31/0x40
[154460.734475]  [] vfs_fstatat+0x4e/0xa0
[154460.739960]  [] SYSC_newfstatat+0x15/0x30
[154460.745817]  [] ? vm_munmap+0x47/0x60
[154460.751314]  [] SyS_newfstatat+0x9/0x10
[154460.756979]  [] entry_SYSCALL_64_fastpath+0x13/0x8f
[154460.763718] Code: 48 8b 3e 48 85 ff 74 35 49 89 f6 49 89 d5 49 89 cc 48 81 
ef e0 00 00 00 48 89 fb 75 0e eb 1e 48 89 fb 48 81 eb e0 00 00 00 74 12 <4c> 39 
7b 28 74 1d 48 8b bb e0 00 00 00 48 85 ff 75 e2 31 c0 48
[154460.784196] RIP  [] find_inode+0x3f/0xc0
[154460.789965]  RSP 
[154460.799258] ---[ end trace 108fc543c76a3118 ]---

Re: sharing page cache pages between multiple mappings

2016-05-19 Thread Miklos Szeredi

On Thu, May 19, 2016 at 11:05 AM, Michal Hocko  wrote:
> On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
>> Has anyone thought about sharing pages between multiple files?
>>
>> The obvious application is for COW filesytems where there are
>> logically distinct files that physically share data and could easily
>> share the cache as well if there was infrastructure for it.
>
> FYI this has been discussed at LSFMM this year[1]. I wasn't at the
> session so cannot tell you any details but the LWN article covers it at
> least briefly.

Cool, so it's not such a crazy idea.

Darrick, would you mind briefly sharing your ideas regarding this?

The use case I have is fixing overlayfs weird behavior. The following
may result in "buf" not matching "data":

int fr = open("foo", O_RDONLY);
int fw = open("foo", O_RDWR);
write(fw, data, sizeof(data));
read(fr, buf, sizeof(data));

The reason is that "foo" is on a read-only layer, and opening it for
read-write triggers copy-up into a read-write layer.  However the old,
read-only open still refers to the unmodified file.

Fixing this properly requires that when opening a file, we don't
delegate operations fully to the underlying file, but rather allow
sharing of pages from underlying file until the file is copied up.  At
that point we switch to sharing pages with the read-write copy.

Another use case is direct access in fuse:  people often want I/O
operations on a fuse file to go directly to an underlying file.  Doing
this properly requires sharing pages between the real, underlying file
and the fuse file.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Not TLS] Re: Reducing impact of periodic btrfs balance

2016-05-19 Thread Graham Cobb

On 19/05/16 05:09, Duncan wrote:
> So to Graham, are these 1.5K snapshots all of the same subvolume, or 
> split into snapshots of several subvolumes?  If it's all of the same 
> subvolume or of only 2-3 subvolumes, you still have some work to do in 
> terms of getting down to recommended snapshot levels.  Also, if you have 
> quotas on and don't specifically need them, try turning them off and see 
> if that alone makes it workable.

I have just under 20 subvolumes but the snapshots are only taken if
something has changed (actually I use btrbk: I am not sure if it takes
the snapshot and then removes it if nothing changed or whether it knows
not to even take it).  The most frequently changing subvolumes have just
under 400 snapshots each.  I have played with snapshot retention and
think it unlikely I would want to reduce it further.

I have quotas turned off.  At least, I am not using quotas -- how can I
double check it is really turned off?

I know that very large numbers of snapshots are not recommended, and I
expected the balance to be slow.  I was quite prepared for it to take
many days.  My full backups take several days and even incrementals take
several hours. What I did not expect, and think is a MUCH more serious
problem, is that the balance prevented use of the disk, holding up all
writes to the disk for (quite literally) hours each.  I have not seen
that effect mentioned anywhere!

That means that for a large, busy data disk, it is impossible to do a
balance unless the server is taken down to single-user mode for the time
the balance takes (presumably still days).  I assume this would also
apply to doing a RAID rebuild (I am not using multiple disks at the moment).

At the moment I am still using my previous backup strategy, alongside
the snapshots (that is: rsync-based rsnapshots to another disk daily and
with fairly long retentions, and separate daily full/incremental backups
using dar to a nas in another building).  I was hoping the btrfs
snapshots might replace the daily rsync snapshots but it doesn't look
like that will work out.

Thanks to all for the replies.

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: remove unnecessary btrfs_delalloc_release_metadata() call

2016-05-19 Thread Wang Xiaoguang

In __btrfs_write_out_cache(), we don't call btrfs_delalloc_reserve_metadata()
or btrfs_delalloc_reserve_space() to reserve metadata space, so we also should
not call btrfs_delalloc_release_metadata(), in which 
BTRFS_I(inode)->outstanding_extents
will be decreased, then WARN_ON(BTRFS_I(inode)->outstanding_extents) in
btrfs_destroy_inode() will be triggered.

To be honest, I do not see this WARNING in real, but I think we should fix it.

Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/free-space-cache.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 8f835bf..ababf56 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -3512,7 +3512,6 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
struct btrfs_free_space_ctl *ctl = root->free_ino_ctl;
int ret;
struct btrfs_io_ctl io_ctl;
-   bool release_metadata = true;
 
if (!btrfs_test_opt(root, INODE_MAP_CACHE))
return 0;
@@ -3520,26 +3519,16 @@ int btrfs_write_out_ino_cache(struct btrfs_root *root,
memset(_ctl, 0, sizeof(io_ctl));
ret = __btrfs_write_out_cache(root, inode, ctl, NULL, _ctl,
  trans, path, 0);
-   if (!ret) {
-   /*
-* At this point writepages() didn't error out, so our metadata
-* reservation is released when the writeback finishes, at
-* inode.c:btrfs_finish_ordered_io(), regardless of it finishing
-* with or without an error.
-*/
-   release_metadata = false;
+   if (!ret)
ret = btrfs_wait_cache_io(root, trans, NULL, _ctl, path, 0);
-   }
 
-   if (ret) {
-   if (release_metadata)
-   btrfs_delalloc_release_metadata(inode, inode->i_size);
 #ifdef DEBUG
+   if (ret) {
btrfs_err(root->fs_info,
"failed to write free ino cache for root %llu",
root->root_key.objectid);
-#endif
}
+#endif
 
return ret;
 }
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sharing page cache pages between multiple mappings

2016-05-19 Thread Michal Hocko

On Thu 19-05-16 10:20:13, Miklos Szeredi wrote:
> Has anyone thought about sharing pages between multiple files?
> 
> The obvious application is for COW filesytems where there are
> logically distinct files that physically share data and could easily
> share the cache as well if there was infrastructure for it.

FYI this has been discussed at LSFMM this year[1]. I wasn't at the
session so cannot tell you any details but the LWN article covers it at
least briefly.

[1] https://lwn.net/Articles/684826/
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: Enhance tree block check by checking empty leaf or node