[PATCH] Btrfs: limit thread pool size when remounting

2012-09-25 Thread Liu Bo
For some asynchronous threads, such as submit worker and cache worker, we
limit their thread pool size when mounting.

So we also need to do such things when remounting.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/super.c |   13 -
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 83d6f9f..a58e834 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct 
btrfs_fs_info *fs_info,
printk(KERN_INFO btrfs: resize thread pool %d - %d\n,
   old_pool_size, new_pool_size);
 
-   btrfs_set_max_workers(fs_info-generic_worker, new_pool_size);
+   btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size));
btrfs_set_max_workers(fs_info-workers, new_pool_size);
btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size);
-   btrfs_set_max_workers(fs_info-submit_workers, new_pool_size);
-   btrfs_set_max_workers(fs_info-caching_workers, new_pool_size);
-   btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size);
+   btrfs_set_max_workers(fs_info-submit_workers,
+ min_t(u64, fs_info-fs_devices-num_devices,
+ new_pool_size));
+   btrfs_set_max_workers(fs_info-caching_workers, min(2, new_pool_size));
+   btrfs_set_max_workers(fs_info-fixup_workers, min(1, new_pool_size));
btrfs_set_max_workers(fs_info-endio_workers, new_pool_size);
btrfs_set_max_workers(fs_info-endio_meta_workers, new_pool_size);
btrfs_set_max_workers(fs_info-endio_meta_write_workers, 
new_pool_size);
btrfs_set_max_workers(fs_info-endio_write_workers, new_pool_size);
-   btrfs_set_max_workers(fs_info-endio_freespace_worker, new_pool_size);
+   btrfs_set_max_workers(fs_info-endio_freespace_worker,
+ min(1, new_pool_size));
btrfs_set_max_workers(fs_info-delayed_workers, new_pool_size);
btrfs_set_max_workers(fs_info-readahead_workers, new_pool_size);
btrfs_set_max_workers(fs_info-scrub_workers, new_pool_size);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   One root structure hot_info is defined, is hooked
 up in super_block, and will be used to hold rb trees
 root, hash list root and some other information, etc.
   Adds hot_inode_tree struct to keep track of
 frequently accessed files, and be keyed by {inode, offset}.
 Trees contain hot_inode_items representing those files
 and ranges.
   Having these trees means that vfs can quickly determine the
 temperature of some data by doing some calculations on the
 hot_freq_data struct that hangs off of the tree item.
   Define two items hot_inode_item and hot_range_item,
 one of them represents one tracked file
 to keep track of its access frequency and the tree of
 ranges in this file, while the latter represents
 a file range of one inode.
   Each of the two structures contains a hot_freq_data
 struct with its frequency of access metrics (number of
 {reads, writes}, last {read,write} time, frequency of
 {reads,writes}).
   Also, each hot_inode_item contains one hot_range_tree
 struct which is keyed by {inode, offset, length}
 and used to keep track of all the ranges in this file.
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com

Just a coupl eof minor formatting things first up - I'll have more
comments as I get deeper into the series.


 +/*
 + * Initialize the inode tree. Should be called for each new inode
 + * access or other user of the hot_inode interface.
 + */
 +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree)

The names of these are a bit clunky. You probably don't need the
_rb_ in the function name. i.e. hot_inode_tree_init() is
sufficient, and if we every want to change in the tree type we don't
have to rename every single function...

.
 +/*
 + * Initialize a new hot_inode_item structure. The new structure is
 + * returned with a reference count of one and needs to be
 + * freed using free_inode_item()
 + */
 +void hot_rb_inode_item_init(void *_item)
 +{

The usual naming convention for slab initialiser functions is to use
a suffix of _once to indicate it is only ever called once per
slab object instantiation, not every time the object is allocated
fom the slab. See, for example, inode_init_once() and
inode_init_always().

so, that would make this function hot_inode_item_init_once().


 +/* init hot_inode_item and hot_range_item kmem cache */
 +static int __init hot_rb_item_cache_init(void)
 +{
 + hot_inode_item_cache = kmem_cache_create(hot_inode_item,
 + sizeof(struct hot_inode_item), 0,
 + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + hot_rb_inode_item_init);
 + if (!hot_inode_item_cache)
 + goto inode_err;
 +
 + hot_range_item_cache = kmem_cache_create(hot_range_item,
 + sizeof(struct hot_range_item), 0,
 + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + hot_rb_range_item_init);
 + if (!hot_range_item_cache)
 + goto range_err;
 +
 + return 0;
 +
 +range_err:
 + kmem_cache_destroy(hot_inode_item_cache);
 +inode_err:
 + return -ENOMEM;
 +}
 +
 +/*
 + * Initialize kmem cache for hot_inode_item
 + * and hot_range_item
 + */
 +void __init hot_track_cache_init(void)
 +{
 + if (hot_rb_item_cache_init())
 + return;

No real need to have a hot_rb_item_cache_init() function here - just
open code it all in the hot_track_cache_init() function.

 +}
 diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
 new file mode 100644
 index 000..269b67a
 --- /dev/null
 +++ b/fs/hot_tracking.h
 @@ -0,0 +1,27 @@
 +/*
 + * fs/hot_tracking.h
 + *
 + * Copyright (C) 2012 IBM Corp. All rights reserved.
 + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
 + *Ben Chociej bchoc...@gmail.com
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License v2 as published by the Free Software Foundation.
 + */
 +
 +#ifndef __HOT_TRACKING__
 +#define __HOT_TRACKING__
 +
 +#include linux/rbtree.h
 +#include linux/hot_tracking.h
 +
 +/* values for hot_freq_data flags */
 +/* freq data struct is for an inode */
 +#define FREQ_DATA_TYPE_INODE (1  0)
 +/* freq data struct is for a range */
 +#define FREQ_DATA_TYPE_RANGE (1  1)

The comments are redundant - the name of the object documents it's
use sufficiently.  ie.

/* values for hot_freq_data flags */
#define FREQ_DATA_TYPE_INODE (1  0)
#define FREQ_DATA_TYPE_RANGE (1  1)

is just fine by itself.


 +/* A frequency data struct holds values that are used to
 + * determine temperature of files and file ranges. These structs
 + * are members of hot_inode_item and hot_range_item
 + */

/*
 * This is a
 * multiline comment. ;)
 */

 +struct hot_freq_data {
 + struct timespec last_read_time;
 + struct timespec 

Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Zhi Yong Wu
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   One root structure hot_info is defined, is hooked
 up in super_block, and will be used to hold rb trees
 root, hash list root and some other information, etc.
   Adds hot_inode_tree struct to keep track of
 frequently accessed files, and be keyed by {inode, offset}.
 Trees contain hot_inode_items representing those files
 and ranges.
   Having these trees means that vfs can quickly determine the
 temperature of some data by doing some calculations on the
 hot_freq_data struct that hangs off of the tree item.
   Define two items hot_inode_item and hot_range_item,
 one of them represents one tracked file
 to keep track of its access frequency and the tree of
 ranges in this file, while the latter represents
 a file range of one inode.
   Each of the two structures contains a hot_freq_data
 struct with its frequency of access metrics (number of
 {reads, writes}, last {read,write} time, frequency of
 {reads,writes}).
   Also, each hot_inode_item contains one hot_range_tree
 struct which is keyed by {inode, offset, length}
 and used to keep track of all the ranges in this file.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com

 Just a coupl eof minor formatting things first up - I'll have more
 comments as I get deeper into the series.
All comments are very reasonable, and will be applied.
thanks for your review.


 
 +/*
 + * Initialize the inode tree. Should be called for each new inode
 + * access or other user of the hot_inode interface.
 + */
 +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree)

 The names of these are a bit clunky. You probably don't need the
 _rb_ in the function name. i.e. hot_inode_tree_init() is
 sufficient, and if we every want to change in the tree type we don't
 have to rename every single function...

 .
 +/*
 + * Initialize a new hot_inode_item structure. The new structure is
 + * returned with a reference count of one and needs to be
 + * freed using free_inode_item()
 + */
 +void hot_rb_inode_item_init(void *_item)
 +{

 The usual naming convention for slab initialiser functions is to use
 a suffix of _once to indicate it is only ever called once per
 slab object instantiation, not every time the object is allocated
 fom the slab. See, for example, inode_init_once() and
 inode_init_always().

 so, that would make this function hot_inode_item_init_once().

 
 +/* init hot_inode_item and hot_range_item kmem cache */
 +static int __init hot_rb_item_cache_init(void)
 +{
 + hot_inode_item_cache = kmem_cache_create(hot_inode_item,
 + sizeof(struct hot_inode_item), 0,
 + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + hot_rb_inode_item_init);
 + if (!hot_inode_item_cache)
 + goto inode_err;
 +
 + hot_range_item_cache = kmem_cache_create(hot_range_item,
 + sizeof(struct hot_range_item), 0,
 + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + hot_rb_range_item_init);
 + if (!hot_range_item_cache)
 + goto range_err;
 +
 + return 0;
 +
 +range_err:
 + kmem_cache_destroy(hot_inode_item_cache);
 +inode_err:
 + return -ENOMEM;
 +}
 +
 +/*
 + * Initialize kmem cache for hot_inode_item
 + * and hot_range_item
 + */
 +void __init hot_track_cache_init(void)
 +{
 + if (hot_rb_item_cache_init())
 + return;

 No real need to have a hot_rb_item_cache_init() function here - just
 open code it all in the hot_track_cache_init() function.

 +}
 diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
 new file mode 100644
 index 000..269b67a
 --- /dev/null
 +++ b/fs/hot_tracking.h
 @@ -0,0 +1,27 @@
 +/*
 + * fs/hot_tracking.h
 + *
 + * Copyright (C) 2012 IBM Corp. All rights reserved.
 + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
 + *Ben Chociej bchoc...@gmail.com
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License v2 as published by the Free Software Foundation.
 + */
 +
 +#ifndef __HOT_TRACKING__
 +#define __HOT_TRACKING__
 +
 +#include linux/rbtree.h
 +#include linux/hot_tracking.h
 +
 +/* values for hot_freq_data flags */
 +/* freq data struct is for an inode */
 +#define FREQ_DATA_TYPE_INODE (1  0)
 +/* freq data struct is for a range */
 +#define FREQ_DATA_TYPE_RANGE (1  1)

 The comments are redundant - the name of the object documents it's
 use sufficiently.  ie.

 /* values for hot_freq_data flags */
 #define FREQ_DATA_TYPE_INODE (1  0)
 #define FREQ_DATA_TYPE_RANGE (1  1)

 is just fine by itself.

 
 +/* A frequency data struct holds values that are used to
 + * determine temperature of files and file ranges. These structs
 + * are members of 

Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Zhi Yong Wu
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   One root structure hot_info is defined, is hooked
 up in super_block, and will be used to hold rb trees
 root, hash list root and some other information, etc.
   Adds hot_inode_tree struct to keep track of
 frequently accessed files, and be keyed by {inode, offset}.
 Trees contain hot_inode_items representing those files
 and ranges.
   Having these trees means that vfs can quickly determine the
 temperature of some data by doing some calculations on the
 hot_freq_data struct that hangs off of the tree item.
   Define two items hot_inode_item and hot_range_item,
 one of them represents one tracked file
 to keep track of its access frequency and the tree of
 ranges in this file, while the latter represents
 a file range of one inode.
   Each of the two structures contains a hot_freq_data
 struct with its frequency of access metrics (number of
 {reads, writes}, last {read,write} time, frequency of
 {reads,writes}).
   Also, each hot_inode_item contains one hot_range_tree
 struct which is keyed by {inode, offset, length}
 and used to keep track of all the ranges in this file.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com

 Just a coupl eof minor formatting things first up - I'll have more
 comments as I get deeper into the series.
OK, very look forward to seeing more on other patches, indeed thanks again.


 
 +/*
 + * Initialize the inode tree. Should be called for each new inode
 + * access or other user of the hot_inode interface.
 + */
 +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree)

 The names of these are a bit clunky. You probably don't need the
 _rb_ in the function name. i.e. hot_inode_tree_init() is
 sufficient, and if we every want to change in the tree type we don't
 have to rename every single function...

 .
 +/*
 + * Initialize a new hot_inode_item structure. The new structure is
 + * returned with a reference count of one and needs to be
 + * freed using free_inode_item()
 + */
 +void hot_rb_inode_item_init(void *_item)
 +{

 The usual naming convention for slab initialiser functions is to use
 a suffix of _once to indicate it is only ever called once per
 slab object instantiation, not every time the object is allocated
 fom the slab. See, for example, inode_init_once() and
 inode_init_always().

 so, that would make this function hot_inode_item_init_once().

 
 +/* init hot_inode_item and hot_range_item kmem cache */
 +static int __init hot_rb_item_cache_init(void)
 +{
 + hot_inode_item_cache = kmem_cache_create(hot_inode_item,
 + sizeof(struct hot_inode_item), 0,
 + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + hot_rb_inode_item_init);
 + if (!hot_inode_item_cache)
 + goto inode_err;
 +
 + hot_range_item_cache = kmem_cache_create(hot_range_item,
 + sizeof(struct hot_range_item), 0,
 + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD,
 + hot_rb_range_item_init);
 + if (!hot_range_item_cache)
 + goto range_err;
 +
 + return 0;
 +
 +range_err:
 + kmem_cache_destroy(hot_inode_item_cache);
 +inode_err:
 + return -ENOMEM;
 +}
 +
 +/*
 + * Initialize kmem cache for hot_inode_item
 + * and hot_range_item
 + */
 +void __init hot_track_cache_init(void)
 +{
 + if (hot_rb_item_cache_init())
 + return;

 No real need to have a hot_rb_item_cache_init() function here - just
 open code it all in the hot_track_cache_init() function.

 +}
 diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h
 new file mode 100644
 index 000..269b67a
 --- /dev/null
 +++ b/fs/hot_tracking.h
 @@ -0,0 +1,27 @@
 +/*
 + * fs/hot_tracking.h
 + *
 + * Copyright (C) 2012 IBM Corp. All rights reserved.
 + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com
 + *Ben Chociej bchoc...@gmail.com
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License v2 as published by the Free Software Foundation.
 + */
 +
 +#ifndef __HOT_TRACKING__
 +#define __HOT_TRACKING__
 +
 +#include linux/rbtree.h
 +#include linux/hot_tracking.h
 +
 +/* values for hot_freq_data flags */
 +/* freq data struct is for an inode */
 +#define FREQ_DATA_TYPE_INODE (1  0)
 +/* freq data struct is for a range */
 +#define FREQ_DATA_TYPE_RANGE (1  1)

 The comments are redundant - the name of the object documents it's
 use sufficiently.  ie.

 /* values for hot_freq_data flags */
 #define FREQ_DATA_TYPE_INODE (1  0)
 #define FREQ_DATA_TYPE_RANGE (1  1)

 is just fine by itself.

 
 +/* A frequency data struct holds values that are used to
 + * determine temperature of files and file ranges. These structs
 + * are members of 

Re: [RFC v2 02/10] vfs: add support for updating access frequency

2012-09-25 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Add some utils helpers to update access frequencies
 for one file or its range.
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/hot_tracking.c |  359 
 +
  fs/hot_tracking.h |   15 +++
  2 files changed, 374 insertions(+), 0 deletions(-)
 
 diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
 index 173054b..52ed926 100644
 --- a/fs/hot_tracking.c
 +++ b/fs/hot_tracking.c
 @@ -106,6 +106,365 @@ inode_err:
  }
  
  /*
 + * Drops the reference out on hot_inode_item by one and free the structure
 + * if the reference count hits zero
 + */
 +void hot_rb_free_hot_inode_item(struct hot_inode_item *he)

hot_inode_item_put()

 +{
 + if (!he)
 + return;

It's a bug to call a put function on a kref counted item with a null
pointer. Let the kernel crash so it is noticed and fixed.

 +
 + if (atomic_dec_and_test(he-refs.refcount)) {
 + WARN_ON(he-in_tree);
 + kmem_cache_free(hot_inode_item_cache, he);
 + }

Isn't this abusing krefs? i.e. this should be:

hot_inode_item_free()
{
WARN_ON(he-in_tree);
kmem_cache_free(hot_inode_item_cache, he);
}

hot_inode_item_put()
{
kref_put(he-refs, hot_inode_item_free)
}

 +/*
 + * Drops the reference out on hot_range_item by one and free the structure
 + * if the reference count hits zero
 + */
 +static void hot_rb_free_hot_range_item(struct hot_range_item *hr)

same comments as above.

 +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root,
 + unsigned long inode_num,
 + struct rb_node *node)

static struct rb_node *
hot_inode_item_find(. )

 +{
 + struct rb_node **p = root-rb_node;
 + struct rb_node *parent = NULL;
 + struct hot_inode_item *entry;
 +
 + /* walk tree to find insertion point */
 + while (*p) {
 + parent = *p;
 + entry = rb_entry(parent, struct hot_inode_item, rb_node);
 +
 + if (inode_num  entry-i_ino)
 + p = (*p)-rb_left;
 + else if (inode_num  entry-i_ino)
 + p = (*p)-rb_right;
 + else
 + return parent;
 + }
 +
 + entry = rb_entry(node, struct hot_inode_item, rb_node);
 + entry-in_tree = 1;
 + rb_link_node(node, parent, p);
 + rb_insert_color(node, root);

So the hot inode item search key is the inode number? Why use an
rb-tree then? Wouldn't a btree be a more memory efficient way to
hold a sparse index that has fixed key and pointer sizes?

Also, the API seems quite strange. you pass in the the rb_node and
the inode number which instead of passing in the hot inode item that
already holds this information. You then convert the rb_node back to
a hot inode item to set the in_tree variable. So why not just pass
in the struct hot_inode_item in the first place?

 +static u64 hot_rb_range_end(struct hot_range_item *hr)

hot_range_item_end()

 +{
 + if (hr-start + hr-len  hr-start)
 + return (u64)-1;
 +
 + return hr-start + hr-len - 1;
 +}
 +
 +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root,

hot_range_item_find()

 + u64 start,
 + struct rb_node *node)
 +{
 + struct rb_node **p = root-rb_node;
 + struct rb_node *parent = NULL;
 + struct hot_range_item *entry;
 +
 + /* ensure start is on a range boundary */
 + start = start  RANGE_SIZE_MASK;
 + /* walk tree to find insertion point */
 + while (*p) {
 + parent = *p;
 + entry = rb_entry(parent, struct hot_range_item, rb_node);
 +
 + if (start  entry-start)
 + p = (*p)-rb_left;
 + else if (start = hot_rb_range_end(entry))

Shouldn't an aligned end always be one byte short of the start
offset of the next aligned region? i.e. start ==
hot_rb_range_end(entry) is an indication of an off-by one bug
somewhere?

 + p = (*p)-rb_right;
 + else
 + return parent;
 + }
 +
 + entry = rb_entry(node, struct hot_range_item, rb_node);
 + entry-in_tree = 1;
 + rb_link_node(node, parent, p);
 + rb_insert_color(node, root);

Same comments as the hot_inode_range.

Also, the start offset in the hot_range_item is already aligned, so
why do you need to align it again? 

 +
 + return NULL;
 +}
 +
 +/*
 + * Add a hot_inode_item to a hot_inode_tree. If the tree already contains
 + * an item with the index given, return -EEXIST
 + */
 +static int hot_rb_add_hot_inode_item(struct hot_inode_tree *tree,
 + struct hot_inode_item *he)
 +{
 + int ret = 0;
 + struct rb_node *rb;
 +
 + rb = 

Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-25 Thread Dave Chinner
On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Introduce one new mount option '-o hottrack',
 and add its parsing support.
   Its usage looks like:
mount -o hottrack
mount -o nouser,hottrack
mount -o nouser,hottrack,loop
mount -o hottrack,nouser

I think that this option parsing should be done by the filesystem,
even though the tracking functionality is in the VFS. That way ony
the filesystems that can use the tracking information will turn it
on, rather than being able to turn it on for everything regardless
of whether it is useful or not.

Along those lines, just using a normal superblock flag to indicate
it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
don't need to allocate the sb-s_hot_info structure just to be able
to check whether we are tracking hot inodes or not.

This then means the hot inode tracking for the superblock can be
initialised by the filesystem as part of it's fill_super method,
along with the filesystem specific code that will use the hot
tracking information the VFS gathers

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-25 Thread Ram Pai
On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Adds a hash table structure which contains
 a lot of hash list and is used to efficiently
 look up the data temperature of a file or its
 ranges.
   In each hash list of hash table, the hash node
 will keep track of temperature info.
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/hot_tracking.c|   77 -
  include/linux/hot_tracking.h |   35 +++
  2 files changed, 110 insertions(+), 2 deletions(-)
 
 diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
 index fa89f70..5f96442 100644
 --- a/fs/hot_tracking.c
 +++ b/fs/hot_tracking.c
 @@ -16,6 +16,7 @@
  #include linux/module.h
  #include linux/spinlock.h
  #include linux/hardirq.h
 +#include linux/hash.h
  #include linux/fs.h
  #include linux/blkdev.h
  #include linux/types.h
 @@ -24,6 +25,9 @@

...snip...

 +/* Hash list heads for hot hash table */
 +struct hot_hash_head {
 + struct hlist_head hashhead;
 + rwlock_t rwlock;
 + u32 temperature;
 +};
 +
 +/* Nodes stored in each hash list of hash table */
 +struct hot_hash_node {
 + struct hlist_node hashnode;
 + struct list_head node;
 + struct hot_freq_data *hot_freq_data;
 + struct hot_hash_head *hlist;
 + spinlock_t lock; /* protects hlist */
 +
 + /*
 +  * number of references to this node
 +  * equals 1 (hashlist entry)
 +  */
 + struct kref refs;
 +};

Dont see why you need yet another datastructure to hash the inode_item
and the range_item into a hash list.  You can just add another
hlist_node in the inode_item and range_item. This field can be then used
to link into the corresponding hash list.

You can use the container_of() get to the inode_item or the range_item
using the hlist_node field. 

You can thus eliminate a lot of code.

 +
  /* An item representing an inode and its access frequency */
  struct hot_inode_item {
   /* node for hot_inode_tree rb_tree */
 @@ -68,6 +93,8 @@ struct hot_inode_item {
   spinlock_t lock;
   /* prevents kfree */
   struct kref refs;
 + /* hashlist node for this inode */
 + struct hot_hash_node *heat_node;

this can be just
struct hlist_node head_node; /* lookup hot_inode hash list */

Use this field to link it into the corresponding hashlist.

  };
 
this can be just 
  /*
 @@ -91,6 +118,8 @@ struct hot_range_item {
   spinlock_t lock;
   /* prevents kfree */
   struct kref refs;
 + /* hashlist node for this range */
 + struct hot_hash_node *heat_node;

this can be just 
struct hlist_node head_node; /* lookup hot_range hash list */


  };
 
  struct hot_info {
 @@ -98,6 +127,12 @@ struct hot_info {
 
   /* red-black tree that keeps track of fs-wide hot data */
   struct hot_inode_tree hot_inode_tree;
 +
 + /* hash map of inode temperature */
 + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE];
 +
 + /* hash map of range temperature */
 + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE];
  };
 
  #endif  /* _LINUX_HOTTRACK_H */

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] btrfs-progs: Close file descriptor on exit

2012-09-25 Thread David Sterba
On Tue, Sep 25, 2012 at 10:02:15AM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Need to close fd on exit.

Strictly you don't need to, kernel will do that at exit() time.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage

2012-09-25 Thread David Sterba
On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   Some code pathes forget to free memory on exit.

Same as with the fd's, kernel will free all memory for us at exit().

If there's lots of memory allocated, it may be even faster to leave the
unallocation process to kernel as it will do it in one go, while the
application would unnecessarily free it chunk by chunk.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Ram Pai
On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com
 
   One root structure hot_info is defined, is hooked
 up in super_block, and will be used to hold rb trees
 root, hash list root and some other information, etc.
   Adds hot_inode_tree struct to keep track of
 frequently accessed files, and be keyed by {inode, offset}.
 Trees contain hot_inode_items representing those files
 and ranges.
   Having these trees means that vfs can quickly determine the
 temperature of some data by doing some calculations on the
 hot_freq_data struct that hangs off of the tree item.
   Define two items hot_inode_item and hot_range_item,
 one of them represents one tracked file
 to keep track of its access frequency and the tree of
 ranges in this file, while the latter represents
 a file range of one inode.
   Each of the two structures contains a hot_freq_data
 struct with its frequency of access metrics (number of
 {reads, writes}, last {read,write} time, frequency of
 {reads,writes}).
   Also, each hot_inode_item contains one hot_range_tree
 struct which is keyed by {inode, offset, length}
 and used to keep track of all the ranges in this file.
 
 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
 +
..snip..

 +/* A tree that sits on the hot_info */
 +struct hot_inode_tree {
 + struct rb_root map;
 + rwlock_t lock;
 +};
 +
 +/* A tree of ranges for each inode in the hot_inode_tree */
 +struct hot_range_tree {
 + struct rb_root map;
 + rwlock_t lock;
 +};

Can as well have a generic datastructure called hot_tree instead
of having two different datastructure which basically are the same.

 +
 +/* A frequency data struct holds values that are used to
 + * determine temperature of files and file ranges. These structs
 + * are members of hot_inode_item and hot_range_item
 + */
 +struct hot_freq_data {
 + struct timespec last_read_time;
 + struct timespec last_write_time;
 + u32 nr_reads;
 + u32 nr_writes;
 + u64 avg_delta_reads;
 + u64 avg_delta_writes;
 + u8 flags;
 + u32 last_temperature;
 +};
 +
 +/* An item representing an inode and its access frequency */
 +struct hot_inode_item {
 + /* node for hot_inode_tree rb_tree */
 + struct rb_node rb_node;
 + /* tree of ranges in this inode */
 + struct hot_range_tree hot_range_tree;
 + /* frequency data for this inode */
 + struct hot_freq_data hot_freq_data;
 + /* inode number, copied from inode */
 + unsigned long i_ino;
 + /* used to check for errors in ref counting */
 + u8 in_tree;
 + /* protects hot_freq_data, i_no, in_tree */
 + spinlock_t lock;
 + /* prevents kfree */
 + struct kref refs;
 +};
 +
 +/*
 + * An item representing a range inside of an inode whose frequency
 + * is being tracked
 + */
 +struct hot_range_item {
 + /* node for hot_range_tree rb_tree */
 + struct rb_node rb_node;
 + /* frequency data for this range */
 + struct hot_freq_data hot_freq_data;
 + /* the hot_inode_item associated with this hot_range_item */
 + struct hot_inode_item *hot_inode;
 + /* starting offset of this range */
 + u64 start;
 + /* length of this range */
 + u64 len;
 + /* used to check for errors in ref counting */
 + u8 in_tree;
 + /* protects hot_freq_data, start, len, and in_tree */
 + spinlock_t lock;
 + /* prevents kfree */
 + struct kref refs;
 +};

might as well have just one generic datastructure called hot_item with 
all the common fields and then have 

struct hot_inode_item  {
struct hot_item hot_inode;
struct hot_tree hot_range_tree;
unsigned long i_ino;
}

and 

struct hot_range_item {
struct hot_item hot_range;
u64 start;
u64 len;/* length of this range */
}

This should help you eliminate some duplicate code as well.


RP

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/2] Btrfs: cleanup duplicated division functions

2012-09-25 Thread Miao Xie
On  mon, 24 Sep 2012 21:42:42 +0300, Ilya Dryomov wrote:
 On Mon, Sep 24, 2012 at 06:47:42PM +0200, David Sterba wrote:
 On Mon, Sep 24, 2012 at 10:05:02AM +0800, Miao Xie wrote:
 Generally, we should check the value when it is input. If not, we might
 run our program with the wrong value, and it is possible to cause unknown
 problems. So I think the above chunk is necessary.

 The difference is that giving an invalid value will exit early (your
 version), while Ilya's version will clamp the 'usage' to the allowed
 range during processing. From usability POV, I'd prefer to let the
 command fail early with a verbose error eg. that the range is invalid.
 
 I think that's the job for progs, and that's where this and a few other
 checks are currently implemented.
 
 There is no possibility for unknown problems, that's exactly how it's
 been implemented before the cleanup.  The purpose of balance filters
 (and the usage filter in particular) is to lower the number of chunks we
 have to relocate.  If someone decides to use raw ioctls, and supplies us
 with the invalid value, we just cut it down to a 100 and proceed.
 usage=100 is the equivalent of not using the usage filter at all, so the
 raw ioctl user will get what he deserves.
 
 This is very similar to what we currently do with other filters.  For
 example, soft can only be used with convert, and this is checked by
 progs.  But, if somebody were to set a soft flag without setting
 convert we would simply ignore that soft.  Same goes for drange
 and devid, invalid ranges, invalid devids, etc.  Pulling all these
 checks into the kernel is questionable at best, and, if enough people
 insist on it, should be discussed separately.

I think the usage is a special case that doesn't cause critical problem and
are not used everywhere. But if there is a variant which is referenced at
several places and the kernel would crash if it is invalid, in this case,
we would add the check everywhere and make the code more complex and ugly
if we don't check it when it is passed in. Besides that if we have done
lots of work before the check, we must roll back when we find the variant
is invalid, it wastes time. So I think doing the check and returning the
error number as early as possible is a rule we should follow.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c

2012-09-25 Thread David Sterba
On Mon, Sep 24, 2012 at 11:03:49PM +0200, David Sterba wrote:
 Could you please put the check into a separate helper

Please note that checksum will become a variable per-filesystem
property, stored within the superblock, so the helper should be passed a
fs_info pointer.

thanks,
david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: limit thread pool size when remounting

2012-09-25 Thread David Sterba
On Tue, Sep 25, 2012 at 02:48:33PM +0800, Liu Bo wrote:
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct 
 btrfs_fs_info *fs_info,
   printk(KERN_INFO btrfs: resize thread pool %d - %d\n,
  old_pool_size, new_pool_size);
  
 - btrfs_set_max_workers(fs_info-generic_worker, new_pool_size);
 + btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size));

How could new_pool_size be  1 ?

There's a check in super.c to pick only values  0

   btrfs_set_max_workers(fs_info-workers, new_pool_size);
   btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size);
 - btrfs_set_max_workers(fs_info-submit_workers, new_pool_size);
 - btrfs_set_max_workers(fs_info-caching_workers, new_pool_size);
 - btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size);
 + btrfs_set_max_workers(fs_info-submit_workers,
 +   min_t(u64, fs_info-fs_devices-num_devices,
 +   new_pool_size));

This ask for update also when a new device is added/removed.

 + btrfs_set_max_workers(fs_info-caching_workers, min(2, new_pool_size));
 + btrfs_set_max_workers(fs_info-fixup_workers, min(1, new_pool_size));

Same as above, is it expected to be  1 ?

   btrfs_set_max_workers(fs_info-endio_workers, new_pool_size);
   btrfs_set_max_workers(fs_info-endio_meta_workers, new_pool_size);
   btrfs_set_max_workers(fs_info-endio_meta_write_workers, 
 new_pool_size);
   btrfs_set_max_workers(fs_info-endio_write_workers, new_pool_size);
 - btrfs_set_max_workers(fs_info-endio_freespace_worker, new_pool_size);
 + btrfs_set_max_workers(fs_info-endio_freespace_worker,
 +   min(1, new_pool_size));

Not sure, do we actually need more than 1 free space worker?

   btrfs_set_max_workers(fs_info-delayed_workers, new_pool_size);
   btrfs_set_max_workers(fs_info-readahead_workers, new_pool_size);
   btrfs_set_max_workers(fs_info-scrub_workers, new_pool_size);
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c

2012-09-25 Thread ching
On 09/25/2012 06:51 PM, David Sterba wrote:
 On Mon, Sep 24, 2012 at 11:03:49PM +0200, David Sterba wrote:
 Could you please put the check into a separate helper
 Please note that checksum will become a variable per-filesystem
 property, stored within the superblock, so the helper should be passed a
 fs_info pointer.

 thanks,
 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


How about enhancing the *thread_pool=/number mount option instead?

thread_pool=n enable threadpool/**/for compression and checksum, MAY 
improve bandwidth/*
*/thread_pool=0 disable threadpool for compression and checksum, MIGHT 
reduce latency
thread_pool=-1 or not providedautomatically managed (current behaviour and 
default choice)


This should allow user to tradeoff between latency and bandwidth, furthermore, 
you do not need to assume that btrfs may use crc32c algorithm only forever.




/*
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c

2012-09-25 Thread ching
On 09/25/2012 06:51 PM, David Sterba wrote:
 On Mon, Sep 24, 2012 at 11:03:49PM +0200, David Sterba wrote:
 Could you please put the check into a separate helper
 Please note that checksum will become a variable per-filesystem
 property, stored within the superblock, so the helper should be passed a
 fs_info pointer.

 thanks,
 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


How about enhancing the thread_pool=number mount option instead?

thread_pool=n enable threadpool for compression and 
checksum, MAY improve bandwidth
thread_pool=0 disable threadpool for compression and 
checksum, MIGHT reduce latency
thread_pool=-1 or not providedautomatically managed (current behavior and 
default choice)


This should allow user to tradeoff between latency and bandwidth, furthermore, 
you do not need to assume that btrfs may use crc32c algorithm only forever.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: limit thread pool size when remounting

2012-09-25 Thread Liu Bo
On 09/25/2012 07:39 PM, David Sterba wrote:
 On Tue, Sep 25, 2012 at 02:48:33PM +0800, Liu Bo wrote:
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct 
 btrfs_fs_info *fs_info,
  printk(KERN_INFO btrfs: resize thread pool %d - %d\n,
 old_pool_size, new_pool_size);
  
 -btrfs_set_max_workers(fs_info-generic_worker, new_pool_size);
 +btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size));
 
 How could new_pool_size be  1 ?
 
 There's a check in super.c to pick only values  0
 

I think we just need only 1 generic_worker

  btrfs_set_max_workers(fs_info-workers, new_pool_size);
  btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size);
 -btrfs_set_max_workers(fs_info-submit_workers, new_pool_size);
 -btrfs_set_max_workers(fs_info-caching_workers, new_pool_size);
 -btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size);
 +btrfs_set_max_workers(fs_info-submit_workers,
 +  min_t(u64, fs_info-fs_devices-num_devices,
 +  new_pool_size));
 
 This ask for update also when a new device is added/removed.
 

Oh, yes, but we should do it in another new patch instead.

 +btrfs_set_max_workers(fs_info-caching_workers, min(2, new_pool_size));
 +btrfs_set_max_workers(fs_info-fixup_workers, min(1, new_pool_size));
 
 Same as above, is it expected to be  1 ?
 
  btrfs_set_max_workers(fs_info-endio_workers, new_pool_size);
  btrfs_set_max_workers(fs_info-endio_meta_workers, new_pool_size);
  btrfs_set_max_workers(fs_info-endio_meta_write_workers, 
 new_pool_size);
  btrfs_set_max_workers(fs_info-endio_write_workers, new_pool_size);
 -btrfs_set_max_workers(fs_info-endio_freespace_worker, new_pool_size);
 +btrfs_set_max_workers(fs_info-endio_freespace_worker,
 +  min(1, new_pool_size));
 
 Not sure, do we actually need more than 1 free space worker?
 

Same as generic_worker and fixup_workers, I think only one is enough, that' why 
I make
the minimum limitation, or we can set it as 1 directly.


thanks,
liubo

  btrfs_set_max_workers(fs_info-delayed_workers, new_pool_size);
  btrfs_set_max_workers(fs_info-readahead_workers, new_pool_size);
  btrfs_set_max_workers(fs_info-scrub_workers, new_pool_size);
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: limit thread pool size when remounting

2012-09-25 Thread David Sterba
On Tue, Sep 25, 2012 at 08:07:00PM +0800, Liu Bo wrote:
 On 09/25/2012 07:39 PM, David Sterba wrote:
  On Tue, Sep 25, 2012 at 02:48:33PM +0800, Liu Bo wrote:
  --- a/fs/btrfs/super.c
  +++ b/fs/btrfs/super.c
  @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct 
  btrfs_fs_info *fs_info,
 printk(KERN_INFO btrfs: resize thread pool %d - %d\n,
old_pool_size, new_pool_size);
   
  -  btrfs_set_max_workers(fs_info-generic_worker, new_pool_size);
  +  btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size));
  
  How could new_pool_size be  1 ?
  
  There's a check in super.c to pick only values  0
 
 I think we just need only 1 generic_worker

I got it reversed, sorry for the noise.

 btrfs_set_max_workers(fs_info-workers, new_pool_size);
 btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size);
  -  btrfs_set_max_workers(fs_info-submit_workers, new_pool_size);
  -  btrfs_set_max_workers(fs_info-caching_workers, new_pool_size);
  -  btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size);
  +  btrfs_set_max_workers(fs_info-submit_workers,
  +min_t(u64, fs_info-fs_devices-num_devices,
  +new_pool_size));
  
  This ask for update also when a new device is added/removed.
  
 
 Oh, yes, but we should do it in another new patch instead.

Yes, this was just a hint to pair the remount/thread_pool capabilities.

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c

2012-09-25 Thread David Sterba
On Tue, Sep 25, 2012 at 07:40:17PM +0800, ching wrote:
 How about enhancing the *thread_pool=/number mount option instead?
 
 thread_pool=n enable threadpool/**/for compression and checksum, MAY 
 improve bandwidth/*
 */thread_pool=0 disable threadpool for compression and checksum, MIGHT 
 reduce latency
 thread_pool=-1 or not providedautomatically managed (current behaviour 
 and default choice)

Sorry, I don't understand the syntax, can you please write it more
clearly? Thanks.

 This should allow user to tradeoff between latency and bandwidth,
 furthermore, you do not need to assume that btrfs may use crc32c
 algorithm only forever.

Some sort of finer control over the threads makes sense, we should
distinguish for cpu-bound processing where paralelism wins and io-bound
where it is not helpful to add more and more threads that hammer a
single device (namely in case of HDD, so this should be tunable for the
devices with cheap seeks).

david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage

2012-09-25 Thread Zhi Yong Wu
On Tue, Sep 25, 2012 at 6:14 PM, David Sterba d...@jikos.cz wrote:
 On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Some code pathes forget to free memory on exit.

 Same as with the fd's, kernel will free all memory for us at exit().
hi, can you let me know the pointer to exit() said by you?


 If there's lots of memory allocated, it may be even faster to leave the
 unallocation process to kernel as it will do it in one go, while the
 application would unnecessarily free it chunk by chunk.
got it, thanks.

 david



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC design issues

2012-09-25 Thread David Sterba
On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote:
 I'm going to look at fixing some of the performance issues that crop up 
 because
 of our reservation system.  Before I go and do a whole lot of work I want some
 feedback.  I've done a brain dump here
 https://btrfs.wiki.kernel.org/index.php/ENOSPC

Thanks for writing it down, much appreciated.

My first and probably naive approach is described in the page, quoting
here:

 Attempt to address how to flush less stated below. The
 over-reservation of a 4k block can go up to 96k as the worst case
 calculation (see above). This accounts for splitting the full tree path
 from 8th level root down to the leaf plus the node splits. My question:
 how often do we need to go up to the level N+1 from current level N?
 for levels 0 and 1 it may happen within one transaction, maybe not so
 often for level 2 and with exponentially decreasing frequency for the
 higher levels. Therefore, is it possible to check the tree level first
 and adapt the calculation according to that? Let's say we can reduce
 the 4k reservation size from 96k to 32k on average (for a many-gigabyte
 filesystem), thus increasing the space available for reservations by
 some factor. The expected gain is less pressure to the flusher because
 more reservations will succeed immediately.
 The idea behind is to make the initial reservation more accurate to
 current state than blindly overcommitting by some random factor (1/2).
 Another hint to the tree root level may be the usage of the root node:
 eg. if the root is less than half full, splitting will not happen
 unless there are K concurrent reservations running where K is
 proportional to overwriting the whole subtree (same exponential
 decrease with increasing level) and this will not be possible within
 one transaction or there will not be enough space to satisfy all
 reservations. (This attempts to fine-tune the currently hardcoded level
 8 up to the best value). The safe value for the level in the
 calculations would be like N+1, ie. as if all the possible splits
 happen with respect to current tree height.

implemented as follows on top of next/master, in short:
* disable overcommit completely
* do the optimistically best guess for the metadata and reserve only up
  to the current tree height

--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2794,7 +2794,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct 
address_space *mapping)
 static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root,
 unsigned num_items)
 {
-   return (root-leafsize + root-nodesize * (BTRFS_MAX_LEVEL - 1)) *
+   int level = btrfs_header_level(root-node);
+
+   level = min(level, BTRFS_MAX_LEVEL);
+
+   return (root-leafsize + root-nodesize * (level - 1)) *
3 * num_items;
 }

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index efb044e..c9fa7ed 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3649,6 +3649,8 @@ static int can_overcommit(struct btrfs_root *root,
u64 avail;
u64 used;

+   return 0;
+
used = space_info-bytes_used + space_info-bytes_reserved +
space_info-bytes_pinned + space_info-bytes_readonly +
space_info-bytes_may_use;
---

I must be doing something wrong, because I don't see any unexpected
ENOSPC while performing some tests on a 2G, 4G and 10G partitions with
following loads:

fs_mark -F -k -S 0 -D 20 -N 100
- dumb, no file contents

fs_mark -F -k -S 0 -D 2 -N 40 -s 2048 -t 8
- metadata intense, files and inline contents

fs_mark -F -k -S 0 -D 20 -N 100 -s 3900 -t 24
- ~dtto, lots writers

fs_mark -F -k -S 0 -D 20 -N 100 -s 8192 -t 24
- lots writers, no inlines

tar xf linux-3.2.tar.bz2(1G in total)
- simple untar


The fs_mark loads do not do any kind of sync, as this should stress the
number of data in flight. After each load finishes with ENOSPC, the rest
of the filesystem is filled with a file full of zeros. Then 'fi df'
reports all the space is used, no unexpectedly large files can be found
(ie a few hundred KBs if it was data-intense, or using whole remaining
data space if it was meta-intense, eg. 8MB).

mkfs was default, so are the mount options, did not push it through
xfstests. but at least verified md5sums of the kernel tree.

Sample tree height for extent tree was 2 and 3 for fs tree, so I think
it exercised the case where the tree height increased during the load
(but maybe it was just lucky to get the +1 block from somewhere else,
dunno).

I'm running the tests on a 100G filesystem now.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC design issues

2012-09-25 Thread Josef Bacik
On Tue, Sep 25, 2012 at 10:43:36AM -0600, David Sterba wrote:
 On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote:
  I'm going to look at fixing some of the performance issues that crop up 
  because
  of our reservation system.  Before I go and do a whole lot of work I want 
  some
  feedback.  I've done a brain dump here
  https://btrfs.wiki.kernel.org/index.php/ENOSPC
 
 Thanks for writing it down, much appreciated.
 
 My first and probably naive approach is described in the page, quoting
 here:
 
  Attempt to address how to flush less stated below. The
  over-reservation of a 4k block can go up to 96k as the worst case
  calculation (see above). This accounts for splitting the full tree path
  from 8th level root down to the leaf plus the node splits. My question:
  how often do we need to go up to the level N+1 from current level N?
  for levels 0 and 1 it may happen within one transaction, maybe not so
  often for level 2 and with exponentially decreasing frequency for the
  higher levels. Therefore, is it possible to check the tree level first
  and adapt the calculation according to that? Let's say we can reduce
  the 4k reservation size from 96k to 32k on average (for a many-gigabyte
  filesystem), thus increasing the space available for reservations by
  some factor. The expected gain is less pressure to the flusher because
  more reservations will succeed immediately.
  The idea behind is to make the initial reservation more accurate to
  current state than blindly overcommitting by some random factor (1/2).
  Another hint to the tree root level may be the usage of the root node:
  eg. if the root is less than half full, splitting will not happen
  unless there are K concurrent reservations running where K is
  proportional to overwriting the whole subtree (same exponential
  decrease with increasing level) and this will not be possible within
  one transaction or there will not be enough space to satisfy all
  reservations. (This attempts to fine-tune the currently hardcoded level
  8 up to the best value). The safe value for the level in the
  calculations would be like N+1, ie. as if all the possible splits
  happen with respect to current tree height.
 
 implemented as follows on top of next/master, in short:
 * disable overcommit completely
 * do the optimistically best guess for the metadata and reserve only up
   to the current tree height
 

So I had tried to do this before, the problem is when height changes our reserve
changes.  So for things like delalloc we say we have X number of extents and we
reserve that much space, but then when we run delalloc we re-calculate the
metadata size for X number extents we've removed and that number could come out
differently since the height of the tree would have changed.  One thing we could
do is to store the actual reservation with the extent in the io_tree, but I
think we already use the private for something else so we'd have to add it
somewhere else.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage

2012-09-25 Thread Goffredo Baroncelli

On 09/25/2012 12:14 PM, David Sterba wrote:

On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote:

From: Zhi Yong Wuwu...@linux.vnet.ibm.com

   Some code pathes forget to free memory on exit.


Same as with the fd's, kernel will free all memory for us at exit().


I strongly disagree with this approach. The callee often don't know what 
happen after and before the call. The same is true for the programmer, 
because the code is quite often updated by several people. A clean 
exit() is the right thing to do as general rule. I don't see any valid 
reason (in the btrfs context) to do otherwise.


Relying on the exit() for a proper clean-up increase the likelihood of 
bug when the code evolves (see my patch   [RESPOST][BTRFS-PROGS][PATCH] 
btrfs_read_dev_super(): uninitialized variable for an example of what 
means an incorrect deallocation of resource).



If there's lots of memory allocated, it may be even faster to leave the
unallocation process to kernel as it will do it in one go, while the
application would unnecessarily free it chunk by chunk.


May be I am wrong, but I don't think that the increase of speed of the 
btrfs command is even measurable relying on exit instead of free()-ing 
each chunk of memory one at time The same should be true for the 
open()/close()


My 2¢

BR
G.Baroncelli



david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2 v3] Btrfs: snapshot-aware defrag

2012-09-25 Thread Mitch Harder
On Mon, Sep 17, 2012 at 4:58 AM, Liu Bo bo.li@oracle.com wrote:
 This comes from one of btrfs's project ideas,
 As we defragment files, we break any sharing from other snapshots.
 The balancing code will preserve the sharing, and defrag needs to grow this
 as well.

 Now we're able to fill the blank with this patch, in which we make full use of
 backref walking stuff.

 Here is the basic idea,
 o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
 o  at endio, after we finish updating fs tree, we use backref walking to find
all parents of the ranges and re-link them with the new COWed file layout 
 by
adding corresponding backrefs.

 Originally patch by Li Zefan l...@cn.fujitsu.com
 Signed-off-by: Liu Bo bo.li@oracle.com

I'm hitting the WARN_ON in record_extent_backrefs() indicating a
problem with the return value from iterate_inodes_from_logical().

[ 6865.184782] [ cut here ]
[ 6865.184819] WARNING: at fs/btrfs/inode.c:2062
record_extent_backrefs+0xe5/0xe7 [btrfs]()
[ 6865.184823] Hardware name: OptiPlex 745
[ 6865.184825] Modules linked in: lpc_ich mfd_core xts gf128mul cryptd
aes_x86_64 sha256_generic btrfs libcrc32c
[ 6865.184841] Pid: 4239, comm: btrfs-endio-wri Not tainted 3.5.4-git-local+ #1
[ 6865.184844] Call Trace:
[ 6865.184856]  [81031d6a] warn_slowpath_common+0x74/0xa2
[ 6865.184862]  [81031db2] warn_slowpath_null+0x1a/0x1c
[ 6865.184884]  [a003356b] record_extent_backrefs+0xe5/0xe7 [btrfs]
[ 6865.184908]  [a003cf3a] btrfs_finish_ordered_io+0x131/0xa4b [btrfs]
[ 6865.184930]  [a003d869] finish_ordered_fn+0x15/0x17 [btrfs]
[ 6865.184951]  [a005882f] worker_loop+0x145/0x516 [btrfs]
[ 6865.184959]  [81059727] ? __wake_up_common+0x54/0x84
[ 6865.184983]  [a00586ea] ? btrfs_queue_worker+0x2d3/0x2d3 [btrfs]
[ 6865.184989]  [810516bb] kthread+0x93/0x98
[ 6865.184996]  [817d7934] kernel_thread_helper+0x4/0x10
[ 6865.185001]  [81051628] ? kthread_freezable_should_stop+0x6a/0x6a
[ 6865.185021]  [817d7930] ? gs_change+0xb/0xb
[ 6865.185025] ---[ end trace 26cc0e186efc79d8 ]---


I'm testing a 3.5.4 kernel merged with 3.6_rc patchset as well as the
send_recv patches and most of the btrfs-next patches.

I'm running into this issue when mounting with autodefrag, and running
some snapshot tests.

This may be related to a problem elsewhere, because I've been
encountering other backref issues even before testing this patch.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: do not hold the file extent leaf locked when adding extent item

2012-09-25 Thread Josef Bacik
For some reason we unlock everything except the leaf we are on, set the path
blocking and then add the extent item for the extent we just finished
writing.  I can't for the life of me figure out why we would want to do
this, and the history doesn't really indicate that there was a real reason
for it, so just remove it.  This will reduce our tree lock contention on
heavy writes.  Thanks,

Signed-off-by: Josef Bacik jba...@fusionio.com
---
 fs/btrfs/inode.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cad0c57..50a54b3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1823,10 +1823,8 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
btrfs_set_file_extent_encryption(leaf, fi, encryption);
btrfs_set_file_extent_other_encoding(leaf, fi, other_encoding);
 
-   btrfs_unlock_up_safe(path, 1);
-   btrfs_set_lock_blocking(leaf);
-
btrfs_mark_buffer_dirty(leaf);
+   btrfs_release_path(path);
 
inode_add_bytes(inode, num_bytes);
 
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


various processes blocked in btrfs_tree_read_lock for more than 120 secs

2012-09-25 Thread Jim Schutt

Hi,

I'm hitting the following on a btrfs filesystem used as
a Ceph OSD data store, under a heavy write load.

My kernel is current Linus master (commit 56d27adcb536)
merged with Josef Bacik's btrfs-next master (commit d5b04fb3bbb6).

What can I do to help resolve this?

[ 1558.754105] INFO: task btrfs-endio-wri:28012 blocked for more than 120 
seconds.
[ 1558.761569] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 1558.769575] btrfs-endio-wri D 8160b9c0 0 28012  2 0x
[ 1558.777063]  8806230b39c0 0046 0002 
8806230b3fd8
[ 1558.785725]  8806230b2010 8806230b2000 8806230b2000 
8806230b2000
[ 1558.793776]  8806230b3fd8 8806230b2000 8806245a9740 
880615825d00
[ 1558.802396] Call Trace:
[ 1558.805018]  [8148b51d] schedule+0x5d/0x60
[ 1558.810189]  [a05c7423] btrfs_tree_read_lock+0xf3/0x140 [btrfs]
[ 1558.817223]  [810660a0] ? wake_up_bit+0x40/0x40
[ 1558.822851]  [a056cf53] btrfs_read_lock_root_node+0x23/0x50 [btrfs]
[ 1558.830150]  [a0575000] btrfs_search_slot+0x1f0/0x730 [btrfs]
[ 1558.837722]  [a0588dbd] btrfs_lookup_file_extent+0x3d/0x40 [btrfs]
[ 1558.845099]  [a05a80e1] __btrfs_drop_extents+0x171/0xad0 [btrfs]
[ 1558.852506]  [8114b3eb] ? kmem_cache_alloc+0xcb/0x160
[ 1558.858602]  [a05a9443] btrfs_drop_extents+0x73/0xa0 [btrfs]
[ 1558.865457]  [a059a1c0] 
insert_reserved_file_extent.clone.0+0x80/0x2a0 [btrfs]
[ 1558.873999]  [a059640b] ? start_transaction+0x3cb/0x450 [btrfs]
[ 1558.881239]  [a05a4bc9] btrfs_finish_ordered_io+0x339/0x4d0 [btrfs]
[ 1558.888608]  [a05a4d75] finish_ordered_fn+0x15/0x20 [btrfs]
[ 1558.895352]  [a05bfbd2] worker_loop+0x1a2/0x400 [btrfs]
[ 1558.901753]  [a05bfa30] ? check_pending_worker_creates+0xe0/0xe0 
[btrfs]
[ 1558.910128]  [a05bfa30] ? check_pending_worker_creates+0xe0/0xe0 
[btrfs]
[ 1558.918228]  [81065b66] kthread+0x96/0xa0
[ 1558.923339]  [81495c24] kernel_thread_helper+0x4/0x10
[ 1558.929821]  [8148ca46] ? retint_restore_args+0xe/0xe
[ 1558.935920]  [81065ad0] ? __init_kthread_worker+0x40/0x40
[ 1558.942694]  [81495c20] ? gs_change+0xb/0xb
[ 1558.948170] INFO: task flush-btrfs-21:31685 blocked for more than 120 
seconds.
[ 1558.955408] echo 0  /proc/sys/kernel/hung_task_timeout_secs disables this 
message.
[ 1558.963274] flush-btrfs-21  D 8160b9c0 0 31685  2 0x
[ 1558.970602]  8806074d3470 0046 0002 
8806074d3fd8
[ 1558.978333]  8806074d2010 8806074d2000 8806074d2000 
8806074d2000
[ 1558.986052]  8806074d3fd8 8806074d2000 8806245aae80 
8804cf048000
[ 1558.993744] Call Trace:
[ 1558.996238]  [8148b51d] schedule+0x5d/0x60
[ 1559.001243]  [a05c7423] btrfs_tree_read_lock+0xf3/0x140 [btrfs]
[ 1559.008057]  [810660a0] ? wake_up_bit+0x40/0x40
[ 1559.013554]  [a056cf53] btrfs_read_lock_root_node+0x23/0x50 [btrfs]
[ 1559.020788]  [a0575000] btrfs_search_slot+0x1f0/0x730 [btrfs]
[ 1559.027451]  [a05aa4c7] ? free_extent_map+0x87/0x90 [btrfs]
[ 1559.033957]  [a0588dbd] btrfs_lookup_file_extent+0x3d/0x40 [btrfs]
[ 1559.041062]  [a05a80e1] __btrfs_drop_extents+0x171/0xad0 [btrfs]
[ 1559.047977]  [8114b3eb] ? kmem_cache_alloc+0xcb/0x160
[ 1559.053948]  [a05a9443] btrfs_drop_extents+0x73/0xa0 [btrfs]
[ 1559.060698]  [a059cc85] cow_file_range_inline+0xe5/0x1c0 [btrfs]
[ 1559.067807]  [a059ced2] cow_file_range+0x172/0x4a0 [btrfs]
[ 1559.074296]  [a059f27b] run_delalloc_range+0x7b/0xa0 [btrfs]
[ 1559.080967]  [a05b406d] __extent_writepage+0x22d/0x740 [btrfs]
[ 1559.087868]  [a0597757] ? btrfs_add_delayed_iput+0x77/0xe0 [btrfs]
[ 1559.095034]  [810fcad8] ? find_get_pages_tag+0x148/0x190
[ 1559.101329]  [a05b4972] 
extent_write_cache_pages.clone.3+0x242/0x3d0 [btrfs]
[ 1559.109465]  [a05b4b47] extent_writepages+0x47/0x60 [btrfs]
[ 1559.116008]  [a05a2790] ? btrfs_update_time+0xb0/0xb0 [btrfs]
[ 1559.122716]  [a0599987] btrfs_writepages+0x27/0x30 [btrfs]
[ 1559.129211]  [81108a03] do_writepages+0x23/0x40
[ 1559.134701]  [8117f94b] __writeback_single_inode+0x4b/0x180
[ 1559.141198]  [81065f77] ? bit_waitqueue+0x17/0xc0
[ 1559.147201]  [811820f6] writeback_sb_inodes+0x286/0x390
[ 1559.153365]  [81182286] __writeback_inodes_wb+0x86/0xd0
[ 1559.159509]  [811824fb] wb_writeback+0x18b/0x320
[ 1559.165082]  [81182884] wb_do_writeback+0x1f4/0x290
[ 1559.170988]  [814897b2] ? schedule_timeout+0x1d2/0x240
[ 1559.177249]  [81182a03] bdi_writeback_thread+0xe3/0x2c0
[ 1559.183685]  [81182920] ? wb_do_writeback+0x290/0x290
[ 1559.189740]  [81182920] ? wb_do_writeback+0x290/0x290
[ 1559.195878]  

Re: [PATCH v4 0/4] btrfs: extended inode refs

2012-09-25 Thread Chris Mason
On Mon, Aug 20, 2012 at 02:29:17PM -0600, Mark Fasheh wrote:
 
 Testing wise, the basic namespace operations work well (link, unlink, etc).
 The rest has gotten less debugging (and I really don't have a great way of
 testing the code in tree-log.c) Attached to this e-mail are btrfs-progs
 patches which make testing of the changes possible.

Hi Mark,

I hit a few problems testing this, so I have the patch below that I plan
on folding into your commits (to keep bisect from crashing in tree log).

Just let me know if this is a problem, or if you see any bugs in there.
I'm still doing a last round of checks on it, but I wanted to send along
early for comments.

The biggest change in here is to always check the ref_objectid when
returning a backref.  Hash collisions mean we may return a ref for a
completely different parent id otherwise.  I think I caught all the
places missing that logic, but please double check me.

Other than that I went through and fixed up bugs in
tree-log.c.  __add_inode_ref had a bunch of cut and paste errors, and you
carefully preserved a huge use-after-free bug in the original
add_inode_ref.

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f2e7e6..50dcd0f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3231,7 +3231,8 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle 
*trans,
  u64 inode_objectid, u64 ref_objectid, int ins_len,
  int cow);
 
-int btrfs_find_name_in_ext_backref(struct btrfs_path *path, const char *name,
+int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
+  u64 ref_objectid, const char *name,
   int name_len,
   struct btrfs_inode_extref **extref_ret);
 
diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c
index ad11b30..48b8fda 100644
--- a/fs/btrfs/inode-item.c
+++ b/fs/btrfs/inode-item.c
@@ -51,8 +51,8 @@ static int find_name_in_backref(struct btrfs_path *path, 
const char *name,
return 0;
 }
 
-int btrfs_find_name_in_ext_backref(struct btrfs_path *path, const char *name,
-  int name_len,
+int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid,
+  const char *name, int name_len,
   struct btrfs_inode_extref **extref_ret)
 {
struct extent_buffer *leaf;
@@ -78,8 +78,9 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, 
const char *name,
name_ptr = (unsigned long)(extref-name);
ref_name_len = btrfs_inode_extref_name_len(leaf, extref);
 
-   if (ref_name_len == name_len
-(memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 
0)) {
+   if (ref_name_len == name_len 
+   btrfs_inode_extref_parent(leaf, extref) == ref_objectid 
+   (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 
0)) {
if (extref_ret)
*extref_ret = extref;
return 1;
@@ -138,7 +139,7 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
return ERR_PTR(ret);
if (ret  0)
return NULL;
-   if (!btrfs_find_name_in_ext_backref(path, name, name_len, extref))
+   if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, 
extref))
return NULL;
return extref;
 }
@@ -218,7 +219,8 @@ int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
 * This should always succeed so error here will make the FS
 * readonly.
 */
-   if (!btrfs_find_name_in_ext_backref(path, name, name_len, extref)) {
+   if (!btrfs_find_name_in_ext_backref(path, ref_objectid,
+   name, name_len, extref)) {
btrfs_std_error(root-fs_info, -ENOENT);
ret = -EROFS;
goto out;
@@ -355,7 +357,8 @@ static int btrfs_insert_inode_extref(struct 
btrfs_trans_handle *trans,
ret = btrfs_insert_empty_item(trans, root, path, key,
  ins_len);
if (ret == -EEXIST) {
-   if (btrfs_find_name_in_ext_backref(path, name, name_len, NULL))
+   if (btrfs_find_name_in_ext_backref(path, ref_objectid,
+  name, name_len, NULL))
goto out;
 
btrfs_extend_item(trans, root, path, ins_len);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index baf8be2..1d7b348 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -745,6 +745,7 @@ out:
  */
 static noinline int backref_in_log(struct btrfs_root *log,
   struct btrfs_key *key,
+  u64 ref_objectid,
   char *name, int namelen)
 {

Re: [PATCH 2/2 v3] Btrfs: snapshot-aware defrag

2012-09-25 Thread Liu Bo
On 09/26/2012 01:39 AM, Mitch Harder wrote:
 On Mon, Sep 17, 2012 at 4:58 AM, Liu Bo bo.li@oracle.com wrote:
 This comes from one of btrfs's project ideas,
 As we defragment files, we break any sharing from other snapshots.
 The balancing code will preserve the sharing, and defrag needs to grow this
 as well.

 Now we're able to fill the blank with this patch, in which we make full use 
 of
 backref walking stuff.

 Here is the basic idea,
 o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
 o  at endio, after we finish updating fs tree, we use backref walking to find
all parents of the ranges and re-link them with the new COWed file layout 
 by
adding corresponding backrefs.

 Originally patch by Li Zefan l...@cn.fujitsu.com
 Signed-off-by: Liu Bo bo.li@oracle.com
 
 I'm hitting the WARN_ON in record_extent_backrefs() indicating a
 problem with the return value from iterate_inodes_from_logical().
 
 [ 6865.184782] [ cut here ]
 [ 6865.184819] WARNING: at fs/btrfs/inode.c:2062
 record_extent_backrefs+0xe5/0xe7 [btrfs]()
 [ 6865.184823] Hardware name: OptiPlex 745
 [ 6865.184825] Modules linked in: lpc_ich mfd_core xts gf128mul cryptd
 aes_x86_64 sha256_generic btrfs libcrc32c
 [ 6865.184841] Pid: 4239, comm: btrfs-endio-wri Not tainted 3.5.4-git-local+ 
 #1
 [ 6865.184844] Call Trace:
 [ 6865.184856]  [81031d6a] warn_slowpath_common+0x74/0xa2
 [ 6865.184862]  [81031db2] warn_slowpath_null+0x1a/0x1c
 [ 6865.184884]  [a003356b] record_extent_backrefs+0xe5/0xe7 [btrfs]
 [ 6865.184908]  [a003cf3a] btrfs_finish_ordered_io+0x131/0xa4b 
 [btrfs]
 [ 6865.184930]  [a003d869] finish_ordered_fn+0x15/0x17 [btrfs]
 [ 6865.184951]  [a005882f] worker_loop+0x145/0x516 [btrfs]
 [ 6865.184959]  [81059727] ? __wake_up_common+0x54/0x84
 [ 6865.184983]  [a00586ea] ? btrfs_queue_worker+0x2d3/0x2d3 [btrfs]
 [ 6865.184989]  [810516bb] kthread+0x93/0x98
 [ 6865.184996]  [817d7934] kernel_thread_helper+0x4/0x10
 [ 6865.185001]  [81051628] ? kthread_freezable_should_stop+0x6a/0x6a
 [ 6865.185021]  [817d7930] ? gs_change+0xb/0xb
 [ 6865.185025] ---[ end trace 26cc0e186efc79d8 ]---
 
 
 I'm testing a 3.5.4 kernel merged with 3.6_rc patchset as well as the
 send_recv patches and most of the btrfs-next patches.
 
 I'm running into this issue when mounting with autodefrag, and running
 some snapshot tests.
 
 This may be related to a problem elsewhere, because I've been
 encountering other backref issues even before testing this patch.
 

Oh, will look into it, thanks for the report.

thanks,
liubo

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 02/10] vfs: add support for updating access frequency

2012-09-25 Thread Zhi Yong Wu
thanks a lot for your review in my heart, Dave. It is very helpful to me.

On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Add some utils helpers to update access frequencies
 for one file or its range.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/hot_tracking.c |  359 
 +
  fs/hot_tracking.h |   15 +++
  2 files changed, 374 insertions(+), 0 deletions(-)

 diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
 index 173054b..52ed926 100644
 --- a/fs/hot_tracking.c
 +++ b/fs/hot_tracking.c
 @@ -106,6 +106,365 @@ inode_err:
  }

  /*
 + * Drops the reference out on hot_inode_item by one and free the structure
 + * if the reference count hits zero
 + */
 +void hot_rb_free_hot_inode_item(struct hot_inode_item *he)

 hot_inode_item_put()

 +{
 + if (!he)
 + return;

 It's a bug to call a put function on a kref counted item with a null
 pointer. Let the kernel crash so it is noticed and fixed.
OK, will remove it.

 +
 + if (atomic_dec_and_test(he-refs.refcount)) {
 + WARN_ON(he-in_tree);
 + kmem_cache_free(hot_inode_item_cache, he);
 + }

 Isn't this abusing krefs? i.e. this should be:
Sorry, thanks for your explaination as below:

 hot_inode_item_free()
 {
 WARN_ON(he-in_tree);
 kmem_cache_free(hot_inode_item_cache, he);
 }

 hot_inode_item_put()
 {
 kref_put(he-refs, hot_inode_item_free)
 }

 +/*
 + * Drops the reference out on hot_range_item by one and free the structure
 + * if the reference count hits zero
 + */
 +static void hot_rb_free_hot_range_item(struct hot_range_item *hr)

 same comments as above.
OK, thanks.
 
 +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root,
 + unsigned long inode_num,
 + struct rb_node *node)

 static struct rb_node *
 hot_inode_item_find(. )
OK, thanks.

 +{
 + struct rb_node **p = root-rb_node;
 + struct rb_node *parent = NULL;
 + struct hot_inode_item *entry;
 +
 + /* walk tree to find insertion point */
 + while (*p) {
 + parent = *p;
 + entry = rb_entry(parent, struct hot_inode_item, rb_node);
 +
 + if (inode_num  entry-i_ino)
 + p = (*p)-rb_left;
 + else if (inode_num  entry-i_ino)
 + p = (*p)-rb_right;
 + else
 + return parent;
 + }
 +
 + entry = rb_entry(node, struct hot_inode_item, rb_node);
 + entry-in_tree = 1;
 + rb_link_node(node, parent, p);
 + rb_insert_color(node, root);

 So the hot inode item search key is the inode number? Why use an
Yes
 rb-tree then? Wouldn't a btree be a more memory efficient way to
 hold a sparse index that has fixed key and pointer sizes?
Yes, i know, but if we don't use btree, what will be better? Radix tree?


 Also, the API seems quite strange. you pass in the the rb_node and
 the inode number which instead of passing in the hot inode item that
 already holds this information. You then convert the rb_node back to
 a hot inode item to set the in_tree variable. So why not just pass
 in the struct hot_inode_item in the first place?
Good catch, thanks for your remider.

 +static u64 hot_rb_range_end(struct hot_range_item *hr)

 hot_range_item_end()
OK

 +{
 + if (hr-start + hr-len  hr-start)
 + return (u64)-1;
 +
 + return hr-start + hr-len - 1;
 +}
 +
 +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root,

 hot_range_item_find()
OK

 + u64 start,
 + struct rb_node *node)
 +{
 + struct rb_node **p = root-rb_node;
 + struct rb_node *parent = NULL;
 + struct hot_range_item *entry;
 +
 + /* ensure start is on a range boundary */
 + start = start  RANGE_SIZE_MASK;
 + /* walk tree to find insertion point */
 + while (*p) {
 + parent = *p;
 + entry = rb_entry(parent, struct hot_range_item, rb_node);
 +
 + if (start  entry-start)
 + p = (*p)-rb_left;
 + else if (start = hot_rb_range_end(entry))

 Shouldn't an aligned end always be one byte short of the start
 offset of the next aligned region? i.e. start ==
 hot_rb_range_end(entry) is an indication of an off-by one bug
 somewhere?
This is really one good catch, thanks.

 + p = (*p)-rb_right;
 + else
 + return parent;
 + }
 +
 + entry = rb_entry(node, struct hot_range_item, rb_node);
 + entry-in_tree = 1;
 + rb_link_node(node, parent, p);
 + rb_insert_color(node, root);

 Same comments as the hot_inode_range.
OK.

 Also, the start offset in the 

Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'

2012-09-25 Thread Zhi Yong Wu
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote:
 On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Introduce one new mount option '-o hottrack',
 and add its parsing support.
   Its usage looks like:
mount -o hottrack
mount -o nouser,hottrack
mount -o nouser,hottrack,loop
mount -o hottrack,nouser

 I think that this option parsing should be done by the filesystem,
 even though the tracking functionality is in the VFS. That way ony
 the filesystems that can use the tracking information will turn it
 on, rather than being able to turn it on for everything regardless
 of whether it is useful or not.

 Along those lines, just using a normal superblock flag to indicate
 it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you
 don't need to allocate the sb-s_hot_info structure just to be able
 to check whether we are tracking hot inodes or not.

 This then means the hot inode tracking for the superblock can be
 initialised by the filesystem as part of it's fill_super method,
 along with the filesystem specific code that will use the hot
 tracking information the VFS gathers
I can see what you mean, but don't know if other guys also agree with this.
If so, all FS specific code which use hot tracking feature wll have to add
the same chunk of code in it fill_super method. Is it good?


 Cheers,

 Dave.
 --
 Dave Chinner
 da...@fromorbit.com



-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage

2012-09-25 Thread Zhi Yong Wu
On Wed, Sep 26, 2012 at 1:14 AM, Goffredo Baroncelli kreij...@libero.it wrote:
 On 09/25/2012 12:14 PM, David Sterba wrote:

 On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote:

 From: Zhi Yong Wuwu...@linux.vnet.ibm.com

Some code pathes forget to free memory on exit.


 Same as with the fd's, kernel will free all memory for us at exit().


 I strongly disagree with this approach. The callee often don't know what
 happen after and before the call. The same is true for the programmer,
 because the code is quite often updated by several people. A clean exit() is
 the right thing to do as general rule. I don't see any valid reason (in the
 btrfs context) to do otherwise.

 Relying on the exit() for a proper clean-up increase the likelihood of bug
 when the code evolves (see my patch   [RESPOST][BTRFS-PROGS][PATCH]
 btrfs_read_dev_super(): uninitialized variable for an example of what means
 an incorrect deallocation of resource).


 If there's lots of memory allocated, it may be even faster to leave the
 unallocation process to kernel as it will do it in one go, while the
 application would unnecessarily free it chunk by chunk.


 May be I am wrong, but I don't think that the increase of speed of the btrfs
 command is even measurable relying on exit instead of free()-ing each
 chunk of memory one at time The same should be true for the
 open()/close()

I fully agree with you. In one same function, i find that some code
path free system sources,
while other code path doesn't. This is one nice way.


 My 2¢

 BR
 G.Baroncelli


 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in

 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 .





-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 01/10] vfs: introduce private rb structures

2012-09-25 Thread Zhi Yong Wu
On Tue, Sep 25, 2012 at 6:20 PM, Ram Pai linux...@us.ibm.com wrote:
 On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   One root structure hot_info is defined, is hooked
 up in super_block, and will be used to hold rb trees
 root, hash list root and some other information, etc.
   Adds hot_inode_tree struct to keep track of
 frequently accessed files, and be keyed by {inode, offset}.
 Trees contain hot_inode_items representing those files
 and ranges.
   Having these trees means that vfs can quickly determine the
 temperature of some data by doing some calculations on the
 hot_freq_data struct that hangs off of the tree item.
   Define two items hot_inode_item and hot_range_item,
 one of them represents one tracked file
 to keep track of its access frequency and the tree of
 ranges in this file, while the latter represents
 a file range of one inode.
   Each of the two structures contains a hot_freq_data
 struct with its frequency of access metrics (number of
 {reads, writes}, last {read,write} time, frequency of
 {reads,writes}).
   Also, each hot_inode_item contains one hot_range_tree
 struct which is keyed by {inode, offset, length}
 and used to keep track of all the ranges in this file.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
 +
 ..snip..

 +/* A tree that sits on the hot_info */
 +struct hot_inode_tree {
 + struct rb_root map;
 + rwlock_t lock;
 +};
 +
 +/* A tree of ranges for each inode in the hot_inode_tree */
 +struct hot_range_tree {
 + struct rb_root map;
 + rwlock_t lock;
 +};

 Can as well have a generic datastructure called hot_tree instead
 of having two different datastructure which basically are the same.
OK.

 +
 +/* A frequency data struct holds values that are used to
 + * determine temperature of files and file ranges. These structs
 + * are members of hot_inode_item and hot_range_item
 + */
 +struct hot_freq_data {
 + struct timespec last_read_time;
 + struct timespec last_write_time;
 + u32 nr_reads;
 + u32 nr_writes;
 + u64 avg_delta_reads;
 + u64 avg_delta_writes;
 + u8 flags;
 + u32 last_temperature;
 +};
 +
 +/* An item representing an inode and its access frequency */
 +struct hot_inode_item {
 + /* node for hot_inode_tree rb_tree */
 + struct rb_node rb_node;
 + /* tree of ranges in this inode */
 + struct hot_range_tree hot_range_tree;
 + /* frequency data for this inode */
 + struct hot_freq_data hot_freq_data;
 + /* inode number, copied from inode */
 + unsigned long i_ino;
 + /* used to check for errors in ref counting */
 + u8 in_tree;
 + /* protects hot_freq_data, i_no, in_tree */
 + spinlock_t lock;
 + /* prevents kfree */
 + struct kref refs;
 +};
 +
 +/*
 + * An item representing a range inside of an inode whose frequency
 + * is being tracked
 + */
 +struct hot_range_item {
 + /* node for hot_range_tree rb_tree */
 + struct rb_node rb_node;
 + /* frequency data for this range */
 + struct hot_freq_data hot_freq_data;
 + /* the hot_inode_item associated with this hot_range_item */
 + struct hot_inode_item *hot_inode;
 + /* starting offset of this range */
 + u64 start;
 + /* length of this range */
 + u64 len;
 + /* used to check for errors in ref counting */
 + u8 in_tree;
 + /* protects hot_freq_data, start, len, and in_tree */
 + spinlock_t lock;
 + /* prevents kfree */
 + struct kref refs;
 +};

 might as well have just one generic datastructure called hot_item with
 all the common fields and then have

 struct hot_inode_item  {
 struct hot_item hot_inode;
 struct hot_tree hot_range_tree;
 unsigned long i_ino;
 }

 and

 struct hot_range_item {
 struct hot_item hot_range;
 u64 start;
 u64 len;/* length of this range */
 }

 This should help you eliminate some duplicate code as well.
OK, i will try to apply them. thanks.


 RP




-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v2 05/10] vfs: introduce one hash table

2012-09-25 Thread Zhi Yong Wu
On Tue, Sep 25, 2012 at 5:54 PM, Ram Pai linux...@us.ibm.com wrote:
 On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote:
 From: Zhi Yong Wu wu...@linux.vnet.ibm.com

   Adds a hash table structure which contains
 a lot of hash list and is used to efficiently
 look up the data temperature of a file or its
 ranges.
   In each hash list of hash table, the hash node
 will keep track of temperature info.

 Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com
 ---
  fs/hot_tracking.c|   77 
 -
  include/linux/hot_tracking.h |   35 +++
  2 files changed, 110 insertions(+), 2 deletions(-)

 diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c
 index fa89f70..5f96442 100644
 --- a/fs/hot_tracking.c
 +++ b/fs/hot_tracking.c
 @@ -16,6 +16,7 @@
  #include linux/module.h
  #include linux/spinlock.h
  #include linux/hardirq.h
 +#include linux/hash.h
  #include linux/fs.h
  #include linux/blkdev.h
  #include linux/types.h
 @@ -24,6 +25,9 @@

 ...snip...

 +/* Hash list heads for hot hash table */
 +struct hot_hash_head {
 + struct hlist_head hashhead;
 + rwlock_t rwlock;
 + u32 temperature;
 +};
 +
 +/* Nodes stored in each hash list of hash table */
 +struct hot_hash_node {
 + struct hlist_node hashnode;
 + struct list_head node;
 + struct hot_freq_data *hot_freq_data;
 + struct hot_hash_head *hlist;
 + spinlock_t lock; /* protects hlist */
 +
 + /*
 +  * number of references to this node
 +  * equals 1 (hashlist entry)
 +  */
 + struct kref refs;
 +};

 Dont see why you need yet another datastructure to hash the inode_item
 and the range_item into a hash list.  You can just add another
If there's such one structure, we can rapidly know which hash bucket
one inode_item is currently linking to via its hlist field.
This is useful if one inode_item corresponding to one file need to
move from one bucket to another bucket when this file get cold or
hotter.
Does it make sense to you?

 hlist_node in the inode_item and range_item. This field can be then used
 to link into the corresponding hash list.

 You can use the container_of() get to the inode_item or the range_item
 using the hlist_node field.

 You can thus eliminate a lot of code.
As what i said above, i don't think it can eliminate a lo of code.

 +
  /* An item representing an inode and its access frequency */
  struct hot_inode_item {
   /* node for hot_inode_tree rb_tree */
 @@ -68,6 +93,8 @@ struct hot_inode_item {
   spinlock_t lock;
   /* prevents kfree */
   struct kref refs;
 + /* hashlist node for this inode */
 + struct hot_hash_node *heat_node;

 this can be just
 struct hlist_node head_node; /* lookup hot_inode hash list */

 Use this field to link it into the corresponding hashlist.

  };

 this can be just
  /*
 @@ -91,6 +118,8 @@ struct hot_range_item {
   spinlock_t lock;
   /* prevents kfree */
   struct kref refs;
 + /* hashlist node for this range */
 + struct hot_hash_node *heat_node;

 this can be just
 struct hlist_node head_node; /* lookup hot_range hash list */


  };

  struct hot_info {
 @@ -98,6 +127,12 @@ struct hot_info {

   /* red-black tree that keeps track of fs-wide hot data */
   struct hot_inode_tree hot_inode_tree;
 +
 + /* hash map of inode temperature */
 + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE];
 +
 + /* hash map of range temperature */
 + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE];
  };

  #endif  /* _LINUX_HOTTRACK_H */




-- 
Regards,

Zhi Yong Wu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html