[PATCH] Btrfs: limit thread pool size when remounting
For some asynchronous threads, such as submit worker and cache worker, we limit their thread pool size when mounting. So we also need to do such things when remounting. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/super.c | 13 - 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 83d6f9f..a58e834 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info, printk(KERN_INFO btrfs: resize thread pool %d - %d\n, old_pool_size, new_pool_size); - btrfs_set_max_workers(fs_info-generic_worker, new_pool_size); + btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size)); btrfs_set_max_workers(fs_info-workers, new_pool_size); btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size); - btrfs_set_max_workers(fs_info-submit_workers, new_pool_size); - btrfs_set_max_workers(fs_info-caching_workers, new_pool_size); - btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size); + btrfs_set_max_workers(fs_info-submit_workers, + min_t(u64, fs_info-fs_devices-num_devices, + new_pool_size)); + btrfs_set_max_workers(fs_info-caching_workers, min(2, new_pool_size)); + btrfs_set_max_workers(fs_info-fixup_workers, min(1, new_pool_size)); btrfs_set_max_workers(fs_info-endio_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_meta_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_meta_write_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_write_workers, new_pool_size); - btrfs_set_max_workers(fs_info-endio_freespace_worker, new_pool_size); + btrfs_set_max_workers(fs_info-endio_freespace_worker, + min(1, new_pool_size)); btrfs_set_max_workers(fs_info-delayed_workers, new_pool_size); btrfs_set_max_workers(fs_info-readahead_workers, new_pool_size); btrfs_set_max_workers(fs_info-scrub_workers, new_pool_size); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Just a coupl eof minor formatting things first up - I'll have more comments as I get deeper into the series. +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) The names of these are a bit clunky. You probably don't need the _rb_ in the function name. i.e. hot_inode_tree_init() is sufficient, and if we every want to change in the tree type we don't have to rename every single function... . +/* + * Initialize a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +void hot_rb_inode_item_init(void *_item) +{ The usual naming convention for slab initialiser functions is to use a suffix of _once to indicate it is only ever called once per slab object instantiation, not every time the object is allocated fom the slab. See, for example, inode_init_once() and inode_init_always(). so, that would make this function hot_inode_item_init_once(). +/* init hot_inode_item and hot_range_item kmem cache */ +static int __init hot_rb_item_cache_init(void) +{ + hot_inode_item_cache = kmem_cache_create(hot_inode_item, + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_inode_item_init); + if (!hot_inode_item_cache) + goto inode_err; + + hot_range_item_cache = kmem_cache_create(hot_range_item, + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_range_item_init); + if (!hot_range_item_cache) + goto range_err; + + return 0; + +range_err: + kmem_cache_destroy(hot_inode_item_cache); +inode_err: + return -ENOMEM; +} + +/* + * Initialize kmem cache for hot_inode_item + * and hot_range_item + */ +void __init hot_track_cache_init(void) +{ + if (hot_rb_item_cache_init()) + return; No real need to have a hot_rb_item_cache_init() function here - just open code it all in the hot_track_cache_init() function. +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 000..269b67a --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,27 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com + *Ben Chociej bchoc...@gmail.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include linux/rbtree.h +#include linux/hot_tracking.h + +/* values for hot_freq_data flags */ +/* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_INODE (1 0) +/* freq data struct is for a range */ +#define FREQ_DATA_TYPE_RANGE (1 1) The comments are redundant - the name of the object documents it's use sufficiently. ie. /* values for hot_freq_data flags */ #define FREQ_DATA_TYPE_INODE (1 0) #define FREQ_DATA_TYPE_RANGE (1 1) is just fine by itself. +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item + */ /* * This is a * multiline comment. ;) */ +struct hot_freq_data { + struct timespec last_read_time; + struct timespec
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Just a coupl eof minor formatting things first up - I'll have more comments as I get deeper into the series. All comments are very reasonable, and will be applied. thanks for your review. +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) The names of these are a bit clunky. You probably don't need the _rb_ in the function name. i.e. hot_inode_tree_init() is sufficient, and if we every want to change in the tree type we don't have to rename every single function... . +/* + * Initialize a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +void hot_rb_inode_item_init(void *_item) +{ The usual naming convention for slab initialiser functions is to use a suffix of _once to indicate it is only ever called once per slab object instantiation, not every time the object is allocated fom the slab. See, for example, inode_init_once() and inode_init_always(). so, that would make this function hot_inode_item_init_once(). +/* init hot_inode_item and hot_range_item kmem cache */ +static int __init hot_rb_item_cache_init(void) +{ + hot_inode_item_cache = kmem_cache_create(hot_inode_item, + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_inode_item_init); + if (!hot_inode_item_cache) + goto inode_err; + + hot_range_item_cache = kmem_cache_create(hot_range_item, + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_range_item_init); + if (!hot_range_item_cache) + goto range_err; + + return 0; + +range_err: + kmem_cache_destroy(hot_inode_item_cache); +inode_err: + return -ENOMEM; +} + +/* + * Initialize kmem cache for hot_inode_item + * and hot_range_item + */ +void __init hot_track_cache_init(void) +{ + if (hot_rb_item_cache_init()) + return; No real need to have a hot_rb_item_cache_init() function here - just open code it all in the hot_track_cache_init() function. +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 000..269b67a --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,27 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com + *Ben Chociej bchoc...@gmail.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include linux/rbtree.h +#include linux/hot_tracking.h + +/* values for hot_freq_data flags */ +/* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_INODE (1 0) +/* freq data struct is for a range */ +#define FREQ_DATA_TYPE_RANGE (1 1) The comments are redundant - the name of the object documents it's use sufficiently. ie. /* values for hot_freq_data flags */ #define FREQ_DATA_TYPE_INODE (1 0) #define FREQ_DATA_TYPE_RANGE (1 1) is just fine by itself. +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Just a coupl eof minor formatting things first up - I'll have more comments as I get deeper into the series. OK, very look forward to seeing more on other patches, indeed thanks again. +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) The names of these are a bit clunky. You probably don't need the _rb_ in the function name. i.e. hot_inode_tree_init() is sufficient, and if we every want to change in the tree type we don't have to rename every single function... . +/* + * Initialize a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +void hot_rb_inode_item_init(void *_item) +{ The usual naming convention for slab initialiser functions is to use a suffix of _once to indicate it is only ever called once per slab object instantiation, not every time the object is allocated fom the slab. See, for example, inode_init_once() and inode_init_always(). so, that would make this function hot_inode_item_init_once(). +/* init hot_inode_item and hot_range_item kmem cache */ +static int __init hot_rb_item_cache_init(void) +{ + hot_inode_item_cache = kmem_cache_create(hot_inode_item, + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_inode_item_init); + if (!hot_inode_item_cache) + goto inode_err; + + hot_range_item_cache = kmem_cache_create(hot_range_item, + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_range_item_init); + if (!hot_range_item_cache) + goto range_err; + + return 0; + +range_err: + kmem_cache_destroy(hot_inode_item_cache); +inode_err: + return -ENOMEM; +} + +/* + * Initialize kmem cache for hot_inode_item + * and hot_range_item + */ +void __init hot_track_cache_init(void) +{ + if (hot_rb_item_cache_init()) + return; No real need to have a hot_rb_item_cache_init() function here - just open code it all in the hot_track_cache_init() function. +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 000..269b67a --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,27 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com + *Ben Chociej bchoc...@gmail.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include linux/rbtree.h +#include linux/hot_tracking.h + +/* values for hot_freq_data flags */ +/* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_INODE (1 0) +/* freq data struct is for a range */ +#define FREQ_DATA_TYPE_RANGE (1 1) The comments are redundant - the name of the object documents it's use sufficiently. ie. /* values for hot_freq_data flags */ #define FREQ_DATA_TYPE_INODE (1 0) #define FREQ_DATA_TYPE_RANGE (1 1) is just fine by itself. +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of
Re: [RFC v2 02/10] vfs: add support for updating access frequency
On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some utils helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c | 359 + fs/hot_tracking.h | 15 +++ 2 files changed, 374 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 173054b..52ed926 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -106,6 +106,365 @@ inode_err: } /* + * Drops the reference out on hot_inode_item by one and free the structure + * if the reference count hits zero + */ +void hot_rb_free_hot_inode_item(struct hot_inode_item *he) hot_inode_item_put() +{ + if (!he) + return; It's a bug to call a put function on a kref counted item with a null pointer. Let the kernel crash so it is noticed and fixed. + + if (atomic_dec_and_test(he-refs.refcount)) { + WARN_ON(he-in_tree); + kmem_cache_free(hot_inode_item_cache, he); + } Isn't this abusing krefs? i.e. this should be: hot_inode_item_free() { WARN_ON(he-in_tree); kmem_cache_free(hot_inode_item_cache, he); } hot_inode_item_put() { kref_put(he-refs, hot_inode_item_free) } +/* + * Drops the reference out on hot_range_item by one and free the structure + * if the reference count hits zero + */ +static void hot_rb_free_hot_range_item(struct hot_range_item *hr) same comments as above. +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, + unsigned long inode_num, + struct rb_node *node) static struct rb_node * hot_inode_item_find(. ) +{ + struct rb_node **p = root-rb_node; + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num entry-i_ino) + p = (*p)-rb_left; + else if (inode_num entry-i_ino) + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_inode_item, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); So the hot inode item search key is the inode number? Why use an rb-tree then? Wouldn't a btree be a more memory efficient way to hold a sparse index that has fixed key and pointer sizes? Also, the API seems quite strange. you pass in the the rb_node and the inode number which instead of passing in the hot inode item that already holds this information. You then convert the rb_node back to a hot inode item to set the in_tree variable. So why not just pass in the struct hot_inode_item in the first place? +static u64 hot_rb_range_end(struct hot_range_item *hr) hot_range_item_end() +{ + if (hr-start + hr-len hr-start) + return (u64)-1; + + return hr-start + hr-len - 1; +} + +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root, hot_range_item_find() + u64 start, + struct rb_node *node) +{ + struct rb_node **p = root-rb_node; + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + /* ensure start is on a range boundary */ + start = start RANGE_SIZE_MASK; + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start entry-start) + p = (*p)-rb_left; + else if (start = hot_rb_range_end(entry)) Shouldn't an aligned end always be one byte short of the start offset of the next aligned region? i.e. start == hot_rb_range_end(entry) is an indication of an off-by one bug somewhere? + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_range_item, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); Same comments as the hot_inode_range. Also, the start offset in the hot_range_item is already aligned, so why do you need to align it again? + + return NULL; +} + +/* + * Add a hot_inode_item to a hot_inode_tree. If the tree already contains + * an item with the index given, return -EEXIST + */ +static int hot_rb_add_hot_inode_item(struct hot_inode_tree *tree, + struct hot_inode_item *he) +{ + int ret = 0; + struct rb_node *rb; + + rb =
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 05/10] vfs: introduce one hash table
On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 77 - include/linux/hot_tracking.h | 35 +++ 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fa89f70..5f96442 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -16,6 +16,7 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/hash.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h @@ -24,6 +25,9 @@ ...snip... +/* Hash list heads for hot hash table */ +struct hot_hash_head { + struct hlist_head hashhead; + rwlock_t rwlock; + u32 temperature; +}; + +/* Nodes stored in each hash list of hash table */ +struct hot_hash_node { + struct hlist_node hashnode; + struct list_head node; + struct hot_freq_data *hot_freq_data; + struct hot_hash_head *hlist; + spinlock_t lock; /* protects hlist */ + + /* + * number of references to this node + * equals 1 (hashlist entry) + */ + struct kref refs; +}; Dont see why you need yet another datastructure to hash the inode_item and the range_item into a hash list. You can just add another hlist_node in the inode_item and range_item. This field can be then used to link into the corresponding hash list. You can use the container_of() get to the inode_item or the range_item using the hlist_node field. You can thus eliminate a lot of code. + /* An item representing an inode and its access frequency */ struct hot_inode_item { /* node for hot_inode_tree rb_tree */ @@ -68,6 +93,8 @@ struct hot_inode_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this inode */ + struct hot_hash_node *heat_node; this can be just struct hlist_node head_node; /* lookup hot_inode hash list */ Use this field to link it into the corresponding hashlist. }; this can be just /* @@ -91,6 +118,8 @@ struct hot_range_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this range */ + struct hot_hash_node *heat_node; this can be just struct hlist_node head_node; /* lookup hot_range hash list */ }; struct hot_info { @@ -98,6 +127,12 @@ struct hot_info { /* red-black tree that keeps track of fs-wide hot data */ struct hot_inode_tree hot_inode_tree; + + /* hash map of inode temperature */ + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; + + /* hash map of range temperature */ + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; }; #endif /* _LINUX_HOTTRACK_H */ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] btrfs-progs: Close file descriptor on exit
On Tue, Sep 25, 2012 at 10:02:15AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Need to close fd on exit. Strictly you don't need to, kernel will do that at exit() time. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Some code pathes forget to free memory on exit. Same as with the fd's, kernel will free all memory for us at exit(). If there's lots of memory allocated, it may be even faster to leave the unallocation process to kernel as it will do it in one go, while the application would unnecessarily free it chunk by chunk. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- + ..snip.. +/* A tree that sits on the hot_info */ +struct hot_inode_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* A tree of ranges for each inode in the hot_inode_tree */ +struct hot_range_tree { + struct rb_root map; + rwlock_t lock; +}; Can as well have a generic datastructure called hot_tree instead of having two different datastructure which basically are the same. + +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item + */ +struct hot_freq_data { + struct timespec last_read_time; + struct timespec last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u8 flags; + u32 last_temperature; +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + /* node for hot_inode_tree rb_tree */ + struct rb_node rb_node; + /* tree of ranges in this inode */ + struct hot_range_tree hot_range_tree; + /* frequency data for this inode */ + struct hot_freq_data hot_freq_data; + /* inode number, copied from inode */ + unsigned long i_ino; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, i_no, in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; + +/* + * An item representing a range inside of an inode whose frequency + * is being tracked + */ +struct hot_range_item { + /* node for hot_range_tree rb_tree */ + struct rb_node rb_node; + /* frequency data for this range */ + struct hot_freq_data hot_freq_data; + /* the hot_inode_item associated with this hot_range_item */ + struct hot_inode_item *hot_inode; + /* starting offset of this range */ + u64 start; + /* length of this range */ + u64 len; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, start, len, and in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; might as well have just one generic datastructure called hot_item with all the common fields and then have struct hot_inode_item { struct hot_item hot_inode; struct hot_tree hot_range_tree; unsigned long i_ino; } and struct hot_range_item { struct hot_item hot_range; u64 start; u64 len;/* length of this range */ } This should help you eliminate some duplicate code as well. RP -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/2] Btrfs: cleanup duplicated division functions
On mon, 24 Sep 2012 21:42:42 +0300, Ilya Dryomov wrote: On Mon, Sep 24, 2012 at 06:47:42PM +0200, David Sterba wrote: On Mon, Sep 24, 2012 at 10:05:02AM +0800, Miao Xie wrote: Generally, we should check the value when it is input. If not, we might run our program with the wrong value, and it is possible to cause unknown problems. So I think the above chunk is necessary. The difference is that giving an invalid value will exit early (your version), while Ilya's version will clamp the 'usage' to the allowed range during processing. From usability POV, I'd prefer to let the command fail early with a verbose error eg. that the range is invalid. I think that's the job for progs, and that's where this and a few other checks are currently implemented. There is no possibility for unknown problems, that's exactly how it's been implemented before the cleanup. The purpose of balance filters (and the usage filter in particular) is to lower the number of chunks we have to relocate. If someone decides to use raw ioctls, and supplies us with the invalid value, we just cut it down to a 100 and proceed. usage=100 is the equivalent of not using the usage filter at all, so the raw ioctl user will get what he deserves. This is very similar to what we currently do with other filters. For example, soft can only be used with convert, and this is checked by progs. But, if somebody were to set a soft flag without setting convert we would simply ignore that soft. Same goes for drange and devid, invalid ranges, invalid devids, etc. Pulling all these checks into the kernel is questionable at best, and, if enough people insist on it, should be discussed separately. I think the usage is a special case that doesn't cause critical problem and are not used everywhere. But if there is a variant which is referenced at several places and the kernel would crash if it is invalid, in this case, we would add the check everywhere and make the code more complex and ugly if we don't check it when it is passed in. Besides that if we have done lots of work before the check, we must roll back when we find the variant is invalid, it wastes time. So I think doing the check and returning the error number as early as possible is a rule we should follow. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c
On Mon, Sep 24, 2012 at 11:03:49PM +0200, David Sterba wrote: Could you please put the check into a separate helper Please note that checksum will become a variable per-filesystem property, stored within the superblock, so the helper should be passed a fs_info pointer. thanks, david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: limit thread pool size when remounting
On Tue, Sep 25, 2012 at 02:48:33PM +0800, Liu Bo wrote: --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info, printk(KERN_INFO btrfs: resize thread pool %d - %d\n, old_pool_size, new_pool_size); - btrfs_set_max_workers(fs_info-generic_worker, new_pool_size); + btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size)); How could new_pool_size be 1 ? There's a check in super.c to pick only values 0 btrfs_set_max_workers(fs_info-workers, new_pool_size); btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size); - btrfs_set_max_workers(fs_info-submit_workers, new_pool_size); - btrfs_set_max_workers(fs_info-caching_workers, new_pool_size); - btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size); + btrfs_set_max_workers(fs_info-submit_workers, + min_t(u64, fs_info-fs_devices-num_devices, + new_pool_size)); This ask for update also when a new device is added/removed. + btrfs_set_max_workers(fs_info-caching_workers, min(2, new_pool_size)); + btrfs_set_max_workers(fs_info-fixup_workers, min(1, new_pool_size)); Same as above, is it expected to be 1 ? btrfs_set_max_workers(fs_info-endio_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_meta_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_meta_write_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_write_workers, new_pool_size); - btrfs_set_max_workers(fs_info-endio_freespace_worker, new_pool_size); + btrfs_set_max_workers(fs_info-endio_freespace_worker, + min(1, new_pool_size)); Not sure, do we actually need more than 1 free space worker? btrfs_set_max_workers(fs_info-delayed_workers, new_pool_size); btrfs_set_max_workers(fs_info-readahead_workers, new_pool_size); btrfs_set_max_workers(fs_info-scrub_workers, new_pool_size); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c
On 09/25/2012 06:51 PM, David Sterba wrote: On Mon, Sep 24, 2012 at 11:03:49PM +0200, David Sterba wrote: Could you please put the check into a separate helper Please note that checksum will become a variable per-filesystem property, stored within the superblock, so the helper should be passed a fs_info pointer. thanks, david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html How about enhancing the *thread_pool=/number mount option instead? thread_pool=n enable threadpool/**/for compression and checksum, MAY improve bandwidth/* */thread_pool=0 disable threadpool for compression and checksum, MIGHT reduce latency thread_pool=-1 or not providedautomatically managed (current behaviour and default choice) This should allow user to tradeoff between latency and bandwidth, furthermore, you do not need to assume that btrfs may use crc32c algorithm only forever. /* -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c
On 09/25/2012 06:51 PM, David Sterba wrote: On Mon, Sep 24, 2012 at 11:03:49PM +0200, David Sterba wrote: Could you please put the check into a separate helper Please note that checksum will become a variable per-filesystem property, stored within the superblock, so the helper should be passed a fs_info pointer. thanks, david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html How about enhancing the thread_pool=number mount option instead? thread_pool=n enable threadpool for compression and checksum, MAY improve bandwidth thread_pool=0 disable threadpool for compression and checksum, MIGHT reduce latency thread_pool=-1 or not providedautomatically managed (current behavior and default choice) This should allow user to tradeoff between latency and bandwidth, furthermore, you do not need to assume that btrfs may use crc32c algorithm only forever. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: limit thread pool size when remounting
On 09/25/2012 07:39 PM, David Sterba wrote: On Tue, Sep 25, 2012 at 02:48:33PM +0800, Liu Bo wrote: --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info, printk(KERN_INFO btrfs: resize thread pool %d - %d\n, old_pool_size, new_pool_size); -btrfs_set_max_workers(fs_info-generic_worker, new_pool_size); +btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size)); How could new_pool_size be 1 ? There's a check in super.c to pick only values 0 I think we just need only 1 generic_worker btrfs_set_max_workers(fs_info-workers, new_pool_size); btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size); -btrfs_set_max_workers(fs_info-submit_workers, new_pool_size); -btrfs_set_max_workers(fs_info-caching_workers, new_pool_size); -btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size); +btrfs_set_max_workers(fs_info-submit_workers, + min_t(u64, fs_info-fs_devices-num_devices, + new_pool_size)); This ask for update also when a new device is added/removed. Oh, yes, but we should do it in another new patch instead. +btrfs_set_max_workers(fs_info-caching_workers, min(2, new_pool_size)); +btrfs_set_max_workers(fs_info-fixup_workers, min(1, new_pool_size)); Same as above, is it expected to be 1 ? btrfs_set_max_workers(fs_info-endio_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_meta_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_meta_write_workers, new_pool_size); btrfs_set_max_workers(fs_info-endio_write_workers, new_pool_size); -btrfs_set_max_workers(fs_info-endio_freespace_worker, new_pool_size); +btrfs_set_max_workers(fs_info-endio_freespace_worker, + min(1, new_pool_size)); Not sure, do we actually need more than 1 free space worker? Same as generic_worker and fixup_workers, I think only one is enough, that' why I make the minimum limitation, or we can set it as 1 directly. thanks, liubo btrfs_set_max_workers(fs_info-delayed_workers, new_pool_size); btrfs_set_max_workers(fs_info-readahead_workers, new_pool_size); btrfs_set_max_workers(fs_info-scrub_workers, new_pool_size); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: limit thread pool size when remounting
On Tue, Sep 25, 2012 at 08:07:00PM +0800, Liu Bo wrote: On 09/25/2012 07:39 PM, David Sterba wrote: On Tue, Sep 25, 2012 at 02:48:33PM +0800, Liu Bo wrote: --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1158,17 +1158,20 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info, printk(KERN_INFO btrfs: resize thread pool %d - %d\n, old_pool_size, new_pool_size); - btrfs_set_max_workers(fs_info-generic_worker, new_pool_size); + btrfs_set_max_workers(fs_info-generic_worker, min(1, new_pool_size)); How could new_pool_size be 1 ? There's a check in super.c to pick only values 0 I think we just need only 1 generic_worker I got it reversed, sorry for the noise. btrfs_set_max_workers(fs_info-workers, new_pool_size); btrfs_set_max_workers(fs_info-delalloc_workers, new_pool_size); - btrfs_set_max_workers(fs_info-submit_workers, new_pool_size); - btrfs_set_max_workers(fs_info-caching_workers, new_pool_size); - btrfs_set_max_workers(fs_info-fixup_workers, new_pool_size); + btrfs_set_max_workers(fs_info-submit_workers, +min_t(u64, fs_info-fs_devices-num_devices, +new_pool_size)); This ask for update also when a new device is added/removed. Oh, yes, but we should do it in another new patch instead. Yes, this was just a hint to pair the remount/thread_pool capabilities. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not async metadata csums if we have hardware crc32c
On Tue, Sep 25, 2012 at 07:40:17PM +0800, ching wrote: How about enhancing the *thread_pool=/number mount option instead? thread_pool=n enable threadpool/**/for compression and checksum, MAY improve bandwidth/* */thread_pool=0 disable threadpool for compression and checksum, MIGHT reduce latency thread_pool=-1 or not providedautomatically managed (current behaviour and default choice) Sorry, I don't understand the syntax, can you please write it more clearly? Thanks. This should allow user to tradeoff between latency and bandwidth, furthermore, you do not need to assume that btrfs may use crc32c algorithm only forever. Some sort of finer control over the threads makes sense, we should distinguish for cpu-bound processing where paralelism wins and io-bound where it is not helpful to add more and more threads that hammer a single device (namely in case of HDD, so this should be tunable for the devices with cheap seeks). david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Tue, Sep 25, 2012 at 6:14 PM, David Sterba d...@jikos.cz wrote: On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Some code pathes forget to free memory on exit. Same as with the fd's, kernel will free all memory for us at exit(). hi, can you let me know the pointer to exit() said by you? If there's lots of memory allocated, it may be even faster to leave the unallocation process to kernel as it will do it in one go, while the application would unnecessarily free it chunk by chunk. got it, thanks. david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC design issues
On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote: I'm going to look at fixing some of the performance issues that crop up because of our reservation system. Before I go and do a whole lot of work I want some feedback. I've done a brain dump here https://btrfs.wiki.kernel.org/index.php/ENOSPC Thanks for writing it down, much appreciated. My first and probably naive approach is described in the page, quoting here: Attempt to address how to flush less stated below. The over-reservation of a 4k block can go up to 96k as the worst case calculation (see above). This accounts for splitting the full tree path from 8th level root down to the leaf plus the node splits. My question: how often do we need to go up to the level N+1 from current level N? for levels 0 and 1 it may happen within one transaction, maybe not so often for level 2 and with exponentially decreasing frequency for the higher levels. Therefore, is it possible to check the tree level first and adapt the calculation according to that? Let's say we can reduce the 4k reservation size from 96k to 32k on average (for a many-gigabyte filesystem), thus increasing the space available for reservations by some factor. The expected gain is less pressure to the flusher because more reservations will succeed immediately. The idea behind is to make the initial reservation more accurate to current state than blindly overcommitting by some random factor (1/2). Another hint to the tree root level may be the usage of the root node: eg. if the root is less than half full, splitting will not happen unless there are K concurrent reservations running where K is proportional to overwriting the whole subtree (same exponential decrease with increasing level) and this will not be possible within one transaction or there will not be enough space to satisfy all reservations. (This attempts to fine-tune the currently hardcoded level 8 up to the best value). The safe value for the level in the calculations would be like N+1, ie. as if all the possible splits happen with respect to current tree height. implemented as follows on top of next/master, in short: * disable overcommit completely * do the optimistically best guess for the metadata and reserve only up to the current tree height --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2794,7 +2794,11 @@ static inline gfp_t btrfs_alloc_write_mask(struct address_space *mapping) static inline u64 btrfs_calc_trans_metadata_size(struct btrfs_root *root, unsigned num_items) { - return (root-leafsize + root-nodesize * (BTRFS_MAX_LEVEL - 1)) * + int level = btrfs_header_level(root-node); + + level = min(level, BTRFS_MAX_LEVEL); + + return (root-leafsize + root-nodesize * (level - 1)) * 3 * num_items; } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index efb044e..c9fa7ed 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3649,6 +3649,8 @@ static int can_overcommit(struct btrfs_root *root, u64 avail; u64 used; + return 0; + used = space_info-bytes_used + space_info-bytes_reserved + space_info-bytes_pinned + space_info-bytes_readonly + space_info-bytes_may_use; --- I must be doing something wrong, because I don't see any unexpected ENOSPC while performing some tests on a 2G, 4G and 10G partitions with following loads: fs_mark -F -k -S 0 -D 20 -N 100 - dumb, no file contents fs_mark -F -k -S 0 -D 2 -N 40 -s 2048 -t 8 - metadata intense, files and inline contents fs_mark -F -k -S 0 -D 20 -N 100 -s 3900 -t 24 - ~dtto, lots writers fs_mark -F -k -S 0 -D 20 -N 100 -s 8192 -t 24 - lots writers, no inlines tar xf linux-3.2.tar.bz2(1G in total) - simple untar The fs_mark loads do not do any kind of sync, as this should stress the number of data in flight. After each load finishes with ENOSPC, the rest of the filesystem is filled with a file full of zeros. Then 'fi df' reports all the space is used, no unexpectedly large files can be found (ie a few hundred KBs if it was data-intense, or using whole remaining data space if it was meta-intense, eg. 8MB). mkfs was default, so are the mount options, did not push it through xfstests. but at least verified md5sums of the kernel tree. Sample tree height for extent tree was 2 and 3 for fs tree, so I think it exercised the case where the tree height increased during the load (but maybe it was just lucky to get the +1 block from somewhere else, dunno). I'm running the tests on a 100G filesystem now. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC design issues
On Tue, Sep 25, 2012 at 10:43:36AM -0600, David Sterba wrote: On Thu, Sep 20, 2012 at 03:03:06PM -0400, Josef Bacik wrote: I'm going to look at fixing some of the performance issues that crop up because of our reservation system. Before I go and do a whole lot of work I want some feedback. I've done a brain dump here https://btrfs.wiki.kernel.org/index.php/ENOSPC Thanks for writing it down, much appreciated. My first and probably naive approach is described in the page, quoting here: Attempt to address how to flush less stated below. The over-reservation of a 4k block can go up to 96k as the worst case calculation (see above). This accounts for splitting the full tree path from 8th level root down to the leaf plus the node splits. My question: how often do we need to go up to the level N+1 from current level N? for levels 0 and 1 it may happen within one transaction, maybe not so often for level 2 and with exponentially decreasing frequency for the higher levels. Therefore, is it possible to check the tree level first and adapt the calculation according to that? Let's say we can reduce the 4k reservation size from 96k to 32k on average (for a many-gigabyte filesystem), thus increasing the space available for reservations by some factor. The expected gain is less pressure to the flusher because more reservations will succeed immediately. The idea behind is to make the initial reservation more accurate to current state than blindly overcommitting by some random factor (1/2). Another hint to the tree root level may be the usage of the root node: eg. if the root is less than half full, splitting will not happen unless there are K concurrent reservations running where K is proportional to overwriting the whole subtree (same exponential decrease with increasing level) and this will not be possible within one transaction or there will not be enough space to satisfy all reservations. (This attempts to fine-tune the currently hardcoded level 8 up to the best value). The safe value for the level in the calculations would be like N+1, ie. as if all the possible splits happen with respect to current tree height. implemented as follows on top of next/master, in short: * disable overcommit completely * do the optimistically best guess for the metadata and reserve only up to the current tree height So I had tried to do this before, the problem is when height changes our reserve changes. So for things like delalloc we say we have X number of extents and we reserve that much space, but then when we run delalloc we re-calculate the metadata size for X number extents we've removed and that number could come out differently since the height of the tree would have changed. One thing we could do is to store the actual reservation with the extent in the io_tree, but I think we already use the private for something else so we'd have to add it somewhere else. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On 09/25/2012 12:14 PM, David Sterba wrote: On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wuwu...@linux.vnet.ibm.com Some code pathes forget to free memory on exit. Same as with the fd's, kernel will free all memory for us at exit(). I strongly disagree with this approach. The callee often don't know what happen after and before the call. The same is true for the programmer, because the code is quite often updated by several people. A clean exit() is the right thing to do as general rule. I don't see any valid reason (in the btrfs context) to do otherwise. Relying on the exit() for a proper clean-up increase the likelihood of bug when the code evolves (see my patch [RESPOST][BTRFS-PROGS][PATCH] btrfs_read_dev_super(): uninitialized variable for an example of what means an incorrect deallocation of resource). If there's lots of memory allocated, it may be even faster to leave the unallocation process to kernel as it will do it in one go, while the application would unnecessarily free it chunk by chunk. May be I am wrong, but I don't think that the increase of speed of the btrfs command is even measurable relying on exit instead of free()-ing each chunk of memory one at time The same should be true for the open()/close() My 2¢ BR G.Baroncelli david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v3] Btrfs: snapshot-aware defrag
On Mon, Sep 17, 2012 at 4:58 AM, Liu Bo bo.li@oracle.com wrote: This comes from one of btrfs's project ideas, As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well. Now we're able to fill the blank with this patch, in which we make full use of backref walking stuff. Here is the basic idea, o set the writeback ranges started by defragment with flag EXTENT_DEFRAG o at endio, after we finish updating fs tree, we use backref walking to find all parents of the ranges and re-link them with the new COWed file layout by adding corresponding backrefs. Originally patch by Li Zefan l...@cn.fujitsu.com Signed-off-by: Liu Bo bo.li@oracle.com I'm hitting the WARN_ON in record_extent_backrefs() indicating a problem with the return value from iterate_inodes_from_logical(). [ 6865.184782] [ cut here ] [ 6865.184819] WARNING: at fs/btrfs/inode.c:2062 record_extent_backrefs+0xe5/0xe7 [btrfs]() [ 6865.184823] Hardware name: OptiPlex 745 [ 6865.184825] Modules linked in: lpc_ich mfd_core xts gf128mul cryptd aes_x86_64 sha256_generic btrfs libcrc32c [ 6865.184841] Pid: 4239, comm: btrfs-endio-wri Not tainted 3.5.4-git-local+ #1 [ 6865.184844] Call Trace: [ 6865.184856] [81031d6a] warn_slowpath_common+0x74/0xa2 [ 6865.184862] [81031db2] warn_slowpath_null+0x1a/0x1c [ 6865.184884] [a003356b] record_extent_backrefs+0xe5/0xe7 [btrfs] [ 6865.184908] [a003cf3a] btrfs_finish_ordered_io+0x131/0xa4b [btrfs] [ 6865.184930] [a003d869] finish_ordered_fn+0x15/0x17 [btrfs] [ 6865.184951] [a005882f] worker_loop+0x145/0x516 [btrfs] [ 6865.184959] [81059727] ? __wake_up_common+0x54/0x84 [ 6865.184983] [a00586ea] ? btrfs_queue_worker+0x2d3/0x2d3 [btrfs] [ 6865.184989] [810516bb] kthread+0x93/0x98 [ 6865.184996] [817d7934] kernel_thread_helper+0x4/0x10 [ 6865.185001] [81051628] ? kthread_freezable_should_stop+0x6a/0x6a [ 6865.185021] [817d7930] ? gs_change+0xb/0xb [ 6865.185025] ---[ end trace 26cc0e186efc79d8 ]--- I'm testing a 3.5.4 kernel merged with 3.6_rc patchset as well as the send_recv patches and most of the btrfs-next patches. I'm running into this issue when mounting with autodefrag, and running some snapshot tests. This may be related to a problem elsewhere, because I've been encountering other backref issues even before testing this patch. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: do not hold the file extent leaf locked when adding extent item
For some reason we unlock everything except the leaf we are on, set the path blocking and then add the extent item for the extent we just finished writing. I can't for the life of me figure out why we would want to do this, and the history doesn't really indicate that there was a real reason for it, so just remove it. This will reduce our tree lock contention on heavy writes. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/inode.c |4 +--- 1 files changed, 1 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cad0c57..50a54b3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1823,10 +1823,8 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, btrfs_set_file_extent_encryption(leaf, fi, encryption); btrfs_set_file_extent_other_encoding(leaf, fi, other_encoding); - btrfs_unlock_up_safe(path, 1); - btrfs_set_lock_blocking(leaf); - btrfs_mark_buffer_dirty(leaf); + btrfs_release_path(path); inode_add_bytes(inode, num_bytes); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
various processes blocked in btrfs_tree_read_lock for more than 120 secs
Hi, I'm hitting the following on a btrfs filesystem used as a Ceph OSD data store, under a heavy write load. My kernel is current Linus master (commit 56d27adcb536) merged with Josef Bacik's btrfs-next master (commit d5b04fb3bbb6). What can I do to help resolve this? [ 1558.754105] INFO: task btrfs-endio-wri:28012 blocked for more than 120 seconds. [ 1558.761569] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 1558.769575] btrfs-endio-wri D 8160b9c0 0 28012 2 0x [ 1558.777063] 8806230b39c0 0046 0002 8806230b3fd8 [ 1558.785725] 8806230b2010 8806230b2000 8806230b2000 8806230b2000 [ 1558.793776] 8806230b3fd8 8806230b2000 8806245a9740 880615825d00 [ 1558.802396] Call Trace: [ 1558.805018] [8148b51d] schedule+0x5d/0x60 [ 1558.810189] [a05c7423] btrfs_tree_read_lock+0xf3/0x140 [btrfs] [ 1558.817223] [810660a0] ? wake_up_bit+0x40/0x40 [ 1558.822851] [a056cf53] btrfs_read_lock_root_node+0x23/0x50 [btrfs] [ 1558.830150] [a0575000] btrfs_search_slot+0x1f0/0x730 [btrfs] [ 1558.837722] [a0588dbd] btrfs_lookup_file_extent+0x3d/0x40 [btrfs] [ 1558.845099] [a05a80e1] __btrfs_drop_extents+0x171/0xad0 [btrfs] [ 1558.852506] [8114b3eb] ? kmem_cache_alloc+0xcb/0x160 [ 1558.858602] [a05a9443] btrfs_drop_extents+0x73/0xa0 [btrfs] [ 1558.865457] [a059a1c0] insert_reserved_file_extent.clone.0+0x80/0x2a0 [btrfs] [ 1558.873999] [a059640b] ? start_transaction+0x3cb/0x450 [btrfs] [ 1558.881239] [a05a4bc9] btrfs_finish_ordered_io+0x339/0x4d0 [btrfs] [ 1558.888608] [a05a4d75] finish_ordered_fn+0x15/0x20 [btrfs] [ 1558.895352] [a05bfbd2] worker_loop+0x1a2/0x400 [btrfs] [ 1558.901753] [a05bfa30] ? check_pending_worker_creates+0xe0/0xe0 [btrfs] [ 1558.910128] [a05bfa30] ? check_pending_worker_creates+0xe0/0xe0 [btrfs] [ 1558.918228] [81065b66] kthread+0x96/0xa0 [ 1558.923339] [81495c24] kernel_thread_helper+0x4/0x10 [ 1558.929821] [8148ca46] ? retint_restore_args+0xe/0xe [ 1558.935920] [81065ad0] ? __init_kthread_worker+0x40/0x40 [ 1558.942694] [81495c20] ? gs_change+0xb/0xb [ 1558.948170] INFO: task flush-btrfs-21:31685 blocked for more than 120 seconds. [ 1558.955408] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 1558.963274] flush-btrfs-21 D 8160b9c0 0 31685 2 0x [ 1558.970602] 8806074d3470 0046 0002 8806074d3fd8 [ 1558.978333] 8806074d2010 8806074d2000 8806074d2000 8806074d2000 [ 1558.986052] 8806074d3fd8 8806074d2000 8806245aae80 8804cf048000 [ 1558.993744] Call Trace: [ 1558.996238] [8148b51d] schedule+0x5d/0x60 [ 1559.001243] [a05c7423] btrfs_tree_read_lock+0xf3/0x140 [btrfs] [ 1559.008057] [810660a0] ? wake_up_bit+0x40/0x40 [ 1559.013554] [a056cf53] btrfs_read_lock_root_node+0x23/0x50 [btrfs] [ 1559.020788] [a0575000] btrfs_search_slot+0x1f0/0x730 [btrfs] [ 1559.027451] [a05aa4c7] ? free_extent_map+0x87/0x90 [btrfs] [ 1559.033957] [a0588dbd] btrfs_lookup_file_extent+0x3d/0x40 [btrfs] [ 1559.041062] [a05a80e1] __btrfs_drop_extents+0x171/0xad0 [btrfs] [ 1559.047977] [8114b3eb] ? kmem_cache_alloc+0xcb/0x160 [ 1559.053948] [a05a9443] btrfs_drop_extents+0x73/0xa0 [btrfs] [ 1559.060698] [a059cc85] cow_file_range_inline+0xe5/0x1c0 [btrfs] [ 1559.067807] [a059ced2] cow_file_range+0x172/0x4a0 [btrfs] [ 1559.074296] [a059f27b] run_delalloc_range+0x7b/0xa0 [btrfs] [ 1559.080967] [a05b406d] __extent_writepage+0x22d/0x740 [btrfs] [ 1559.087868] [a0597757] ? btrfs_add_delayed_iput+0x77/0xe0 [btrfs] [ 1559.095034] [810fcad8] ? find_get_pages_tag+0x148/0x190 [ 1559.101329] [a05b4972] extent_write_cache_pages.clone.3+0x242/0x3d0 [btrfs] [ 1559.109465] [a05b4b47] extent_writepages+0x47/0x60 [btrfs] [ 1559.116008] [a05a2790] ? btrfs_update_time+0xb0/0xb0 [btrfs] [ 1559.122716] [a0599987] btrfs_writepages+0x27/0x30 [btrfs] [ 1559.129211] [81108a03] do_writepages+0x23/0x40 [ 1559.134701] [8117f94b] __writeback_single_inode+0x4b/0x180 [ 1559.141198] [81065f77] ? bit_waitqueue+0x17/0xc0 [ 1559.147201] [811820f6] writeback_sb_inodes+0x286/0x390 [ 1559.153365] [81182286] __writeback_inodes_wb+0x86/0xd0 [ 1559.159509] [811824fb] wb_writeback+0x18b/0x320 [ 1559.165082] [81182884] wb_do_writeback+0x1f4/0x290 [ 1559.170988] [814897b2] ? schedule_timeout+0x1d2/0x240 [ 1559.177249] [81182a03] bdi_writeback_thread+0xe3/0x2c0 [ 1559.183685] [81182920] ? wb_do_writeback+0x290/0x290 [ 1559.189740] [81182920] ? wb_do_writeback+0x290/0x290 [ 1559.195878]
Re: [PATCH v4 0/4] btrfs: extended inode refs
On Mon, Aug 20, 2012 at 02:29:17PM -0600, Mark Fasheh wrote: Testing wise, the basic namespace operations work well (link, unlink, etc). The rest has gotten less debugging (and I really don't have a great way of testing the code in tree-log.c) Attached to this e-mail are btrfs-progs patches which make testing of the changes possible. Hi Mark, I hit a few problems testing this, so I have the patch below that I plan on folding into your commits (to keep bisect from crashing in tree log). Just let me know if this is a problem, or if you see any bugs in there. I'm still doing a last round of checks on it, but I wanted to send along early for comments. The biggest change in here is to always check the ref_objectid when returning a backref. Hash collisions mean we may return a ref for a completely different parent id otherwise. I think I caught all the places missing that logic, but please double check me. Other than that I went through and fixed up bugs in tree-log.c. __add_inode_ref had a bunch of cut and paste errors, and you carefully preserved a huge use-after-free bug in the original add_inode_ref. diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6f2e7e6..50dcd0f 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3231,7 +3231,8 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, u64 inode_objectid, u64 ref_objectid, int ins_len, int cow); -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, const char *name, +int btrfs_find_name_in_ext_backref(struct btrfs_path *path, + u64 ref_objectid, const char *name, int name_len, struct btrfs_inode_extref **extref_ret); diff --git a/fs/btrfs/inode-item.c b/fs/btrfs/inode-item.c index ad11b30..48b8fda 100644 --- a/fs/btrfs/inode-item.c +++ b/fs/btrfs/inode-item.c @@ -51,8 +51,8 @@ static int find_name_in_backref(struct btrfs_path *path, const char *name, return 0; } -int btrfs_find_name_in_ext_backref(struct btrfs_path *path, const char *name, - int name_len, +int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 ref_objectid, + const char *name, int name_len, struct btrfs_inode_extref **extref_ret) { struct extent_buffer *leaf; @@ -78,8 +78,9 @@ int btrfs_find_name_in_ext_backref(struct btrfs_path *path, const char *name, name_ptr = (unsigned long)(extref-name); ref_name_len = btrfs_inode_extref_name_len(leaf, extref); - if (ref_name_len == name_len -(memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) { + if (ref_name_len == name_len + btrfs_inode_extref_parent(leaf, extref) == ref_objectid + (memcmp_extent_buffer(leaf, name, name_ptr, name_len) == 0)) { if (extref_ret) *extref_ret = extref; return 1; @@ -138,7 +139,7 @@ btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans, return ERR_PTR(ret); if (ret 0) return NULL; - if (!btrfs_find_name_in_ext_backref(path, name, name_len, extref)) + if (!btrfs_find_name_in_ext_backref(path, ref_objectid, name, name_len, extref)) return NULL; return extref; } @@ -218,7 +219,8 @@ int btrfs_del_inode_extref(struct btrfs_trans_handle *trans, * This should always succeed so error here will make the FS * readonly. */ - if (!btrfs_find_name_in_ext_backref(path, name, name_len, extref)) { + if (!btrfs_find_name_in_ext_backref(path, ref_objectid, + name, name_len, extref)) { btrfs_std_error(root-fs_info, -ENOENT); ret = -EROFS; goto out; @@ -355,7 +357,8 @@ static int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans, ret = btrfs_insert_empty_item(trans, root, path, key, ins_len); if (ret == -EEXIST) { - if (btrfs_find_name_in_ext_backref(path, name, name_len, NULL)) + if (btrfs_find_name_in_ext_backref(path, ref_objectid, + name, name_len, NULL)) goto out; btrfs_extend_item(trans, root, path, ins_len); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index baf8be2..1d7b348 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -745,6 +745,7 @@ out: */ static noinline int backref_in_log(struct btrfs_root *log, struct btrfs_key *key, + u64 ref_objectid, char *name, int namelen) {
Re: [PATCH 2/2 v3] Btrfs: snapshot-aware defrag
On 09/26/2012 01:39 AM, Mitch Harder wrote: On Mon, Sep 17, 2012 at 4:58 AM, Liu Bo bo.li@oracle.com wrote: This comes from one of btrfs's project ideas, As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well. Now we're able to fill the blank with this patch, in which we make full use of backref walking stuff. Here is the basic idea, o set the writeback ranges started by defragment with flag EXTENT_DEFRAG o at endio, after we finish updating fs tree, we use backref walking to find all parents of the ranges and re-link them with the new COWed file layout by adding corresponding backrefs. Originally patch by Li Zefan l...@cn.fujitsu.com Signed-off-by: Liu Bo bo.li@oracle.com I'm hitting the WARN_ON in record_extent_backrefs() indicating a problem with the return value from iterate_inodes_from_logical(). [ 6865.184782] [ cut here ] [ 6865.184819] WARNING: at fs/btrfs/inode.c:2062 record_extent_backrefs+0xe5/0xe7 [btrfs]() [ 6865.184823] Hardware name: OptiPlex 745 [ 6865.184825] Modules linked in: lpc_ich mfd_core xts gf128mul cryptd aes_x86_64 sha256_generic btrfs libcrc32c [ 6865.184841] Pid: 4239, comm: btrfs-endio-wri Not tainted 3.5.4-git-local+ #1 [ 6865.184844] Call Trace: [ 6865.184856] [81031d6a] warn_slowpath_common+0x74/0xa2 [ 6865.184862] [81031db2] warn_slowpath_null+0x1a/0x1c [ 6865.184884] [a003356b] record_extent_backrefs+0xe5/0xe7 [btrfs] [ 6865.184908] [a003cf3a] btrfs_finish_ordered_io+0x131/0xa4b [btrfs] [ 6865.184930] [a003d869] finish_ordered_fn+0x15/0x17 [btrfs] [ 6865.184951] [a005882f] worker_loop+0x145/0x516 [btrfs] [ 6865.184959] [81059727] ? __wake_up_common+0x54/0x84 [ 6865.184983] [a00586ea] ? btrfs_queue_worker+0x2d3/0x2d3 [btrfs] [ 6865.184989] [810516bb] kthread+0x93/0x98 [ 6865.184996] [817d7934] kernel_thread_helper+0x4/0x10 [ 6865.185001] [81051628] ? kthread_freezable_should_stop+0x6a/0x6a [ 6865.185021] [817d7930] ? gs_change+0xb/0xb [ 6865.185025] ---[ end trace 26cc0e186efc79d8 ]--- I'm testing a 3.5.4 kernel merged with 3.6_rc patchset as well as the send_recv patches and most of the btrfs-next patches. I'm running into this issue when mounting with autodefrag, and running some snapshot tests. This may be related to a problem elsewhere, because I've been encountering other backref issues even before testing this patch. Oh, will look into it, thanks for the report. thanks, liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 02/10] vfs: add support for updating access frequency
thanks a lot for your review in my heart, Dave. It is very helpful to me. On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some utils helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c | 359 + fs/hot_tracking.h | 15 +++ 2 files changed, 374 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 173054b..52ed926 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -106,6 +106,365 @@ inode_err: } /* + * Drops the reference out on hot_inode_item by one and free the structure + * if the reference count hits zero + */ +void hot_rb_free_hot_inode_item(struct hot_inode_item *he) hot_inode_item_put() +{ + if (!he) + return; It's a bug to call a put function on a kref counted item with a null pointer. Let the kernel crash so it is noticed and fixed. OK, will remove it. + + if (atomic_dec_and_test(he-refs.refcount)) { + WARN_ON(he-in_tree); + kmem_cache_free(hot_inode_item_cache, he); + } Isn't this abusing krefs? i.e. this should be: Sorry, thanks for your explaination as below: hot_inode_item_free() { WARN_ON(he-in_tree); kmem_cache_free(hot_inode_item_cache, he); } hot_inode_item_put() { kref_put(he-refs, hot_inode_item_free) } +/* + * Drops the reference out on hot_range_item by one and free the structure + * if the reference count hits zero + */ +static void hot_rb_free_hot_range_item(struct hot_range_item *hr) same comments as above. OK, thanks. +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, + unsigned long inode_num, + struct rb_node *node) static struct rb_node * hot_inode_item_find(. ) OK, thanks. +{ + struct rb_node **p = root-rb_node; + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num entry-i_ino) + p = (*p)-rb_left; + else if (inode_num entry-i_ino) + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_inode_item, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); So the hot inode item search key is the inode number? Why use an Yes rb-tree then? Wouldn't a btree be a more memory efficient way to hold a sparse index that has fixed key and pointer sizes? Yes, i know, but if we don't use btree, what will be better? Radix tree? Also, the API seems quite strange. you pass in the the rb_node and the inode number which instead of passing in the hot inode item that already holds this information. You then convert the rb_node back to a hot inode item to set the in_tree variable. So why not just pass in the struct hot_inode_item in the first place? Good catch, thanks for your remider. +static u64 hot_rb_range_end(struct hot_range_item *hr) hot_range_item_end() OK +{ + if (hr-start + hr-len hr-start) + return (u64)-1; + + return hr-start + hr-len - 1; +} + +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root, hot_range_item_find() OK + u64 start, + struct rb_node *node) +{ + struct rb_node **p = root-rb_node; + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + /* ensure start is on a range boundary */ + start = start RANGE_SIZE_MASK; + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start entry-start) + p = (*p)-rb_left; + else if (start = hot_rb_range_end(entry)) Shouldn't an aligned end always be one byte short of the start offset of the next aligned region? i.e. start == hot_rb_range_end(entry) is an indication of an off-by one bug somewhere? This is really one good catch, thanks. + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_range_item, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); Same comments as the hot_inode_range. OK. Also, the start offset in the
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers I can see what you mean, but don't know if other guys also agree with this. If so, all FS specific code which use hot tracking feature wll have to add the same chunk of code in it fill_super method. Is it good? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Wed, Sep 26, 2012 at 1:14 AM, Goffredo Baroncelli kreij...@libero.it wrote: On 09/25/2012 12:14 PM, David Sterba wrote: On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wuwu...@linux.vnet.ibm.com Some code pathes forget to free memory on exit. Same as with the fd's, kernel will free all memory for us at exit(). I strongly disagree with this approach. The callee often don't know what happen after and before the call. The same is true for the programmer, because the code is quite often updated by several people. A clean exit() is the right thing to do as general rule. I don't see any valid reason (in the btrfs context) to do otherwise. Relying on the exit() for a proper clean-up increase the likelihood of bug when the code evolves (see my patch [RESPOST][BTRFS-PROGS][PATCH] btrfs_read_dev_super(): uninitialized variable for an example of what means an incorrect deallocation of resource). If there's lots of memory allocated, it may be even faster to leave the unallocation process to kernel as it will do it in one go, while the application would unnecessarily free it chunk by chunk. May be I am wrong, but I don't think that the increase of speed of the btrfs command is even measurable relying on exit instead of free()-ing each chunk of memory one at time The same should be true for the open()/close() I fully agree with you. In one same function, i find that some code path free system sources, while other code path doesn't. This is one nice way. My 2¢ BR G.Baroncelli david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 6:20 PM, Ram Pai linux...@us.ibm.com wrote: On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- + ..snip.. +/* A tree that sits on the hot_info */ +struct hot_inode_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* A tree of ranges for each inode in the hot_inode_tree */ +struct hot_range_tree { + struct rb_root map; + rwlock_t lock; +}; Can as well have a generic datastructure called hot_tree instead of having two different datastructure which basically are the same. OK. + +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item + */ +struct hot_freq_data { + struct timespec last_read_time; + struct timespec last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u8 flags; + u32 last_temperature; +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + /* node for hot_inode_tree rb_tree */ + struct rb_node rb_node; + /* tree of ranges in this inode */ + struct hot_range_tree hot_range_tree; + /* frequency data for this inode */ + struct hot_freq_data hot_freq_data; + /* inode number, copied from inode */ + unsigned long i_ino; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, i_no, in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; + +/* + * An item representing a range inside of an inode whose frequency + * is being tracked + */ +struct hot_range_item { + /* node for hot_range_tree rb_tree */ + struct rb_node rb_node; + /* frequency data for this range */ + struct hot_freq_data hot_freq_data; + /* the hot_inode_item associated with this hot_range_item */ + struct hot_inode_item *hot_inode; + /* starting offset of this range */ + u64 start; + /* length of this range */ + u64 len; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, start, len, and in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; might as well have just one generic datastructure called hot_item with all the common fields and then have struct hot_inode_item { struct hot_item hot_inode; struct hot_tree hot_range_tree; unsigned long i_ino; } and struct hot_range_item { struct hot_item hot_range; u64 start; u64 len;/* length of this range */ } This should help you eliminate some duplicate code as well. OK, i will try to apply them. thanks. RP -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 05/10] vfs: introduce one hash table
On Tue, Sep 25, 2012 at 5:54 PM, Ram Pai linux...@us.ibm.com wrote: On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 77 - include/linux/hot_tracking.h | 35 +++ 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fa89f70..5f96442 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -16,6 +16,7 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/hash.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h @@ -24,6 +25,9 @@ ...snip... +/* Hash list heads for hot hash table */ +struct hot_hash_head { + struct hlist_head hashhead; + rwlock_t rwlock; + u32 temperature; +}; + +/* Nodes stored in each hash list of hash table */ +struct hot_hash_node { + struct hlist_node hashnode; + struct list_head node; + struct hot_freq_data *hot_freq_data; + struct hot_hash_head *hlist; + spinlock_t lock; /* protects hlist */ + + /* + * number of references to this node + * equals 1 (hashlist entry) + */ + struct kref refs; +}; Dont see why you need yet another datastructure to hash the inode_item and the range_item into a hash list. You can just add another If there's such one structure, we can rapidly know which hash bucket one inode_item is currently linking to via its hlist field. This is useful if one inode_item corresponding to one file need to move from one bucket to another bucket when this file get cold or hotter. Does it make sense to you? hlist_node in the inode_item and range_item. This field can be then used to link into the corresponding hash list. You can use the container_of() get to the inode_item or the range_item using the hlist_node field. You can thus eliminate a lot of code. As what i said above, i don't think it can eliminate a lo of code. + /* An item representing an inode and its access frequency */ struct hot_inode_item { /* node for hot_inode_tree rb_tree */ @@ -68,6 +93,8 @@ struct hot_inode_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this inode */ + struct hot_hash_node *heat_node; this can be just struct hlist_node head_node; /* lookup hot_inode hash list */ Use this field to link it into the corresponding hashlist. }; this can be just /* @@ -91,6 +118,8 @@ struct hot_range_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this range */ + struct hot_hash_node *heat_node; this can be just struct hlist_node head_node; /* lookup hot_range hash list */ }; struct hot_info { @@ -98,6 +127,12 @@ struct hot_info { /* red-black tree that keeps track of fs-wide hot data */ struct hot_inode_tree hot_inode_tree; + + /* hash map of inode temperature */ + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; + + /* hash map of range temperature */ + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; }; #endif /* _LINUX_HOTTRACK_H */ -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html