Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some util helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 179 ++ fs/hot_tracking.h|7 ++ include/linux/hot_tracking.h |2 + 3 files changed, 188 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 68591f0..0a7d9a3 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) } } +struct hot_inode_item +*hot_inode_item_find(struct hot_info *root, u64 ino) +{ + struct hot_inode_item *he; + int ret; + +again: + spin_lock(root-lock); + he = radix_tree_lookup(root-hot_inode_tree, ino); + if (he) { + kref_get(he-hot_inode.refs); + spin_unlock(root-lock); + return he; + } + spin_unlock(root-lock); + + he = kmem_cache_zalloc(hot_inode_item_cachep, + GFP_KERNEL | GFP_NOFS); This doesn't look quite right... which of these two did you mean? I assume probably just GFP_NOFS + if (!he) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he, ino, root-hot_inode_tree); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_inode_item_cachep, he); + return ERR_PTR(ret); + } + + spin_lock(root-lock); + ret = radix_tree_insert(root-hot_inode_tree, ino, he); + if (ret == -EEXIST) { + kmem_cache_free(hot_inode_item_cachep, he); + spin_unlock(root-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(root-lock); + radix_tree_preload_end(); + + kref_get(he-hot_inode.refs); + return he; +} +EXPORT_SYMBOL_GPL(hot_inode_item_find); + +static struct hot_range_item +*hot_range_item_find(struct hot_inode_item *he, + u32 start) +{ + struct hot_range_item *hr; + int ret; + +again: + spin_lock(he-lock); + hr = radix_tree_lookup(he-hot_range_tree, start); + if (hr) { + kref_get(hr-hot_range.refs); + spin_unlock(he-lock); + return hr; + } + spin_unlock(he-lock); + + hr = kmem_cache_zalloc(hot_range_item_cachep, + GFP_KERNEL | GFP_NOFS); Likewise, here too. Steve. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a delayed_work to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 85 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |3 + 3 files changed, 91 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fff0038..0ef9cad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,9 +15,12 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h +#include linux/list_sort.h #include linux/limits.h #include hot_tracking.h @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) } } +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = + container_of(a, struct hot_comm_item, n_list); + struct hot_comm_item *bp = + container_of(b, struct hot_comm_item, n_list); + + int diff = ap-hot_freq_data.last_temp + - bp-hot_freq_data.last_temp; + if (diff 0) + return -1; + if (diff 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; + + while (1) { + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) + break; + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + } + + /* Sort temperature map info */ + for (i = 0; i HEAT_MAP_SIZE; i++) { + list_sort(NULL, root-heat_inode_map[i].node_list, + hot_temp_cmp); + list_sort(NULL, root-heat_range_map[i].node_list, + hot_temp_cmp); + } + If this list can potentially have one (or more) entries per inode, then filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Steve. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Mon, Nov 5, 2012 at 7:07 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some util helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 179 ++ fs/hot_tracking.h|7 ++ include/linux/hot_tracking.h |2 + 3 files changed, 188 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 68591f0..0a7d9a3 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) } } +struct hot_inode_item +*hot_inode_item_find(struct hot_info *root, u64 ino) +{ + struct hot_inode_item *he; + int ret; + +again: + spin_lock(root-lock); + he = radix_tree_lookup(root-hot_inode_tree, ino); + if (he) { + kref_get(he-hot_inode.refs); + spin_unlock(root-lock); + return he; + } + spin_unlock(root-lock); + + he = kmem_cache_zalloc(hot_inode_item_cachep, + GFP_KERNEL | GFP_NOFS); This doesn't look quite right... which of these two did you mean? I assume probably just GFP_NOFS Yes, good catch, thanks. + if (!he) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he, ino, root-hot_inode_tree); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_inode_item_cachep, he); + return ERR_PTR(ret); + } + + spin_lock(root-lock); + ret = radix_tree_insert(root-hot_inode_tree, ino, he); + if (ret == -EEXIST) { + kmem_cache_free(hot_inode_item_cachep, he); + spin_unlock(root-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(root-lock); + radix_tree_preload_end(); + + kref_get(he-hot_inode.refs); + return he; +} +EXPORT_SYMBOL_GPL(hot_inode_item_find); + +static struct hot_range_item +*hot_range_item_find(struct hot_inode_item *he, + u32 start) +{ + struct hot_range_item *hr; + int ret; + +again: + spin_lock(he-lock); + hr = radix_tree_lookup(he-hot_range_tree, start); + if (hr) { + kref_get(hr-hot_range.refs); + spin_unlock(he-lock); + return hr; + } + spin_unlock(he-lock); + + hr = kmem_cache_zalloc(hot_range_item_cachep, + GFP_KERNEL | GFP_NOFS); Likewise, here too. ditto Steve. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a delayed_work to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 85 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |3 + 3 files changed, 91 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fff0038..0ef9cad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,9 +15,12 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h +#include linux/list_sort.h #include linux/limits.h #include hot_tracking.h @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) } } +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = + container_of(a, struct hot_comm_item, n_list); + struct hot_comm_item *bp = + container_of(b, struct hot_comm_item, n_list); + + int diff = ap-hot_freq_data.last_temp + - bp-hot_freq_data.last_temp; + if (diff 0) + return -1; + if (diff 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; + + while (1) { + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) + break; + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + } + + /* Sort temperature map info */ + for (i = 0; i HEAT_MAP_SIZE; i++) { + list_sort(NULL, root-heat_inode_map[i].node_list, + hot_temp_cmp); + list_sort(NULL, root-heat_range_map[i].node_list, + hot_temp_cmp); + } + If this list can potentially have one (or more) entries per inode, then Only one hot_inode_item per inode, while maybe multiple hot_range_items per inode. filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. I haven't do so large scale test. If we want to find that issue, we need to do large scale performance test, before that, i want to make sure the code change is correct at first. To be honest, for that issue you pointed to, i also have such concern.But list_sort() performance looks good from the test result of the following URL: https://lkml.org/lkml/2010/1/20/485 It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Beside list_sort(), do you have any other way to share? For this concern, how does GFS2 resolve it? Steve. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Corruption at start of files
Here is what I see in my kern.log (see below). For me this first happened when the filesystem was close to full (less than 1GB left), but someone on the irc channel mentioned a similar problem on suspend to ram. The files that have checksum failures end up with their first 4k filled with 0x01 bytes. They were seeing a lot of writes; things like firefox session data and cookie data, plus files that disappeared before I could call inode-resolve on them. I was running 3.6.3 when this happened; I've upgraded to -rcs since but I haven't tried to reproduce the bug deliberately. I didn't see relevant changes in the changelog. Oct 31 17:06:31 moulinex kernel: [93539.008465] BTRFS warning (device dm-16): Aborting unused transaction. Oct 31 17:06:31 moulinex kernel: [93539.011257] BTRFS warning (device dm-16): Aborting unused transaction. Oct 31 17:06:31 moulinex kernel: [93539.017137] BTRFS warning (device dm-16): Aborting unused transaction. Oct 31 17:06:46 moulinex kernel: [93554.728793] use_block_rsv: 16 callbacks suppressed Oct 31 17:06:46 moulinex kernel: [93554.728795] btrfs: block rsv returned -28 Oct 31 17:06:46 moulinex kernel: [93554.728796] [ cut here ] Oct 31 17:06:46 moulinex kernel: [93554.728818] WARNING: at /home/apw/COD/linux/fs/btrfs/extent-tree.c:6323 use_block_rsv+0x19f/0x1b0 [btrfs]() Oct 31 17:06:46 moulinex kernel: [93554.728819] Hardware name: System Product Name Oct 31 17:06:46 moulinex kernel: [93554.728820] Modules linked in: snd_seq_dummy vhost_net macvtap macvlan xt_recent bnep rfcomm bluetooth snd_hrtimer nls_utf8 sch_fq_codel ebtable_nat ebtables xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat bridge stp llc ppdev lp parport deflate ctr twofish_generic twofish_x86_64_3way twofish_x86_64 twofish_common camellia_generic camellia_x86_64 serpent_sse2_x86_64 glue_helper lrw serpent_generic xts gf128mul blowfish_generic blowfish_x86_64 blowfish_common cast5 des_generic xcbc rmd160 sha512_generic crypto_null af_key xfrm_algo binfmt_misc dm_crypt snd_hda_codec_hdmi snd_hda_codec_realtek eeepc_wmi asus_wmi sparse_keymap coretemp kvm_intel kvm dm_multipath scsi_dh microcode arc4 joydev snd_hda_intel snd_hda_codec snd_hwdep snd_pcm rt61pci rt2x00pci rt2x00lib snd_seq_midi snd_rawmidi mac80211 snd_seq_midi_event snd_seq snd_timer snd_seq_device snd cfg80211 soundcore snd_page_alloc eeprom_93cx6 serio_raw lpc_ich mei mac_hid k8temp hw mon_vid i2c_nforce2 firewire_sbp2 firew Oct 31 17:06:46 moulinex kernel: ire_core crc_itu_t psmouse ip6t_REJECT xt_hl ip6t_rt nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT xt_multiport xt_limit xt_tcpudp xt_addrtype xt_state ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack iptable_filter ip_tables x_tables btrfs zlib_deflate libcrc32c raid10 raid0 multipath linear raid456 async_pq async_xor xor async_memcpy async_raid6_recov hid_generic raid6_pq async_tx hid_cherry usbhid hid raid1 ghash_clmulni_intel sata_via wmi aesni_intel ablk_helper cryptd aes_x86_64 r8169 i915 drm_kms_helper drm i2c_algo_bit video [last unloaded: ipmi_msghandler] Oct 31 17:06:46 moulinex kernel: [93554.728873] Pid: 2230, comm: btrfs-endio-wri Tainted: GW3.6.3-030603-generic #201210211349 Oct 31 17:06:46 moulinex kernel: [93554.728874] Call Trace: Oct 31 17:06:46 moulinex kernel: [93554.728880] [81056f6f] warn_slowpath_common+0x7f/0xc0 Oct 31 17:06:46 moulinex kernel: [93554.728882] [81056fca] warn_slowpath_null+0x1a/0x20 Oct 31 17:06:46 moulinex kernel: [93554.728889] [a01feedf] use_block_rsv+0x19f/0x1b0 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728897] [a020260d] btrfs_alloc_free_block+0x3d/0x220 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728904] [a01ef38d] ? balance_level+0xcd/0x890 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728906] [81332e10] ? rb_insert_color+0x110/0x150 Oct 31 17:06:46 moulinex kernel: [93554.728916] [a022f16c] ? read_extent_buffer+0xbc/0x120 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728918] [81178ebd] ? kmem_cache_alloc_trace+0x12d/0x150 Oct 31 17:06:46 moulinex kernel: [93554.728925] [a01ee3b2] __btrfs_cow_block+0x122/0x4f0 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728927] [81136892] ? set_page_dirty+0x62/0x70 Oct 31 17:06:46 moulinex kernel: [93554.728930] [8169f37e] ? _raw_spin_lock+0xe/0x20 Oct 31 17:06:46 moulinex kernel: [93554.728936] [a01ee87c] btrfs_cow_block+0xfc/0x220 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728943] [a01f29f8] btrfs_search_slot+0x368/0x740 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728951] [a0206e84] btrfs_lookup_csum+0x74/0x190 [btrfs] Oct 31 17:06:46 moulinex kernel: [93554.728953] [81179cfc] ? kmem_cache_alloc+0x11c/0x150 Oct 31 17:06:46 moulinex kernel: [93554.728960]
Production use with vanilla 3.6.6
Hello list, is btrfs ready for production use in 3.6.6? Or should i backport fixes from 3.7-rc? Is it planned to have a stable kernel which will get all btrfs fixes backported? Greets Stefan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
Hi, On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote: On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a delayed_work to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 85 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |3 + 3 files changed, 91 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fff0038..0ef9cad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,9 +15,12 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h +#include linux/list_sort.h #include linux/limits.h #include hot_tracking.h @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) } } +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = + container_of(a, struct hot_comm_item, n_list); + struct hot_comm_item *bp = + container_of(b, struct hot_comm_item, n_list); + + int diff = ap-hot_freq_data.last_temp + - bp-hot_freq_data.last_temp; + if (diff 0) + return -1; + if (diff 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; + + while (1) { + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) + break; + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + } + + /* Sort temperature map info */ + for (i = 0; i HEAT_MAP_SIZE; i++) { + list_sort(NULL, root-heat_inode_map[i].node_list, + hot_temp_cmp); + list_sort(NULL, root-heat_range_map[i].node_list, + hot_temp_cmp); + } + If this list can potentially have one (or more) entries per inode, then Only one hot_inode_item per inode, while maybe multiple hot_range_items per inode. filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. I haven't do so large scale test. If we want to find that issue, we need to do large scale performance test, before that, i want to make sure the code change is correct at first. To be honest, for that issue you pointed to, i also have such concern.But list_sort() performance looks good from the test result of the following URL: https://lkml.org/lkml/2010/1/20/485 Yes, I think it is good. Also, even when it says that it's performance is poor (via the warning message) it is still much better than the alternative (of not sorting) in the GFS2 case. So currently our workaround is to ignore the warning. Due to what we using it for (sorting the data blocks for ordered writeback) we only see it very occasionally when there has been lots of data write activity with little journal activity on a node with lots of RAM. It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Beside list_sort(), do you have any other way to share? For this concern, how does GFS2 resolve it? That is an ongoing investigation :-) I've pondered various options... increase temp variable space in list_sort(), not using list_sort() and insertion sorting the blocks instead, flushing the ordered write data early if the list gets too long, figuring out how to remove blocks written back by the VM from the list before the sort, and various other
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
On Mon, Nov 5, 2012 at 8:07 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote: On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a delayed_work to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 85 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |3 + 3 files changed, 91 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fff0038..0ef9cad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,9 +15,12 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h +#include linux/list_sort.h #include linux/limits.h #include hot_tracking.h @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) } } +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = + container_of(a, struct hot_comm_item, n_list); + struct hot_comm_item *bp = + container_of(b, struct hot_comm_item, n_list); + + int diff = ap-hot_freq_data.last_temp + - bp-hot_freq_data.last_temp; + if (diff 0) + return -1; + if (diff 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; + + while (1) { + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) + break; + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + } + + /* Sort temperature map info */ + for (i = 0; i HEAT_MAP_SIZE; i++) { + list_sort(NULL, root-heat_inode_map[i].node_list, + hot_temp_cmp); + list_sort(NULL, root-heat_range_map[i].node_list, + hot_temp_cmp); + } + If this list can potentially have one (or more) entries per inode, then Only one hot_inode_item per inode, while maybe multiple hot_range_items per inode. filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. I haven't do so large scale test. If we want to find that issue, we need to do large scale performance test, before that, i want to make sure the code change is correct at first. To be honest, for that issue you pointed to, i also have such concern.But list_sort() performance looks good from the test result of the following URL: https://lkml.org/lkml/2010/1/20/485 Yes, I think it is good. Also, even when it says that it's performance is poor (via the warning message) it is still much better than the alternative (of not sorting) in the GFS2 case. So currently our workaround is to ignore the warning. Due to what we using it for (sorting the data blocks for ordered writeback) we only see it very occasionally when there has been lots of data write activity with little journal activity on a node with lots of RAM. OK. It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Beside list_sort(), do you have any other way to share? For this concern, how does GFS2 resolve it? That is an ongoing investigation :-) I've pondered various options... increase temp variable space in list_sort(), not using list_sort() and insertion sorting the blocks instead, flushing the ordered write data early if the list gets too long, figuring
[PATCH 1/2] Btrfs: fix a deadlock in aborting transaction due to ENOSPC
When committing a transaction, we may bail out of running delayed refs due to ENOSPC, and then abort the current transaction to flip into readonly. But we'll hit a deadlock on ref head's lock since we forget to release its lock and other cleanup stuff. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/extent-tree.c |7 +++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3d3e2c1..e0c4809 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2314,6 +2314,9 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, kfree(extent_op); if (ret) { + list_del_init(locked_ref-cluster); + mutex_unlock(locked_ref-mutex); + printk(KERN_DEBUG btrfs: run_delayed_extent_op returned %d\n, ret); spin_lock(delayed_refs-lock); return ret; @@ -2356,6 +2359,10 @@ static noinline int run_clustered_refs(struct btrfs_trans_handle *trans, count++; if (ret) { + if (locked_ref) { + list_del_init(locked_ref-cluster); + mutex_unlock(locked_ref-mutex); + } printk(KERN_DEBUG btrfs: run_one_delayed_ref returned %d\n, ret); spin_lock(delayed_refs-lock); return ret; -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: fix a double free on pending snapshots in error handling
When creating a snapshot, failing to commit a transaction can end up with aborting the transaction, following by doing a cleanup for it, where we'll free all snapshots pending to disk. So we check it and avoid double free on pending snapshots. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/ioctl.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 8fcf9a5..4e1a1ce 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -571,8 +571,12 @@ static int create_snapshot(struct btrfs_root *root, struct dentry *dentry, ret = btrfs_commit_transaction(trans, root-fs_info-extent_root); } - if (ret) + if (ret) { + /* cleanup_transaction has freed this for us */ + if (trans-aborted) + pending_snapshot = NULL; goto fail; + } ret = pending_snapshot-error; if (ret) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: Don't trust the superblock label and simply printk(%s) it
Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the superblock and make Btrfs printk(%s) crash while holding the uuid_mutex since nobody forces a limit on the string. Since the uuid_mutex is significant, the system would be unusable afterwards. Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/volumes.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index eeed97d..a429cc6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -764,10 +764,13 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, devid = btrfs_stack_device_id(disk_super-dev_item); transid = btrfs_super_generation(disk_super); total_devices = btrfs_super_num_devices(disk_super); - if (disk_super-label[0]) + if (disk_super-label[0]) { + if (disk_super-label[BTRFS_LABEL_SIZE - 1]) + disk_super-label[BTRFS_LABEL_SIZE - 1] = '\0'; printk(KERN_INFO device label %s , disk_super-label); - else + } else { printk(KERN_INFO device fsid %pU , disk_super-fsid); + } printk(KERN_CONT devid %llu transid %llu %s\n, (unsigned long long)devid, (unsigned long long)transid, path); ret = device_list_add(path, disk_super, devid, fs_devices_ret); -- 1.8.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: Don't trust the superblock label and simply printk(%s) it
On Mon, Nov 05, 2012 at 02:10:49PM +0100, Stefan Behrens wrote: --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -764,10 +764,13 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, devid = btrfs_stack_device_id(disk_super-dev_item); transid = btrfs_super_generation(disk_super); total_devices = btrfs_super_num_devices(disk_super); - if (disk_super-label[0]) + if (disk_super-label[0]) { + if (disk_super-label[BTRFS_LABEL_SIZE - 1]) + disk_super-label[BTRFS_LABEL_SIZE - 1] = '\0'; The label set via 'btrfs fi label' will also set the last-1 byte to 0, so this keeps it as expected, although it is silent. thanks, Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/16] fs/btrfs: use WARN
On Sat, Nov 03, 2012 at 11:58:34AM +0100, Julia Lawall wrote: From: Julia Lawall julia.law...@lip6.fr Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // smpl @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // /smpl Signed-off-by: Julia Lawall julia.law...@lip6.fr Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] fs/btrfs: drop if around WARN_ON
On Sat, Nov 03, 2012 at 09:30:18PM +0100, Julia Lawall wrote: From: Julia Lawall julia.law...@lip6.fr Just use WARN_ON rather than an if containing only WARN_ON(1). A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // smpl @@ expression e; @@ - if (e) WARN_ON(1); + WARN_ON(e); // /smpl Signed-off-by: Julia Lawall julia.law...@lip6.fr Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merging printk and WARN
On Sun, Nov 04, 2012 at 09:25:53PM +0100, Julia Lawall wrote: It looks like these patches were not a good idea, because in each case the printk provides an error level, and WARN then provides another one. I think this is not a problem within btrfs at the place where this has changed. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (late) REQUEST: Default mkfs.btrfs block size
On Wed, Oct 31, 2012 at 12:20:39PM +, Alex wrote: As one 'stuck' with 4k leaves on my main machine for the moment, can I request the btrfs-progs v0.20 defaults to more efficient decent block sizes before release. Most distro install programs for the moment don't give access to the options at install time and there seems to be is a significant advantage to 16k or 32k IMHO this should be fixed inside the installer, changing defaults for a core utility will affect everybody. 4k is the most tested option and thus can be considered safe for everybody. The installer may let you to enter a shell and create the filesystem by hand, then point it to use it for installation. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (late) REQUEST: Default mkfs.btrfs block size
On Mon, Nov 5, 2012 at 10:06 AM, David Sterba d...@jikos.cz wrote: On Wed, Oct 31, 2012 at 12:20:39PM +, Alex wrote: As one 'stuck' with 4k leaves on my main machine for the moment, can I request the btrfs-progs v0.20 defaults to more efficient decent block sizes before release. Most distro install programs for the moment don't give access to the options at install time and there seems to be is a significant advantage to 16k or 32k IMHO this should be fixed inside the installer, changing defaults for a core utility will affect everybody. 4k is the most tested option and thus can be considered safe for everybody. The installer may let you to enter a shell and create the filesystem by hand, then point it to use it for installation. If we know a better setting, we should default to it. Punting the decision to the distro just means I'll spend the next 3 years telling people yeah, distro X doesn't set it to the recommended setting (which isn't the mkfs default), and there's no way to change it without wiping and reinstalling using manual partitioning blah blah blah. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to find (out if) files sharing content?
On Wed, Oct 31, 2012 at 09:02:15PM +0800, Jeff Liu wrote: I propose this because OCFS2 report shared space in this way combine with du(1). An old patch set to teach du(1) aware of reflinked file: https://oss.oracle.com/pipermail/ocfs2-devel/2010-September/007293.html Patch looks ok, the shared size is requested by an option. Do you means that the costs is very expensive for userland extent status checkup per file? The most expensive part is IMO not in userspace, it does in-memory lookups. And without any possibility to turn this off,I'm afraid this will render FIEMAP unusable in practice. For OCFS2, the FIEMAP_EXTENT_SHARED flag will be set upon fiemap ioctl(2) if an extent is OCFS2_EXT_REFCOUNTED(i.e. reflinked or cloned), which means that FIEMAP_EXTENT_SHARED is not a persistent flag, but I have no idea how Btrfs would be in this point. :( After some research, I think this could work for btrfs without unwanted performance penalties. There's the fiemap::fm_flags field that can be extended to request the shared extent info from fiemap, so the information is not computed unconditionally (that was my concern before). The rest is only implementation details how to speed up the file extent - refcount info lookups. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs defrag problem
On Thu, Nov 01, 2012 at 05:17:04AM +0800, ching wrote: 3. Is any possible to online defrag a btrfs partition without hindered by mount point/polyinstantied directories? Sorry, I do not understand the question. when a device is mounted under a directory, files in the directory is hidden, and files in the device is available, right? when a directory is polyinstantied, files in the original directory is hidden, and files in the polyinstantied directory is available, How to get past them and pass those hidden files to defrag command? I hope I get it right, so unless you have a reference to the directory with hidden files (using your term), there's no way to access them. And this is a more generic question, not related to btrfs itself. The hidden files may also belong to a different filesystem. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What's the minimum size I can shrink my FS to?
On Sat, Nov 03, 2012 at 03:10:52PM +1030, Jordan Windsor wrote: [root@archpc ~]# btrfs fi df /home/jordan/Storage/ Data: total=580.88GB, used=490.88GB This is getting full, 84%, there is not much chance of getting rid of substantially many 1G-chunks through the 'usage=1' balance filter. Some of the space between 490G and 580G will be spent on slack space and fragmentation, the rest may be packed together by a higher usage= value (but will be slower due to relocating more data). System, DUP: total=32.00MB, used=76.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=13.01GB, used=1001.83MB If you intend to shrink a filesystem, all space group types must be taken into account, so here you have at least 580G + 2x32M + 4M + 2x13G = ~607G [root@archpc ~]# btrfs fi sh failed to read /dev/sr0 Label: 'Storage' uuid: 717d4a43-38b3-495f-841b-d223068584de Total devices 1 FS bytes used 491.86GB devid1 size 612.04GB used 606.96GB path /dev/sda6 ^^ confirmed :) So basically you cannot go under this number when shrinking. I think you can squeeze the metadata space down to 2G (or maybe to 1G, it's getting very close to 1G so hard to guess) by the -musage= filter AND using at least 3.7 kernel (or 3.6+ chris' for-linus branch) with the fixed over-allocation bug (otherwise the size will stay pinned at 2% of the filesystem size). david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Request for review] [RFC] Add label support for snapshots and subvols
On Mon, Nov 05, 2012 at 03:24:48PM +0800, Anand Jain wrote: featurexattr btrfs-kernel-way [1] NoYes [2] NoYes [3] Yes No [1]. Ability to read subvol label without mount It is possible to read it offline, one can traverse the data structures the same way as from kernel, ie root_tree - subovlume fs_tree - root directory item - xattr item. [2]. Full-ability to log and track the property when it is modified What is expected to happen when the label changes? I understand that somebody may change the xattr value silently, but let's say this is changed through kernel -- do you intend to prohibit any changes or issue some notification or whatever? [3]. risk-free patch ? No patch is risk free :) but yes, xattrs use an established and tested infrastruture. davdi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs defrag problem
On 11/06/2012 06:57 AM, David Sterba wrote: On Thu, Nov 01, 2012 at 05:17:04AM +0800, ching wrote: 3. Is any possible to online defrag a btrfs partition without hindered by mount point/polyinstantied directories? Sorry, I do not understand the question. when a device is mounted under a directory, files in the directory is hidden, and files in the device is available, right? when a directory is polyinstantied, files in the original directory is hidden, and files in the polyinstantied directory is available, How to get past them and pass those hidden files to defrag command? I hope I get it right, so unless you have a reference to the directory with hidden files (using your term), there's no way to access them. And this is a more generic question, not related to btrfs itself. The hidden files may also belong to a different filesystem. david thank for your explanation ching -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to find (out if) files sharing content?
On 11/06/2012 06:45 AM, David Sterba wrote: On Wed, Oct 31, 2012 at 09:02:15PM +0800, Jeff Liu wrote: I propose this because OCFS2 report shared space in this way combine with du(1). An old patch set to teach du(1) aware of reflinked file: https://oss.oracle.com/pipermail/ocfs2-devel/2010-September/007293.html Patch looks ok, the shared size is requested by an option. Do you means that the costs is very expensive for userland extent status checkup per file? The most expensive part is IMO not in userspace, it does in-memory lookups. And without any possibility to turn this off,I'm afraid this will render FIEMAP unusable in practice. For OCFS2, the FIEMAP_EXTENT_SHARED flag will be set upon fiemap ioctl(2) if an extent is OCFS2_EXT_REFCOUNTED(i.e. reflinked or cloned), which means that FIEMAP_EXTENT_SHARED is not a persistent flag, but I have no idea how Btrfs would be in this point. :( After some research, I think this could work for btrfs without unwanted performance penalties. There's the fiemap::fm_flags field that can be extended to request the shared extent info from fiemap, so the information is not computed unconditionally (that was my concern before). The rest is only implementation details how to speed up the file extent - refcount info lookups. Thanks for your confirmation. -Jeff david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
slow after btrfs balance
Hello, Here is a little background of my setup. mdadm-lvm-dmcrypt-btrfs. I had the btrfs in ext4 but I converted it. When I first did the convert, everything was fine. After a moment, I did a btrfs balance and since that day, the writing speed is very slow. When I do things like unzip/unrar, the load average flies to 15+. That makes the system to be almost unusable. During the whole process I went from 3.2 wheezy to 3.5. I went to back to 3.2 to check the stability, the results are the same. I was asking myself if simply changing the chunk would solve my problem? Thanks William -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Production use with vanilla 3.6.6
On Mon, Nov 5, 2012 at 7:07 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hello list, is btrfs ready for production use in 3.6.6? Or should i backport fixes from 3.7-rc? Is it planned to have a stable kernel which will get all btrfs fixes backported? I would say no to both, but you should check with distros that supports btrfs (Oracle Linux and SLES). In particular, whether they backport fixes, and what exactly does supported status gives you when you buy support for that distro. -- Fajar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html