Re: Abysmal Performance
On Mon, 20 Jun 2011 20:12:16 -0400, Josef Bacik wrote: On 06/20/2011 05:51 PM, Henning Rohlfs wrote: Hello, I've migrated my system to btrfs (raid1) a few months ago. Since then the performance has been pretty bad, but recently it's gotten unbearable: a simple sync called while the system is idle can take 20 up to 60 seconds. Creating or deleting files often has several seconds latency, too. One curious - but maybe unrelated - observation is that even though I'm using a raid1 btrfs setup, the hdds are often being written to sequentially. One hard-drive sees some write activity and after it subsides, the other drive sees some activity. (See attached sequential-writes.txt.) Can you do sysrq+w while this is happening so we can see who is doing the writing? Thanks, Josef When I call sync, it starts with several seconds of 100% (one core) cpu usage by sync itself. Afterwards btrfs-submit-0 and sync are blocked. sysrq+w output is attached.SysRq : Show Blocked State taskPC stack pid father btrfs-submit-0 D 88021c8bbf00 0 851 2 0x 880245b07b20 0046 0048 880245b06010 880246ce2b80 00011340 880245b07fd8 4000 880245b07fd8 00011340 88021f2d95c0 880246ce2b80 Call Trace: [81333f7a] ? put_device+0x12/0x14 [81343af6] ? scsi_request_fn+0x341/0x40d [812a4643] ? __blk_run_queue+0x16/0x18 [814fdc2f] io_schedule+0x51/0x66 [812a6bd9] get_request_wait+0xa1/0x12f [81050c85] ? wake_up_bit+0x25/0x25 [812a3585] ? elv_merge+0x99/0xa9 [812a6dee] __make_request+0x187/0x27f [812a540f] generic_make_request+0x229/0x2a4 [812a5553] submit_bio+0xc9/0xe8 [81267b24] run_scheduled_bios+0x296/0x415 [81044692] ? del_timer+0x83/0x83 [81267cb3] pending_bios_fn+0x10/0x12 [8126d591] worker_loop+0x189/0x4b7 [8126d408] ? btrfs_queue_worker+0x263/0x263 [8126d408] ? btrfs_queue_worker+0x263/0x263 [810508bc] kthread+0x7d/0x85 [81500c94] kernel_thread_helper+0x4/0x10 [8105083f] ? kthread_worker_fn+0x13a/0x13a [81500c90] ? gs_change+0xb/0xb syncD 000102884001 0 22091 19401 0x 88019f0edc98 0086 8801 88019f0ec010 880227808000 00011340 88019f0edfd8 4000 88019f0edfd8 00011340 8181f020 880227808000 Call Trace: [810b189b] ? add_partial+0x1b/0x64 [810b36d6] ? kmem_cache_free+0x8e/0x93 [812623ad] ? free_extent_state+0x43/0x47 [81086c2b] ? __lock_page+0x68/0x68 [814fdc2f] io_schedule+0x51/0x66 [81086c34] sleep_on_page+0x9/0xd [814fe32c] __wait_on_bit+0x43/0x76 [81086df3] wait_on_page_bit+0x6d/0x74 [81050cb9] ? autoremove_wake_function+0x34/0x34 [81086b0e] ? find_get_page+0x19/0x66 [81247d60] btrfs_wait_marked_extents+0xeb/0x124 [81247ee6] btrfs_write_and_wait_marked_extents+0x2a/0x3c [81247f3a] btrfs_write_and_wait_transaction+0x42/0x44 [81248676] btrfs_commit_transaction+0x53c/0x650 [81050c85] ? wake_up_bit+0x25/0x25 [810db775] ? __sync_filesystem+0x75/0x75 [8122a33a] btrfs_sync_fs+0x66/0x6b [810db761] __sync_filesystem+0x61/0x75 [810db786] sync_one_sb+0x11/0x13 [810bc89c] iterate_supers+0x67/0xbd [810db7c8] sys_sync+0x40/0x57 [814fff7b] system_call_fastpath+0x16/0x1b Sched Debug Version: v0.10, 2.6.39 #3 ktime : 141912370.788052 sched_clk : 141831001.710073 cpu_clk : 141912370.788875 jiffies : 4337451011 sched_clock_stable : 0 sysctl_sched .sysctl_sched_latency: 12.00 .sysctl_sched_min_granularity: 1.50 .sysctl_sched_wakeup_granularity : 2.00 .sysctl_sched_child_runs_first : 0 .sysctl_sched_features : 7279 .sysctl_sched_tunable_scaling: 1 (logaritmic) cpu#0, 2009.138 MHz .nr_running: 0 .load : 0 .nr_switches : 145331671 .nr_load_updates : 14210439 .nr_uninterruptible: 0 .next_balance : 4337.451012 .curr-pid : 0 .clock : 141912370.762325 .cpu_load[0] : 0 .cpu_load[1] : 0 .cpu_load[2] : 14 .cpu_load[3] : 87 .cpu_load[4] : 139 .yld_count : 20406719 .sched_switch : 0 .sched_count : 167876926 .sched_goidle : 51210495 .avg_idle
Re: Abysmal Performance
Henning Rohlfs wrote (ao): - space_cache was enabled, but it seemed to make the problem worse. It's no longer in the mount options. space_cache is a one time mount option which enabled space_cache. Not supplying it anymore as a mount option has no effect (dmesg | grep btrfs). Sander -- Humilis IT Services and Solutions http://www.humilis.net -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL v3] Btrfs: improve write ahead log with sub transaction
I've been working to try to improve the write-ahead log's performance, and I found that the bottleneck addresses in the checksum items, especially when we want to make a random write on a large file, e.g a 4G file. Then a idea for this suggested by Chris is to use sub transaction ids and just to log the part of inode that had changed since either the last log commit or the last transaction commit. And as we also push the sub transid into the btree blocks, we'll get much faster tree walks. As a result, we abandon the original brute force approach, which is to delete all items of the inode in log, to making sure we get the most uptodate copies of everything, and instead we manage to find and merge, i.e. finding extents in the log tree and merging in the new extents from the file. This patchset puts the above idea into code, and although the code is now more complex, it brings us a great deal of performance improvement: in my sysbench write + fsync test: 451.01Kb/sec - 4.3621Mb/sec In v2, thanks to Chris, we worked together to solve 2 bugs, and after that it works as expected. Since there are some vital changes in recent rc, like kill trans_mutex and use cur_trans, as David asked, I rebase the patchset to the latest for-linus branch. More tests are welcome! You can also get this patchset from: git://repo.or.cz/linux-btrfs-devel.git sub-trans Liu Bo (12): Btrfs: introduce sub transaction stuff Btrfs: update block generation if should_cow_block fails Btrfs: modify btrfs_drop_extents API Btrfs: introduce first sub trans Btrfs: still update inode trans stuff when size remains unchanged Btrfs: improve log with sub transaction Btrfs: add checksum check for log Btrfs: fix a bug of log check Btrfs: kick off useless code Btrfs: deal with EEXIST after iput Btrfs: use the right generation number to read log_root_tree Revert Btrfs: do not flush csum items of unchanged file data during treelog fs/btrfs/btrfs_inode.h | 12 ++- fs/btrfs/ctree.c | 69 +--- fs/btrfs/ctree.h |5 +- fs/btrfs/disk-io.c | 12 +- fs/btrfs/extent-tree.c | 10 +- fs/btrfs/file.c| 22 ++--- fs/btrfs/inode.c | 33 --- fs/btrfs/ioctl.c |6 +- fs/btrfs/relocation.c |6 +- fs/btrfs/transaction.c | 14 ++- fs/btrfs/transaction.h | 19 +++- fs/btrfs/tree-defrag.c |2 +- fs/btrfs/tree-log.c| 272 ++- 13 files changed, 331 insertions(+), 151 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/12 v3] Btrfs: update block generation if should_cow_block fails
Cause we've added sub transaction, if it do not want to cow a block, we also need to get new sub transid recorded. This is used for log code to find the most uptodate file extents. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.c | 34 +- 1 files changed, 33 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index f35b517..cfddb05 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -477,6 +477,33 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, return 0; } +static inline void update_block_generation(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct extent_buffer *buf, + struct extent_buffer *parent, + int slot) +{ + /* +* If it does not need to cow this block, we still need to +* update the block's generation, for transid may have been +* changed during fsync. + */ + if (btrfs_header_generation(buf) == trans-transid) + return; + + if (buf == root-node) { + btrfs_set_header_generation(buf, trans-transid); + btrfs_mark_buffer_dirty(buf); + add_root_to_dirty_list(root); + } else { + btrfs_set_node_ptr_generation(parent, slot, + trans-transid); + btrfs_set_header_generation(buf, trans-transid); + btrfs_mark_buffer_dirty(parent); + btrfs_mark_buffer_dirty(buf); + } +} + static inline int should_cow_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf) @@ -517,6 +544,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, } if (!should_cow_block(trans, root, buf)) { + update_block_generation(trans, root, buf, parent, parent_slot); *cow_ret = buf; return 0; } @@ -1655,8 +1683,12 @@ again: * then we don't want to set the path blocking, * so we test it here */ - if (!should_cow_block(trans, root, b)) + if (!should_cow_block(trans, root, b)) { + update_block_generation(trans, root, b, + p-nodes[level + 1], + p-slots[level + 1]); goto cow_done; + } btrfs_set_path_blocking(p); -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/12 v3] Btrfs: introduce first sub trans
In multi-thread situations, writeback of a file may span across several sub transactions, and we need to introduce first_sub_trans to get sub_transid of teh first sub transaction recorded, so that log code can skip file extents which have been logged or committed onto disk. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/btrfs_inode.h |9 + fs/btrfs/inode.c | 13 - fs/btrfs/transaction.h | 17 - 3 files changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 52d7eca..8eca5de 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -80,6 +80,15 @@ struct btrfs_inode { /* sequence number for NFS changes */ u64 sequence; + /* used to avoid race of first_sub_trans */ + spinlock_t sub_trans_lock; + + /* +* sub transid of the trans that first modified this inode before +* a trans commit or a log sync +*/ + u64 first_sub_trans; + /* * transid of the trans_handle that last modified this inode */ diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ff9d4d1..93c9d22 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6468,7 +6468,16 @@ again: set_page_dirty(page); SetPageUptodate(page); - BTRFS_I(inode)-last_trans = root-fs_info-generation; + spin_lock(BTRFS_I(inode)-sub_trans_lock); + + if (BTRFS_I(inode)-first_sub_trans root-fs_info-sub_generation || + BTRFS_I(inode)-last_trans = BTRFS_I(inode)-logged_trans || + BTRFS_I(inode)-last_trans = root-fs_info-last_trans_committed) + BTRFS_I(inode)-first_sub_trans = root-fs_info-sub_generation; + + spin_unlock(BTRFS_I(inode)-sub_trans_lock); + + BTRFS_I(inode)-last_trans = root-fs_info-sub_generation; BTRFS_I(inode)-last_sub_trans = BTRFS_I(inode)-root-log_transid; unlock_extent_cached(io_tree, page_start, page_end, cached_state, GFP_NOFS); @@ -6714,6 +6723,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) ei-space_info = NULL; ei-generation = 0; ei-sequence = 0; + ei-first_sub_trans = 0; ei-last_trans = 0; ei-last_sub_trans = 0; ei-logged_trans = 0; @@ -6740,6 +6750,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) extent_io_tree_init(ei-io_tree, inode-i_data); extent_io_tree_init(ei-io_failure_tree, inode-i_data); mutex_init(ei-log_mutex); + spin_lock_init(ei-sub_trans_lock); btrfs_ordered_inode_tree_init(ei-ordered_tree); INIT_LIST_HEAD(ei-i_orphan); INIT_LIST_HEAD(ei-delalloc_inodes); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 45876b0..f5ca0fd 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -73,7 +73,22 @@ struct btrfs_pending_snapshot { static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans, struct inode *inode) { - BTRFS_I(inode)-last_trans = trans-transaction-transid; + spin_lock(BTRFS_I(inode)-sub_trans_lock); + + /* +* We have joined in a transaction, so btrfs_commit_transaction will +* definitely wait for us and it does not need to add a extra +* trans_mutex lock here. +*/ + if (BTRFS_I(inode)-first_sub_trans trans-transid || + BTRFS_I(inode)-last_trans = BTRFS_I(inode)-logged_trans || + BTRFS_I(inode)-last_trans = +BTRFS_I(inode)-root-fs_info-last_trans_committed) + BTRFS_I(inode)-first_sub_trans = trans-transid; + + spin_unlock(BTRFS_I(inode)-sub_trans_lock); + + BTRFS_I(inode)-last_trans = trans-transid; BTRFS_I(inode)-last_sub_trans = BTRFS_I(inode)-root-log_transid; } -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/12 v3] Btrfs: modify btrfs_drop_extents API
We want to use btrfs_drop_extent() in log code. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.h|3 ++- fs/btrfs/file.c |9 +++-- fs/btrfs/inode.c|6 +++--- fs/btrfs/ioctl.c|4 ++-- fs/btrfs/tree-log.c |2 +- 5 files changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 54fcc62..073fb37 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2604,7 +2604,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, int skip_pinned); extern const struct file_operations btrfs_file_operations; int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode, - u64 start, u64 end, u64 *hint_byte, int drop_cache); + u64 start, u64 end, u64 *hint_byte, int drop_cache, + int log); int btrfs_mark_extent_written(struct btrfs_trans_handle *trans, struct inode *inode, u64 start, u64 end); int btrfs_release_file(struct inode *inode, struct file *file); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index fa4ef18..3c434bb 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -547,7 +547,8 @@ int btrfs_drop_extent_cache(struct inode *inode, u64 start, u64 end, * is deleted from the tree. */ int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode, - u64 start, u64 end, u64 *hint_byte, int drop_cache) + u64 start, u64 end, u64 *hint_byte, int drop_cache, + int log) { struct btrfs_root *root = BTRFS_I(inode)-root; struct extent_buffer *leaf; @@ -567,6 +568,10 @@ int btrfs_drop_extents(struct btrfs_trans_handle *trans, struct inode *inode, int recow; int ret; + /* drop the existed extents in log tree */ + if (log) + root = root-log_root; + if (drop_cache) btrfs_drop_extent_cache(inode, start, end - 1, 0); @@ -747,7 +752,7 @@ next_slot: extent_end - key.offset); extent_end = ALIGN(extent_end, root-sectorsize); - } else if (disk_bytenr 0) { + } else if (disk_bytenr 0 !log) { ret = btrfs_free_extent(trans, root, disk_bytenr, num_bytes, 0, root-root_key.objectid, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 6291445..ff9d4d1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -246,7 +246,7 @@ static noinline int cow_file_range_inline(struct btrfs_trans_handle *trans, } ret = btrfs_drop_extents(trans, inode, start, aligned_end, -hint_byte, 1); +hint_byte, 1, 0); BUG_ON(ret); if (isize actual_end) @@ -1658,7 +1658,7 @@ static int insert_reserved_file_extent(struct btrfs_trans_handle *trans, * with the others. */ ret = btrfs_drop_extents(trans, inode, file_pos, file_pos + num_bytes, -hint, 0); +hint, 0, 0); BUG_ON(ret); ins.objectid = btrfs_ino(inode); @@ -3517,7 +3517,7 @@ int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size) err = btrfs_drop_extents(trans, inode, cur_offset, cur_offset + hole_size, -hint_byte, 1); +hint_byte, 1, 0); if (err) break; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 29026a7..72e1a98 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2339,7 +2339,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, ret = btrfs_drop_extents(trans, inode, new_key.offset, new_key.offset + datal, -hint_byte, 1); +hint_byte, 1, 0); BUG_ON(ret); ret = btrfs_insert_empty_item(trans, root, path, @@ -2394,7 +2394,7 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd, ret = btrfs_drop_extents(trans, inode, new_key.offset, new_key.offset + datal, -
[PATCH 07/12 v3] Btrfs: add checksum check for log
If a inode is a BTRFS_INODE_NODATASUM one, it need not to look for csum items any more. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/tree-log.c | 13 - 1 files changed, 8 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6193e27..1eca1d4 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2672,7 +2672,8 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, struct inode *inode, struct btrfs_path *dst_path, struct extent_buffer *src, - int start_slot, int nr, int inode_only) + int start_slot, int nr, int inode_only, + int csum) { unsigned long src_offset; unsigned long dst_offset; @@ -2739,7 +2740,8 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, * or deletes of this inode don't have to relog the inode * again */ - if (btrfs_key_type(ins_keys + i) == BTRFS_EXTENT_DATA_KEY) { + if (btrfs_key_type(ins_keys + i) == + BTRFS_EXTENT_DATA_KEY csum) { int found_type; extent = btrfs_item_ptr(src, start_slot + i, struct btrfs_file_extent_item); @@ -2857,6 +2859,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, int ins_nr; u64 ino = btrfs_ino(inode); u64 transid; + int csum = (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) ? 0 : 1; /* * We use transid in btrfs_search_forward() as a filter, in order to @@ -2934,7 +2937,7 @@ filter: if (ins_nr) { ret = copy_items(trans, inode, dst_path, src, ins_start_slot, -ins_nr, inode_only); +ins_nr, inode_only, csum); if (ret) { err = ret; goto out_unlock; @@ -2953,7 +2956,7 @@ next_slot: if (ins_nr) { ret = copy_items(trans, inode, dst_path, src, ins_start_slot, -ins_nr, inode_only); +ins_nr, inode_only, csum); if (ret) { err = ret; goto out_unlock; @@ -2974,7 +2977,7 @@ next_slot: if (ins_nr) { ret = copy_items(trans, inode, dst_path, src, ins_start_slot, -ins_nr, inode_only); +ins_nr, inode_only, csum); if (ret) { err = ret; goto out_unlock; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/12 v3] Btrfs: introduce sub transaction stuff
Introduce a new concept sub transaction, the relation between transaction and sub transaction is transaction A --- transid = x sub trans a(1) --- sub_transid = x+1 sub trans a(2) --- sub_transid = x+2 ... ... sub trans a(n-1) --- sub_transid = x+n-1 sub trans a(n) --- sub_transid = x+n transaction B --- transid = x+n+1 ... ... And the most important is a) a trans handler's transid now gets value from sub transid instead of transid. b) when a transaction commits, transid may not added by 1, but depend on the biggest sub_transaction of the last neighbour transaction, i.e. B-transid = a(n)-transid + 1, (B-transid - A-transid) = 1 c) we start a new sub transaction after a fsync. We also ship some 'trans-transid' to 'trans-transaction-transid' to ensure btrfs works well and to get rid of WARNings. These are used for the new log code. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/ctree.c | 35 ++- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |7 --- fs/btrfs/extent-tree.c | 10 ++ fs/btrfs/inode.c |4 ++-- fs/btrfs/ioctl.c |2 +- fs/btrfs/relocation.c |6 +++--- fs/btrfs/transaction.c | 14 +- fs/btrfs/transaction.h |1 + fs/btrfs/tree-defrag.c |2 +- fs/btrfs/tree-log.c| 16 ++-- 11 files changed, 60 insertions(+), 38 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 2e66786..f35b517 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -194,9 +194,9 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans, int level; struct btrfs_disk_key disk_key; - WARN_ON(root-ref_cows trans-transid != + WARN_ON(root-ref_cows trans-transaction-transid != root-fs_info-running_transaction-transid); - WARN_ON(root-ref_cows trans-transid != root-last_trans); + WARN_ON(root-ref_cows trans-transid root-last_trans); level = btrfs_header_level(buf); if (level == 0) @@ -391,9 +391,9 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, btrfs_assert_tree_locked(buf); - WARN_ON(root-ref_cows trans-transid != + WARN_ON(root-ref_cows trans-transaction-transid != root-fs_info-running_transaction-transid); - WARN_ON(root-ref_cows trans-transid != root-last_trans); + WARN_ON(root-ref_cows trans-transid root-last_trans); level = btrfs_header_level(buf); @@ -459,7 +459,8 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, else parent_start = 0; - WARN_ON(trans-transid != btrfs_header_generation(parent)); + WARN_ON(btrfs_header_generation(parent) + trans-transaction-transid); btrfs_set_node_blockptr(parent, parent_slot, cow-start); btrfs_set_node_ptr_generation(parent, parent_slot, @@ -480,7 +481,7 @@ static inline int should_cow_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf) { - if (btrfs_header_generation(buf) == trans-transid + if (btrfs_header_generation(buf) = trans-transaction-transid !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) !(root-root_key.objectid != BTRFS_TREE_RELOC_OBJECTID btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) @@ -508,7 +509,7 @@ noinline int btrfs_cow_block(struct btrfs_trans_handle *trans, root-fs_info-running_transaction-transid); WARN_ON(1); } - if (trans-transid != root-fs_info-generation) { + if (trans-transaction-transid != root-fs_info-generation) { printk(KERN_CRIT trans %llu running %llu\n, (unsigned long long)trans-transid, (unsigned long long)root-fs_info-generation); @@ -611,7 +612,7 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, if (trans-transaction != root-fs_info-running_transaction) WARN_ON(1); - if (trans-transid != root-fs_info-generation) + if (trans-transaction-transid != root-fs_info-generation) WARN_ON(1); parent_nritems = btrfs_header_nritems(parent); @@ -891,7 +892,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, mid = path-nodes[level]; WARN_ON(!path-locks[level]); - WARN_ON(btrfs_header_generation(mid) != trans-transid); + WARN_ON(btrfs_header_generation(mid) trans-transaction-transid); orig_ptr = btrfs_node_blockptr(mid, orig_slot); @@ -1098,7 +1099,7 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
[PATCH 05/12 v3] Btrfs: still update inode trans stuff when size remains unchanged
Due to DIO stuff, commit 1ef30be142d2cc60e2687ef267de864cf31be995 makes btrfs not call btrfs_update_inode when it does not update i_disk_size, but in buffer write case, we need to update btrfs internal inode's trans stuff, so that the log code can find the inode's changes. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/inode.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 93c9d22..47dcbc0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1790,7 +1790,8 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end) if (!ret) { ret = btrfs_update_inode(trans, root, inode); BUG_ON(ret); - } + } else + btrfs_set_inode_last_trans(trans, inode); ret = 0; out: if (nolock) { -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/12 v3] Btrfs: deal with EEXIST after iput
There are two cases when BTRFS_I(inode)-logged_trans is zero: a) an inode is just allocated; b) iput an inode and reread it. However, in b) if btrfs is not committed yet, and this inode _may_ still remain in log tree. So we need to check the log tree to get logged_trans a right value in case it hits a EEXIST while logging. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/inode.c|9 +++-- fs/btrfs/tree-log.c | 43 +++ 2 files changed, 46 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0007ae3..6eff806 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1786,12 +1786,9 @@ static int btrfs_finish_ordered_io(struct inode *inode, u64 start, u64 end) add_pending_csums(trans, inode, ordered_extent-file_offset, ordered_extent-list); - ret = btrfs_ordered_update_i_size(inode, 0, ordered_extent); - if (!ret) { - ret = btrfs_update_inode(trans, root, inode); - BUG_ON(ret); - } else - btrfs_set_inode_last_trans(trans, inode); + btrfs_ordered_update_i_size(inode, 0, ordered_extent); + ret = btrfs_update_inode(trans, root, inode); + BUG_ON(ret); ret = 0; out: if (nolock) { diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index f1e95bf..3cf7ed2 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3063,6 +3063,37 @@ out: return ret; } +static int check_logged_trans(struct btrfs_trans_handle *trans, + struct btrfs_root *root, struct inode *inode) +{ + struct btrfs_inode_item *inode_item; + struct btrfs_path *path; + int ret; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + ret = btrfs_search_slot(trans, root, + BTRFS_I(inode)-location, path, 0, 0); + if (ret) { + if (ret 0) + ret = 0; + goto out; + } + + btrfs_unlock_up_safe(path, 1); + inode_item = btrfs_item_ptr(path-nodes[0], path-slots[0], + struct btrfs_inode_item); + + BTRFS_I(inode)-logged_trans = btrfs_inode_transid(path-nodes[0], + inode_item); +out: + btrfs_free_path(path); + return ret; +} + + static int inode_in_log(struct btrfs_trans_handle *trans, struct inode *inode) { @@ -3115,6 +3146,18 @@ int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, if (ret) goto end_no_trans; + /* +* After we iput a inode and reread it from disk, logged_trans is 0. +* However, this inode _may_ still remain in log tree and not be +* committed yet. +* So we need to check the log tree to get logged_trans a right value. +*/ + if (!BTRFS_I(inode)-logged_trans root-log_root) { + ret = check_logged_trans(trans, root-log_root, inode); + if (ret) + goto end_no_trans; + } + if (inode_in_log(trans, inode)) { ret = BTRFS_NO_LOG_SYNC; goto end_no_trans; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/12 v3] Btrfs: fix a bug of log check
The current code uses struct root's last_log_commit to check if an inode has been logged, but the problem is that this root-last_log_commit is shared among files. Say we have N inodes to be logged, after the first inode, root-last_log_commit is updated and the N-1 remains will not be logged. As we've introduce sub transaction and filled inode's last_trans and logged_trans with sub_transid instead of transaction id, we can just compare last_trans with logged_trans to determine if the processing inode is logged. And the more important thing is these two values are inode-individual, so it will not interfere with others. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/btrfs_inode.h |5 - fs/btrfs/ctree.h |1 - fs/btrfs/disk-io.c |2 -- fs/btrfs/inode.c |2 -- fs/btrfs/transaction.h |1 - fs/btrfs/tree-log.c| 16 +++- 6 files changed, 3 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 8eca5de..a85161e 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -95,11 +95,6 @@ struct btrfs_inode { u64 last_trans; /* -* log transid when this inode was last modified -*/ - u64 last_sub_trans; - - /* * transid that last logged this inode */ u64 logged_trans; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 073fb37..da3c769 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1157,7 +1157,6 @@ struct btrfs_root { atomic_t log_writers; atomic_t log_commit[2]; unsigned long log_transid; - unsigned long last_log_commit; unsigned long log_batch; pid_t log_start_pid; bool log_multiple_pids; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b7e80c3..d6d71f2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1065,7 +1065,6 @@ static int __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize, atomic_set(root-log_writers, 0); root-log_batch = 0; root-log_transid = 0; - root-last_log_commit = 0; extent_io_tree_init(root-dirty_log_pages, fs_info-btree_inode-i_mapping); @@ -1202,7 +1201,6 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans, WARN_ON(root-log_root); root-log_root = log_root; root-log_transid = 0; - root-last_log_commit = 0; return 0; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 47dcbc0..0007ae3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6479,7 +6479,6 @@ again: spin_unlock(BTRFS_I(inode)-sub_trans_lock); BTRFS_I(inode)-last_trans = root-fs_info-sub_generation; - BTRFS_I(inode)-last_sub_trans = BTRFS_I(inode)-root-log_transid; unlock_extent_cached(io_tree, page_start, page_end, cached_state, GFP_NOFS); @@ -6726,7 +6725,6 @@ struct inode *btrfs_alloc_inode(struct super_block *sb) ei-sequence = 0; ei-first_sub_trans = 0; ei-last_trans = 0; - ei-last_sub_trans = 0; ei-logged_trans = 0; ei-delalloc_bytes = 0; ei-reserved_bytes = 0; diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index f5ca0fd..fd2474a 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -89,7 +89,6 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans, spin_unlock(BTRFS_I(inode)-sub_trans_lock); BTRFS_I(inode)-last_trans = trans-transid; - BTRFS_I(inode)-last_sub_trans = BTRFS_I(inode)-root-log_transid; } int btrfs_end_transaction(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1eca1d4..f1e95bf 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -1983,7 +1983,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, int ret; struct btrfs_root *log = root-log_root; struct btrfs_root *log_root_tree = root-fs_info-log_root_tree; - unsigned long log_transid = 0; mutex_lock(root-log_mutex); index1 = root-log_transid % 2; @@ -2018,8 +2017,7 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, goto out; } - log_transid = root-log_transid; - if (log_transid % 2 == 0) + if (root-log_transid % 2 == 0) mark = EXTENT_DIRTY; else mark = EXTENT_NEW; @@ -2126,11 +2124,6 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, btrfs_scrub_continue_super(root); ret = 0; - mutex_lock(root-log_mutex); - if (root-last_log_commit log_transid) - root-last_log_commit = log_transid; - mutex_unlock(root-log_mutex); - out_wake_log_root: atomic_set(log_root_tree-log_commit[index2], 0); smp_mb(); @@ -3073,14 +3066,11 @@ out: static int inode_in_log(struct btrfs_trans_handle *trans, struct inode *inode) { -
[PATCH 09/12 v3] Btrfs: kick off useless code
fsync will wait for writeback till it finishes, and last_trans will get the real transid recorded in writeback, so it does not need an extra +1 to ensure fsync's process on the file. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/file.c | 13 - 1 files changed, 0 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 3c434bb..8f27007 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1404,19 +1404,6 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb, mutex_unlock(inode-i_mutex); - /* -* we want to make sure fsync finds this change -* but we haven't joined a transaction running right now. -* -* Later on, someone is sure to update the inode and get the -* real transid recorded. -* -* We set last_trans now to the fs_info generation + 1, -* this will either be one more than the running transaction -* or the generation used for the next transaction if there isn't -* one running right now. -*/ - BTRFS_I(inode)-last_trans = root-fs_info-generation + 1; if (num_written 0 || num_written == -EIOCBQUEUED) { err = generic_write_sync(file, pos, num_written); if (err 0 num_written 0) -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/12 v3] Revert Btrfs: do not flush csum items of unchanged file data during treelog
This reverts commit 8e531cdfeb75269c6c5aae33651cca39707848da. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/tree-log.c |3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 166fda5..d4ccf91 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2741,9 +2741,6 @@ static noinline int copy_items(struct btrfs_trans_handle *trans, extent = btrfs_item_ptr(src, start_slot + i, struct btrfs_file_extent_item); - if (btrfs_file_extent_generation(src, extent) trans-transid) - continue; - found_type = btrfs_file_extent_type(src, extent); if (found_type == BTRFS_FILE_EXTENT_REG || found_type == BTRFS_FILE_EXTENT_PREALLOC) { -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/12 v3] Btrfs: improve log with sub transaction
When logging an inode _A_, current btrfs will a) clear all items belonged to _A_ in log, b) copy all items belonged to _A_ from fs/file tree to log tree, and this just wastes a lot of time, especially when logging big files. So we want to use a smarter approach, i.e. find and merge. The amount of file extent items is the largest, so we focus on it. Thanks to sub transaction, now we can find those file extent items which are changed after last _transaction commit_ or last _log commit_, and then merge them with the existed ones in log tree. It will be great helpful on fsync performance, cause the common case is make changes on a _part_ of inode. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/tree-log.c | 179 --- 1 files changed, 127 insertions(+), 52 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index b91fe8b..6193e27 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2581,61 +2581,106 @@ again: } /* - * a helper function to drop items from the log before we relog an - * inode. max_key_type indicates the highest item type to remove. - * This cannot be run for file data extents because it does not - * free the extents they point to. + * a helper function to drop items from the log before we merge + * the uptodate items into the log tree. */ -static int drop_objectid_items(struct btrfs_trans_handle *trans, - struct btrfs_root *log, - struct btrfs_path *path, - u64 objectid, int max_key_type) +static int prepare_for_merge_items(struct btrfs_trans_handle *trans, + struct inode *inode, + struct extent_buffer *eb, + int slot, int nr) { - int ret; - struct btrfs_key key; + struct btrfs_root *log = BTRFS_I(inode)-root-log_root; + struct btrfs_path *path; struct btrfs_key found_key; + struct btrfs_key key; + int i; + int ret; - key.objectid = objectid; - key.type = max_key_type; - key.offset = (u64)-1; + /* There are no relative items of the inode in log. */ + if (BTRFS_I(inode)-logged_trans trans-transaction-transid) + return 0; - while (1) { + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + for (i = 0; i nr; i++) { + btrfs_item_key_to_cpu(eb, key, i + slot); + + if (btrfs_key_type(key) == BTRFS_EXTENT_DATA_KEY) { + struct btrfs_file_extent_item *fi; + int found_type; + u64 mask = BTRFS_I(inode)-root-sectorsize - 1; + u64 start = key.offset; + u64 extent_end; + u64 hint; + unsigned long size; + + fi = btrfs_item_ptr(eb, slot + i, + struct btrfs_file_extent_item); + found_type = btrfs_file_extent_type(eb, fi); + + if (found_type == BTRFS_FILE_EXTENT_REG || + found_type == BTRFS_FILE_EXTENT_PREALLOC) + extent_end = start + + btrfs_file_extent_num_bytes(eb, fi); + else if (found_type == BTRFS_FILE_EXTENT_INLINE) { + size = btrfs_file_extent_inline_len(eb, fi); + extent_end = (start + size + mask) ~mask; + } else + BUG_ON(1); + + /* drop any overlapping extents */ + ret = btrfs_drop_extents(trans, inode, start, +extent_end, hint, 0, 1); + BUG_ON(ret); + + continue; + } + + /* non file extent */ ret = btrfs_search_slot(trans, log, key, path, -1, 1); - BUG_ON(ret == 0); if (ret 0) break; - if (path-slots[0] == 0) + /* empty log! */ + if (ret 0 path-slots[0] == 0) break; - path-slots[0]--; + if (ret 0) { + btrfs_release_path(path); + continue; + } + btrfs_item_key_to_cpu(path-nodes[0], found_key, path-slots[0]); - if (found_key.objectid != objectid) - break; + if (btrfs_comp_cpu_keys(found_key, key)) + BUG_ON(1); ret = btrfs_del_item(trans, log, path); - if (ret) - break; +
[PATCH 11/12 v3] Btrfs: use the right generation number to read log_root_tree
Currently we use the generation number of the super to read in the log tree root after a crash. This doesn't always match the sub trans id and so it doesn't always match the transid stored in the btree blocks. We can use log_root_transid to record the log_root_tree's generation so that when we recover from crash, we can match log_root_tree's btree blocks. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/disk-io.c |3 ++- fs/btrfs/tree-log.c |2 ++ 2 files changed, 4 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d6d71f2..4568317 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2002,6 +2002,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, if (btrfs_super_log_root(disk_super) != 0 !(fs_info-fs_state BTRFS_SUPER_FLAG_ERROR)) { u64 bytenr = btrfs_super_log_root(disk_super); + u64 log_root_transid = btrfs_super_log_root_transid(disk_super); if (fs_devices-rw_devices == 0) { printk(KERN_WARNING Btrfs log replay required @@ -2024,7 +2025,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, log_tree_root-node = read_tree_block(tree_root, bytenr, blocksize, - generation + 1); + log_root_transid); ret = btrfs_recover_log_trees(log_tree_root); BUG_ON(ret); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 3cf7ed2..166fda5 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2105,6 +2105,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, log_root_tree-node-start); btrfs_set_super_log_root_level(root-fs_info-super_for_commit, btrfs_header_level(log_root_tree-node)); + btrfs_set_super_log_root_transid(root-fs_info-super_for_commit, +trans-transid); log_root_tree-log_batch = 0; log_root_tree-log_transid++; -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Abysmal Performance
On Tue, 21 Jun 2011 10:00:59 +0200, Sander wrote: Henning Rohlfs wrote (ao): - space_cache was enabled, but it seemed to make the problem worse. It's no longer in the mount options. space_cache is a one time mount option which enabled space_cache. Not supplying it anymore as a mount option has no effect (dmesg | grep btrfs). I'm sure that after the first reboot after removing the flag from the mount options, the system was faster for a while. That must have been a coincidence (or just an error on my part). Anyway, I rebooted with clear_cache as mount option and there was no improvement either. Thanks for pointing this out, Henning -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/8] fs: kill i_alloc_sem
On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner wrote: Modification of inode-i_state is not safe outside the inode-i_lock. We never actually set the new bit in i_state, we just use it as a key for the hashed lookups. Or rather we try to, as I misunderstood how wait_on_bit works, so currently we busywait for i_dio_count to reach zero. I'll respin a version that actually works as expected. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: fix space leak when trimming free extents
On Tuesday, June 21, 2011 01:50:10 PM Li Zefan wrote: When the end of an extent exceeds the end of the specified range, the extent will be accidentally truncated. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/free-space-cache.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 292c0d9..185cf8e 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2509,8 +2509,15 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, bytes = min(entry-bytes, end - start); if (bytes minlen) goto next; + unlink_free_space(ctl, entry); - kmem_cache_free(btrfs_free_space_cachep, entry); + if (bytes entry-bytes) { + entry-offset = entry-offset + bytes; + entry-bytes = entry-bytes - bytes; + link_free_space(ctl, entry); yes, I forgot to link the rest extent to the free space cache. Thanks for the fix! + } else { + kmem_cache_free(btrfs_free_space_cachep, entry); + } } spin_unlock(ctl-tree_lock); -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Applications using fsync cause hangs for several seconds every few minutes
Hello, Nirbheek Chauhan nirbheek at gentoo.org writes: However, recently, perhaps with 2.6.39, or after I quickly started filling up my disk again, it has become impossible for me to work for long periods on my machine. Every few minutes, (I guess) when applications do fsync (firefox, xchat, vim, etc), all applications that use fsync() hang for several seconds, and applications that use general IO suffer extreme slowdowns. iotop shows various combinations of the processes listed below doing writes, and the total write as 2-3MB/s. [btrfs-dealloc-] [btrfs-submit-0] [btrfs-transacti] [btrfs-endio-wri] [flush-btrfs-1] I'm using btrfs under a 2.6.39-ARCH kernel and run into the same issue. In my case the [btrfs-submit-0] and [btrfs-transacti] shows up in iotop and produce 99% of IO at the time a application is frozen. For something like 10 to 30 seconds. After a fresh boot everything runs quiet acceptable and goes worse over the day. If I reboot the unmount process takes a lot of time something up to 8 minutes. After the reboot everything is back to normal. At least for a while. I use snapshots on a daily base and have 20 active snapshots all the time. Maybe this is the reason for the performance impact? What can I do to debug this issue? What other information should I supply? Could someone guide me on how to figure out why my machine is unusable now? I have no solution for this problem. In fact a reboot helps me even if only temporary. I also changed the IO scheduler, switched to laptop mode and remounted the device. Neither of these helped. Maybe someone else has any suggestion? Thanks. Jan Stilow -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[HELP!] parent transid verify failed on 600755752960 wanted 757102 found 756726
[HELP!] parent transid verify failed on 600755752960 wanted 757102 found 756726 Hi list, I've a broken btrfs filesystem to deal with can someone please help me in recover the data? The filesystem has been created a pair of years ago with 4 devices with the command at #create and is mounted with #fstab options. Recently I've added a pair of devices and made a `btrfs filesystem balance`, after it succeded I was doing a `btrfs device delete` on space02 (the currently broken one) in the middle of this the power cable has been axed. After replacing the cable cord 'space01' is mountable, 'space02' is not. tryed to use a backup copy of super with `btrfs-select-super` but it fail as reported in #btrfs-select-super please, pretty please have you suggestion on what try next? #current kernel (vanilla + linux-vserver) uname -a Linux dobbia 2.6.38.8-vs2.3.0.37-rc17 #5 SMP Mon Jun 20 15:04:39 CEST 2011 x86_64 Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz GenuineIntel GNU/Linux #create modprobe btrfs mkfs.btrfs -L space01 -m raid10 -d raid10 $DEVICES1 mkfs.btrfs -L space02 -m raid10 -d raid10 $DEVICES2 # fstab /dev/sda6 /mnt/space01 btrfs defaults,device=/dev/sda6,device=/dev/sdc6,device=/dev/sdd1,device=/dev/sde1,device=/dev/sdf6,device=/dev/sdg6 0 0 /dev/sda7 /mnt/space02 btrfs defaults,device=/dev/sda7,device=/dev/sdb7,device=/dev/sdc7,device=/dev/sdd2,device=/dev/sde2,device=/dev/sdf7,device=/dev/sdg7 0 0 # current layout btrfs filesystem show failed to read /dev/sr0 Label: 'space01' uuid: c77c6e87-fccd-4204-bd2c-d924fe06be31 Total devices 6 FS bytes used 164.81GB devid7 size 244.14GB used 56.59GB path /dev/sdf6 devid5 size 244.93GB used 56.59GB path /dev/sdd1 devid8 size 244.14GB used 56.59GB path /dev/sdg6 devid6 size 244.93GB used 56.59GB path /dev/sde1 devid4 size 244.14GB used 56.59GB path /dev/sda6 devid3 size 244.14GB used 56.59GB path /dev/sdc6 Label: 'space02' uuid: f752def1-1abc-48c7-8ebb-47ba37b8ffa6 Total devices 7 FS bytes used 172.94GB devid7 size 487.65GB used 0.00 path /dev/sdf7 devid6 size 488.94GB used 60.25GB path /dev/sde2 devid5 size 488.94GB used 58.75GB path /dev/sdd2 devid4 size 487.65GB used 60.26GB path /dev/sda7 devid7 size 487.65GB used 1.50GB path /dev/sdg7 devid2 size 487.65GB used 58.76GB path /dev/sdb7 devid3 size 487.65GB used 60.26GB path /dev/sdc7 Btrfs v0.19-35-g1b444cd-dirty # first error messages Jun 20 14:04:35 dobbia kernel: [ 806.587580] device label space02 devid 4 transid 757294 /dev/sda7 Jun 20 14:04:35 dobbia kernel: [ 806.629781] device label space02 devid 2 transid 756848 /dev/sdb7 Jun 20 14:04:35 dobbia kernel: [ 806.630107] device label space02 devid 3 transid 757294 /dev/sdc7 Jun 20 14:04:35 dobbia kernel: [ 806.652126] device label space02 devid 5 transid 756846 /dev/sdd2 Jun 20 14:04:37 dobbia kernel: [ 808.201719] device label space02 devid 6 transid 757294 /dev/sde2 Jun 20 14:04:37 dobbia kernel: [ 808.218108] device label space02 devid 7 transid 756846 /dev/sdf7 Jun 20 14:04:37 dobbia kernel: [ 808.218433] device label space02 devid 7 transid 757294 /dev/sdg7 Jun 20 14:04:37 dobbia kernel: [ 808.218715] device label space02 devid 4 transid 757294 /dev/sda7 Jun 20 14:04:37 dobbia kernel: [ 808.271797] btrfs: failed to read the system array on sdg7 Jun 20 14:04:37 dobbia kernel: [ 808.293776] btrfs: open_ctree failed Jun 20 14:04:56 dobbia kernel: [ 827.190208] device label space02 devid 4 transid 757294 /dev/sda7 Jun 20 14:04:56 dobbia kernel: [ 827.254517] btrfs: failed to read the system array on sdg7 Jun 20 14:04:56 dobbia kernel: [ 827.280152] btrfs: open_ctree failed Jun 20 14:05:01 dobbia kernel: [ 832.442454] device label space02 devid 4 transid 757294 /dev/sda7 Jun 20 14:05:01 dobbia kernel: [ 832.502017] btrfs: failed to read the system array on sdg7 Jun 20 14:05:01 dobbia kernel: [ 832.521492] btrfs: open_ctree failed Jun 20 14:05:20 dobbia kernel: [ 851.113237] device label space02 devid 4 transid 757294 /dev/sda7 Jun 20 14:05:20 dobbia kernel: [ 851.199478] btrfs: allowing degraded mounts Jun 20 14:05:20 dobbia kernel: [ 851.563583] parent transid verify failed on 600755752960 wanted 757102 found 756726 Jun 20 14:05:20 dobbia kernel: [ 851.564146] parent transid verify failed on 600755752960 wanted 757102 found 756726 Jun 20 14:05:20 dobbia kernel: [ 851.651006] btrfs bad tree block start 0 600859951104 Jun 20 14:05:20 dobbia kernel: [ 851.671362] parent transid verify failed on 600859955200 wanted 756926 found 756726 Jun 20 14:05:20 dobbia kernel: [ 851.671636] parent transid verify failed on 600859955200 wanted 756926 found 756726 Jun 20 14:05:20 dobbia kernel: [ 851.693515] btrfs bad tree block start 0 601053986816 Jun 20 14:05:20 dobbia kernel: [ 851.693559] btrfs bad tree block start 0 601054003200 Jun 20 14:05:20 dobbia kernel: [ 851.693566]
Re: [PATCH 06/12 v3] Btrfs: improve log with sub transaction
On 06/21/2011 04:49 AM, Liu Bo wrote: When logging an inode _A_, current btrfs will a) clear all items belonged to _A_ in log, b) copy all items belonged to _A_ from fs/file tree to log tree, and this just wastes a lot of time, especially when logging big files. So we want to use a smarter approach, i.e. find and merge. The amount of file extent items is the largest, so we focus on it. Thanks to sub transaction, now we can find those file extent items which are changed after last _transaction commit_ or last _log commit_, and then merge them with the existed ones in log tree. It will be great helpful on fsync performance, cause the common case is make changes on a _part_ of inode. Signed-off-by: Liu Boliubo2...@cn.fujitsu.com --- fs/btrfs/tree-log.c | 179 --- 1 files changed, 127 insertions(+), 52 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index b91fe8b..6193e27 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2581,61 +2581,106 @@ again: } /* - * a helper function to drop items from the log before we relog an - * inode. max_key_type indicates the highest item type to remove. - * This cannot be run for file data extents because it does not - * free the extents they point to. + * a helper function to drop items from the log before we merge + * the uptodate items into the log tree. */ -static int drop_objectid_items(struct btrfs_trans_handle *trans, - struct btrfs_root *log, - struct btrfs_path *path, - u64 objectid, int max_key_type) +static int prepare_for_merge_items(struct btrfs_trans_handle *trans, + struct inode *inode, + struct extent_buffer *eb, + int slot, int nr) { - int ret; - struct btrfs_key key; + struct btrfs_root *log = BTRFS_I(inode)-root-log_root; + struct btrfs_path *path; struct btrfs_key found_key; + struct btrfs_key key; + int i; + int ret; - key.objectid = objectid; - key.type = max_key_type; - key.offset = (u64)-1; + /* There are no relative items of the inode in log. */ + if (BTRFS_I(inode)-logged_trans trans-transaction-transid) + return 0; - while (1) { + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + for (i = 0; i nr; i++) { + btrfs_item_key_to_cpu(eb,key, i + slot); + Just a nit, but it would be less of a pain to do for (i = slot; i slot+nr; i++) that way you don't have a bunch of getter(eb, slot + i) things scattered around. + if (btrfs_key_type(key) == BTRFS_EXTENT_DATA_KEY) { + struct btrfs_file_extent_item *fi; + int found_type; + u64 mask = BTRFS_I(inode)-root-sectorsize - 1; + u64 start = key.offset; + u64 extent_end; + u64 hint; + unsigned long size; + + fi = btrfs_item_ptr(eb, slot + i, + struct btrfs_file_extent_item); + found_type = btrfs_file_extent_type(eb, fi); + + if (found_type == BTRFS_FILE_EXTENT_REG || + found_type == BTRFS_FILE_EXTENT_PREALLOC) + extent_end = start + + btrfs_file_extent_num_bytes(eb, fi); + else if (found_type == BTRFS_FILE_EXTENT_INLINE) { + size = btrfs_file_extent_inline_len(eb, fi); + extent_end = (start + size + mask) ~mask; + } else + BUG_ON(1); + Style screw up, if one chunk of an if/else block has braces, they all need it. + /* drop any overlapping extents */ + ret = btrfs_drop_extents(trans, inode, start, +extent_end,hint, 0, 1); There are lot's of places where you've missed a space, like here where you have extent_end,hint + BUG_ON(ret); + + continue; + } + + /* non file extent */ ret = btrfs_search_slot(trans, log,key, path, -1, 1); - BUG_ON(ret == 0); if (ret 0) break; - if (path-slots[0] == 0) + /* empty log! */ + if (ret 0 path-slots[0] == 0) Here too. break; - path-slots[0]--; + if (ret 0) { Here. + btrfs_release_path(path); + continue; + } +
Re: [PATCH 10/12 v3] Btrfs: deal with EEXIST after iput
On 06/21/2011 04:49 AM, Liu Bo wrote: There are two cases when BTRFS_I(inode)-logged_trans is zero: a) an inode is just allocated; b) iput an inode and reread it. However, in b) if btrfs is not committed yet, and this inode _may_ still remain in log tree. So we need to check the log tree to get logged_trans a right value in case it hits a EEXIST while logging. Instead of doing this why not just check and see if the inode has been logged but the transaction has not yet been committed in btrfs_drop_inode? That way the inode doesn't get evicted from cache until after we know it's ok and that way we don't have to waste a tree lookup. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [HELP!] parent transid verify failed on 600755752960 wanted 757102 found 756726
Welcome to the club, I have a similar issue. We pretty much have to wait for the fsck tool to finish being developed. If possible unhook the drives and leave them be until the tool is done. I don't know when it will be done as I am not a developer, mearly a follower. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/7] added helper functions to iterate backrefs
On Sat, Jun 18, 2011 at 01:45:58PM +0200, Jan Schmidt wrote: +/* + * returns 0 if the path could be dumped (probably truncated) + * returns 0 in case of an error + */ +static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref, + struct extent_buffer *eb_ir, int slot, + void *ctx) +{ + struct inode_fs_paths *ipath = ctx; + struct extent_buffer *eb_i = ipath-eb; + u32 path_len; + char *fs_path; + + if (ipath-bytes_left 2) + return -EOVERFLOW; + + *ipath-dest++ = ' '; + --ipath-bytes_left; this actually prepends a space before the first entry. I've played a bit with this interface, it's the missing piece to implement the 'inode nubmer - directory names' search :) All the names are placed into one buffer, 4k sized, separated by a space. Not all files fit, so I suggest to print one filename at a time. Me-user wants to see all the filenames, even if the list is potentially long. I've also noticed that in iref_to_path() 130 len = btrfs_inode_ref_name_len(eb, iref); returns very large values, which do not fit into the 4k buffer and the function returns. When trying to print the data by read_extent_buffer to a temporary array, a pagefault occurs (and seems to come from reading the extent buffer). The numbers are like 19000 or higher, which are way above path or filename maximum size. I don't think this is the intentional behaviour, it relies on some side effect rather than clear logic. Example: ipath buffer and scratch are 32K, each, ie. the overly sized ref_name_len will fit there: [ 8766.928232] btrfs: ino2name: 266 p1/ [ 8767.440964] i2p: [4] namelen 10, left 32766 [ 8767.446417] i2p: [7] path: [ 8767.450445] i2p: [4] namelen 2, left 32755 [ 8767.455733] i2p: [7] path: / [ 8767.459834] i2p: [2] p1/ [ 8767.463593] i2p: [4] namelen 0, left 32752 [ 8767.468903] i2p: [7] path: [ 8767.472908] i2p: [4] namelen 2, left 32751 [ 8767.478188] i2p: [7] path: / [ 8767.482280] i2p: [2] p1/ [ 8767.486024] i2p: [4] namelen 0, left 32748 [ 8767.491293] i2p: [7] path: [ 8767.495283] i2p: [4] namelen 2, left 32747 [ 8767.500587] i2p: [7] path: / [ 8767.504680] i2p: [2] p1/ [ 8767.508430] i2p: [4] namelen 0, left 32744 [ 8767.513708] i2p: [7] path: [ 8767.517702] i2p: [4] namelen 2, left 32743 [ 8767.522969] i2p: [7] path: / [ 8767.527042] i2p: [2] p1/ [ 8767.530787] i2p: [4] namelen 0, left 32740 [ 8767.536049] i2p: [7] path: [ 8767.539991] i2p: [4] namelen 2, left 32739 [ 8767.545282] i2p: [7] path: / [ 8767.549374] i2p: [2] p1/ [ 8767.553109] i2p: [4] namelen 0, left 32736 [ 8767.558377] i2p: [7] path: [ 8767.562354] i2p: [4] namelen 2, left 32735 [ 8767.567615] i2p: [7] path: / [ 8767.571713] i2p: [2] p1/ [ 8767.575443] i2p: [4] namelen 0, left 32732 [ 8767.580701] i2p: [7] path: [ 8767.584678] i2p: [4] namelen 2, left 32731 [ 8767.589933] i2p: [7] path: / [ 8767.594007] i2p: [2] p1/ [ 8767.597713] i2p: [4] namelen 0, left 32728 [ 8767.602908] i2p: [7] path: [ 8767.606832] i2p: [4] namelen 2, left 32727 [ 8767.612030] i2p: [7] path: / [ 8767.616050] i2p: [2] p1/ [ 8767.619676] i2p: [4] namelen 0, left 32724 [ 8767.624873] i2p: [7] path: [ 8767.628782] i2p: [4] namelen 2, left 32723 [ 8767.633970] i2p: [7] path: / [ 8767.637962] i2p: [2] p1/ [ 8767.641639] i2p: [4] namelen 0, left 32720 [ 8767.646838] i2p: [7] path: [ 8767.650757] i2p: [4] namelen 2, left 32719 [ 8767.655952] i2p: [7] path: / [ 8767.659940] i2p: [2] p1/ [ 8767.663604] i2p: [4] namelen 0, left 32716 [ 8767.668790] i2p: [7] path: [ 8767.672696] i2p: [4] namelen 2, left 32715 [ 8767.677881] i2p: [7] path: / [ 8767.681878] i2p: [2] p1/ [ 8767.685549] i2p: [4] namelen 0, left 32712 [ 8767.690742] i2p: [7] path: [ 8767.694655] i2p: [4] namelen 2, left 32711 [ 8767.699843] i2p: [7] path: / [ 8767.703840] i2p: [2] p1/ [ 8767.707520] i2p: [4] namelen 19967, left 32708 [ 8767.713057] i2p: [7] path: [ 8767.716955] BUG: unable to handle kernel NULL pointer dereference at (null) [ 8767.720932] IP: [a005f1d2] read_extent_buffer+0xa2/0x220 [btrfs] [ 8767.720932] PGD 77bd0067 PUD 79d35067 PMD 0 [ 8767.720932] Oops: [#1] SMP [ 8767.720932] CPU 1 [ 8767.720932] Modules linked in: btrfs loop [ 8767.720932] [ 8767.720932] Pid: 10859, comm: btrfs-ino2path Not tainted 3.0.0-rc3-default+ #75 Intel Corporation Santa Rosa platform/Matanzas [ 8767.720932] RIP: 0010:[a005f1d2] [a005f1d2] read_extent_buffer+0xa2/0x220 [btrfs] [ 8767.720932] RSP: 0018:880074acbc58 EFLAGS: 00010202 [ 8767.720932] RAX: RBX: 3deb RCX: 1000 [ 8767.720932] RDX: 01e0 RSI: RDI: 0246 [ 8767.720932] RBP: 880074acbcb8 R08: R09: 0001 [ 8767.720932] R10: R11: 1c04 R12: 0002 [ 8767.720932] R13: R14: 880079bac1d9 R15: 880074acbfd8 [ 8767.720932] FS:
Re: [HELP!] parent transid verify failed on 600755752960 wanted 757102 found 756726
2011/6/21 Daniel Witzel dannyboy48...@gmail.com: Welcome to the club, I have a similar issue. We pretty much have to wait for the fsck tool to finish being developed. If possible unhook the drives and leave them be until the tool is done. I don't know when it will be done as I am not a developer, mearly a follower. There are tools to view the metadata stored as raid10? possibly in high level language? I see Chris Mason stopped git commits to btrfs-progs-unstable in 2010, there is someone working on it? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/7] added helper functions to iterate backrefs
On 21.06.2011 16:37, David Sterba wrote: On Sat, Jun 18, 2011 at 01:45:58PM +0200, Jan Schmidt wrote: +/* + * returns 0 if the path could be dumped (probably truncated) + * returns 0 in case of an error + */ +static int inode_to_path(u64 inum, struct btrfs_inode_ref *iref, +struct extent_buffer *eb_ir, int slot, +void *ctx) +{ +struct inode_fs_paths *ipath = ctx; +struct extent_buffer *eb_i = ipath-eb; +u32 path_len; +char *fs_path; + +if (ipath-bytes_left 2) +return -EOVERFLOW; + +*ipath-dest++ = ' '; +--ipath-bytes_left; this actually prepends a space before the first entry. Which is fine because we print the buffer with (path:%s) in scrub.c:257, and a space is well appropriate there. That way we don't need to handle the first path differently. (Makes only sense as long as we stick to multiple files per line). I've played a bit with this interface, it's the missing piece to implement the 'inode nubmer - directory names' search :) All the names are placed into one buffer, 4k sized, separated by a space. Not all files fit, so I suggest to print one filename at a time. It does print all references to one *inode* to the same buffer, right. If there are more inodes refering to that extent, those paths are printed to a separate buffer. You can make some reflinks to try that out. Me-user wants to see all the filenames, even if the list is potentially long. Well, maybe. However, for me, the number of hardlinks created is either very low, or I know why a file has a whole bunch of hardlinks myself (e.g. some kind of backup strategy). In both cases, this approach would fit. I don't want to see some thousands of printk messages for that one file I know to have a lot of hardlinks - maybe even missing some error of a non-hardlinked file due to rate limiting. On the other hand, you could argue, when I deliberately create a bunch of reflinks, I will get the same problem. Another approach would be to put it all into one buffer, no matter which inode it comes from. Either way shouldn't be hard to accomplish. I've also noticed that in iref_to_path() 130 len = btrfs_inode_ref_name_len(eb, iref); returns very large values, which do not fit into the 4k buffer and the function returns. When trying to print the data by read_extent_buffer to a temporary array, a pagefault occurs (and seems to come from reading the extent buffer). The numbers are like 19000 or higher, which are way above path or filename maximum size. I don't think this is the intentional behaviour, it relies on some side effect rather than clear logic. Something is going wrong here: Example: ipath buffer and scratch are 32K, each, ie. the overly sized ref_name_len will fit there: [ 8766.928232] btrfs: ino2name: 266 p1/ [ 8767.440964] i2p: [4] namelen 10, left 32766 [ 8767.446417] i2p: [7] path: [ 8767.450445] i2p: [4] namelen 2, left 32755 [ 8767.455733] i2p: [7] path: / [ 8767.459834] i2p: [2] p1/ Do I get that right? This is inode 266, which should refer to ./p1/d6/d7/d1f/c41 (as you state below). Already on second iteration, we're at something named pl, which shouldn't happen as we construct the path bottom-up. And those namelen 0 in the following lines shouldn't happen, either. Furthermore, the path should change on every iteration (I guess, that might depend on the printk behind, of course). [ 8767.463593] i2p: [4] namelen 0, left 32752 [ 8767.468903] i2p: [7] path: [ 8767.472908] i2p: [4] namelen 2, left 32751 [ 8767.478188] i2p: [7] path: / [ 8767.482280] i2p: [2] p1/ [ 8767.486024] i2p: [4] namelen 0, left 32748 [ 8767.491293] i2p: [7] path: [ 8767.495283] i2p: [4] namelen 2, left 32747 [ 8767.500587] i2p: [7] path: / [ 8767.504680] i2p: [2] p1/ [ 8767.508430] i2p: [4] namelen 0, left 32744 [ 8767.513708] i2p: [7] path: [ 8767.517702] i2p: [4] namelen 2, left 32743 [ 8767.522969] i2p: [7] path: / [ 8767.527042] i2p: [2] p1/ [ 8767.530787] i2p: [4] namelen 0, left 32740 [ 8767.536049] i2p: [7] path: [ 8767.539991] i2p: [4] namelen 2, left 32739 [ 8767.545282] i2p: [7] path: / [ 8767.549374] i2p: [2] p1/ [ 8767.553109] i2p: [4] namelen 0, left 32736 [ 8767.558377] i2p: [7] path: [ 8767.562354] i2p: [4] namelen 2, left 32735 [ 8767.567615] i2p: [7] path: / [ 8767.571713] i2p: [2] p1/ [ 8767.575443] i2p: [4] namelen 0, left 32732 [ 8767.580701] i2p: [7] path: [ 8767.584678] i2p: [4] namelen 2, left 32731 [ 8767.589933] i2p: [7] path: / [ 8767.594007] i2p: [2] p1/ [ 8767.597713] i2p: [4] namelen 0, left 32728 [ 8767.602908] i2p: [7] path: [ 8767.606832] i2p: [4] namelen 2, left 32727 [ 8767.612030] i2p: [7] path: / [ 8767.616050] i2p: [2] p1/ [ 8767.619676] i2p: [4] namelen 0, left 32724 [ 8767.624873] i2p: [7] path: [ 8767.628782] i2p: [4] namelen 2, left 32723 [ 8767.633970] i2p: [7] path: / [ 8767.637962] i2p: [2] p1/ [ 8767.641639] i2p: [4]
Re: Abysmal Performance
On 06/21/2011 05:26 AM, Henning Rohlfs wrote: On Tue, 21 Jun 2011 10:00:59 +0200, Sander wrote: Henning Rohlfs wrote (ao): - space_cache was enabled, but it seemed to make the problem worse. It's no longer in the mount options. space_cache is a one time mount option which enabled space_cache. Not supplying it anymore as a mount option has no effect (dmesg | grep btrfs). I'm sure that after the first reboot after removing the flag from the mount options, the system was faster for a while. That must have been a coincidence (or just an error on my part). No, the space cache will make your system faster _after_ having been enabled once. The reason for this is because we have to build the cache the slow way at first, and then after that we can do it the fast way. What is probably happening is your box is slowing down trying to build this cache. Don't mount with clear_cache unless there is a bug in your cache. Let it do it's thing and stuff will get faster. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Abysmal Performance
On Mon, 2011-06-20 at 23:51 +0200, Henning Rohlfs wrote: Hello, I've migrated my system to btrfs (raid1) a few months ago. Since then the performance has been pretty bad, but recently it's gotten unbearable: a simple sync called while the system is idle can take 20 up to 60 seconds. Creating or deleting files often has several seconds latency, too. I think I’ve been seeing a fairly similar, or possibly the same? issue as well. It looks like it’s actually a regression introduced in 2.6.39 - if I switch back to a 2.6.38 kernel, my latency issues magically go away! (I'm curious: does using the older 2.6.38.x kernel help with anyone else that's seeing the issue?) Some hardware/configuration details: btrfs on a single disc (Seagate Momentus XT hybrid), lzo compression and space cache enabled. Some snapshots in use. I notice that in latencytop I'm seeing a lot of lines with (cropped) traces like sleep_on_page wait_on_page_bit read_extent_buffer_ 13.3 msec 0.5 % showing up that I didn't see with the 2.6.38 kernel. I occasionally see latencies as bad as 20-30 seconds on operations like fsync or synchronous writes. I think I can reproduce the issue well enough to bisect it, so I might give that a try. It'll be slow going, though. -- Calvin Walton calvin.wal...@kepstin.ca -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: save preloaded extent_state's in a percpu cache V2
When doing DIO tracing I noticed we were doing a ton of allocations, a lot of the time for extent_states. Some of the time we don't even use the prealloc'ed extent_state, it just get's free'd up. So instead create a per-cpu cache like the radix tree stuff. So we will check to see if our per-cpu cache has a prealloc'ed extent_state in it and if so we just continue, else we alloc a new one and fill the cache. Then if we need to use a prealloc'ed extent_state we can just take it out of our per-cpu cache. We will also refill the cache on free to try and limit the number of times we have to ask the allocator for caches. With this patch dbench 50 goes from ~210 mb/s to ~260 mb/s. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- V1-V2: Fix a uneven preempt_disable() when we use a prealloc but still have to search again. fs/btrfs/extent_io.c | 167 + 1 files changed, 126 insertions(+), 41 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7055d11..9adc614 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -12,6 +12,7 @@ #include linux/pagevec.h #include linux/prefetch.h #include linux/cleancache.h +#include linux/cpu.h #include extent_io.h #include extent_map.h #include compat.h @@ -31,6 +32,8 @@ static DEFINE_SPINLOCK(leak_lock); #define BUFFER_LRU_MAX 64 +static DEFINE_PER_CPU(struct extent_state *, extent_state_preloads) = NULL; + struct tree_entry { u64 start; u64 end; @@ -71,10 +74,36 @@ free_state_cache: return -ENOMEM; } +static void __free_extent_state(struct extent_state *state) +{ + if (!state) + return; + if (atomic_dec_and_test(state-refs)) { +#if LEAK_DEBUG + unsigned long flags; +#endif + WARN_ON(state-tree); +#if LEAK_DEBUG + spin_lock_irqsave(leak_lock, flags); + list_del(state-leak_list); + spin_unlock_irqrestore(leak_lock, flags); +#endif + kmem_cache_free(extent_state_cache, state); + } +} void extent_io_exit(void) { struct extent_state *state; struct extent_buffer *eb; + int cpu; + + for_each_possible_cpu(cpu) { + state = per_cpu(extent_state_preloads, cpu); + if (!state) + continue; + per_cpu(extent_state_preloads, cpu) = NULL; + __free_extent_state(state); + } while (!list_empty(states)) { state = list_entry(states.next, struct extent_state, leak_list); @@ -114,16 +143,11 @@ void extent_io_tree_init(struct extent_io_tree *tree, tree-mapping = mapping; } -static struct extent_state *alloc_extent_state(gfp_t mask) +static void init_extent_state(struct extent_state *state) { - struct extent_state *state; #if LEAK_DEBUG unsigned long flags; #endif - - state = kmem_cache_alloc(extent_state_cache, mask); - if (!state) - return state; state-state = 0; state-private = 0; state-tree = NULL; @@ -134,6 +158,16 @@ static struct extent_state *alloc_extent_state(gfp_t mask) #endif atomic_set(state-refs, 1); init_waitqueue_head(state-wq); +} + +static struct extent_state *alloc_extent_state(gfp_t mask) +{ + struct extent_state *state; + + state = kmem_cache_alloc(extent_state_cache, mask); + if (!state) + return state; + init_extent_state(state); return state; } @@ -142,6 +176,7 @@ void free_extent_state(struct extent_state *state) if (!state) return; if (atomic_dec_and_test(state-refs)) { + struct extent_state *tmp; #if LEAK_DEBUG unsigned long flags; #endif @@ -151,10 +186,53 @@ void free_extent_state(struct extent_state *state) list_del(state-leak_list); spin_unlock_irqrestore(leak_lock, flags); #endif - kmem_cache_free(extent_state_cache, state); + preempt_disable(); + tmp = __get_cpu_var(extent_state_preloads); + if (!tmp) { + init_extent_state(state); + __get_cpu_var(extent_state_preloads) = state; + } else { + kmem_cache_free(extent_state_cache, state); + } + preempt_enable(); } } +/* + * Pre-load an extent state. + */ +static int extent_state_preload(gfp_t mask) +{ + struct extent_state *state; + + preempt_disable(); + state = __get_cpu_var(extent_state_preloads); + if (!state) { + struct extent_state *tmp; + preempt_enable(); + tmp = alloc_extent_state(mask); + if (!tmp) + return -ENOMEM; + preempt_disable(); + state = __get_cpu_var(extent_state_preloads); +
[PATCH] Btrfs: fix how we merge extent states and deal with cached states V2
First, we can sometimes free the state we're merging, which means anybody who calls merge_state() may have the state it passed in free'ed. This is problematic because we could end up caching the state, which makes caching useless as the state will no longer be part of the tree. So instead of free'ing the state we passed into merge_state(), set it's end to the other-end and free the other state. This way we are sure to cache the correct state. Also because we can merge states together, instead of only using the cache'd state if it's start == the start we are looking for, go ahead and use it if the start we are looking for is within the range of the cached state. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- V1-V2: -Make this patch apply properly ontop of the new percpu cache -Accidently removed a cache_state(), add that back fs/btrfs/extent_io.c | 23 +++ 1 files changed, 11 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 9adc614..9a43a96 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -359,11 +359,10 @@ static int merge_state(struct extent_io_tree *tree, if (other-start == state-end + 1 other-state == state-state) { merge_cb(tree, state, other); - other-start = state-start; - state-tree = NULL; - rb_erase(state-rb_node, tree-state); - free_extent_state(state); - state = NULL; + state-end = other-end; + other-tree = NULL; + rb_erase(other-rb_node, tree-state); + free_extent_state(other); } } @@ -429,7 +428,6 @@ static int insert_state(struct extent_io_tree *tree, %llu %llu\n, (unsigned long long)found-start, (unsigned long long)found-end, (unsigned long long)start, (unsigned long long)end); - free_extent_state(state); return -EEXIST; } state-tree = tree; @@ -589,7 +587,8 @@ again: cached_state = NULL; } - if (cached cached-tree cached-start == start) { + if (cached cached-tree cached-start = start + cached-end start) { if (clear) atomic_dec(cached-refs); state = cached; @@ -832,7 +831,8 @@ again: spin_lock(tree-lock); if (cached_state *cached_state) { state = *cached_state; - if (state-start == start state-tree) { + if (state-start = start state-end start + state-tree) { node = state-rb_node; goto hit_next; } @@ -872,13 +872,13 @@ hit_next: if (err) goto out; - next_node = rb_next(node); cache_state(state, cached_state); merge_state(tree, state); if (last_end == (u64)-1) goto out; start = last_end + 1; + next_node = rb_next(state-rb_node); if (next_node start end !need_resched()) { state = rb_entry(next_node, struct extent_state, rb_node); @@ -950,7 +950,6 @@ hit_next: * Avoid to free 'prealloc' if it can be merged with * the later extent. */ - atomic_inc(prealloc-refs); err = insert_state(tree, prealloc, start, this_end, bits); BUG_ON(err == -EEXIST); @@ -959,7 +958,6 @@ hit_next: goto out; } cache_state(prealloc, cached_state); - free_extent_state(prealloc); start = this_end + 1; goto search_again; } @@ -1649,7 +1647,8 @@ int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end, int bitset = 0; spin_lock(tree-lock); - if (cached cached-tree cached-start == start) + if (cached cached-tree cached-start = start + cached-end start) node = cached-rb_node; else node = tree_search(tree, start); -- 1.7.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] far: remove i_alloc_sem abuse
Christoph Hellwig h...@infradead.org writes: Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. In FAT case, -i_mutex was better. But, last time I saw, shmfs was using -i_mutex to call -bmap. So, this was chosen instead. I'm not checking current version yet though, if shmfs had change, we can use -i_mutex. BTW, this patch looks good to me. Thanks. Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/fs/fat/inode.c === --- linux-2.6.orig/fs/fat/inode.c 2011-06-20 21:28:19.707963855 +0200 +++ linux-2.6/fs/fat/inode.c 2011-06-20 21:29:25.031293882 +0200 @@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address sector_t blocknr; /* fat_get_cluster() assumes the requested blocknr isn't truncated. */ - down_read(mapping-host-i_alloc_sem); + down_read(MSDOS_I(mapping-host)-truncate_lock); blocknr = generic_block_bmap(mapping, block, fat_get_block); - up_read(mapping-host-i_alloc_sem); + up_read(MSDOS_I(mapping-host)-truncate_lock); return blocknr; } @@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS); if (!ei) return NULL; + + init_rwsem(ei-truncate_lock); return ei-vfs_inode; } Index: linux-2.6/fs/fat/fat.h === --- linux-2.6.orig/fs/fat/fat.h 2011-06-20 21:28:19.724630522 +0200 +++ linux-2.6/fs/fat/fat.h2011-06-20 21:29:25.034627215 +0200 @@ -109,6 +109,7 @@ struct msdos_inode_info { int i_attrs;/* unused attribute bits */ loff_t i_pos; /* on-disk position of directory entry or 0 */ struct hlist_node i_fat_hash; /* hash by i_location */ + struct rw_semaphore truncate_lock; /* protect bmap against truncate */ struct inode vfs_inode; }; Index: linux-2.6/fs/fat/file.c === --- linux-2.6.orig/fs/fat/file.c 2011-06-20 21:28:19.744630521 +0200 +++ linux-2.6/fs/fat/file.c 2011-06-20 21:29:54.501292390 +0200 @@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s } if (attr-ia_valid ATTR_SIZE) { + down_write(MSDOS_I(inode)-truncate_lock); truncate_setsize(inode, attr-ia_size); fat_truncate_blocks(inode, attr-ia_size); + up_write(MSDOS_I(inode)-truncate_lock); } setattr_copy(inode, attr); -- OGAWA Hirofumi hirof...@mail.parknet.co.jp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] far: remove i_alloc_sem abuse
OGAWA Hirofumi hirof...@mail.parknet.co.jp writes: Christoph Hellwig h...@infradead.org writes: Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. In FAT case, -i_mutex was better. But, last time I saw, shmfs was using -i_mutex to call -bmap. So, this was chosen instead. I'm not checking current version yet though, if shmfs had change, we can use -i_mutex. It was swapfile. And unfortunately it doesn't have change. BTW, this patch looks good to me. -- OGAWA Hirofumi hirof...@mail.parknet.co.jp -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] far: remove i_alloc_sem abuse
On Wed, Jun 22, 2011 at 12:57:43AM +0900, OGAWA Hirofumi wrote: Christoph Hellwig h...@infradead.org writes: Add a new rw_semaphore to protect bmap against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. In FAT case, -i_mutex was better. But, last time I saw, shmfs was using -i_mutex to call -bmap. So, this was chosen instead. I'm not checking current version yet though, if shmfs had change, we can use -i_mutex. I tried that first, but it gives an instant deadlock when using a swapfile on fat. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Abysmal Performance
On Tue, 21 Jun 2011 11:18:30 -0400, Josef Bacik wrote: On 06/21/2011 05:26 AM, Henning Rohlfs wrote: On Tue, 21 Jun 2011 10:00:59 +0200, Sander wrote: Henning Rohlfs wrote (ao): - space_cache was enabled, but it seemed to make the problem worse. It's no longer in the mount options. space_cache is a one time mount option which enabled space_cache. Not supplying it anymore as a mount option has no effect (dmesg | grep btrfs). I'm sure that after the first reboot after removing the flag from the mount options, the system was faster for a while. That must have been a coincidence (or just an error on my part). No, the space cache will make your system faster _after_ having been enabled once. The reason for this is because we have to build the cache the slow way at first, and then after that we can do it the fast way. What is probably happening is your box is slowing down trying to build this cache. Don't mount with clear_cache unless there is a bug in your cache. Let it do it's thing and stuff will get faster. I'm just reporting what I experienced. I had space_cache in the mount options while the problem developed and removed it when the system got too slow. After the next reboot the system was responsive for a short time (an hour maybe - which seems to have been unrelated to the mount option though from what you described). Now there's no difference whatsoever between no options, space_cache and clear_cache. To sum it up: I only played with the clear_cache option because the system got too slow in the first place. I don't see how the problem can be related to this option if changing it it makes no difference. Thanks, Henning -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse
On Mon, 20 Jun 2011, Christoph Hellwig wrote: Add a new rw_semaphore to protect page_mkwrite against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/fs/ext4/inode.c === --- linux-2.6.orig/fs/ext4/inode.c2011-06-20 14:25:33.779248128 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-20 14:27:53.025907745 +0200 @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry, if (attr-ia_size sbi-s_bitmap_maxbytes) return -EFBIG; } + down_write((EXT4_I(inode)-truncate_lock)); } if (S_ISREG(inode-i_mode) @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry, ext4_truncate(inode); } + if (attr-ia_valid ATTR_SIZE) + up_write((EXT4_I(inode)-truncate_lock)); + Hi Christoph, this looks like a bug fix, so we should rather do it in a separate commit. Could you send it separately, or at least mention that in commit description ? Otherwise the patch looks good. Thanks! -Lukas if (!rc) { setattr_copy(inode, attr); mark_inode_dirty(inode); @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str struct address_space *mapping = inode-i_mapping; /* - * Get i_alloc_sem to stop truncates messing with the inode. We cannot - * get i_mutex because we are already holding mmap_sem. + * Get truncate_lock to stop truncates messing with the inode. We + * cannot get i_mutex because we are already holding mmap_sem. */ - down_read(inode-i_alloc_sem); + down_read((EXT4_I(inode)-truncate_lock)); size = i_size_read(inode); if (page-mapping != mapping || size = page_offset(page) || !PageUptodate(page)) { @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str lock_page(page); wait_on_page_writeback(page); if (PageMappedToDisk(page)) { - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return VM_FAULT_LOCKED; } @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return VM_FAULT_LOCKED; } } @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str */ lock_page(page); wait_on_page_writeback(page); - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return VM_FAULT_LOCKED; out_unlock: if (ret) ret = VM_FAULT_SIGBUS; - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return ret; } Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c2011-06-20 14:25:33.795914793 +0200 +++ linux-2.6/fs/ext4/super.c 2011-06-20 14:27:06.269243443 +0200 @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st ei-i_datasync_tid = 0; atomic_set(ei-i_ioend_count, 0); atomic_set(ei-i_aiodio_unwritten, 0); + init_rwsem(ei-truncate_lock); return ei-vfs_inode; } Index: linux-2.6/fs/ext4/ext4.h === --- linux-2.6.orig/fs/ext4/ext4.h 2011-06-20 14:25:33.752581461 +0200 +++ linux-2.6/fs/ext4/ext4.h 2011-06-20 14:27:06.272576777 +0200 @@ -844,6 +844,9 @@ struct ext4_inode_info { /* on-disk additional length */ __u16 i_extra_isize; + /* protect page_mkwrite against truncates */ + struct rw_semaphore truncate_lock; + #ifdef CONFIG_QUOTA /* quota space reservation, managed internally by quota code */ qsize_t i_reserved_quota; -- To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse
On Tue, Jun 21, 2011 at 06:34:50PM +0200, Lukas Czerner wrote: this looks like a bug fix, so we should rather do it in a separate commit. Could you send it separately, or at least mention that in commit description ? What looks like a bugfix? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse
On Tue, 21 Jun 2011, Lukas Czerner wrote: On Mon, 20 Jun 2011, Christoph Hellwig wrote: Add a new rw_semaphore to protect page_mkwrite against truncate. Previous i_alloc_sem was abused for this, but it's going away in this series. Signed-off-by: Christoph Hellwig h...@lst.de Index: linux-2.6/fs/ext4/inode.c === --- linux-2.6.orig/fs/ext4/inode.c 2011-06-20 14:25:33.779248128 +0200 +++ linux-2.6/fs/ext4/inode.c 2011-06-20 14:27:53.025907745 +0200 @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry, if (attr-ia_size sbi-s_bitmap_maxbytes) return -EFBIG; } + down_write((EXT4_I(inode)-truncate_lock)); } if (S_ISREG(inode-i_mode) @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry, ext4_truncate(inode); } + if (attr-ia_valid ATTR_SIZE) + up_write((EXT4_I(inode)-truncate_lock)); + Hi Christoph, this looks like a bug fix, so we should rather do it in a separate commit. Could you send it separately, or at least mention that in commit description ? That's stupid note, it is not a bugfix of course. Ignore it, sorry for the noise. Otherwise the patch looks good. Thanks! -Lukas if (!rc) { setattr_copy(inode, attr); mark_inode_dirty(inode); @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str struct address_space *mapping = inode-i_mapping; /* -* Get i_alloc_sem to stop truncates messing with the inode. We cannot -* get i_mutex because we are already holding mmap_sem. +* Get truncate_lock to stop truncates messing with the inode. We +* cannot get i_mutex because we are already holding mmap_sem. */ - down_read(inode-i_alloc_sem); + down_read((EXT4_I(inode)-truncate_lock)); size = i_size_read(inode); if (page-mapping != mapping || size = page_offset(page) || !PageUptodate(page)) { @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str lock_page(page); wait_on_page_writeback(page); if (PageMappedToDisk(page)) { - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return VM_FAULT_LOCKED; } @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str if (page_has_buffers(page)) { if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_unmapped)) { - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return VM_FAULT_LOCKED; } } @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str */ lock_page(page); wait_on_page_writeback(page); - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return VM_FAULT_LOCKED; out_unlock: if (ret) ret = VM_FAULT_SIGBUS; - up_read(inode-i_alloc_sem); + up_read((EXT4_I(inode)-truncate_lock)); return ret; } Index: linux-2.6/fs/ext4/super.c === --- linux-2.6.orig/fs/ext4/super.c 2011-06-20 14:25:33.795914793 +0200 +++ linux-2.6/fs/ext4/super.c 2011-06-20 14:27:06.269243443 +0200 @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st ei-i_datasync_tid = 0; atomic_set(ei-i_ioend_count, 0); atomic_set(ei-i_aiodio_unwritten, 0); + init_rwsem(ei-truncate_lock); return ei-vfs_inode; } Index: linux-2.6/fs/ext4/ext4.h === --- linux-2.6.orig/fs/ext4/ext4.h 2011-06-20 14:25:33.752581461 +0200 +++ linux-2.6/fs/ext4/ext4.h2011-06-20 14:27:06.272576777 +0200 @@ -844,6 +844,9 @@ struct ext4_inode_info { /* on-disk additional length */ __u16 i_extra_isize; + /* protect page_mkwrite against truncates */ + struct rw_semaphore truncate_lock; + #ifdef CONFIG_QUOTA /* quota space reservation, managed internally by quota code */ qsize_t i_reserved_quota; -- To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: save preloaded extent_state's in a percpu cache V2
Josef Bacik jo...@redhat.com writes: When doing DIO tracing I noticed we were doing a ton of allocations, a lot of the time for extent_states. Some of the time we don't even use the prealloc'ed extent_state, it just get's free'd up. So instead create a per-cpu cache like the radix tree stuff. So we will check to see if our per-cpu cache has a prealloc'ed extent_state in it and if so we just continue, else we alloc a new one and fill the cache. Then if we need to use a prealloc'ed extent_state we can just take it out of our per-cpu cache. We will also refill the cache on free to try and limit the number of times we have to ask the allocator for caches. With this patch dbench 50 goes from ~210 mb/s to ~260 mb/s. Thanks, You're just reimplementing a poor man's custom slab cache -- all of this is already done in slab. If the difference is really that big better fix slab and have everyone benefit? Did you use slub or slab? Did you analyze where the cycles are spent? -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: save preloaded extent_state's in a percpu cache V2
On 06/21/2011 04:20 PM, Andi Kleen wrote: Josef Bacik jo...@redhat.com writes: When doing DIO tracing I noticed we were doing a ton of allocations, a lot of the time for extent_states. Some of the time we don't even use the prealloc'ed extent_state, it just get's free'd up. So instead create a per-cpu cache like the radix tree stuff. So we will check to see if our per-cpu cache has a prealloc'ed extent_state in it and if so we just continue, else we alloc a new one and fill the cache. Then if we need to use a prealloc'ed extent_state we can just take it out of our per-cpu cache. We will also refill the cache on free to try and limit the number of times we have to ask the allocator for caches. With this patch dbench 50 goes from ~210 mb/s to ~260 mb/s. Thanks, You're just reimplementing a poor man's custom slab cache -- all of this is already done in slab. If the difference is really that big better fix slab and have everyone benefit? Did you use slub or slab? Did you analyze where the cycles are spent? Ugh slub debugging bites me again. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] remove i_alloc_sem
On Mon 20-06-11 16:15:33, Christoph Hellwig wrote: i_alloc_sem has always been a bit of an odd lock. It's the only remaining rw_semaphore that can be released by a different thread than the one that locked it, and it's use case in the core direct I/O code is more like a counter given that the writers already have external serialization. This series removes it in favour of a simpler counter scheme, thus getting rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the same time shrinking the size of struct inode by 160 bytes on 64-bit systems. The only nasty bit is that two filesystems (fat and ext4) have started abusing the lock for their own purposes. I've added a new rw_semaphore ext4 abuse should be gone when Ted merges my rewrite of ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend it? Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/12 v3] Btrfs: deal with EEXIST after iput
On 06/21/2011 10:00 PM, Josef Bacik wrote: On 06/21/2011 04:49 AM, Liu Bo wrote: There are two cases when BTRFS_I(inode)-logged_trans is zero: a) an inode is just allocated; b) iput an inode and reread it. However, in b) if btrfs is not committed yet, and this inode _may_ still remain in log tree. So we need to check the log tree to get logged_trans a right value in case it hits a EEXIST while logging. Instead of doing this why not just check and see if the inode has been logged but the transaction has not yet been committed in btrfs_drop_inode? That way the inode doesn't get evicted from cache until after we know it's ok and that way we don't have to waste a tree lookup. Thanks, Good idea, I'll follow it. thanks, liubo Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Christopher Whelan wants to chat
--- Christopher Whelan wants to stay in better touch using some of Google's coolest new products. If you already have Gmail or Google Talk, visit: http://mail.google.com/mail/b-ee9a324c7f-06c7f01d3e-rl8EXZ4nBadGESgcw_qYvUYj1Po You'll need to click this link to be able to chat with Christopher Whelan. To get Gmail - a free email account from Google with over 2,800 megabytes of storage - and chat with Christopher Whelan, visit: http://mail.google.com/mail/a-ee9a324c7f-06c7f01d3e-rl8EXZ4nBadGESgcw_qYvUYj1Po Gmail offers: - Instant messaging right inside Gmail - Powerful spam protection - Built-in search for finding your messages and a helpful way of organizing emails into conversations - No pop-up ads or untargeted banners - just text ads and related information that are relevant to the content of your messages All this, and its yours for free. But wait, there's more! By opening a Gmail account, you also get access to Google Talk, Google's instant messaging service: http://www.google.com/talk/ Google Talk offers: - Web-based chat that you can use anywhere, without a download - A contact list that's synchronized with your Gmail account - Free, high quality PC-to-PC voice calls when you download the Google Talk client We're working hard to add new features and make improvements, so we might also ask for your comments and suggestions periodically. We appreciate your help in making our products even better! Thanks, The Google Team To learn more about Gmail and Google Talk, visit: http://mail.google.com/mail/help/about.html http://www.google.com/talk/about.html (If clicking the URLs in this message does not work, copy and paste them into the address bar of your browser). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/delayed-inode.c:1301!
On Tue, 21 Jun 2011 02:08:54 +0200, David Sterba wrote: On Mon, Jun 20, 2011 at 06:12:10PM +0800, Miao Xie wrote: From 457f39393b2e3d475fbba029b90b6a4e17b94d43 Mon Sep 17 00:00:00 2001 From: Miao Xie mi...@cn.fujitsu.com Date: Mon, 20 Jun 2011 17:21:51 +0800 Subject: [PATCH] btrfs: fix inconsonant inode information When iputting the inode, We may leave the delayed nodes if they have some delayed items that have not been dealt with. So when the inode is read again, we must look up the relative delayed node, and use the information in it to initialize the inode. Or we will get inconsonant inode information. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/delayed-inode.c | 104 +++--- fs/btrfs/delayed-inode.h |1 + fs/btrfs/inode.c | 12 - 3 files changed, 91 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index f1cbd02..280755e 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -82,19 +82,16 @@ static inline struct btrfs_delayed_root *btrfs_get_delayed_root( return root-fs_info-delayed_root; } -static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node( -struct inode *inode) +static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) { -struct btrfs_delayed_node *node; struct btrfs_inode *btrfs_inode = BTRFS_I(inode); struct btrfs_root *root = btrfs_inode-root; u64 ino = btrfs_ino(inode); -int ret; +struct btrfs_delayed_node *node; -again: node = ACCESS_ONCE(btrfs_inode-delayed_node); do you still need the volatile access here, after the again: label has been removed? it does not break things if it's there, but it raises questions ... The rule of ACCESS_ONCE said by Linus is: if you access unlocked values, you use ACCESS_ONCE(). (See: http://yarchive.net/comp/linux/ACCESS_ONCE.html) So I think it is still needed. if (node) { -atomic_inc(node-refs);/* can be accessed */ +atomic_inc(node-refs); return node; } @@ -103,7 +100,9 @@ again: if (node) { if (btrfs_inode-delayed_node) { spin_unlock(root-inode_lock); -goto again; +BUG_ON(btrfs_inode-delayed_node != node); +atomic_inc(node-refs);/* can be accessed */ +return node; } btrfs_inode-delayed_node = node; atomic_inc(node-refs);/* can be accessed */ @@ -113,6 +112,23 @@ again: } spin_unlock(root-inode_lock); +return NULL; +} + +static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node( +struct inode *inode) +{ +struct btrfs_delayed_node *node; +struct btrfs_inode *btrfs_inode = BTRFS_I(inode); +struct btrfs_root *root = btrfs_inode-root; +u64 ino = btrfs_ino(inode); +int ret; + +again: +node = btrfs_get_delayed_node(inode); ... aha, it's been somehow moved here, which copies the original logic. Now reading inode-delayed_node is inside a function and I do not think that compiler could optimize reading value of btrfs_inode-delayed_node that it would require ACCESS_ONCE. The reason that I rewrote btrfs_get_delayed_node() is: The old btrfs_get_delayed_node() may ignore the old delayed node which holds the up-to-date information (such as: new directory indexes, up-to-date inode information), considers there is no relative delayed node. And then btrfs will use the max index number in the fs tree to initialize -index_cnt, but in fact, this number is not right, perhaps there are some directory indexes in the delayed node, the index number of them is greater than the one in the fs tree. In this way, the same index number may be allocated twice, and hit EEXIST error. And there is another ACCESS_ONCE in btrfs_remove_delayed_node. I wonder what's the reason for that. Sorry to abuse this thread, but I'd like to be sure about protection of the -delayed_node members inside btrfs_inode. Can you please comment on that? OK, I'll write some comment. If you recall one message from the original report: [ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17) (-17 == -EEXIST) a printk after return from __btrfs_add_delayed_item (which is able to return -EEXIST) in btrfs_insert_delayed_dir_index. I haven't looked farther, but it seems that the item is being inserted (at least) twice and I suspect missing locking or other type of protection. I don't think. The reason is above. Thanks Miao thanks, david +if (node) +return node; + node =
[PATCH] Btrfs: remove unneeded BUG_ON()
The following functions always return 0. - add_delayed_ref_head() - add_delayed_tree_ref() - add_delayed_data_ref() - add_excluded_extent() Therefore, check by BUG_ON() is unnecessary at the caller of these functions. Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- fs/btrfs/delayed-ref.c | 32 +--- fs/btrfs/extent-tree.c | 14 -- 2 files changed, 17 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 125cf76..ce0c728 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -588,7 +588,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_tree_ref *ref; struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; - int ret; BUG_ON(extent_op extent_op-is_data); ref = kmalloc(sizeof(*ref), GFP_NOFS); @@ -610,13 +609,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, * insert both the head node and the new ref without dropping * the spin lock */ - ret = add_delayed_ref_head(trans, head_ref-node, bytenr, num_bytes, - action, 0); - BUG_ON(ret); + add_delayed_ref_head(trans, head_ref-node, bytenr, num_bytes, +action, 0); + + add_delayed_tree_ref(trans, ref-node, bytenr, num_bytes, +parent, ref_root, level, action); - ret = add_delayed_tree_ref(trans, ref-node, bytenr, num_bytes, - parent, ref_root, level, action); - BUG_ON(ret); spin_unlock(delayed_refs-lock); return 0; } @@ -633,7 +631,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_data_ref *ref; struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; - int ret; BUG_ON(extent_op !extent_op-is_data); ref = kmalloc(sizeof(*ref), GFP_NOFS); @@ -655,13 +652,12 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, * insert both the head node and the new ref without dropping * the spin lock */ - ret = add_delayed_ref_head(trans, head_ref-node, bytenr, num_bytes, - action, 1); - BUG_ON(ret); + add_delayed_ref_head(trans, head_ref-node, bytenr, num_bytes, +action, 1); + + add_delayed_data_ref(trans, ref-node, bytenr, num_bytes, +parent, ref_root, owner, offset, action); - ret = add_delayed_data_ref(trans, ref-node, bytenr, num_bytes, - parent, ref_root, owner, offset, action); - BUG_ON(ret); spin_unlock(delayed_refs-lock); return 0; } @@ -672,7 +668,6 @@ int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans, { struct btrfs_delayed_ref_head *head_ref; struct btrfs_delayed_ref_root *delayed_refs; - int ret; head_ref = kmalloc(sizeof(*head_ref), GFP_NOFS); if (!head_ref) @@ -683,10 +678,9 @@ int btrfs_add_delayed_extent_op(struct btrfs_trans_handle *trans, delayed_refs = trans-transaction-delayed_refs; spin_lock(delayed_refs-lock); - ret = add_delayed_ref_head(trans, head_ref-node, bytenr, - num_bytes, BTRFS_UPDATE_DELAYED_HEAD, - extent_op-is_data); - BUG_ON(ret); + add_delayed_ref_head(trans, head_ref-node, bytenr, +num_bytes, BTRFS_UPDATE_DELAYED_HEAD, +extent_op-is_data); spin_unlock(delayed_refs-lock); return 0; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1f61bf5..15286d1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -225,9 +225,7 @@ static int exclude_super_stripes(struct btrfs_root *root, if (cache-key.objectid BTRFS_SUPER_INFO_OFFSET) { stripe_len = BTRFS_SUPER_INFO_OFFSET - cache-key.objectid; cache-bytes_super += stripe_len; - ret = add_excluded_extent(root, cache-key.objectid, - stripe_len); - BUG_ON(ret); + add_excluded_extent(root, cache-key.objectid, stripe_len); } for (i = 0; i BTRFS_SUPER_MIRROR_MAX; i++) { @@ -239,9 +237,7 @@ static int exclude_super_stripes(struct btrfs_root *root, while (nr--) { cache-bytes_super += stripe_len; - ret = add_excluded_extent(root, logical[nr], - stripe_len); - BUG_ON(ret); + add_excluded_extent(root, logical[nr], stripe_len); } kfree(logical);