Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
On 19/05/14 02:33, Chris Mason wrote: I had this one in the integration branch, but lockdep reported troubles. It looks like lockdep is correct. find_free_extent is nesting the cache rwsem inside the groups rwsem, but btrfs_write_out_cache is holding the new cache rwsem when it calls find_free_extent. Is there a new patch for this problem? We have another issue that makes us reboot the system quite often - I'm going to report that in a separate thread - and after each machine restart, btrfs reports broken free space cache. I know that's probably not that harmful, but it's disturbing and it's not that easy to remount the filesystem to fix the free space cache. Cheers, Alin. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
On 01/15/2014 07:00 AM, Miao Xie wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. I had this one in the integration branch, but lockdep reported troubles. It looks like lockdep is correct. find_free_extent is nesting the cache rwsem inside the groups rwsem, but btrfs_write_out_cache is holding the new cache rwsem when it calls find_free_extent. -chris [ 2857.610731] == [ 2857.623158] [ INFO: possible circular locking dependency detected ] [ 2857.635771] 3.15.0-rc5-mason+ #43 Not tainted [ 2857.644553] --- [ 2857.657139] btrfs-transacti/19476 is trying to acquire lock: [ 2857.668518] (found-groups_sem){..}, at: [a059dbc1] find_free_extent+0x931/0xe20 [btrfs] [ 2857.687771] [ 2857.687771] but task is already holding lock: [ 2857.699566] (cache-data_rwsem){..}, at: [a05f78bf] btrfs_write_out_cache+0x9f/0x170 [btrfs] [ 2857.719480] [ 2857.719480] which lock already depends on the new lock. [ 2857.719480] [ 2857.736021] [ 2857.736021] the existing dependency chain (in reverse order) is: [ 2857.751120] - #1 (cache-data_rwsem){..}: [ 2857.760823][810a14fe] lock_acquire+0x8e/0x110 [ 2857.772772][81649cf7] down_read+0x47/0x60 [ 2857.784028][a059db2c] find_free_extent+0x89c/0xe20 [btrfs] [ 2857.798253][a059e11b] btrfs_reserve_extent+0x6b/0x140 [btrfs] [ 2857.813041][a05b73ec] cow_file_range+0x13c/0x460 [btrfs] [ 2857.826892][a05bc097] run_delalloc_range+0x347/0x380 [btrfs] [ 2857.841510][a05d3f3d] __extent_writepage+0x70d/0x870 [btrfs] [ 2857.856129][a05d456a] extent_write_cache_pages.clone.6+0x30a/0x410 [btrfs] [ 2857.873185][a05d46c2] extent_writepages+0x52/0x70 [btrfs] [ 2857.887224][a05b3d57] btrfs_writepages+0x27/0x30 [btrfs] [ 2857.901078][81142543] do_writepages+0x23/0x40 [ 2857.913034][81135b99] __filemap_fdatawrite_range+0x59/0x60 [ 2857.927240][81135dec] filemap_flush+0x1c/0x20 [ 2857.939215][a05b2502] btrfs_run_delalloc_work+0x72/0xa0 [btrfs] [ 2857.954367][a05e05fe] normal_work_helper+0x6e/0x2d0 [btrfs] [ 2857.968749][8106b9e2] process_one_work+0x1d2/0x550 [ 2857.981561][8106cd8f] worker_thread+0x11f/0x3a0 [ 2857.993856][8107317e] kthread+0xde/0x100 [ 2858.004936][8165436c] ret_from_fork+0x7c/0xb0 [ 2858.016887] - #0 (found-groups_sem){..}: [ 2858.026590][810a12de] __lock_acquire+0x161e/0x17b0 [ 2858.039407][810a14fe] lock_acquire+0x8e/0x110 [ 2858.051370][81649cf7] down_read+0x47/0x60 [ 2858.062629][a059dbc1] find_free_extent+0x931/0xe20 [btrfs] [ 2858.076841][a059e11b] btrfs_reserve_extent+0x6b/0x140 [btrfs] [ 2858.091629][a059e307] btrfs_alloc_free_block+0x117/0x420 [btrfs] [ 2858.106940][a0589a5b] __btrfs_cow_block+0x11b/0x530 [btrfs] [ 2858.121331][a058a4a0] btrfs_cow_block+0x130/0x1e0 [btrfs] [ 2858.135375][a058c999] btrfs_search_slot+0x219/0x9c0 [btrfs] [ 2858.149760][a05f7595] __btrfs_write_out_cache+0x755/0x970 [btrfs] [ 2858.165245][a05f7958] btrfs_write_out_cache+0x138/0x170 [btrfs] [ 2858.180411][a059ccb0] btrfs_write_dirty_block_groups+0x480/0x600 [btrfs] [ 2858.197107][a05ae7af] commit_cowonly_roots+0x19f/0x250 [btrfs] [ 2858.212084][a05afbc0] btrfs_commit_transaction+0x450/0xa60 [btrfs] [ 2858.227738][a05aa8a6] transaction_kthread+0x216/0x290 [btrfs] [ 2858.242533][8107317e] kthread+0xde/0x100 [ 2858.253617][8165436c] ret_from_fork+0x7c/0xb0 [ 2858.265569] [ 2858.265569] other info that might
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
Hi Chris, Josef This patch is an important bug fix patch, could you merge it into your tree? Thanks Miao On wed, 15 Jan 2014 20:00:58 +0800, Miao Xie wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* + * It is used to record the extents that are allocated for + * the data, but don/t update its metadata. + */ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* + * It is just used for the delayed data space allocation + * because only the data space allocation can be done during + * we write out the free space cache. + */ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, - struct btrfs_key *ins, int is_data); + struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8072cfa..426b558 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb, fs_info-pinned_extents = fs_info-freed_extents[0]; fs_info-do_barriers = 1; - mutex_init(fs_info-ordered_operations_mutex); mutex_init(fs_info-ordered_extent_flush_mutex); mutex_init(fs_info-tree_log_mutex); @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb, init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); + init_rwsem(fs_info-data_rwsem); sema_init(fs_info-uuid_tree_rescan_sem, 1); fs_info-dev_replace.lock_owner = 0; atomic_set(fs_info-dev_replace.nesting_level, 0); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3664cfb..7b07876 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6173,7 +6173,7 @@ enum btrfs_loop_type { static noinline int find_free_extent(struct btrfs_root *orig_root, u64 num_bytes, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, - u64 flags) + u64 flags, bool need_pin) { int ret = 0; struct btrfs_root *root = orig_root-fs_info-extent_root; @@ -6502,6 +6502,16 @@ checks: ins-objectid = search_start; ins-offset = num_bytes; + if (need_pin) { + ASSERT(search_start = block_group-key.objectid +search_start block_group-key.objectid + +
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
Thanks, Miao, so the problem is that cow_file_range() joins transaction, allocates space through btrfs_reserve_extent(), then detaches from transaction. And then btrfs_finish_ordered_io() joins transaction again, adds a delayed ref and detaches from transaction. So if between these two, the transaction commits and we crash, then yes, the allocation is lost. Alex. On Tue, Mar 4, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote: On sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote: Hi Miao, On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. Can you please clarify more about the problem? So I understand that we allocate something from the free space cache. So after that, the free space cache does not account for this extent anymore. On the other hand the extent tree also does not account for it (yet). We need to add a delayed reference, which will be run at transaction commit and update the extent tree. But free-space cache is also written out during transaction commit. So how the issue happens? Can you perhaps post a flow of events? Task1 Task2 btrfs_writepages() alloc data space (The allocated space was removed from the free space cache) submit_bio() btrfs_commit_transaction() write out space cache ... commit transaction completed system crash (end_io() wasn't executed) The system crashed before end_io was executed, so no file references to the allocated space, and the extent tree also does not account for it. That space was lost. Thanks Miao Thanks. Alex. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* +* It is used to record the extents that are allocated for +* the data, but don/t update its metadata. +*/ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* +* It is just used for the delayed data space allocation +* because only the data space allocation can be done during +* we write out the free space cache. +*/ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, -struct btrfs_key *ins, int is_data); +struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/05/2014 02:02 AM, Miao Xie wrote: On Tue, 4 Mar 2014 10:19:20 -0500, Josef Bacik wrote: On 01/15/2014 07:00 AM, Miao Xie wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. I didn't like this idea at first but I've come around to it. The only thing is the data_rwsem thing, we don't need it as we are protected by the transaction being blocked when we do writeout, so nobody can data allocations during this time. Thanks, But this protection was removed by the patch Commit ID: 00361589d2eebd90fca022148c763e40d3e90871 Excellent point, then I'm good overall, I'll pull it in once I finish qgroups. Thanks, Josef -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTFzyJAAoJEANb+wAKly3BFHMP+wctKXDuKxP4kI27bfRlfJmV QB5i6qXr4WvQQw04FF3BtxCW9Z36l/1cFLFK0OaQ8Q54+4s0FUXNAynXFkf41TJz XWhkjTq0hJmzM95tB2+B2HwtEDI6m0l6fnOhsyiiSHF8V1g7V9VRwkeqpB9ZGu4/ ZryAUkSJGXFDvtFgTmRvOAclwD0R6oCXCw3f/AgtZQL1H9ucaogI2XKmcsFC jOIbn7L+gtFmTAJdvSFlRQAYaV2g69rf0Q5bVHZMAaLKN5rMIXBC2xFOCUxwg3si ZyOl72ojaGbCt7MI/s2X0uZ5d+xWYjaG+tF2K+XLjFoIUkny3RndFfU6pKk54gHP 9O/GXiilq2t3qZUn3zMuLXIG+cCaYTt3QsHnNyJisqOVLL95LbIvsINm0Xgu6bF/ cJ0acApJr6y2EdEbfVU/mrB06K81bxnZez8rOgyFXKD4yWoTMG23xttKZHG5LbdT xkwrJWPeJ77mI0+V/MPWePgomcH5cDs0IckSOXXwy8gZF4HJzVbXklcq22BfgIQA SLVgJYxqIlzIHYHZPisWJHUwFXe4C58YCDP2w4FI32g5LZuzhlus/wFI9Tg1543w sYf23ZYxlDUYAiD+zb+UipEA4CLtdZgZGonoG9lxK9JkgU2VHzXuTgty2+B0F9kt l3B5jpy1H2cP2TUskkbZ =E0en -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/15/2014 07:00 AM, Miao Xie wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. I didn't like this idea at first but I've come around to it. The only thing is the data_rwsem thing, we don't need it as we are protected by the transaction being blocked when we do writeout, so nobody can data allocations during this time. Thanks, Josef -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTFe73AAoJEANb+wAKly3BMMMP/1yY3WBGzaz5oTyNRRSgnVvl sjk7WSN2A0mhrl6lQOBlTvhgFo3gDNAtIwJaJIfWxWXZfM5i1hXSViFIc5NND3lF Jm10oaCPiTaYM0S5jqj8oeSXFyb1ny+/D4C1OfC/eRVpu9juO3bFUZYcM1XMFUcI mJxXg9Mqi1C6TOocCIdQI7ijXgm8xEPauaQy71EJZmgSjjVAXFG9BHay26L2a3Yu XIlnjyHMcgIFZXGVHQ+45S2pwWVgVZBIHLcKFSDFy4aupq/+EAN15oxdeTLBG8hJ RFJh15wLQguawlirc7boPzEugSieixbbUjn6CZqikVZke1g1fjGvrTAjiacpTDxv jskrCPYaXYWKMqGjugsVSI8GSonuWmmVRBsg4k+52U9AYM8wapjL+RHzBXU1cBqu zfyqaJhBj7EOQIu2oQDT2ZF9E+XA8dLcrysasMS+CYmcmR1Bs1fzpP2W3DoOg68F YoYuXnaAvIkvUQSsPUNGPjCn+iGCq65ZwiwnwF93RN7zlFIQt12DSn7l/NKnT/p9 1jvLkQ2MFmb6QD264mAOL5TULWyyB0AXsitIBJOMs0XWXZheLqqpv1clcaZacBFH P1yN51+d8DJmF0x0j1HacWi62WnGCh8GzR5ef5Pqi11/11xknbPugrzfFI9v1vR7 STfhGs/0sg7oQWYgarOj =Y2df -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
On sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote: Hi Miao, On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. Can you please clarify more about the problem? So I understand that we allocate something from the free space cache. So after that, the free space cache does not account for this extent anymore. On the other hand the extent tree also does not account for it (yet). We need to add a delayed reference, which will be run at transaction commit and update the extent tree. But free-space cache is also written out during transaction commit. So how the issue happens? Can you perhaps post a flow of events? Task1 Task2 btrfs_writepages() alloc data space (The allocated space was removed from the free space cache) submit_bio() btrfs_commit_transaction() write out space cache ... commit transaction completed system crash (end_io() wasn't executed) The system crashed before end_io was executed, so no file references to the allocated space, and the extent tree also does not account for it. That space was lost. Thanks Miao Thanks. Alex. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* +* It is used to record the extents that are allocated for +* the data, but don/t update its metadata. +*/ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* +* It is just used for the delayed data space allocation +* because only the data space allocation can be done during +* we write out the free space cache. +*/ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, -struct btrfs_key *ins, int is_data); +struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8072cfa..426b558 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb, fs_info-pinned_extents = fs_info-freed_extents[0]; fs_info-do_barriers = 1; - mutex_init(fs_info-ordered_operations_mutex); mutex_init(fs_info-ordered_extent_flush_mutex); mutex_init(fs_info-tree_log_mutex); @@ -2287,6 +2286,7 @@ int
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
Hi Miao, On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. Can you please clarify more about the problem? So I understand that we allocate something from the free space cache. So after that, the free space cache does not account for this extent anymore. On the other hand the extent tree also does not account for it (yet). We need to add a delayed reference, which will be run at transaction commit and update the extent tree. But free-space cache is also written out during transaction commit. So how the issue happens? Can you perhaps post a flow of events? Thanks. Alex. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* +* It is used to record the extents that are allocated for +* the data, but don/t update its metadata. +*/ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* +* It is just used for the delayed data space allocation +* because only the data space allocation can be done during +* we write out the free space cache. +*/ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, -struct btrfs_key *ins, int is_data); +struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8072cfa..426b558 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb, fs_info-pinned_extents = fs_info-freed_extents[0]; fs_info-do_barriers = 1; - mutex_init(fs_info-ordered_operations_mutex); mutex_init(fs_info-ordered_extent_flush_mutex); mutex_init(fs_info-tree_log_mutex); @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb, init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); + init_rwsem(fs_info-data_rwsem); sema_init(fs_info-uuid_tree_rescan_sem, 1); fs_info-dev_replace.lock_owner = 0; atomic_set(fs_info-dev_replace.nesting_level, 0); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3664cfb..7b07876 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6173,7 +6173,7 @@ enum btrfs_loop_type { static noinline int find_free_extent(struct btrfs_root *orig_root, u64 num_bytes, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, -
[RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* +* It is used to record the extents that are allocated for +* the data, but don/t update its metadata. +*/ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* +* It is just used for the delayed data space allocation +* because only the data space allocation can be done during +* we write out the free space cache. +*/ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, -struct btrfs_key *ins, int is_data); +struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8072cfa..426b558 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb, fs_info-pinned_extents = fs_info-freed_extents[0]; fs_info-do_barriers = 1; - mutex_init(fs_info-ordered_operations_mutex); mutex_init(fs_info-ordered_extent_flush_mutex); mutex_init(fs_info-tree_log_mutex); @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb, init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); + init_rwsem(fs_info-data_rwsem); sema_init(fs_info-uuid_tree_rescan_sem, 1); fs_info-dev_replace.lock_owner = 0; atomic_set(fs_info-dev_replace.nesting_level, 0); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3664cfb..7b07876 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6173,7 +6173,7 @@ enum btrfs_loop_type { static noinline int find_free_extent(struct btrfs_root *orig_root, u64 num_bytes, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, -u64 flags) +u64 flags, bool need_pin) { int ret = 0; struct btrfs_root *root = orig_root-fs_info-extent_root; @@ -6502,6 +6502,16 @@ checks: ins-objectid = search_start; ins-offset = num_bytes; + if (need_pin) { + ASSERT(search_start = block_group-key.objectid + search_start block_group-key.objectid + + block_group-key.offset); + set_extent_dirty(block_group-pinned_extents, +search_start, +
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
On Wed, Jan 15, 2014 at 08:00:58PM +0800, Miao Xie wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 The code itself is fine. But I'm not sure if it's worth doing so, I mean those memory allocation and lock acquire and release in our core endio path. To fallback to rebuild space cache is fairly acceptable to me in the case of crash. -liubo It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* + * It is used to record the extents that are allocated for + * the data, but don/t update its metadata. + */ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* + * It is just used for the delayed data space allocation + * because only the data space allocation can be done during + * we write out the free space cache. + */ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, - struct btrfs_key *ins, int is_data); + struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8072cfa..426b558 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb, fs_info-pinned_extents = fs_info-freed_extents[0]; fs_info-do_barriers = 1; - mutex_init(fs_info-ordered_operations_mutex); mutex_init(fs_info-ordered_extent_flush_mutex); mutex_init(fs_info-tree_log_mutex); @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb, init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); + init_rwsem(fs_info-data_rwsem); sema_init(fs_info-uuid_tree_rescan_sem, 1); fs_info-dev_replace.lock_owner = 0; atomic_set(fs_info-dev_replace.nesting_level, 0); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3664cfb..7b07876 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6173,7 +6173,7 @@ enum btrfs_loop_type { static noinline int find_free_extent(struct btrfs_root *orig_root, u64 num_bytes, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, - u64 flags) + u64 flags, bool need_pin) { int ret = 0; struct btrfs_root *root = orig_root-fs_info-extent_root; @@ -6502,6 +6502,16 @@ checks: ins-objectid = search_start; ins-offset = num_bytes; + if (need_pin) { + ASSERT(search_start =