Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-06-10 Thread Alin Dobre
On 19/05/14 02:33, Chris Mason wrote:
 I had this one in the integration branch, but lockdep reported troubles.  It
 looks like lockdep is correct.  find_free_extent is nesting the cache rwsem
 inside the groups rwsem, but btrfs_write_out_cache is holding the new cache
 rwsem when it calls find_free_extent.

Is there a new patch for this problem? We have another issue that makes
us reboot the system quite often - I'm going to report that in a
separate thread - and after each machine restart, btrfs reports broken
free space cache. I know that's probably not that harmful, but it's
disturbing and it's not that easy to remount the filesystem to fix the
free space cache.

Cheers,
Alin.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-05-18 Thread Chris Mason
On 01/15/2014 07:00 AM, Miao Xie wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816
 
 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
 
 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.
 
 Only data block groups had this problem, so the above change is just
 for data space allocation.

I had this one in the integration branch, but lockdep reported troubles.  It
looks like lockdep is correct.  find_free_extent is nesting the cache rwsem
inside the groups rwsem, but btrfs_write_out_cache is holding the new cache
rwsem when it calls find_free_extent.

-chris

[ 2857.610731] ==
[ 2857.623158] [ INFO: possible circular locking dependency detected ]
[ 2857.635771] 3.15.0-rc5-mason+ #43 Not tainted
[ 2857.644553] ---
[ 2857.657139] btrfs-transacti/19476 is trying to acquire lock:
[ 2857.668518]  (found-groups_sem){..}, at: [a059dbc1] 
find_free_extent+0x931/0xe20 [btrfs]
[ 2857.687771] 
[ 2857.687771] but task is already holding lock:
[ 2857.699566]  (cache-data_rwsem){..}, at: [a05f78bf] 
btrfs_write_out_cache+0x9f/0x170 [btrfs]
[ 2857.719480] 
[ 2857.719480] which lock already depends on the new lock.
[ 2857.719480] 
[ 2857.736021] 
[ 2857.736021] the existing dependency chain (in reverse order) is:
[ 2857.751120] 
- #1 (cache-data_rwsem){..}:
[ 2857.760823][810a14fe] lock_acquire+0x8e/0x110
[ 2857.772772][81649cf7] down_read+0x47/0x60
[ 2857.784028][a059db2c] find_free_extent+0x89c/0xe20 [btrfs]
[ 2857.798253][a059e11b] btrfs_reserve_extent+0x6b/0x140 
[btrfs]
[ 2857.813041][a05b73ec] cow_file_range+0x13c/0x460 [btrfs]
[ 2857.826892][a05bc097] run_delalloc_range+0x347/0x380 
[btrfs]
[ 2857.841510][a05d3f3d] __extent_writepage+0x70d/0x870 
[btrfs]
[ 2857.856129][a05d456a] 
extent_write_cache_pages.clone.6+0x30a/0x410 [btrfs]
[ 2857.873185][a05d46c2] extent_writepages+0x52/0x70 [btrfs]
[ 2857.887224][a05b3d57] btrfs_writepages+0x27/0x30 [btrfs]
[ 2857.901078][81142543] do_writepages+0x23/0x40
[ 2857.913034][81135b99] __filemap_fdatawrite_range+0x59/0x60
[ 2857.927240][81135dec] filemap_flush+0x1c/0x20
[ 2857.939215][a05b2502] btrfs_run_delalloc_work+0x72/0xa0 
[btrfs]
[ 2857.954367][a05e05fe] normal_work_helper+0x6e/0x2d0 [btrfs]
[ 2857.968749][8106b9e2] process_one_work+0x1d2/0x550
[ 2857.981561][8106cd8f] worker_thread+0x11f/0x3a0
[ 2857.993856][8107317e] kthread+0xde/0x100
[ 2858.004936][8165436c] ret_from_fork+0x7c/0xb0
[ 2858.016887] 
- #0 (found-groups_sem){..}:
[ 2858.026590][810a12de] __lock_acquire+0x161e/0x17b0
[ 2858.039407][810a14fe] lock_acquire+0x8e/0x110
[ 2858.051370][81649cf7] down_read+0x47/0x60
[ 2858.062629][a059dbc1] find_free_extent+0x931/0xe20 [btrfs]
[ 2858.076841][a059e11b] btrfs_reserve_extent+0x6b/0x140 
[btrfs]
[ 2858.091629][a059e307] btrfs_alloc_free_block+0x117/0x420 
[btrfs]
[ 2858.106940][a0589a5b] __btrfs_cow_block+0x11b/0x530 [btrfs]
[ 2858.121331][a058a4a0] btrfs_cow_block+0x130/0x1e0 [btrfs]
[ 2858.135375][a058c999] btrfs_search_slot+0x219/0x9c0 [btrfs]
[ 2858.149760][a05f7595] __btrfs_write_out_cache+0x755/0x970 
[btrfs]
[ 2858.165245][a05f7958] btrfs_write_out_cache+0x138/0x170 
[btrfs]
[ 2858.180411][a059ccb0] 
btrfs_write_dirty_block_groups+0x480/0x600 [btrfs]
[ 2858.197107][a05ae7af] commit_cowonly_roots+0x19f/0x250 
[btrfs]
[ 2858.212084][a05afbc0] btrfs_commit_transaction+0x450/0xa60 
[btrfs]
[ 2858.227738][a05aa8a6] transaction_kthread+0x216/0x290 
[btrfs]
[ 2858.242533][8107317e] kthread+0xde/0x100
[ 2858.253617][8165436c] ret_from_fork+0x7c/0xb0
[ 2858.265569] 
[ 2858.265569] other info that might 

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-26 Thread mi...@cn.fujitsu.com
Hi Chris, Josef

This patch is an important bug fix patch, could you merge it into your tree?

Thanks
Miao

On  wed, 15 Jan 2014 20:00:58 +0800, Miao Xie wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816
 
 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
 
 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.
 
 Only data block groups had this problem, so the above change is just
 for data space allocation.
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)
 
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
   /* free space cache stuff */
   struct btrfs_free_space_ctl *free_space_ctl;
  
 + /*
 +  * It is used to record the extents that are allocated for
 +  * the data, but don/t update its metadata.
 +  */
 + struct extent_io_tree pinned_extents;
 +
   /* block group cache stuff */
   struct rb_node cache_node;
  
 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
*/
   struct list_head space_info;
  
 + /*
 +  * It is just used for the delayed data space allocation
 +  * because only the data space allocation can be done during
 +  * we write out the free space cache.
 +  */
 + struct rw_semaphore data_rwsem;
 +
   struct btrfs_space_info *data_sinfo;
  
   struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
  struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -  struct btrfs_key *ins, int is_data);
 +  struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 8072cfa..426b558 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
   fs_info-pinned_extents = fs_info-freed_extents[0];
   fs_info-do_barriers = 1;
  
 -
   mutex_init(fs_info-ordered_operations_mutex);
   mutex_init(fs_info-ordered_extent_flush_mutex);
   mutex_init(fs_info-tree_log_mutex);
 @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
   init_rwsem(fs_info-extent_commit_sem);
   init_rwsem(fs_info-cleanup_work_sem);
   init_rwsem(fs_info-subvol_sem);
 + init_rwsem(fs_info-data_rwsem);
   sema_init(fs_info-uuid_tree_rescan_sem, 1);
   fs_info-dev_replace.lock_owner = 0;
   atomic_set(fs_info-dev_replace.nesting_level, 0);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 3664cfb..7b07876 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
  static noinline int find_free_extent(struct btrfs_root *orig_root,
u64 num_bytes, u64 empty_size,
u64 hint_byte, struct btrfs_key *ins,
 -  u64 flags)
 +  u64 flags, bool need_pin)
  {
   int ret = 0;
   struct btrfs_root *root = orig_root-fs_info-extent_root;
 @@ -6502,6 +6502,16 @@ checks:
   ins-objectid = search_start;
   ins-offset = num_bytes;
  
 + if (need_pin) {
 + ASSERT(search_start = block_group-key.objectid 
 +search_start  block_group-key.objectid +
 +   

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-08 Thread Alex Lyakas
Thanks, Miao,
so the problem is that cow_file_range() joins transaction, allocates
space through btrfs_reserve_extent(), then detaches from transaction.
And then btrfs_finish_ordered_io() joins transaction again, adds a
delayed ref and detaches from transaction. So if between these two,
the transaction commits and we crash, then yes, the allocation is
lost.

Alex.


On Tue, Mar 4, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On  sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote:
 Hi Miao,

 On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816

 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
 Can you please clarify more about the problem?
 So I understand that we allocate something from the free space cache.
 So after that, the free space cache does not account for this extent
 anymore. On the other hand the extent tree also does not account for
 it (yet). We need to add a delayed reference, which will be run at
 transaction commit and update the extent tree. But free-space cache is
 also written out during transaction commit. So how the issue happens?
 Can you perhaps post a flow of events?

 Task1   Task2
 btrfs_writepages()
   alloc data space
 (The allocated space was
  removed from the free
  space cache)
   submit_bio()
 btrfs_commit_transaction()
   write out space cache
   ...
   commit transaction completed
 system crash
  (end_io() wasn't executed)

 The system crashed before end_io was executed, so no file references to the
 allocated space, and the extent tree also does not account for it. That space
 was lost.

 Thanks
 Miao

 Thanks.
 Alex.



 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.

 Only data block groups had this problem, so the above change is just
 for data space allocation.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
 /* free space cache stuff */
 struct btrfs_free_space_ctl *free_space_ctl;

 +   /*
 +* It is used to record the extents that are allocated for
 +* the data, but don/t update its metadata.
 +*/
 +   struct extent_io_tree pinned_extents;
 +
 /* block group cache stuff */
 struct rb_node cache_node;

 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
  */
 struct list_head space_info;

 +   /*
 +* It is just used for the delayed data space allocation
 +* because only the data space allocation can be done during
 +* we write out the free space cache.
 +*/
 +   struct rw_semaphore data_rwsem;
 +
 struct btrfs_space_info *data_sinfo;

 struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
  u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -struct btrfs_key *ins, int is_data);
 +struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root 
 *root,
   struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root 
 *root,
 diff --git 

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-05 Thread Josef Bacik
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/05/2014 02:02 AM, Miao Xie wrote:
 On Tue, 4 Mar 2014 10:19:20 -0500, Josef Bacik wrote: On 01/15/2014
 07:00 AM, Miao Xie wrote:
 When we mounted the filesystem after the crash, we got the 
 following message: BTRFS error (device xxx): block group
 4315938816 has wrong amount of free space BTRFS error (device
 xxx): failed to load free space cache for block group
 4315938816
 
 It is because we didn't update the metadata of the allocated
 space until the file data was written into the disk. During
 this time, there was no information about the allocated
 spaces in either the extent tree nor the free space cache.
 when we wrote out the free space cache at this time, those
 spaces were lost.
 
 In ordered to fix this problem, I use a state tree for every
 block group to record those allocated spaces. We record the
 information when they are allocated, and clean up the
 information after the metadata update. Besides that, we also
 introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.
 
 Only data block groups had this problem, so the above change
 is just for data space allocation.
 
 
 I didn't like this idea at first but I've come around to it.  The
 only thing is the data_rwsem thing, we don't need it as we are
 protected by the transaction being blocked when we do writeout, so
 nobody can data allocations during this time.  Thanks,
 
 But this protection was removed by the patch
 
 Commit ID: 00361589d2eebd90fca022148c763e40d3e90871
 

Excellent point, then I'm good overall, I'll pull it in once I finish
qgroups.  Thanks,

Josef
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTFzyJAAoJEANb+wAKly3BFHMP+wctKXDuKxP4kI27bfRlfJmV
QB5i6qXr4WvQQw04FF3BtxCW9Z36l/1cFLFK0OaQ8Q54+4s0FUXNAynXFkf41TJz
XWhkjTq0hJmzM95tB2+B2HwtEDI6m0l6fnOhsyiiSHF8V1g7V9VRwkeqpB9ZGu4/
ZryAUkSJGXFDvtFgTmRvOAclwD0R6oCXCw3f/AgtZQL1H9ucaogI2XKmcsFC
jOIbn7L+gtFmTAJdvSFlRQAYaV2g69rf0Q5bVHZMAaLKN5rMIXBC2xFOCUxwg3si
ZyOl72ojaGbCt7MI/s2X0uZ5d+xWYjaG+tF2K+XLjFoIUkny3RndFfU6pKk54gHP
9O/GXiilq2t3qZUn3zMuLXIG+cCaYTt3QsHnNyJisqOVLL95LbIvsINm0Xgu6bF/
cJ0acApJr6y2EdEbfVU/mrB06K81bxnZez8rOgyFXKD4yWoTMG23xttKZHG5LbdT
xkwrJWPeJ77mI0+V/MPWePgomcH5cDs0IckSOXXwy8gZF4HJzVbXklcq22BfgIQA
SLVgJYxqIlzIHYHZPisWJHUwFXe4C58YCDP2w4FI32g5LZuzhlus/wFI9Tg1543w
sYf23ZYxlDUYAiD+zb+UipEA4CLtdZgZGonoG9lxK9JkgU2VHzXuTgty2+B0F9kt
l3B5jpy1H2cP2TUskkbZ
=E0en
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-04 Thread Josef Bacik
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 01/15/2014 07:00 AM, Miao Xie wrote:
 When we mounted the filesystem after the crash, we got the
 following message: BTRFS error (device xxx): block group 4315938816
 has wrong amount of free space BTRFS error (device xxx): failed to
 load free space cache for block group 4315938816
 
 It is because we didn't update the metadata of the allocated space
 until the file data was written into the disk. During this time,
 there was no information about the allocated spaces in either the
 extent tree nor the free space cache. when we wrote out the free
 space cache at this time, those spaces were lost.
 
 In ordered to fix this problem, I use a state tree for every block
 group to record those allocated spaces. We record the information
 when they are allocated, and clean up the information after the
 metadata update. Besides that, we also introduce a read-write
 semaphore to avoid the race between the allocation and the free
 space cache write out.
 
 Only data block groups had this problem, so the above change is
 just for data space allocation.
 

I didn't like this idea at first but I've come around to it.  The only
thing is the data_rwsem thing, we don't need it as we are protected by
the transaction being blocked when we do writeout, so nobody can data
allocations during this time.  Thanks,

Josef

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTFe73AAoJEANb+wAKly3BMMMP/1yY3WBGzaz5oTyNRRSgnVvl
sjk7WSN2A0mhrl6lQOBlTvhgFo3gDNAtIwJaJIfWxWXZfM5i1hXSViFIc5NND3lF
Jm10oaCPiTaYM0S5jqj8oeSXFyb1ny+/D4C1OfC/eRVpu9juO3bFUZYcM1XMFUcI
mJxXg9Mqi1C6TOocCIdQI7ijXgm8xEPauaQy71EJZmgSjjVAXFG9BHay26L2a3Yu
XIlnjyHMcgIFZXGVHQ+45S2pwWVgVZBIHLcKFSDFy4aupq/+EAN15oxdeTLBG8hJ
RFJh15wLQguawlirc7boPzEugSieixbbUjn6CZqikVZke1g1fjGvrTAjiacpTDxv
jskrCPYaXYWKMqGjugsVSI8GSonuWmmVRBsg4k+52U9AYM8wapjL+RHzBXU1cBqu
zfyqaJhBj7EOQIu2oQDT2ZF9E+XA8dLcrysasMS+CYmcmR1Bs1fzpP2W3DoOg68F
YoYuXnaAvIkvUQSsPUNGPjCn+iGCq65ZwiwnwF93RN7zlFIQt12DSn7l/NKnT/p9
1jvLkQ2MFmb6QD264mAOL5TULWyyB0AXsitIBJOMs0XWXZheLqqpv1clcaZacBFH
P1yN51+d8DJmF0x0j1HacWi62WnGCh8GzR5ef5Pqi11/11xknbPugrzfFI9v1vR7
STfhGs/0sg7oQWYgarOj
=Y2df
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-03 Thread Miao Xie
On  sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote:
 Hi Miao,
 
 On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816

 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
 Can you please clarify more about the problem?
 So I understand that we allocate something from the free space cache.
 So after that, the free space cache does not account for this extent
 anymore. On the other hand the extent tree also does not account for
 it (yet). We need to add a delayed reference, which will be run at
 transaction commit and update the extent tree. But free-space cache is
 also written out during transaction commit. So how the issue happens?
 Can you perhaps post a flow of events?

Task1   Task2
btrfs_writepages()
  alloc data space
(The allocated space was
 removed from the free
 space cache)
  submit_bio()
btrfs_commit_transaction()
  write out space cache
  ...
  commit transaction completed
system crash
 (end_io() wasn't executed)

The system crashed before end_io was executed, so no file references to the
allocated space, and the extent tree also does not account for it. That space
was lost.

Thanks
Miao
 
 Thanks.
 Alex.
 
 

 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.

 Only data block groups had this problem, so the above change is just
 for data space allocation.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
 /* free space cache stuff */
 struct btrfs_free_space_ctl *free_space_ctl;

 +   /*
 +* It is used to record the extents that are allocated for
 +* the data, but don/t update its metadata.
 +*/
 +   struct extent_io_tree pinned_extents;
 +
 /* block group cache stuff */
 struct rb_node cache_node;

 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
  */
 struct list_head space_info;

 +   /*
 +* It is just used for the delayed data space allocation
 +* because only the data space allocation can be done during
 +* we write out the free space cache.
 +*/
 +   struct rw_semaphore data_rwsem;
 +
 struct btrfs_space_info *data_sinfo;

 struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
  u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -struct btrfs_key *ins, int is_data);
 +struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
   struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 8072cfa..426b558 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
 fs_info-pinned_extents = fs_info-freed_extents[0];
 fs_info-do_barriers = 1;

 -
 mutex_init(fs_info-ordered_operations_mutex);
 mutex_init(fs_info-ordered_extent_flush_mutex);
 mutex_init(fs_info-tree_log_mutex);
 @@ -2287,6 +2286,7 @@ int 

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-01 Thread Alex Lyakas
Hi Miao,

On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816

 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
Can you please clarify more about the problem?
So I understand that we allocate something from the free space cache.
So after that, the free space cache does not account for this extent
anymore. On the other hand the extent tree also does not account for
it (yet). We need to add a delayed reference, which will be run at
transaction commit and update the extent tree. But free-space cache is
also written out during transaction commit. So how the issue happens?
Can you perhaps post a flow of events?

Thanks.
Alex.



 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.

 Only data block groups had this problem, so the above change is just
 for data space allocation.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
 /* free space cache stuff */
 struct btrfs_free_space_ctl *free_space_ctl;

 +   /*
 +* It is used to record the extents that are allocated for
 +* the data, but don/t update its metadata.
 +*/
 +   struct extent_io_tree pinned_extents;
 +
 /* block group cache stuff */
 struct rb_node cache_node;

 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
  */
 struct list_head space_info;

 +   /*
 +* It is just used for the delayed data space allocation
 +* because only the data space allocation can be done during
 +* we write out the free space cache.
 +*/
 +   struct rw_semaphore data_rwsem;
 +
 struct btrfs_space_info *data_sinfo;

 struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
  u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -struct btrfs_key *ins, int is_data);
 +struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
   struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 8072cfa..426b558 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
 fs_info-pinned_extents = fs_info-freed_extents[0];
 fs_info-do_barriers = 1;

 -
 mutex_init(fs_info-ordered_operations_mutex);
 mutex_init(fs_info-ordered_extent_flush_mutex);
 mutex_init(fs_info-tree_log_mutex);
 @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
 init_rwsem(fs_info-extent_commit_sem);
 init_rwsem(fs_info-cleanup_work_sem);
 init_rwsem(fs_info-subvol_sem);
 +   init_rwsem(fs_info-data_rwsem);
 sema_init(fs_info-uuid_tree_rescan_sem, 1);
 fs_info-dev_replace.lock_owner = 0;
 atomic_set(fs_info-dev_replace.nesting_level, 0);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 3664cfb..7b07876 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
  static noinline int find_free_extent(struct btrfs_root *orig_root,
  u64 num_bytes, u64 empty_size,
  u64 hint_byte, struct btrfs_key *ins,
 -

[RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-01-15 Thread Miao Xie
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
space
  BTRFS error (device xxx): failed to load free space cache for block group 
4315938816

It is because we didn't update the metadata of the allocated space until
the file data was written into the disk. During this time, there was no
information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time, those
spaces were lost.

In ordered to fix this problem, I use a state tree for every block group
to record those allocated spaces. We record the information when they are
allocated, and clean up the information after the metadata update. Besides
that, we also introduce a read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Only data block groups had this problem, so the above change is just
for data space allocation.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/ctree.h| 15 ++-
 fs/btrfs/disk-io.c  |  2 +-
 fs/btrfs/extent-tree.c  | 24 
 fs/btrfs/free-space-cache.c | 42 ++
 fs/btrfs/inode.c| 42 +++---
 5 files changed, 108 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1667c9a..f58e1f7 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
/* free space cache stuff */
struct btrfs_free_space_ctl *free_space_ctl;
 
+   /*
+* It is used to record the extents that are allocated for
+* the data, but don/t update its metadata.
+*/
+   struct extent_io_tree pinned_extents;
+
/* block group cache stuff */
struct rb_node cache_node;
 
@@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
 */
struct list_head space_info;
 
+   /*
+* It is just used for the delayed data space allocation
+* because only the data space allocation can be done during
+* we write out the free space cache.
+*/
+   struct rw_semaphore data_rwsem;
+
struct btrfs_space_info *data_sinfo;
 
struct reloc_control *reloc_ctl;
@@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
btrfs_trans_handle *trans,
   struct btrfs_key *ins);
 int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
 u64 min_alloc_size, u64 empty_size, u64 hint_byte,
-struct btrfs_key *ins, int is_data);
+struct btrfs_key *ins, int is_data, bool need_pin);
 int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
  struct extent_buffer *buf, int full_backref, int for_cow);
 int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8072cfa..426b558 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
fs_info-pinned_extents = fs_info-freed_extents[0];
fs_info-do_barriers = 1;
 
-
mutex_init(fs_info-ordered_operations_mutex);
mutex_init(fs_info-ordered_extent_flush_mutex);
mutex_init(fs_info-tree_log_mutex);
@@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
init_rwsem(fs_info-extent_commit_sem);
init_rwsem(fs_info-cleanup_work_sem);
init_rwsem(fs_info-subvol_sem);
+   init_rwsem(fs_info-data_rwsem);
sema_init(fs_info-uuid_tree_rescan_sem, 1);
fs_info-dev_replace.lock_owner = 0;
atomic_set(fs_info-dev_replace.nesting_level, 0);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3664cfb..7b07876 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
 static noinline int find_free_extent(struct btrfs_root *orig_root,
 u64 num_bytes, u64 empty_size,
 u64 hint_byte, struct btrfs_key *ins,
-u64 flags)
+u64 flags, bool need_pin)
 {
int ret = 0;
struct btrfs_root *root = orig_root-fs_info-extent_root;
@@ -6502,6 +6502,16 @@ checks:
ins-objectid = search_start;
ins-offset = num_bytes;
 
+   if (need_pin) {
+   ASSERT(search_start = block_group-key.objectid 
+  search_start  block_group-key.objectid +
+ block_group-key.offset);
+   set_extent_dirty(block_group-pinned_extents,
+search_start,
+

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-01-15 Thread Liu Bo
On Wed, Jan 15, 2014 at 08:00:58PM +0800, Miao Xie wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816

The code itself is fine.
But I'm not sure if it's worth doing so, I mean those memory allocation and lock
acquire and release in our core endio path.

To fallback to rebuild space cache is fairly acceptable to me in the case of
crash.

-liubo

 
 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
 
 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.
 
 Only data block groups had this problem, so the above change is just
 for data space allocation.
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)
 
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
   /* free space cache stuff */
   struct btrfs_free_space_ctl *free_space_ctl;
  
 + /*
 +  * It is used to record the extents that are allocated for
 +  * the data, but don/t update its metadata.
 +  */
 + struct extent_io_tree pinned_extents;
 +
   /* block group cache stuff */
   struct rb_node cache_node;
  
 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
*/
   struct list_head space_info;
  
 + /*
 +  * It is just used for the delayed data space allocation
 +  * because only the data space allocation can be done during
 +  * we write out the free space cache.
 +  */
 + struct rw_semaphore data_rwsem;
 +
   struct btrfs_space_info *data_sinfo;
  
   struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
  struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -  struct btrfs_key *ins, int is_data);
 +  struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 8072cfa..426b558 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
   fs_info-pinned_extents = fs_info-freed_extents[0];
   fs_info-do_barriers = 1;
  
 -
   mutex_init(fs_info-ordered_operations_mutex);
   mutex_init(fs_info-ordered_extent_flush_mutex);
   mutex_init(fs_info-tree_log_mutex);
 @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
   init_rwsem(fs_info-extent_commit_sem);
   init_rwsem(fs_info-cleanup_work_sem);
   init_rwsem(fs_info-subvol_sem);
 + init_rwsem(fs_info-data_rwsem);
   sema_init(fs_info-uuid_tree_rescan_sem, 1);
   fs_info-dev_replace.lock_owner = 0;
   atomic_set(fs_info-dev_replace.nesting_level, 0);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 3664cfb..7b07876 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
  static noinline int find_free_extent(struct btrfs_root *orig_root,
u64 num_bytes, u64 empty_size,
u64 hint_byte, struct btrfs_key *ins,
 -  u64 flags)
 +  u64 flags, bool need_pin)
  {
   int ret = 0;
   struct btrfs_root *root = orig_root-fs_info-extent_root;
 @@ -6502,6 +6502,16 @@ checks:
   ins-objectid = search_start;
   ins-offset = num_bytes;
  
 + if (need_pin) {
 + ASSERT(search_start =