Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option
On 2017年09月30日 00:57, Goffredo Baroncelli wrote: On 09/28/2017 02:00 AM, Qu Wenruo wrote: On 2017年09月28日 00:20, David Sterba wrote: On Mon, Sep 25, 2017 at 07:15:30AM -0400, Austin S. Hemmelgarn wrote: On 2017-09-24 10:08, Goffredo Baroncelli wrote: On 09/24/2017 12:10 PM, Anand Jain wrote: A lot of points in this thread, let me address them here. I don't want to remove --rootdir functionality, the fix that's being proposed removes the minimalization -- feature that was not well known, but I understand it's useful (and already used). I'd like to fix that in another way, eg. as an option to mkfs or a separate tool. I'm not worried about adding more code or code complexity. If we do it right it's not a problem in the long run. But people for some reason like to delete other people's code or functionality. Regarding guessing number of users, this is always hard. So if there's one vocal enough to let us know about the usecase, it's IMHO good time to explore the it, code-wise and documentation-wise, and fix it or improve. So, what next. I'd like to get rid of the custom chunk allocator, namely because of the known 1MB area misuse and duplication. Implementing the minimalization might need some preparatory work and I don't have a realistic ETA now. Well, if using over-reserve-then-shrink method, it could be done, without much hassle. What about doing it on a file instead of a device ? As sparse file, it would be less expensive to enlarge then shrink. I think that who need to build a filesystem with "shrink", doesn't need to create it on a real block device For device, nothing different, just v3 patchset will handle it. For file, sparse file of course. Thanks, Qu However ETA maybe delayed to middle of Oct, as I'm going to enjoy my holiday during 1st Oct to 7th Oct. Thanks, Qu Ideally we'd fix both problems in one version, as I'd rather avoid "temporary" solution to drop the minimalization with the promise to put it back later. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Btrfs: remove nr_async_bios
On Wed, Sep 27, 2017 at 01:30:13PM +0200, David Sterba wrote: > On Thu, Sep 07, 2017 at 11:22:20AM -0600, Liu Bo wrote: > > This was intended to congest higher layers to not send bios, but as > > > > 1) the congested bit has been taken by writeback > > Can you please be more specific here? > Sure, async bios comes from buffered writes and DIO writes. For DIO writes, we want to submit them ASAP, while for buffered writes, writeback uses balance_dirty_pages() to throttle how much dirty pages we can have. > > 2) and no one is waiting for %nr_async_bios down to zero, > > > > we can safely remove this now. > > From the original commit it looks like mechanism to avoid some write > patterns (streaming not becoming random), but the commit is from 2008, > lot of things have changed. I did check the history, but have to check it again before typing the following... IIUC, it was introduced along with changes which makes checksumming workload spread accross different cpus. And at that time, seems pdflush is used instead of per-bdi flush, perhaps pdflush doesn't have the necessary information for writeback to do throttling, Chris should answer this better. > > I think we should at least document whats' the congestion behaviour we > rely on nowadays, so that' sfor the 1). Otherwise patch looks ok. > Sounds good. > > > > Signed-off-by: Liu Bo > > --- > > fs/btrfs/ctree.h | 1 - > > fs/btrfs/disk-io.c | 1 - > > fs/btrfs/volumes.c | 14 -- > > 3 files changed, 16 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 3f3eb7b..27cd882 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -881,7 +881,6 @@ struct btrfs_fs_info { > > > > atomic_t nr_async_submits; > > atomic_t async_submit_draining; > > - atomic_t nr_async_bios; > > atomic_t async_delalloc_pages; > > atomic_t open_ioctl_trans; > > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > index f45b61f..95583e2 100644 > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -2657,7 +2657,6 @@ int open_ctree(struct super_block *sb, > > atomic_set(&fs_info->nr_async_submits, 0); > > atomic_set(&fs_info->async_delalloc_pages, 0); > > atomic_set(&fs_info->async_submit_draining, 0); > > - atomic_set(&fs_info->nr_async_bios, 0); > > atomic_set(&fs_info->defrag_running, 0); > > atomic_set(&fs_info->qgroup_op_seq, 0); > > atomic_set(&fs_info->reada_works_cnt, 0); > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > > index bd679bc..6e9df4d 100644 > > --- a/fs/btrfs/volumes.c > > +++ b/fs/btrfs/volumes.c > > @@ -450,13 +450,6 @@ static noinline void run_scheduled_bios(struct > > btrfs_device *device) > > pending = pending->bi_next; > > cur->bi_next = NULL; > > > > - /* > > -* atomic_dec_return implies a barrier for waitqueue_active > > -*/ > > - if (atomic_dec_return(&fs_info->nr_async_bios) < limit && > > And after that the variable 'limit' becomes unused, please remove it as > well. > OK, thanks for the comments. thanks, -liubo > > - waitqueue_active(&fs_info->async_submit_wait)) > > - wake_up(&fs_info->async_submit_wait); > > - > > BUG_ON(atomic_read(&cur->__bi_cnt) == 0); > > > > /* > > @@ -6132,13 +6125,6 @@ static noinline void btrfs_schedule_bio(struct > > btrfs_device *device, > > return; > > } > > > > - /* > > -* nr_async_bios allows us to reliably return congestion to the > > -* higher layers. Otherwise, the async bio makes it appear we have > > -* made progress against dirty pages when we've really just put it > > -* on a queue for later > > -*/ > > - atomic_inc(&fs_info->nr_async_bios); > > WARN_ON(bio->bi_next); > > bio->bi_next = NULL; > > > > -- > > 2.9.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't remove device -> I/O error
>> Did you removed the disk before mounting (physically or doing echo 1 >/sys/block/xxx/device/delete)? Which steps you performed ? - removed drive physically - mounted degraded mode - btrfs dev del -> same i/o error -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't remove device -> I/O error
On 09/29/2017 11:09 PM, DocMAX wrote: > Thanks for the reply. > > I don't want to replace the drive. I want to remove. > > Also tried in degraded mode. I get the exact same error. Did you removed the disk before mounting (physically or doing echo 1 >/sys/block/xxx/device/delete)? Which steps you performed ? > > I'm not sure but i think i formated the drive on Kernel 4.11. This shouldn't matter > > I am on Kernel 4.13 now. Ok, it is quite recently > > > I have the bad feeling that i will never get rid of that small drive unless i > re-format. No, it should not be necessary. > > > > Am 29.09.2017 um 23:04 schrieb Goffredo Baroncelli: >> On 09/29/2017 10:00 PM, Dirk Diggler wrote: >>> Hi, >>> >>> is there any chance to get my device removed? >> I simulated a device removing in KVM with >> >> echo 1 >/sys/block/sdj/device/delete >> >> then >> >> btrfs dev del 6 /mnt/ >> >> >> And I got success. But I am not sure if this is the right thing todo. >> >> You can use "btrfs replace start -r ". But you need another device. >> >> Otherwise, you can shutdown the filesystem, removing (physically) the disk >> then remount with a "mount -o degraded " followed by a "btrfs dev del >> missing /..." >> Before doing so, please tell us which kernel you are using. >> >> RAID5/6 until few months ago has a lot of bugs, so if you have an old kernel >> it is very difficult to remove a device with success. >> > > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't remove device -> I/O error
Thanks for the reply. I don't want to replace the drive. I want to remove. Also tried in degraded mode. I get the exact same error. I'm not sure but i think i formated the drive on Kernel 4.11. I am on Kernel 4.13 now. I have the bad feeling that i will never get rid of that small drive unless i re-format. Am 29.09.2017 um 23:04 schrieb Goffredo Baroncelli: On 09/29/2017 10:00 PM, Dirk Diggler wrote: Hi, is there any chance to get my device removed? I simulated a device removing in KVM with echo 1 >/sys/block/sdj/device/delete then btrfs dev del 6 /mnt/ And I got success. But I am not sure if this is the right thing todo. You can use "btrfs replace start -r ". But you need another device. Otherwise, you can shutdown the filesystem, removing (physically) the disk then remount with a "mount -o degraded " followed by a "btrfs dev del missing /..." Before doing so, please tell us which kernel you are using. RAID5/6 until few months ago has a lot of bugs, so if you have an old kernel it is very difficult to remove a device with success. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't remove device -> I/O error
On 09/29/2017 10:00 PM, Dirk Diggler wrote: > Hi, > > is there any chance to get my device removed? I simulated a device removing in KVM with echo 1 >/sys/block/sdj/device/delete then btrfs dev del 6 /mnt/ And I got success. But I am not sure if this is the right thing todo. You can use "btrfs replace start -r ". But you need another device. Otherwise, you can shutdown the filesystem, removing (physically) the disk then remount with a "mount -o degraded " followed by a "btrfs dev del missing /..." Before doing so, please tell us which kernel you are using. RAID5/6 until few months ago has a lot of bugs, so if you have an old kernel it is very difficult to remove a device with success. > Scrub literally takes months to complete (SATA 2/3 mix, about 1 minute > per gigabyte) and i'm not sure if that helps. > I guess same with balance. Mabye there is a quicker way. I can do > without some data if it's corrupted. I have a backup, but i want to > avoid to copy all data from scratch! > > Whenever i try to remove dev 6, i get: > > console: > ERROR: error removing device '/dev/sdj': Input/output error > > dmesg (i/o error right after this): > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 > csum 0x98f94189 expected csum 0x585e5744 mirror 1 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616 > csum 0x98f94189 expected csum 0xcefd2ae0 mirror 1 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 > csum 0x98f94189 expected csum 0x585e5744 mirror 1 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616 > csum 0x98f94189 expected csum 0xcefd2ae0 mirror 1 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 > csum 0x4023cac1 expected csum 0x585e5744 mirror 2 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616 > csum 0xea91b663 expected csum 0xcefd2ae0 mirror 2 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 > csum 0x98f94189 expected csum 0x585e5744 mirror 1 > BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 > csum 0x4023cac1 expected csum 0x585e5744 mirror 2 > > My setup: > /dev/sdf, ID: 3 >Device size: 2.73TiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5: 5.00GiB >Unallocated: 2.40TiB > > /dev/sdg, ID: 2 >Device size: 1.82TiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5:955.00GiB >Data,RAID5: 5.57GiB >Metadata,RAID1: 3.00GiB >Unallocated: 566.44GiB > > /dev/sdh, ID: 4 >Device size: 1.82TiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5:955.00GiB >Data,RAID5: 5.57GiB >Metadata,RAID1: 2.00GiB >System,RAID1: 32.00MiB >Unallocated: 567.41GiB > > /dev/sdi, ID: 7 >Device size: 2.73TiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5:955.00GiB >Data,RAID5: 5.57GiB >Metadata,RAID1: 11.00GiB >System,RAID1: 32.00MiB >Unallocated: 1.45TiB > > /dev/sdj, ID: 6 >Device size: 465.76GiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5:587.38MiB >Unallocated: 132.19GiB > > /dev/sdk, ID: 1 >Device size: 1.82TiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5:955.00GiB >Data,RAID5: 5.57GiB >Metadata,RAID1: 3.00GiB >Unallocated: 566.44GiB > > /dev/sdl, ID: 5 >Device size: 1.82TiB >Device slack: 0.00B >Data,RAID5:333.00GiB >Data,RAID5:955.00GiB >Data,RAID5: 5.57GiB >Metadata,RAID1: 3.00GiB >Unallocated: 566.44GiB > > Thanks, > DocMAX > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't remove device -> I/O error
Kernel: Linux game 4.13.3-1-ARCH #1 SMP PREEMPT Thu Sep 21 20:33:16 CEST 2017 x86_64 GNU/Linux -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't remove device -> I/O error
Hi, is there any chance to get my device removed? Scrub literally takes months to complete (SATA 2/3 mix, about 1 minute per gigabyte) and i'm not sure if that helps. I guess same with balance. Mabye there is a quicker way. I can do without some data if it's corrupted. I have a backup, but i want to avoid to copy all data from scratch! Whenever i try to remove dev 6, i get: console: ERROR: error removing device '/dev/sdj': Input/output error dmesg (i/o error right after this): BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 csum 0x98f94189 expected csum 0x585e5744 mirror 1 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616 csum 0x98f94189 expected csum 0xcefd2ae0 mirror 1 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 csum 0x98f94189 expected csum 0x585e5744 mirror 1 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616 csum 0x98f94189 expected csum 0xcefd2ae0 mirror 1 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 csum 0x4023cac1 expected csum 0x585e5744 mirror 2 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675119616 csum 0xea91b663 expected csum 0xcefd2ae0 mirror 2 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 csum 0x98f94189 expected csum 0x585e5744 mirror 1 BTRFS warning (device sdl): csum failed root -9 ino 365 off 3675115520 csum 0x4023cac1 expected csum 0x585e5744 mirror 2 My setup: /dev/sdf, ID: 3 Device size: 2.73TiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5: 5.00GiB Unallocated: 2.40TiB /dev/sdg, ID: 2 Device size: 1.82TiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5:955.00GiB Data,RAID5: 5.57GiB Metadata,RAID1: 3.00GiB Unallocated: 566.44GiB /dev/sdh, ID: 4 Device size: 1.82TiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5:955.00GiB Data,RAID5: 5.57GiB Metadata,RAID1: 2.00GiB System,RAID1: 32.00MiB Unallocated: 567.41GiB /dev/sdi, ID: 7 Device size: 2.73TiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5:955.00GiB Data,RAID5: 5.57GiB Metadata,RAID1: 11.00GiB System,RAID1: 32.00MiB Unallocated: 1.45TiB /dev/sdj, ID: 6 Device size: 465.76GiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5:587.38MiB Unallocated: 132.19GiB /dev/sdk, ID: 1 Device size: 1.82TiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5:955.00GiB Data,RAID5: 5.57GiB Metadata,RAID1: 3.00GiB Unallocated: 566.44GiB /dev/sdl, ID: 5 Device size: 1.82TiB Device slack: 0.00B Data,RAID5:333.00GiB Data,RAID5:955.00GiB Data,RAID5: 5.57GiB Metadata,RAID1: 3.00GiB Unallocated: 566.44GiB Thanks, DocMAX -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/21] btrfs: pass root to various extent ref mod functions
We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.c | 2 +- fs/btrfs/ctree.h | 7 --- fs/btrfs/extent-tree.c | 24 +--- fs/btrfs/file.c| 10 +- fs/btrfs/inode.c | 9 + fs/btrfs/ioctl.c | 2 +- fs/btrfs/relocation.c | 14 +++--- fs/btrfs/tree-log.c| 2 +- 8 files changed, 37 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 19b9c5131745..531e0a8645b0 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -192,7 +192,7 @@ struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root) * tree until you end up with a lock on the root. A locked buffer * is returned, with a reference held. */ -static struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root) +struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root) { struct extent_buffer *eb; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f1bd12f5f2d5..a7484a744ef0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2636,7 +2636,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, struct extent_buffer *buf, u64 parent, int last_ref); int btrfs_alloc_reserved_file_extent(struct btrfs_trans_handle *trans, -u64 root_objectid, u64 owner, +struct btrfs_root *root, u64 owner, u64 offset, u64 ram_bytes, struct btrfs_key *ins); int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, @@ -2655,7 +2655,7 @@ int btrfs_set_disk_extent_flags(struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, u64 flags, int level, int is_data); int btrfs_free_extent(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info, + struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, u64 owner, u64 offset); @@ -2667,7 +2667,7 @@ void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info); int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info, +struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, u64 owner, u64 offset); @@ -2811,6 +2811,7 @@ void btrfs_set_item_key_safe(struct btrfs_fs_info *fs_info, const struct btrfs_key *new_key); struct extent_buffer *btrfs_root_node(struct btrfs_root *root); struct extent_buffer *btrfs_lock_root_node(struct btrfs_root *root); +struct extent_buffer *btrfs_read_lock_root_node(struct btrfs_root *root); int btrfs_find_next_key(struct btrfs_root *root, struct btrfs_path *path, struct btrfs_key *key, int lowest_level, u64 min_trans); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e32ad9fc93a8..2df22cae45b1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2178,10 +2178,11 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, /* Can return -ENOMEM */ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info, +struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid, u64 owner, u64 offset) { + struct btrfs_fs_info *fs_info = root->fs_info; int old_ref_mod, new_ref_mod; int ret; @@ -3340,7 +3341,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, int level; int ret = 0; int (*process_func)(struct btrfs_trans_handle *, - struct btrfs_fs_info *, + struct btrfs_root *, u64, u64, u64, u64, u64, u64); @@ -3380,7 +3381,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle *trans, num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi); key.offset -= btrfs_file_extent_offset(buf, fi); - ret = process_func(trans, fs_info, bytenr, num_bytes, + ret = process_func(trans, root, bytenr, num_bytes, parent, ref_root, key.objectid, key.offset); if (ret) @@ -3388,7 +3389,7 @@ static int
[PATCH 10/21] btrfs: breakout empty head cleanup to a helper
Move this code out to a helper function to further simplivy __btrfs_run_delayed_refs. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 80 -- 1 file changed, 45 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f4048b23c7be..bae2eac11db7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2607,6 +2607,43 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static int cleanup_ref_head(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_ref_root *delayed_refs; + int ret; + + delayed_refs = &trans->transaction->delayed_refs; + + ret = cleanup_extent_op(trans, fs_info, head); + if (ret < 0) { + unselect_delayed_ref_head(delayed_refs, head); + btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); + return ret; + } else if (ret) { + return ret; + } + + /* +* Need to drop our head ref lock and re-acquire the delayed ref lock +* and then re-check to make sure nobody got added. +*/ + spin_unlock(&head->lock); + spin_lock(&delayed_refs->lock); + spin_lock(&head->lock); + if (!list_empty(&head->ref_list) || head->extent_op) { + spin_unlock(&head->lock); + spin_unlock(&delayed_refs->lock); + return 1; + } + head->node.in_tree = 0; + delayed_refs->num_heads--; + rb_erase(&head->href_node, &delayed_refs->href_root); + spin_unlock(&delayed_refs->lock); + return 0; +} + /* * Returns 0 on success or if called with an already aborted transaction. * Returns -ENOMEM or -EIO on failure and will abort the transaction. @@ -2688,47 +2725,20 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } if (!ref) { - - - /* All delayed refs have been processed, Go ahead -* and send the head node to run_one_delayed_ref, -* so that any accounting fixes can happen -*/ - ref = &locked_ref->node; - - ret = cleanup_extent_op(trans, fs_info, locked_ref); - if (ret < 0) { - unselect_delayed_ref_head(delayed_refs, - locked_ref); - btrfs_debug(fs_info, - "run_delayed_extent_op returned %d", - ret); - return ret; - } else if (ret > 0) { + ret = cleanup_ref_head(trans, fs_info, locked_ref); + if (ret > 0 ) { /* We dropped our lock, we need to loop. */ ret = 0; continue; + } else if (ret) { + return ret; } - /* -* Need to drop our head ref lock and re-acquire the -* delayed ref lock and then re-check to make sure -* nobody got added. + /* All delayed refs have been processed, Go ahead +* and send the head node to run_one_delayed_ref, +* so that any accounting fixes can happen */ - spin_unlock(&locked_ref->lock); - spin_lock(&delayed_refs->lock); - spin_lock(&locked_ref->lock); - if (!list_empty(&locked_ref->ref_list) || - locked_ref->extent_op) { - spin_unlock(&locked_ref->lock); - spin_unlock(&delayed_refs->lock); - continue; - } - ref->in_tree = 0; - delayed_refs->num_heads--; - rb_erase(&locked_ref->href_node, -&delayed_refs->href_root); - spin_unlock(&delayed_refs->lock); + ref = &locked_ref->node; } else { actual_count++; ref->in_tree = 0; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/21] btrfs: add ref-verify mount option
This adds the infrastructure for turning ref verify on and off for a mount, to be used by a later patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 1 + fs/btrfs/super.c | 15 +++ 2 files changed, 16 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 93e767e16a43..f1bd12f5f2d5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1334,6 +1334,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) #define BTRFS_MOUNT_FRAGMENT_METADATA (1 << 25) #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26) #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27) +#define BTRFS_MOUNT_REF_VERIFY (1 << 28) #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) #define BTRFS_DEFAULT_MAX_INLINE (2048) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index f88ac4ebe01a..8e74f7029e12 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -326,6 +326,9 @@ enum { #ifdef CONFIG_BTRFS_DEBUG Opt_fragment_data, Opt_fragment_metadata, Opt_fragment_all, #endif +#ifdef CONFIG_BTRFS_FS_REF_VERIFY + Opt_ref_verify, +#endif Opt_err, }; @@ -387,6 +390,9 @@ static const match_table_t tokens = { {Opt_fragment_metadata, "fragment=metadata"}, {Opt_fragment_all, "fragment=all"}, #endif +#ifdef CONFIG_BTRFS_FS_REF_VERIFY + {Opt_ref_verify, "ref_verify"}, +#endif {Opt_err, NULL}, }; @@ -827,6 +833,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, btrfs_set_opt(info->mount_opt, FRAGMENT_DATA); break; #endif +#ifdef CONFIG_BTRFS_FS_REF_VERIFY + case Opt_ref_verify: + btrfs_info(info, "doing ref verification"); + btrfs_set_opt(info->mount_opt, + REF_VERIFY); + break; +#endif case Opt_err: btrfs_info(info, "unrecognized mount option '%s'", p); ret = -EINVAL; @@ -1309,6 +1322,8 @@ static int btrfs_show_options(struct seq_file *seq, struct dentry *dentry) if (btrfs_test_opt(info, FRAGMENT_METADATA)) seq_puts(seq, ",fragment=metadata"); #endif + if (btrfs_test_opt(info, REF_VERIFY)) + seq_puts(seq, ",ref_verify"); seq_printf(seq, ",subvolid=%llu", BTRFS_I(d_inode(dentry))->root->root_key.objectid); seq_puts(seq, ",subvol="); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/21] btrfs: add a comp_refs() helper
Instead of open-coding the delayed ref comparisons, add a helper to do the comparisons generically and use that everywhere. We compare sequence numbers last for following patches. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 54 -- 1 file changed, 30 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index bc940bb374cf..c4cfadb9768c 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -85,6 +85,34 @@ static int comp_data_refs(struct btrfs_delayed_data_ref *ref1, return 0; } +static int comp_refs(struct btrfs_delayed_ref_node *ref1, +struct btrfs_delayed_ref_node *ref2, +bool check_seq) +{ + int ret = 0; + if (ref1->type < ref2->type) + return -1; + if (ref1->type > ref2->type) + return 1; + if (ref1->type == BTRFS_TREE_BLOCK_REF_KEY || + ref1->type == BTRFS_SHARED_BLOCK_REF_KEY) + ret = comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref1), +btrfs_delayed_node_to_tree_ref(ref2)); + else + ret = comp_data_refs(btrfs_delayed_node_to_data_ref(ref1), +btrfs_delayed_node_to_data_ref(ref2)); + if (ret) + return ret; + if (check_seq) { + if (ref1->seq < ref2->seq) + return -1; + if (ref1->seq > ref2->seq) + return 1; + } + return 0; +} + + /* insert a new ref to head ref rbtree */ static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root, struct rb_node *node) @@ -217,18 +245,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans, if (seq && next->seq >= seq) goto next; - if (next->type != ref->type) - goto next; - - if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY || -ref->type == BTRFS_SHARED_BLOCK_REF_KEY) && - comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref), - btrfs_delayed_node_to_tree_ref(next))) - goto next; - if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY || -ref->type == BTRFS_SHARED_DATA_REF_KEY) && - comp_data_refs(btrfs_delayed_node_to_data_ref(ref), - btrfs_delayed_node_to_data_ref(next))) + if (comp_refs(ref, next, false)) goto next; if (ref->action == next->action) { @@ -402,18 +419,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans, exist = list_entry(href->ref_list.prev, struct btrfs_delayed_ref_node, list); /* No need to compare bytenr nor is_head */ - if (exist->type != ref->type || exist->seq != ref->seq) - goto add_tail; - - if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY || -exist->type == BTRFS_SHARED_BLOCK_REF_KEY) && - comp_tree_refs(btrfs_delayed_node_to_tree_ref(exist), - btrfs_delayed_node_to_tree_ref(ref))) - goto add_tail; - if ((exist->type == BTRFS_EXTENT_DATA_REF_KEY || -exist->type == BTRFS_SHARED_DATA_REF_KEY) && - comp_data_refs(btrfs_delayed_node_to_data_ref(exist), - btrfs_delayed_node_to_data_ref(ref))) + if (comp_refs(exist, ref, true)) goto add_tail; /* Now we are sure we can merge */ -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/21] Btrfs: only check delayed ref usage in should_end_transaction
We were only doing btrfs_check_space_for_delayed_refs() if the metadata space was full, ie we couldn't allocate chunks. This assumes we'll be able to allocate chunks during transaction commit, but since nothing does a LIMIT flush during the transaction commit this won't actually happen unless we happen to run shy of actual space. We already take into account a full fs in btrfs_check_space_for_delayed_refs() so just kill this extra check to make sure we're ending the transaction when we need to. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 9c5f126064bd..68c3e1c04bca 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -797,8 +797,7 @@ static int should_end_transaction(struct btrfs_trans_handle *trans) { struct btrfs_fs_info *fs_info = trans->fs_info; - if (fs_info->global_block_rsv.space_info->full && - btrfs_check_space_for_delayed_refs(trans, fs_info)) + if (btrfs_check_space_for_delayed_refs(trans, fs_info)) return 1; return !!btrfs_block_rsv_check(&fs_info->global_block_rsv, 5); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/21] Btrfs: add a extent ref verify tool
We were having corruption issues that were tied back to problems with the extent tree. In order to track them down I built this tool to try and find the culprit, which was pretty successful. If you compile with this tool on it will live verify every ref update that the fs makes and make sure it is consistent and valid. I've run this through with xfstests and haven't gotten any false positives. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/Kconfig | 11 + fs/btrfs/Makefile |1 + fs/btrfs/ctree.h |5 + fs/btrfs/disk-io.c |6 + fs/btrfs/extent-tree.c | 22 ++ fs/btrfs/ref-verify.c | 1012 fs/btrfs/ref-verify.h | 59 +++ 7 files changed, 1116 insertions(+) create mode 100644 fs/btrfs/ref-verify.c create mode 100644 fs/btrfs/ref-verify.h diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index a26c63b4ad68..2e558227931a 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -91,3 +91,14 @@ config BTRFS_ASSERT any of the assertions trip. This is meant for btrfs developers only. If unsure, say N. + +config BTRFS_FS_REF_VERIFY + bool "Btrfs with the ref verify tool compiled in" + depends on BTRFS_FS + default n + help + Enable run-time extent reference verification instrumentation. This + is meant to be used by btrfs developers for tracking down extent + reference problems or verifying they didn't break something. + + If unsure, say N. diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 962a95aefb81..72c60f54f962 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -13,6 +13,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o +btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \ tests/extent-buffer-tests.o tests/btrfs-tests.o \ diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a7484a744ef0..4ffbe9f07cf7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1096,6 +1096,11 @@ struct btrfs_fs_info { u32 nodesize; u32 sectorsize; u32 stripesize; + +#ifdef CONFIG_BTRFS_FS_REF_VERIFY + spinlock_t ref_verify_lock; + struct rb_root block_tree; +#endif }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1307907e19d8..778dc7682966 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -50,6 +50,7 @@ #include "sysfs.h" #include "qgroup.h" #include "compression.h" +#include "ref-verify.h" #ifdef CONFIG_X86 #include @@ -2776,6 +2777,7 @@ int open_ctree(struct super_block *sb, /* readahead state */ INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM); spin_lock_init(&fs_info->reada_lock); + btrfs_init_ref_verify(fs_info); fs_info->thread_pool_size = min_t(unsigned long, num_online_cpus() + 2, 8); @@ -3195,6 +3197,9 @@ int open_ctree(struct super_block *sb, if (ret) goto fail_trans_kthread; + if (btrfs_build_ref_tree(fs_info)) + btrfs_err(fs_info, "BTRFS: couldn't build ref tree\n"); + /* do not make disk changes in broken FS or nologreplay is given */ if (btrfs_super_log_root(disk_super) != 0 && !btrfs_test_opt(fs_info, NOLOGREPLAY)) { @@ -4060,6 +4065,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) cleanup_srcu_struct(&fs_info->subvol_srcu); btrfs_free_stripe_hash_table(fs_info); + btrfs_free_ref_cache(fs_info); __btrfs_free_block_rsv(root->orphan_block_rsv); root->orphan_block_rsv = NULL; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2df22cae45b1..00d86c8afaef 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -39,6 +39,7 @@ #include "math.h" #include "sysfs.h" #include "qgroup.h" +#include "ref-verify.h" #undef SCRAMBLE_DELAYED_REFS @@ -2189,6 +2190,9 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID && root_objectid == BTRFS_TREE_LOG_OBJECTID); + btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid, + owner, offset, BTRFS_ADD_DELAYED_REF); + if (owner < BTRFS_FIRST_FREE_OBJECTID) { ret = btrfs_add_delayed_tree_ref(fs_info, trans, bytenr, num_bytes, parent, @@ -7223,6 +7227,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { int old_ref_mod, new_ref_mod; + btrfs_ref_tree_mod(root, buf->start, buf->len, parent, +
[PATCH 03/21] btrfs: make the delalloc block rsv per inode
The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik --- fs/btrfs/btrfs_inode.h | 27 ++-- fs/btrfs/ctree.h | 5 +- fs/btrfs/delayed-inode.c | 46 +-- fs/btrfs/disk-io.c | 18 ++- fs/btrfs/extent-tree.c | 320 --- fs/btrfs/inode.c | 18 +-- 6 files changed, 141 insertions(+), 293 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 22daa79e77b8..f9c6887a8b6c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -36,14 +36,13 @@ #define BTRFS_INODE_ORPHAN_META_RESERVED 1 #define BTRFS_INODE_DUMMY 2 #define BTRFS_INODE_IN_DEFRAG 3 -#define BTRFS_INODE_DELALLOC_META_RESERVED 4 -#define BTRFS_INODE_HAS_ORPHAN_ITEM5 -#define BTRFS_INODE_HAS_ASYNC_EXTENT 6 -#define BTRFS_INODE_NEEDS_FULL_SYNC7 -#define BTRFS_INODE_COPY_EVERYTHING8 -#define BTRFS_INODE_IN_DELALLOC_LIST 9 -#define BTRFS_INODE_READDIO_NEED_LOCK 10 -#define BTRFS_INODE_HAS_PROPS 11 +#define BTRFS_INODE_HAS_ORPHAN_ITEM4 +#define BTRFS_INODE_HAS_ASYNC_EXTENT 5 +#define BTRFS_INODE_NEEDS_FULL_SYNC6 +#define BTRFS_INODE_COPY_EVERYTHING7 +#define BTRFS_INODE_IN_DELALLOC_LIST 8 +#define BTRFS_INODE_READDIO_NEED_LOCK 9 +#define BTRFS_INODE_HAS_PROPS 10 /* in memory btrfs inode */ struct btrfs_inode { @@ -176,7 +175,8 @@ struct btrfs_inode { * of extent items we've reserved metadata for. */ unsigned outstanding_extents; - unsigned reserved_extents; + + struct btrfs_block_rsv block_rsv; /* * Cached values of inode properties @@ -278,15 +278,6 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode, mod); } -static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode, - int mod) -{ - ASSERT(spin_is_locked(&inode->lock)); - inode->reserved_extents += mod; - if (btrfs_is_free_space_inode(inode)) - return; -} - static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation) { int ret = 0; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1262612fbf78..93e767e16a43 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -763,8 +763,6 @@ struct btrfs_fs_info { * delayed dir index item */ struct btrfs_block_rsv global_block_rsv; - /* block reservation for delay allocation */ - struct btrfs_block_rsv delalloc_block_rsv; /* block reservation for metadata operations */ struct btrfs_block_rsv trans_block_rsv; /* block reservation for chunk tree */ @@ -2751,6 +2749,9 @@ int btrfs_delalloc_reserve_space(struct inode *inode, void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_fs_info *fs_info, unsigned short type); +void btrfs_init_metadata_block_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *rsv, + unsigned short type); void btrfs_free_block_rsv(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); void __btrfs_free_block_rsv(struct btrfs_block_rsv *rsv); diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 19e4ad2f3f2e..5d73f79ded8b 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -581,7 +581,6 @@ static int btrfs_delayed_inode_reserve_metadata( struct btrfs_block_rsv *dst_rsv; u64 num_bytes; int ret; - bool release = false; src_rsv = trans->block_rsv; dst_rsv = &fs_info->delayed_block_rsv; @@ -589,36 +588,13 @@ static int btrfs_delayed_inode_reserve_metadata( num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1); /* -* If our block_rsv is the delalloc block reserve then check and see if -* we have our extra reservation for updating the inode. If not fall -* through and try to reserve space quickly. -* -* We used to try and steal from the delalloc
[PATCH 08/21] btrfs: add a helper to return a head ref
Simplify the error handling in __btrfs_run_delayed_refs by breaking out the code used to return a head back to the delayed_refs tree for processing into a helper function. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 00d86c8afaef..f356b4a66186 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2576,6 +2576,17 @@ select_delayed_ref(struct btrfs_delayed_ref_head *head) return ref; } +static void +unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + spin_lock(&delayed_refs->lock); + head->processing = 0; + delayed_refs->num_heads_ready++; + spin_unlock(&delayed_refs->lock); + btrfs_delayed_ref_unlock(head); +} + /* * Returns 0 on success or if called with an already aborted transaction. * Returns -ENOMEM or -EIO on failure and will abort the transaction. @@ -2649,11 +2660,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, if (ref && ref->seq && btrfs_check_delayed_seq(fs_info, delayed_refs, ref->seq)) { spin_unlock(&locked_ref->lock); - spin_lock(&delayed_refs->lock); - locked_ref->processing = 0; - delayed_refs->num_heads_ready++; - spin_unlock(&delayed_refs->lock); - btrfs_delayed_ref_unlock(locked_ref); + unselect_delayed_ref_head(delayed_refs, locked_ref); locked_ref = NULL; cond_resched(); count++; @@ -2699,14 +2706,11 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, */ if (must_insert_reserved) locked_ref->must_insert_reserved = 1; - spin_lock(&delayed_refs->lock); - locked_ref->processing = 0; - delayed_refs->num_heads_ready++; - spin_unlock(&delayed_refs->lock); + unselect_delayed_ref_head(delayed_refs, + locked_ref); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); - btrfs_delayed_ref_unlock(locked_ref); return ret; } continue; @@ -2764,11 +2768,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, btrfs_free_delayed_extent_op(extent_op); if (ret) { - spin_lock(&delayed_refs->lock); - locked_ref->processing = 0; - delayed_refs->num_heads_ready++; - spin_unlock(&delayed_refs->lock); - btrfs_delayed_ref_unlock(locked_ref); + unselect_delayed_ref_head(delayed_refs, locked_ref); btrfs_put_delayed_ref(ref); btrfs_debug(fs_info, "run_one_delayed_ref returned %d", ret); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/21] btrfs: move all ref head cleanup to the helper function
We do a couple different cleanup operations on the ref head. We adjust counters, we'll free any reserved space if we didn't end up using the ref, and we clear the pending csum bytes. Move all these disparate things into cleanup_ref_head and clean up the logic in __btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner, as well as making run_one_delayed_ref() only deal with real refs and not the ref head. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 144 ++--- 1 file changed, 64 insertions(+), 80 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ac1196af6b7e..260414bd86e7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2501,44 +2501,6 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, return 0; } - if (btrfs_delayed_ref_is_head(node)) { - struct btrfs_delayed_ref_head *head; - /* -* we've hit the end of the chain and we were supposed -* to insert this extent into the tree. But, it got -* deleted before we ever needed to insert it, so all -* we have to do is clean up the accounting -*/ - BUG_ON(extent_op); - head = btrfs_delayed_node_to_head(node); - trace_run_delayed_ref_head(fs_info, node, head, node->action); - - if (head->total_ref_mod < 0) { - struct btrfs_block_group_cache *cache; - - cache = btrfs_lookup_block_group(fs_info, node->bytenr); - ASSERT(cache); - percpu_counter_add(&cache->space_info->total_bytes_pinned, - -node->num_bytes); - btrfs_put_block_group(cache); - } - - if (insert_reserved) { - btrfs_pin_extent(fs_info, node->bytenr, -node->num_bytes, 1); - if (head->is_data) { - ret = btrfs_del_csums(trans, fs_info, - node->bytenr, - node->num_bytes); - } - } - - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); - return ret; - } - if (node->type == BTRFS_TREE_BLOCK_REF_KEY || node->type == BTRFS_SHARED_BLOCK_REF_KEY) ret = run_delayed_tree_ref(trans, fs_info, node, extent_op, @@ -2641,6 +2603,43 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs->num_heads--; rb_erase(&head->href_node, &delayed_refs->href_root); spin_unlock(&delayed_refs->lock); + spin_unlock(&head->lock); + atomic_dec(&delayed_refs->num_entries); + + trace_run_delayed_ref_head(fs_info, &head->node, head, + head->node.action); + + if (head->total_ref_mod < 0) { + struct btrfs_block_group_cache *cache; + + cache = btrfs_lookup_block_group(fs_info, head->node.bytenr); + ASSERT(cache); + percpu_counter_add(&cache->space_info->total_bytes_pinned, + -head->node.num_bytes); + btrfs_put_block_group(cache); + + if (head->is_data) { + spin_lock(&delayed_refs->lock); + delayed_refs->pending_csums -= head->node.num_bytes; + spin_unlock(&delayed_refs->lock); + } + } + + if (head->must_insert_reserved) { + btrfs_pin_extent(fs_info, head->node.bytenr, +head->node.num_bytes, 1); + if (head->is_data) { + ret = btrfs_del_csums(trans, fs_info, + head->node.bytenr, + head->node.num_bytes); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); + btrfs_delayed_ref_unlock(head); + btrfs_put_delayed_ref(&head->node); return 0; } @@ -2724,6 +2723,10 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, continue; } + /* +* We're done processing refs in this ref_head, clean everything +* up and move on to the next ref_head. +*/ if (!ref) { ret = cleanup_ref_he
[PATCH 13/21] btrfs: remove delayed_ref_node from ref_head
This is just excessive information in the ref_head, and makes the code complicated. It is a relic from when we had the heads and the refs in the same tree, which is no longer the case. With this removal I've cleaned up a bunch of the cruft around this old assumption as well. Signed-off-by: Josef Bacik --- fs/btrfs/backref.c | 4 +- fs/btrfs/delayed-ref.c | 126 +++ fs/btrfs/delayed-ref.h | 49 ++--- fs/btrfs/disk-io.c | 12 ++--- fs/btrfs/extent-tree.c | 90 --- include/trace/events/btrfs.h | 15 +++--- 6 files changed, 120 insertions(+), 176 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index b517ef1477ea..33cba1abf8b6 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -1178,7 +1178,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, head = btrfs_find_delayed_ref_head(delayed_refs, bytenr); if (head) { if (!mutex_trylock(&head->mutex)) { - refcount_inc(&head->node.refs); + refcount_inc(&head->refs); spin_unlock(&delayed_refs->lock); btrfs_release_path(path); @@ -1189,7 +1189,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans, */ mutex_lock(&head->mutex); mutex_unlock(&head->mutex); - btrfs_put_delayed_ref(&head->node); + btrfs_put_delayed_ref_head(head); goto again; } spin_unlock(&delayed_refs->lock); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 93ffa898df6d..b9b41c838da4 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -96,15 +96,15 @@ static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root, u64 bytenr; ins = rb_entry(node, struct btrfs_delayed_ref_head, href_node); - bytenr = ins->node.bytenr; + bytenr = ins->bytenr; while (*p) { parent_node = *p; entry = rb_entry(parent_node, struct btrfs_delayed_ref_head, href_node); - if (bytenr < entry->node.bytenr) + if (bytenr < entry->bytenr) p = &(*p)->rb_left; - else if (bytenr > entry->node.bytenr) + else if (bytenr > entry->bytenr) p = &(*p)->rb_right; else return entry; @@ -133,15 +133,15 @@ find_ref_head(struct rb_root *root, u64 bytenr, while (n) { entry = rb_entry(n, struct btrfs_delayed_ref_head, href_node); - if (bytenr < entry->node.bytenr) + if (bytenr < entry->bytenr) n = n->rb_left; - else if (bytenr > entry->node.bytenr) + else if (bytenr > entry->bytenr) n = n->rb_right; else return entry; } if (entry && return_bigger) { - if (bytenr > entry->node.bytenr) { + if (bytenr > entry->bytenr) { n = rb_next(&entry->href_node); if (!n) n = rb_first(root); @@ -164,17 +164,17 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, if (mutex_trylock(&head->mutex)) return 0; - refcount_inc(&head->node.refs); + refcount_inc(&head->refs); spin_unlock(&delayed_refs->lock); mutex_lock(&head->mutex); spin_lock(&delayed_refs->lock); - if (!head->node.in_tree) { + if (RB_EMPTY_NODE(&head->href_node)) { mutex_unlock(&head->mutex); - btrfs_put_delayed_ref(&head->node); + btrfs_put_delayed_ref_head(head); return -EAGAIN; } - btrfs_put_delayed_ref(&head->node); + btrfs_put_delayed_ref_head(head); return 0; } @@ -183,15 +183,10 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head, struct btrfs_delayed_ref_node *ref) { - if (btrfs_delayed_ref_is_head(ref)) { - head = btrfs_delayed_node_to_head(ref); - rb_erase(&head->href_node, &delayed_refs->href_root); - } else { - assert_spin_locked(&head->lock); - list_del(&ref->list); - if (!list_empty(&ref->add_list)) - list_del(&ref->add_list); - } + assert_spin_locked(&head->lock); + list_del(&ref->list); + if (!lis
[PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
We're holding the sb_start_intwrite lock at this point, and doing async filemap_flush of the inodes will result in a deadlock if we freeze the fs during this operation. This is because we could do a btrfs_join_transaction() in the thread we are waiting on which would block at sb_start_intwrite, and thus deadlock. Using writeback_inodes_sb() side steps the problem by not introducing all of these extra locking dependencies. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 68c3e1c04bca..9fed8c67b6e8 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1917,7 +1917,7 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, static inline int btrfs_start_delalloc_flush(struct btrfs_fs_info *fs_info) { if (btrfs_test_opt(fs_info, FLUSHONCOMMIT)) - return btrfs_start_delalloc_roots(fs_info, 1, -1); + writeback_inodes_sb(fs_info->sb, WB_REASON_SYNC); return 0; } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 17/21] btrfs: track refs in a rb_tree instead of a list
If we get a significant amount of delayed refs for a single block (think modifying multiple snapshots) we can end up spending an ungodly amount of time looping through all of the entries trying to see if they can be merged. This is because we only add them to a list, so we have O(2n) for every ref head. This doesn't make any sense as we likely have refs for different roots, and so they cannot be merged. Tracking in a tree will allow us to break as soon as we hit an entry that doesn't match, making our worst case O(n). With this we can also merge entries more easily. Before we had to hope that matching refs were on the ends of our list, but with the tree we can search down to exact matches and merge them at insert time. Signed-off-by: Josef Bacik --- fs/btrfs/backref.c | 5 ++- fs/btrfs/delayed-ref.c | 107 + fs/btrfs/delayed-ref.h | 5 +-- fs/btrfs/disk-io.c | 10 +++-- fs/btrfs/extent-tree.c | 21 ++ 5 files changed, 81 insertions(+), 67 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 33cba1abf8b6..9b627b895806 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -769,6 +769,7 @@ static int add_delayed_refs(const struct btrfs_fs_info *fs_info, struct btrfs_key key; struct btrfs_key tmp_op_key; struct btrfs_key *op_key = NULL; + struct rb_node *n; int count; int ret = 0; @@ -778,7 +779,9 @@ static int add_delayed_refs(const struct btrfs_fs_info *fs_info, } spin_lock(&head->lock); - list_for_each_entry(node, &head->ref_list, list) { + for (n = rb_first(&head->ref_tree); n; n = rb_next(n)) { + node = rb_entry(n, struct btrfs_delayed_ref_node, + ref_node); if (node->seq > seq) continue; diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index c4cfadb9768c..48a9b23774e6 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -143,6 +143,33 @@ static struct btrfs_delayed_ref_head *htree_insert(struct rb_root *root, return NULL; } +static struct btrfs_delayed_ref_node * +tree_insert(struct rb_root *root, struct btrfs_delayed_ref_node *ins) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *node = &ins->ref_node; + struct rb_node *parent_node = NULL; + struct btrfs_delayed_ref_node *entry; + + while (*p) { + int comp; + parent_node = *p; + entry = rb_entry(parent_node, struct btrfs_delayed_ref_node, +ref_node); + comp = comp_refs(ins, entry, true); + if (comp < 0) + p = &(*p)->rb_left; + else if (comp > 0) + p = &(*p)->rb_right; + else + return entry; + } + + rb_link_node(node, parent_node, p); + rb_insert_color(node, root); + return NULL; +} + /* * find an head entry based on bytenr. This returns the delayed ref * head if it was able to find one, or NULL if nothing was in that spot. @@ -212,7 +239,8 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_node *ref) { assert_spin_locked(&head->lock); - list_del(&ref->list); + rb_erase(&ref->ref_node, &head->ref_tree); + RB_CLEAR_NODE(&ref->ref_node); if (!list_empty(&ref->add_list)) list_del(&ref->add_list); ref->in_tree = 0; @@ -229,24 +257,18 @@ static bool merge_ref(struct btrfs_trans_handle *trans, u64 seq) { struct btrfs_delayed_ref_node *next; + struct rb_node *node = rb_next(&ref->ref_node); bool done = false; - next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node, - list); - while (!done && &next->list != &head->ref_list) { + while (!done && node) { int mod; - struct btrfs_delayed_ref_node *next2; - - next2 = list_next_entry(next, list); - - if (next == ref) - goto next; + next = rb_entry(node, struct btrfs_delayed_ref_node, ref_node); + node = rb_next(node); if (seq && next->seq >= seq) - goto next; - + break; if (comp_refs(ref, next, false)) - goto next; + break; if (ref->action == next->action) { mod = next->ref_mod; @@ -270,8 +292,6 @@ static bool merge_ref(struct btrfs_trans_handle *trans, WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY || ref->type == BTRFS_SHARED_BLOCK_REF_KEY); } -next: -
[PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel
We pass in a pointer in our send arg struct, this means the struct size doesn't match with 32bit user space and 64bit kernel space. Fix this by adding a compat mode and doing the appropriate conversion. Signed-off-by: Josef Bacik --- fs/btrfs/ioctl.c | 53 - fs/btrfs/send.c | 12 ++-- fs/btrfs/send.h | 2 +- 3 files changed, 55 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 398495f79c83..6a07d4e12fd2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -5464,6 +5464,53 @@ static int btrfs_ioctl_set_features(struct file *file, void __user *arg) return ret; } +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT) +struct btrfs_ioctl_send_args_32 { + __s64 send_fd; /* in */ + __u64 clone_sources_count; /* in */ + compat_uptr_t clone_sources;/* in */ + __u64 parent_root; /* in */ + __u64 flags;/* in */ + __u64 reserved[4]; /* in */ +} __attribute__ ((__packed__)); +#define BTRFS_IOC_SEND_32 _IOW(BTRFS_IOCTL_MAGIC, 38, \ + struct btrfs_ioctl_send_args_32) +#endif + +static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat) +{ + struct btrfs_ioctl_send_args *arg; + int ret; + + if (compat) { +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT) + struct btrfs_ioctl_send_args_32 args32; + ret = copy_from_user(&args32, argp, sizeof(args32)); + if (ret) + return -EFAULT; + arg = kzalloc(sizeof(*arg), GFP_KERNEL); + if (!arg) + return -ENOMEM; + arg->send_fd = args32.send_fd; + arg->clone_sources_count = args32.clone_sources_count; + arg->clone_sources = compat_ptr(args32.clone_sources); + arg->parent_root = args32.parent_root; + arg->flags = args32.flags; + memcpy(arg->reserved, args32.reserved, + sizeof(args32.reserved)); +#else + return -ENOTTY; +#endif + } else { + arg = memdup_user(argp, sizeof(*arg)); + if (IS_ERR(arg)) + return PTR_ERR(arg); + } + ret = btrfs_ioctl_send(file, arg); + kfree(arg); + return ret; +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -5569,7 +5616,11 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_set_received_subvol_32(file, argp); #endif case BTRFS_IOC_SEND: - return btrfs_ioctl_send(file, argp); + return _btrfs_ioctl_send(file, argp, false); +#if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT) + case BTRFS_IOC_SEND_32: + return _btrfs_ioctl_send(file, argp, true); +#endif case BTRFS_IOC_GET_DEV_STATS: return btrfs_ioctl_get_dev_stats(fs_info, argp); case BTRFS_IOC_QUOTA_CTL: diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 0746eda7231d..d9ddcdbdd2e7 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "send.h" #include "backref.h" @@ -6371,13 +6372,12 @@ static void btrfs_root_dec_send_in_progress(struct btrfs_root* root) spin_unlock(&root->root_item_lock); } -long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_) +long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) { int ret = 0; struct btrfs_root *send_root = BTRFS_I(file_inode(mnt_file))->root; struct btrfs_fs_info *fs_info = send_root->fs_info; struct btrfs_root *clone_root; - struct btrfs_ioctl_send_args *arg = NULL; struct btrfs_key key; struct send_ctx *sctx = NULL; u32 i; @@ -6413,13 +6413,6 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_) goto out; } - arg = memdup_user(arg_, sizeof(*arg)); - if (IS_ERR(arg)) { - ret = PTR_ERR(arg); - arg = NULL; - goto out; - } - /* * Check that we don't overflow at later allocations, we request * clone_sources_count + 1 items, and compare to unsigned long inside @@ -6660,7 +6653,6 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_) if (sctx && !IS_ERR_OR_NULL(sctx->parent_root)) btrfs_root_dec_send_in_progress(sctx->parent_root); - kfree(arg); kvfree(clone_sources_tmp); if (sctx) { diff --git a/fs/btrfs/send.h b/fs/btrfs/send.h index 02e00166c4da..3aa4bc55754f 100644 --- a/fs/btrfs/send.h +++ b/fs/btrfs/send.h @@ -130,5 +130,5 @@ enum { #define BTRFS_SEND_A_MAX (__BTRFS_SEND_A_MAX - 1) #ifdef __KERNEL__ -long btrfs_ioctl_send(struct
[PATCH 21/21] btrfs: add assertions for releasing trans handle reservations
These are useful for debugging problems where we mess with trans->block_rsv to make sure we're not screwing something up. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dc966978ca7b..0bdc10b453b9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5897,12 +5897,15 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info) void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { - if (!trans->block_rsv) + if (!trans->block_rsv) { + ASSERT(!trans->bytes_reserved); return; + } if (!trans->bytes_reserved) return; + ASSERT(trans->block_rsv == &fs_info->trans_block_rsv); trace_btrfs_space_reservation(fs_info, "transaction", trans->transid, trans->bytes_reserved, 0); btrfs_block_rsv_release(fs_info, trans->block_rsv, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/21] btrfs: remove type argument from comp_tree_refs
We can get this from the ref we've passed in. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 10 -- 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index b9b41c838da4..a2973340a94f 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -41,9 +41,9 @@ struct kmem_cache *btrfs_delayed_extent_op_cachep; * compare two delayed tree backrefs with same bytenr and type */ static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref2, - struct btrfs_delayed_tree_ref *ref1, int type) + struct btrfs_delayed_tree_ref *ref1) { - if (type == BTRFS_TREE_BLOCK_REF_KEY) { + if (ref1->node.type == BTRFS_TREE_BLOCK_REF_KEY) { if (ref1->root < ref2->root) return -1; if (ref1->root > ref2->root) @@ -223,8 +223,7 @@ static bool merge_ref(struct btrfs_trans_handle *trans, if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY || ref->type == BTRFS_SHARED_BLOCK_REF_KEY) && comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref), - btrfs_delayed_node_to_tree_ref(next), - ref->type)) + btrfs_delayed_node_to_tree_ref(next))) goto next; if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY || ref->type == BTRFS_SHARED_DATA_REF_KEY) && @@ -409,8 +408,7 @@ add_delayed_ref_tail_merge(struct btrfs_trans_handle *trans, if ((exist->type == BTRFS_TREE_BLOCK_REF_KEY || exist->type == BTRFS_SHARED_BLOCK_REF_KEY) && comp_tree_refs(btrfs_delayed_node_to_tree_ref(exist), - btrfs_delayed_node_to_tree_ref(ref), - ref->type)) + btrfs_delayed_node_to_tree_ref(ref))) goto add_tail; if ((exist->type == BTRFS_EXTENT_DATA_REF_KEY || exist->type == BTRFS_SHARED_DATA_REF_KEY) && -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/21] Btrfs: rework outstanding_extents
Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_dealloc_reserve_metadata - outstanding_extents = 1 btrfs_set_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_bit_hook- outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent- outstanding_extents = 2 clear_extent_bit- outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_reserve_delalloc_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: Josef Bacik --- fs/btrfs/btrfs_inode.h | 18 ++ fs/btrfs/ctree.h | 2 + fs/btrfs/extent-tree.c | 139 --- fs/btrfs/file.c | 22 +++ fs/btrfs/inode-map.c | 3 +- fs/btrfs/inode.c | 114 +++ fs/btrfs/ioctl.c | 2 + fs/btrfs/ordered-data.c | 21 ++- fs/btrfs/relocation.c| 3 + fs/btrfs/tests/inode-tests.c | 18 ++ 10 files changed, 186 insertions(+), 156 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index eccadb5f62a5..a6a22ef41f91 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -267,6 +267,24 @@ static inline bool btrfs_is_free_space_inode(struct btrfs_inode *inode) return false; } +static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode, +int mod) +{ + ASSERT(spin_is_locked(&inode->lock)); + inode->outstanding_extents += mod; + if (btrfs_is_free_space_inode(inode)) + return; +} + +static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode, + int mod) +{ + ASSERT(spin_is_locked(&inode->lock)); + inode->reserved_extents += mod; + if (btrfs_is_free_space_inode(inode)) + return; +} + static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation) { int ret = 0; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a7f68c304b4c..1262612fbf78 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2742,6 +2742,8 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root, u64 *qgroup_reserved, bool use_global_rsv); void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *rsv); +void btrfs_delalloc_release_extents(struct btrfs_inode *inode, u64 num_bytes); + int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes); void btrfs_delalloc_release_metadata(struct btrfs_inode *inode, u64 num_bytes); int btrfs_delalloc_reserve_space(struct inode *inode, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1a6aced00a19..aa0f5c8953b0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5971,42 +5971,31 @@ void btrfs_subvolume_release_metadata(struct btrfs_fs_info *fs_info, } /** - * drop_outstanding_extent - drop an outstanding extent + * drop_over_reserved
[PATCH 00/21] My current btrfs patch queue
This is my current set of outstanding patches. A lot of these had reviews and I've incorporated the feedback. They have been pretty thorougly tested and are pretty solid. [PATCH 01/21] Btrfs: rework outstanding_extents [PATCH 02/21] btrfs: add tracepoints for outstanding extents mods [PATCH 03/21] btrfs: make the delalloc block rsv per inode These are simplification patches. The way we handled metadata reservation for multiple extents for an inode was really complicated and fragile, these patches go a long way towards simplifying that code. We only have one outstanding_extents counter, and the rules for changing that counter are more straightforward and logical. There are some uglier cases still, but this is as clean as I could think to do it at the moment. [PATCH 04/21] btrfs: add ref-verify mount option [PATCH 05/21] btrfs: pass root to various extent ref mod functions [PATCH 06/21] Btrfs: add a extent ref verify tool This is the ref verification set. I've broken the patches up a little more since I originally posted them, but it's a big feature so it's just going to be a bit large unfortunately. I also fixed all of the printk's so they are btrfs_err() now. I folded in all of the various fixes that came from Marc Merlin's usage of the patches. [PATCH 07/21] Btrfs: only check delayed ref usage in [PATCH 08/21] btrfs: add a helper to return a head ref [PATCH 09/21] btrfs: move extent_op cleanup to a helper [PATCH 10/21] btrfs: breakout empty head cleanup to a helper [PATCH 11/21] btrfs: move ref_mod modification into the if (ref) [PATCH 12/21] btrfs: move all ref head cleanup to the helper function [PATCH 13/21] btrfs: remove delayed_ref_node from ref_head Delayed ref's cleanup. run_delayed_refs was kind of complicated and we did two things in two different places. Either we ran the actual delayed ref, or we cleaned up the delayed_ref_head, and we did this in different places depending on how things went. These are mostly moving code around, adding helpers to make it clear what's happening, and to make __btrfs_run_delayed_refs smaller and clearer. I also made btrfs_delayed_ref_head it's own thing and took out the old delayed_ref_node that was embedded in it. That existed because heads and refs used to be on the same list, but since that isn't the case anymore we don't need it to be that way. [PATCH 14/21] btrfs: remove type argument from comp_tree_refs [PATCH 15/21] btrfs: switch args for comp_*_refs [PATCH 16/21] btrfs: add a comp_refs() helper [PATCH 17/21] btrfs: track refs in a rb_tree instead of a list With lots of snapshots we can end up in merge_delayed_refs for _ever_ because we're constantly going over the list to find matches. I converted the ref list in the ref_head to a rb_tree so we can go straight to our refs, which makes searching and on-the-fly merging happen immediately always, and makes post-processing of the tree significantly faster. Now I'm no longer seeing multi-minute commit times with lots of snapshots. [PATCH 18/21] btrfs: fix send ioctl on 32bit with 64bit kernel [PATCH 19/21] btrfs: don't call btrfs_start_delalloc_roots in [PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle [PATCH 21/21] btrfs: add assertions for releasing trans handle Fixes for various problems I found while running xfstests. One is for 32bit ioctls on a 64bit kernel, two are lockdep fixes, and the last one is just so I can catch myself doing stupid stuff in the future. Dave this is based on for-next-20170929, and is in my btrfs-next tree in the master branch if you want to just pull it. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/21] btrfs: move ref_mod modification into the if (ref) logic
We only use this logic if our ref isn't a ref_head, so move it up into the if (ref) case since we know that this is a normal ref and not a delayed ref head. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index bae2eac11db7..ac1196af6b7e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2745,10 +2745,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, list_del(&ref->list); if (!list_empty(&ref->add_list)) list_del(&ref->add_list); - } - atomic_dec(&delayed_refs->num_entries); - - if (!btrfs_delayed_ref_is_head(ref)) { /* * when we play the delayed ref, also correct the * ref_mod on head @@ -2765,6 +2761,8 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, WARN_ON(1); } } + atomic_dec(&delayed_refs->num_entries); + /* * record the must insert reserved flag before we * drop the spin lock. -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 20/21] btrfs: move btrfs_truncate_block out of trans handle
Since we do a delalloc reserve in btrfs_truncate_block we can deadlock with freeze. If somebody else is trying to allocate metadata for this inode and it gets stuck in start_delalloc_inodes because of freeze we will deadlock. Be safe and move this outside of a trans handle. This also has a side-effect of making sure that we're not leaving stale data behind in the other_encoding or encryption case. Not an issue now since nobody uses it, but it would be a problem in the future. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 119 --- 1 file changed, 44 insertions(+), 75 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3cbddfc181dc..46b5632a7c6d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4364,47 +4364,11 @@ static int truncate_space_check(struct btrfs_trans_handle *trans, } -static int truncate_inline_extent(struct inode *inode, - struct btrfs_path *path, - struct btrfs_key *found_key, - const u64 item_end, - const u64 new_size) -{ - struct extent_buffer *leaf = path->nodes[0]; - int slot = path->slots[0]; - struct btrfs_file_extent_item *fi; - u32 size = (u32)(new_size - found_key->offset); - struct btrfs_root *root = BTRFS_I(inode)->root; - - fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item); - - if (btrfs_file_extent_compression(leaf, fi) != BTRFS_COMPRESS_NONE) { - loff_t offset = new_size; - loff_t page_end = ALIGN(offset, PAGE_SIZE); - - /* -* Zero out the remaining of the last page of our inline extent, -* instead of directly truncating our inline extent here - that -* would be much more complex (decompressing all the data, then -* compressing the truncated data, which might be bigger than -* the size of the inline extent, resize the extent, etc). -* We release the path because to get the page we might need to -* read the extent item from disk (data not in the page cache). -*/ - btrfs_release_path(path); - return btrfs_truncate_block(inode, offset, page_end - offset, - 0); - } - - btrfs_set_file_extent_ram_bytes(leaf, fi, size); - size = btrfs_file_extent_calc_inline_size(size); - btrfs_truncate_item(root->fs_info, path, size, 1); - - if (test_bit(BTRFS_ROOT_REF_COWS, &root->state)) - inode_sub_bytes(inode, item_end + 1 - new_size); - - return 0; -} +/* + * Return this if we need to call truncate_block for the last bit of the + * truncate. + */ +#define NEED_TRUNCATE_BLOCK 1 /* * this can truncate away extent items, csum items and directory items. @@ -4565,11 +4529,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (found_type != BTRFS_EXTENT_DATA_KEY) goto delete; - if (del_item) - last_size = found_key.offset; - else - last_size = new_size; - if (extent_type != BTRFS_FILE_EXTENT_INLINE) { u64 num_dec; extent_start = btrfs_file_extent_disk_bytenr(leaf, fi); @@ -4611,40 +4570,29 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, */ if (!del_item && btrfs_file_extent_encryption(leaf, fi) == 0 && - btrfs_file_extent_other_encoding(leaf, fi) == 0) { - + btrfs_file_extent_other_encoding(leaf, fi) == 0 && + btrfs_file_extent_compression(leaf, fi) == 0) { + u32 size = (u32)(new_size - found_key.offset); + btrfs_set_file_extent_ram_bytes(leaf, fi, size); + size = btrfs_file_extent_calc_inline_size(size); + btrfs_truncate_item(root->fs_info, path, size, 1); + } else if (!del_item) { /* -* Need to release path in order to truncate a -* compressed extent. So delete any accumulated -* extent items so far. +* We have to bail so the last_size is set to +* just before this extent. */ - if (btrfs_file_extent_compression(leaf, fi) != - BTRFS_COMPRESS_NONE && pending_del_nr) { - err = btrfs_del_items(trans, root, path, -
[PATCH 15/21] btrfs: switch args for comp_*_refs
Because seriously? ref2 and then ref1? Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index a2973340a94f..bc940bb374cf 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -40,8 +40,8 @@ struct kmem_cache *btrfs_delayed_extent_op_cachep; /* * compare two delayed tree backrefs with same bytenr and type */ -static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref2, - struct btrfs_delayed_tree_ref *ref1) +static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref1, + struct btrfs_delayed_tree_ref *ref2) { if (ref1->node.type == BTRFS_TREE_BLOCK_REF_KEY) { if (ref1->root < ref2->root) @@ -60,8 +60,8 @@ static int comp_tree_refs(struct btrfs_delayed_tree_ref *ref2, /* * compare two delayed data backrefs with same bytenr and type */ -static int comp_data_refs(struct btrfs_delayed_data_ref *ref2, - struct btrfs_delayed_data_ref *ref1) +static int comp_data_refs(struct btrfs_delayed_data_ref *ref1, + struct btrfs_delayed_data_ref *ref2) { if (ref1->node.type == BTRFS_EXTENT_DATA_REF_KEY) { if (ref1->root < ref2->root) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/21] btrfs: add tracepoints for outstanding extents mods
This is handy for tracing problems with modifying the outstanding extents counters. Signed-off-by: Josef Bacik --- fs/btrfs/btrfs_inode.h | 2 ++ include/trace/events/btrfs.h | 21 + 2 files changed, 23 insertions(+) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index a6a22ef41f91..22daa79e77b8 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -274,6 +274,8 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode, inode->outstanding_extents += mod; if (btrfs_is_free_space_inode(inode)) return; + trace_btrfs_inode_mod_outstanding_extents(inode->root, btrfs_ino(inode), + mod); } static inline void btrfs_mod_reserved_extents(struct btrfs_inode *inode, diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 77437f545c63..f0a374a720b0 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1698,6 +1698,27 @@ DEFINE_EVENT(btrfs__prelim_ref, btrfs_prelim_ref_insert, TP_ARGS(fs_info, oldref, newref, tree_size) ); +TRACE_EVENT(btrfs_inode_mod_outstanding_extents, + TP_PROTO(struct btrfs_root *root, u64 ino, int mod), + + TP_ARGS(root, ino, mod), + + TP_STRUCT__entry_btrfs( + __field(u64, root_objectid ) + __field(u64, ino) + __field(int, mod) + ), + + TP_fast_assign_btrfs(root->fs_info, + __entry->root_objectid = root->objectid; + __entry->ino= ino; + __entry->mod= mod; + ), + + TP_printk_btrfs("root = %llu(%s) ino = %llu mod = %d", + show_root_type(__entry->root_objectid), + (unsigned long long)__entry->ino, __entry->mod) +); #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/21] btrfs: move extent_op cleanup to a helper
Move the extent_op cleanup for an empty head ref to a helper function to help simplify __btrfs_run_delayed_refs. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 77 ++ 1 file changed, 40 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f356b4a66186..f4048b23c7be 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2587,6 +2587,26 @@ unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, btrfs_delayed_ref_unlock(head); } +static int cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_fs_info *fs_info, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op = head->extent_op; + int ret; + + if (!extent_op) + return 0; + head->extent_op = NULL; + if (head->must_insert_reserved) { + btrfs_free_delayed_extent_op(extent_op); + return 0; + } + spin_unlock(&head->lock); + ret = run_delayed_extent_op(trans, fs_info, &head->node, extent_op); + btrfs_free_delayed_extent_op(extent_op); + return ret ? ret : 1; +} + /* * Returns 0 on success or if called with an already aborted transaction. * Returns -ENOMEM or -EIO on failure and will abort the transaction. @@ -2667,16 +2687,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, continue; } - /* -* record the must insert reserved flag before we -* drop the spin lock. -*/ - must_insert_reserved = locked_ref->must_insert_reserved; - locked_ref->must_insert_reserved = 0; - - extent_op = locked_ref->extent_op; - locked_ref->extent_op = NULL; - if (!ref) { @@ -2686,33 +2696,17 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, */ ref = &locked_ref->node; - if (extent_op && must_insert_reserved) { - btrfs_free_delayed_extent_op(extent_op); - extent_op = NULL; - } - - if (extent_op) { - spin_unlock(&locked_ref->lock); - ret = run_delayed_extent_op(trans, fs_info, - ref, extent_op); - btrfs_free_delayed_extent_op(extent_op); - - if (ret) { - /* -* Need to reset must_insert_reserved if -* there was an error so the abort stuff -* can cleanup the reserved space -* properly. -*/ - if (must_insert_reserved) - locked_ref->must_insert_reserved = 1; - unselect_delayed_ref_head(delayed_refs, - locked_ref); - btrfs_debug(fs_info, - "run_delayed_extent_op returned %d", - ret); - return ret; - } + ret = cleanup_extent_op(trans, fs_info, locked_ref); + if (ret < 0) { + unselect_delayed_ref_head(delayed_refs, + locked_ref); + btrfs_debug(fs_info, + "run_delayed_extent_op returned %d", + ret); + return ret; + } else if (ret > 0) { + /* We dropped our lock, we need to loop. */ + ret = 0; continue; } @@ -2761,6 +2755,15 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, WARN_ON(1); } } + /* +* record the must insert reserved flag before we +* drop the spin lock. +*/ + must_insert_reserved = locked_ref->must_insert_reserved; + locked_ref->must_insert_reserved = 0; + + extent_op = locked_ref->extent_op; + locked_re
Re: [PATCH v3 2/2] btrfs: Remove received_uuid during received snapshot ro->rw switch
On 29.09.2017 20:56, David Sterba wrote: > On Thu, Sep 28, 2017 at 10:53:18AM +0300, Nikolay Borisov wrote: >> Currently when a read-only snapshot is received and subsequently its ro >> property >> is set to false i.e. switched to rw-mode the received_uuid of that subvol >> remains >> intact. However, once the received volume is switched to RW mode we cannot >> guaranteee that it contains the same data, so it makes sense to remove the >> received uuid. The presence of the received_uuid can also cause problems when >> the volume is being send. >> >> Signed-off-by: Nikolay Borisov >> --- >> >> v3: >> * Rework the patch considering latest feedback from David Sterba i.e. >> explicitly use btrfs_end_transaction >> >> fs/btrfs/ioctl.c | 36 +--- >> 1 file changed, 29 insertions(+), 7 deletions(-) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index ee4ee7cbba72..c0374125cec2 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -1811,6 +1811,17 @@ static noinline int >> btrfs_ioctl_subvol_setflags(struct file *file, >> goto out_drop_sem; >> >> root_flags = btrfs_root_flags(&root->root_item); >> + >> +/* >> + * 1 - root item >> + * 1 - uuid item >> + */ >> +trans = btrfs_start_transaction(root, 2); >> +if (IS_ERR(trans)) { >> +ret = PTR_ERR(trans); >> +goto out_drop_sem; >> +} >> + >> if (flags & BTRFS_SUBVOL_RDONLY) { >> btrfs_set_root_flags(&root->root_item, >> root_flags | BTRFS_ROOT_SUBVOL_RDONLY); >> @@ -1824,22 +1835,33 @@ static noinline int >> btrfs_ioctl_subvol_setflags(struct file *file, >> btrfs_set_root_flags(&root->root_item, >> root_flags & ~BTRFS_ROOT_SUBVOL_RDONLY); >> spin_unlock(&root->root_item_lock); >> +if >> (!btrfs_is_empty_uuid(root->root_item.received_uuid)) { >> +ret = btrfs_uuid_tree_rem(trans, fs_info, >> + root->root_item.received_uuid, >> + BTRFS_UUID_KEY_RECEIVED_SUBVOL, >> + root->root_key.objectid); >> + >> +if (ret && ret != -ENOENT) { >> +btrfs_abort_transaction(trans, ret); >> +btrfs_end_transaction(trans); >> +goto out_reset; >> +} >> + >> +memset(root->root_item.received_uuid, 0, >> + BTRFS_UUID_SIZE); >> +} >> } else { >> spin_unlock(&root->root_item_lock); >> btrfs_warn(fs_info, >> "Attempt to set subvolume %llu read-write >> during send", >> root->root_key.objectid); >> ret = -EPERM; >> -goto out_drop_sem; >> +btrfs_abort_transaction(trans, ret); >> +btrfs_end_transaction(trans); >> +goto out_reset; > > Adding the transaction before the "if (flags & BTRFS_SUBVOL_RDONLY)" > condition makes it much worse. The "is subvolume in send" test is > supposed to be lightweight and should not shoot down the whole > filesystem. The usecase is explained in 2c68653787f91c62f8. > > Also the received_uuid must be changed under the root_item_lock. > > I think it should be fine to keep the transaction start where it is, > change the received_uuid eventually and let it commit. You can set the > transaction units to 2 unconditionally. So what you are suggesting is to not move the transaction start before the if check? But then how would you structure the code to remove the uuid only if we are switchin RO->RW and not in send without duplicating the checks right before btrfs_update_root? > >> } >> } >> >> -trans = btrfs_start_transaction(root, 1); >> -if (IS_ERR(trans)) { >> -ret = PTR_ERR(trans); >> -goto out_reset; >> -} >> - >> ret = btrfs_update_root(trans, fs_info->tree_root, >> &root->root_key, &root->root_item); >> if (ret < 0) { -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delayed inodes implementation questions
On Fri, Sep 29, 2017 at 04:00:29PM +0800, Qu Wenruo wrote: > On 2017年09月29日 15:15, Nikolay Borisov wrote: > > I've been studying the implementation of the delayed inodes feature and > > have a couple of questions. > > Sorry I can't answer your question, but as you may already know by the > rejection mail, Miao Xie has left Fujitsu and has been for several years. > > And I'm not sure if he still checks btrfs mail list, so you may be on > your own. > > So to avoid such situation, it would be better to maintain a wiki page > for such less maintained features, about it design principle and > possible corner cases. What is a less maintained feature? If it works as expected we don't need to touch the code, besided cleanups. Anyway, I did the review of delayed inodes back then and can possibly answer Nikolai's question. > This will make later developers quite happy. > (personally speaking, this should be done before coding) > > Thanks, > Qu > > > > 1. Why are there 2 lists when each and every node is on both lists at > > the same time? So if we happen to be running the async worker we are > > processing nodes based on the prepared_list, at the same time if there > > is a concurent transaction commit happening then the exact some nodes > > are going to be processed from the node_list - it's not a functional > > problem but the presence of 2 lists and each node being on the same list > > at the same time just seems a bit redundant. > > > > 2. in btrfs_next_delayed_node can this code really trigger: > > > > if (!test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags)) { > > /* not in the list */ > > > > if (list_empty(&delayed_root->node_list)) > > > > goto out; > > > > p = delayed_root->node_list.next; > > > > In order for !test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags) to be > > true this means this node should have been btrfs_dequeue_delayed_node > > underneath us? Can this really happen e.g. have multiple concurrent > > __btrfs_run_delayed_items executions? > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Use bd_dev to generate index when dev_state_hashtable add items.
On Fri, Sep 29, 2017 at 04:16:35PM +0800, Gu Jinxiang wrote: > From: Gu JinXiang > > Fix bug of > (). Please explain what bug are you fixing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2] btrfs: Remove received_uuid during received snapshot ro->rw switch
On Thu, Sep 28, 2017 at 10:53:18AM +0300, Nikolay Borisov wrote: > Currently when a read-only snapshot is received and subsequently its ro > property > is set to false i.e. switched to rw-mode the received_uuid of that subvol > remains > intact. However, once the received volume is switched to RW mode we cannot > guaranteee that it contains the same data, so it makes sense to remove the > received uuid. The presence of the received_uuid can also cause problems when > the volume is being send. > > Signed-off-by: Nikolay Borisov > --- > > v3: > * Rework the patch considering latest feedback from David Sterba i.e. > explicitly use btrfs_end_transaction > > fs/btrfs/ioctl.c | 36 +--- > 1 file changed, 29 insertions(+), 7 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index ee4ee7cbba72..c0374125cec2 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -1811,6 +1811,17 @@ static noinline int btrfs_ioctl_subvol_setflags(struct > file *file, > goto out_drop_sem; > > root_flags = btrfs_root_flags(&root->root_item); > + > + /* > + * 1 - root item > + * 1 - uuid item > + */ > + trans = btrfs_start_transaction(root, 2); > + if (IS_ERR(trans)) { > + ret = PTR_ERR(trans); > + goto out_drop_sem; > + } > + > if (flags & BTRFS_SUBVOL_RDONLY) { > btrfs_set_root_flags(&root->root_item, >root_flags | BTRFS_ROOT_SUBVOL_RDONLY); > @@ -1824,22 +1835,33 @@ static noinline int > btrfs_ioctl_subvol_setflags(struct file *file, > btrfs_set_root_flags(&root->root_item, >root_flags & ~BTRFS_ROOT_SUBVOL_RDONLY); > spin_unlock(&root->root_item_lock); > + if > (!btrfs_is_empty_uuid(root->root_item.received_uuid)) { > + ret = btrfs_uuid_tree_rem(trans, fs_info, > + root->root_item.received_uuid, > + BTRFS_UUID_KEY_RECEIVED_SUBVOL, > + root->root_key.objectid); > + > + if (ret && ret != -ENOENT) { > + btrfs_abort_transaction(trans, ret); > + btrfs_end_transaction(trans); > + goto out_reset; > + } > + > + memset(root->root_item.received_uuid, 0, > +BTRFS_UUID_SIZE); > + } > } else { > spin_unlock(&root->root_item_lock); > btrfs_warn(fs_info, > "Attempt to set subvolume %llu read-write > during send", > root->root_key.objectid); > ret = -EPERM; > - goto out_drop_sem; > + btrfs_abort_transaction(trans, ret); > + btrfs_end_transaction(trans); > + goto out_reset; Adding the transaction before the "if (flags & BTRFS_SUBVOL_RDONLY)" condition makes it much worse. The "is subvolume in send" test is supposed to be lightweight and should not shoot down the whole filesystem. The usecase is explained in 2c68653787f91c62f8. Also the received_uuid must be changed under the root_item_lock. I think it should be fine to keep the transaction start where it is, change the received_uuid eventually and let it commit. You can set the transaction units to 2 unconditionally. > } > } > > - trans = btrfs_start_transaction(root, 1); > - if (IS_ERR(trans)) { > - ret = PTR_ERR(trans); > - goto out_reset; > - } > - > ret = btrfs_update_root(trans, fs_info->tree_root, > &root->root_key, &root->root_item); > if (ret < 0) { -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: Explicitly handle btrfs_update_root failure
On Thu, Sep 28, 2017 at 10:53:17AM +0300, Nikolay Borisov wrote: > btrfs_udpate_root can fail and it aborts the transaction, the correct way to > handle an aborted transaction is to explicitly end with btrfs_end_transaction. > Even now the code is correct since btrfs_commit_transaction would handle an > aborted transaction but this is more of an implementation detail. So let's be > explicit in handling failure in btrfs_update_root. > > Furthermore btrfs_commit_transaction can also fail and by ignoring it's return > value we could have left the in-memory copy of the root item in an > inconsistent > state. So capture the error value which allows us to correctly revert the > RO/RW > flags in case of commit failure. > > Signed-off-by: Nikolay Borisov Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/3] btrfs: undo writable when sprouting fails
On Thu, Sep 28, 2017 at 02:51:09PM +0800, Anand Jain wrote: > When new device is being added to seed FS, seed FS is marked writable, > but when we fail to bring in the new device, we missed to undo the > writable part. This patch fixes it. > > Signed-off-by: Anand Jain > Reviewed-by: Nikolay Borisov Thanks, 1-3 added to the queue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: take the error path out if btrfs_attach_transaction() fails
On Thu, Sep 28, 2017 at 03:13:06PM +0800, Anand Jain wrote: > > I'd suggest to first get rid of the in-place returns and add necessary > > labels or separate exit sequences and then address the new error > > handling. > >As it goes deeper there are quite a number of things which aren't >un-done during fail error return .. adding one more to the list >is sb->super_copy updates. With this current design on this function >its kind of too difficult to undo and error return. As >btrfs_init_new_device() is shared between normal device add and >sprout device add. I am mulling on completely removing seed part >to outside of the btrfs_init_new_device(). such as .. > prepare sprout. > ret = btrfs_init_new_device() which is without the seed part > if(ret) undo_prepare_sprout > else finish sprouting. Yeah, this would make the code better structured. > Also with this I think we would find few duplicate code sections > between btrfs_init_new_device() and replace device which will be > a nice cleanup as a whole. This is a long term plan, for now > I think v4 is good. Ok, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] btrfs-progs: check: supplement extent backref list with rbtree
On Tue, Jul 25, 2017 at 04:51:32PM -0400, je...@suse.com wrote: > From: Jeff Mahoney > > For the pathlogical case, like xfstests generic/297 that creates a > large file consisting of one, repeating reflinked extent, fsck can > take hours. The root cause is that calling find_data_backref while > iterating the extent records is an O(n^2) algorithm. For my > example test run, n was 2*2^20 and fsck was at 8 hours and counting. > > This patch supplements the list with an rbtree and drops the runtime > of that testcase to about 20 seconds. > > A previous version of this patch introduced a regression that would > have corrupted file systems during repair. It was traced to the > compare algorithm honoring ->bytes regardless of whether the > reference had been found and a failure to reinsert nodes after > the target reference was found. > > Signed-off-by: Jeff Mahoney Josef has fixed the crash so I've added the patchset to devel. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option
On 09/28/2017 02:00 AM, Qu Wenruo wrote: > > > On 2017年09月28日 00:20, David Sterba wrote: >> On Mon, Sep 25, 2017 at 07:15:30AM -0400, Austin S. Hemmelgarn wrote: >>> On 2017-09-24 10:08, Goffredo Baroncelli wrote: On 09/24/2017 12:10 PM, Anand Jain wrote: >> >> A lot of points in this thread, let me address them here. >> >> I don't want to remove --rootdir functionality, the fix that's being >> proposed removes the minimalization -- feature that was not well known, >> but I understand it's useful (and already used). >> >> I'd like to fix that in another way, eg. as an option to mkfs or a >> separate tool. >> >> I'm not worried about adding more code or code complexity. If we do it >> right it's not a problem in the long run. But people for some reason >> like to delete other people's code or functionality. >> >> Regarding guessing number of users, this is always hard. So if there's >> one vocal enough to let us know about the usecase, it's IMHO good time >> to explore the it, code-wise and documentation-wise, and fix it or >> improve. >> >> So, what next. I'd like to get rid of the custom chunk allocator, namely >> because of the known 1MB area misuse and duplication. >> >> Implementing the minimalization might need some preparatory work and I >> don't have a realistic ETA now. > > Well, if using over-reserve-then-shrink method, it could be done, without > much hassle. What about doing it on a file instead of a device ? As sparse file, it would be less expensive to enlarge then shrink. I think that who need to build a filesystem with "shrink", doesn't need to create it on a real block device > > However ETA maybe delayed to middle of Oct, as I'm going to enjoy my holiday > during 1st Oct to 7th Oct. > > Thanks, > Qu > >> Ideally we'd fix both problems in one >> version, as I'd rather avoid "temporary" solution to drop the >> minimalization with the promise to put it back later. >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fix invalid assert in backref.c
On Thu, Sep 28, 2017 at 04:55:00PM -0400, Josef Bacik wrote: > This should be verify'ing that we have an empty key, not that we have a > filled out key. > > Signed-off-by: Josef Bacik > --- > Dave this is on top of your ext/jeffm/extent-cache branch and fixes the > segfault > you reported. Great, thanks for the fix. I'll fold it to the original patch so the branches remain bisectable. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 0/6] Btrfs: populate heuristic with code
On Thu, Sep 28, 2017 at 05:33:35PM +0300, Timofey Titovets wrote: > Compile tested, hand tested on live system > > Change v7 -> v8 > - All code moved to compression.c (again) > - Heuristic workspaces inmplemented another way > i.e. only share logic with compression workspaces > - Some style fixes suggested by Devid > - Move sampling function from heuristic code > (I'm afraid of big functions) > - Much more comments and explanations Thanks for the update, I went through the patches and they looked good enough to be put into for-next. I may have more comments about a few things, but nothing serious that would hinder testing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs fixes for 4.14-rc3
Hi, we've collected a bunch of isolated fixes, for crashes, user-visible behaviour or missing bits from other subsystem cleanups from the past. The overall number is not small but I was not able to make it significantly smaller. Most of the patches are supposed to go to stable. There are no merge conflicts. Please pull, thanks. The following changes since commit db95c876c568cef951fbbd4c0118cb5386e4bb99: btrfs: submit superblock io with REQ_META and REQ_PRIO (2017-08-22 13:22:05 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.14-rc3 for you to fetch changes up to 8c6c592831a09a28428448e68fb08c6bbb8b9b8b: btrfs: log csums for all modified extents (2017-09-26 14:54:16 +0200) Josef Bacik (1): btrfs: log csums for all modified extents Liu Bo (7): Btrfs: use the new helper wbc_to_write_flags Btrfs: do not reset bio->bi_ops while writing bio Btrfs: do not backup tree roots when fsync Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block Btrfs: fix kernel oops while reading compressed data Btrfs: skip checksum when reading compressed data if some IO have failed Btrfs: fix unexpected result when dio reading corrupted blocks Misono, Tomohiro (1): btrfs: remove BTRFS_FS_QUOTA_DISABLING flag Naohiro Aota (4): btrfs: clear ordered flag on cleaning up ordered extents btrfs: finish ordered extent cleaning if no progress is found btrfs: fix NULL pointer dereference from free_reloc_roots() btrfs: propagate error to btrfs_cmp_data_prepare caller Omar Sandoval (1): Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO Sargun Dhillon (1): btrfs: Report error on removing qgroup if del_qgroup_item fails Tsutomu Itoh (1): Btrfs: send: fix error number for unknown inode types satoru takeuchi (1): btrfs: prevent to set invalid default subvolid fs/btrfs/compression.c | 18 +- fs/btrfs/ctree.h | 1 - fs/btrfs/disk-io.c | 9 - fs/btrfs/extent_io.c | 8 ++-- fs/btrfs/inode.c | 27 ++- fs/btrfs/ioctl.c | 12 fs/btrfs/qgroup.c | 6 ++ fs/btrfs/relocation.c | 2 +- fs/btrfs/send.c| 2 +- fs/btrfs/tree-log.c| 12 ++-- fs/btrfs/volumes.c | 2 +- 11 files changed, 72 insertions(+), 27 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4 v2] add subvolume support to mkfs
On Wed, Aug 30, 2017 at 10:24:46AM -0700, Yingyi Luo wrote: > From: yingyil > > Hi all, > > Thanks for all the comments you gave to me. I really appriciate it. I've > updated the patch, split it into small ones so that it may be easier > for you to see the changes and review. > > One use case of this feature is that it allows easier operations on > subvolumes, > such as migrating subvolumes across images. With the current btrfs tool, we > have > to do this manually by creating an empty subvolume then moving the files over. > This feature which creates a subvolume for a given directory when making btrfs > images simplifies the process and makes the logic cleaner. > > Goffredo: Thanks again for you comments and pointing out the bug. It happened > to be that I skipped the error code checking when I tested the functionality. > The error came from a duplicate call of btrfs_make_root_dir() from > make_root_dir() in mkfs/main.c and create_subvol() in convert/main.c. > > strdup() is used to get a returnable copy of the literal string. > FYI, parse_label() used strdup() to parse the input string. I think using > strdup() would make it consistent with the current code. > > Extending the feature would be a good idea. I'll try to work on it later. > > Best regards, > Yingyi > > yingyil (4): > btrfs-progs: convert: move link_subvol out of main > btrfs-progs: add a parameter to btrfs_link_subvol > btrfs-progs: mkfs: refactor create_data_reloc_tree > btrfs-progs: mkfs: add subvolume support to mkfs The updated patches have been added to devel. Before this will be released, the propsed semantics of subvol/rootdir will have to be implemented. At the moment, overlaying files from rootdir over any previously created subvolume does not work, there are probably missing checks for existing dirs so this ends up with a duplicate entry. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/14] btrfs-progs: Doc/mkfs: Add extra condition for rootdir option
On 2017-09-27 20:00, Qu Wenruo wrote: On 2017年09月28日 00:20, David Sterba wrote: On Mon, Sep 25, 2017 at 07:15:30AM -0400, Austin S. Hemmelgarn wrote: On 2017-09-24 10:08, Goffredo Baroncelli wrote: On 09/24/2017 12:10 PM, Anand Jain wrote: A lot of points in this thread, let me address them here. I don't want to remove --rootdir functionality, the fix that's being proposed removes the minimalization -- feature that was not well known, but I understand it's useful (and already used). I'd like to fix that in another way, eg. as an option to mkfs or a separate tool. I'm not worried about adding more code or code complexity. If we do it right it's not a problem in the long run. But people for some reason like to delete other people's code or functionality. Regarding guessing number of users, this is always hard. So if there's one vocal enough to let us know about the usecase, it's IMHO good time to explore the it, code-wise and documentation-wise, and fix it or improve. So, what next. I'd like to get rid of the custom chunk allocator, namely because of the known 1MB area misuse and duplication. Implementing the minimalization might need some preparatory work and I don't have a realistic ETA now. Well, if using over-reserve-then-shrink method, it could be done, without much hassle. It _should_ be possible to compute the exact size required though. As outlined in one of my other replies, an absolute minimal filesystem should in theory consist of the following space usage: 1. 1MB (for the reserved space at the beginning). 2. However many superblocks it should have given the size. 3. The total amount of file data and extended attribute data to be included, rounding up for block size 4. The exact amount of metadata space needed to represent the tree from 3, also rounding up for block size. 5. The exact amount of system chunk space needed to handle 3 and 4, plus enough room to allocate at least one more chunk of each type (to ultimately allow for resizing the filesystem if desired). 6. Exactly enough reserved metadata space to resize the FS (alternatively, the standard space for the GlobalReserve, which should be enough and then some). Computing this is non-trivial for a human, and even with a computer requires knowledge of how the chunk allocator is implemented. mkfs should have that knowledge though, so in theory it is reasonably well positioned to just compute this, allocate that much space, and then generate the filesystem. The complaints I have about trying to do this manually in two passes are that: 1. Computing the above by hand is not easy. 2. Computing it by hand and then running mkfs on the resultant file doesn't work most of the time. As a result, I would think this could easily be handled in the following steps: 1. Scan the contents of the --rootdir option to determine total size needed. 2. Compute the required allocation and allocate it. 3. Generate the filesystem image. This has an inherent TOCTOU race, but it's one that a lot of archiving tools (tar being a prime example) already have, and it's not likely that people will hit it since you're not likely to be using a changing directory for --rootdir. However ETA maybe delayed to middle of Oct, as I'm going to enjoy my holiday during 1st Oct to 7th Oct. Entirely understandable, enjoy your time off! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Use bd_dev to generate index when dev_state_hashtable add items.
From: Gu JinXiang Fix bug of (). Signed-off-by: Gu JinXiang --- fs/btrfs/check-integrity.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 9d3854839038..86d79bc4cfb3 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -613,7 +613,7 @@ static void btrfsic_dev_state_hashtable_add( struct btrfsic_dev_state_hashtable *h) { const unsigned int hashval = - (((unsigned int)((uintptr_t)ds->bdev)) & + (((unsigned int)((uintptr_t)ds->bdev->bd_dev)) & (BTRFSIC_DEV2STATE_HASHTABLE_SIZE - 1)); list_add(&ds->collision_resolving_node, h->table + hashval); -- 2.13.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delayed inodes implementation questions
On 29.09.2017 11:00, Qu Wenruo wrote: > > > On 2017年09月29日 15:15, Nikolay Borisov wrote: >> Hello, >> >> >> I've been studying the implementation of the delayed inodes feature and >> have a couple of questions. > > Sorry I can't answer your question, but as you may already know by the > rejection mail, Miao Xie has left Fujitsu and has been for several years. > > And I'm not sure if he still checks btrfs mail list, so you may be on > your own. > > So to avoid such situation, it would be better to maintain a wiki page > for such less maintained features, about it design principle and > possible corner cases. > > This will make later developers quite happy. > (personally speaking, this should be done before coding) I'm asking the question exactly because I'm in the process of creating said documentation. > > Thanks, > Qu >> >> 1. Why are there 2 lists when each and every node is on both lists at >> the same time? So if we happen to be running the async worker we are >> processing nodes based on the prepared_list, at the same time if there >> is a concurent transaction commit happening then the exact some nodes >> are going to be processed from the node_list - it's not a functional >> problem but the presence of 2 lists and each node being on the same list >> at the same time just seems a bit redundant. >> >> 2. in btrfs_next_delayed_node can this code really trigger: >> >> if (!test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags)) { >> /* not in the list */ >> >> if (list_empty(&delayed_root->node_list)) >> >> goto out; >> >> p = delayed_root->node_list.next; >> >> In order for !test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags) to be >> true this means this node should have been btrfs_dequeue_delayed_node >> underneath us? Can this really happen e.g. have multiple concurrent >> __btrfs_run_delayed_items executions? >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: delayed inodes implementation questions
On 2017年09月29日 15:15, Nikolay Borisov wrote: Hello, I've been studying the implementation of the delayed inodes feature and have a couple of questions. Sorry I can't answer your question, but as you may already know by the rejection mail, Miao Xie has left Fujitsu and has been for several years. And I'm not sure if he still checks btrfs mail list, so you may be on your own. So to avoid such situation, it would be better to maintain a wiki page for such less maintained features, about it design principle and possible corner cases. This will make later developers quite happy. (personally speaking, this should be done before coding) Thanks, Qu 1. Why are there 2 lists when each and every node is on both lists at the same time? So if we happen to be running the async worker we are processing nodes based on the prepared_list, at the same time if there is a concurent transaction commit happening then the exact some nodes are going to be processed from the node_list - it's not a functional problem but the presence of 2 lists and each node being on the same list at the same time just seems a bit redundant. 2. in btrfs_next_delayed_node can this code really trigger: if (!test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags)) { /* not in the list */ if (list_empty(&delayed_root->node_list)) goto out; p = delayed_root->node_list.next; In order for !test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags) to be true this means this node should have been btrfs_dequeue_delayed_node underneath us? Can this really happen e.g. have multiple concurrent __btrfs_run_delayed_items executions? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix max chunk size on dup
Balancing a fresh METADATA=dup btrfs file system (with size < 50G) generates a 128MB sized block group. While we set max_stripe_size = max_chunk_size = 256MB, we get this half sized block group: $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length length 8388608 owner 2 stripe_len 65536 type DATA length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP length 134217728 owner 2 stripe_len 65536 type METADATA|DUP Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB METADATA|DUP block group. But now, we use "stripe_size * data_stripes > max_chunk_size". Since data_stripes = 1 on DUP, it disallows the block group to have > 128MB. What missing here is "dev_stripes". Proper logical space used by the block group is "stripe_size * data_stripes / dev_stripes". Tweak the equations to use the right value. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0e8f16c305df..d1ac3e1fb753 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4751,10 +4751,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * is really going to be in terms of logical address space, * and compare that answer with the max chunk size */ - if (stripe_size * data_stripes > max_chunk_size) { + if (stripe_size * data_stripes > max_chunk_size * dev_stripes) { u64 mask = (1ULL << 24) - 1; - stripe_size = div_u64(max_chunk_size, data_stripes); + stripe_size = div_u64(max_chunk_size * dev_stripes, + data_stripes); /* bump the answer up to a 16MB boundary */ stripe_size = (stripe_size + mask) & ~mask; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
delayed inodes implementation questions
Hello, I've been studying the implementation of the delayed inodes feature and have a couple of questions. 1. Why are there 2 lists when each and every node is on both lists at the same time? So if we happen to be running the async worker we are processing nodes based on the prepared_list, at the same time if there is a concurent transaction commit happening then the exact some nodes are going to be processed from the node_list - it's not a functional problem but the presence of 2 lists and each node being on the same list at the same time just seems a bit redundant. 2. in btrfs_next_delayed_node can this code really trigger: if (!test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags)) { /* not in the list */ if (list_empty(&delayed_root->node_list)) goto out; p = delayed_root->node_list.next; In order for !test_bit(BTRFS_DELAYED_NODE_IN_LIST, &node->flags) to be true this means this node should have been btrfs_dequeue_delayed_node underneath us? Can this really happen e.g. have multiple concurrent __btrfs_run_delayed_items executions? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html