Re: [PATCH v3 00/12] FITRIM improvements
[CC'ing Filipe as he should now better ] On 25.03.19 г. 20:44 ч., Darrick J. Wong wrote: > On Mon, Mar 25, 2019 at 02:31:20PM +0200, Nikolay Borisov wrote: >> Here is v3 of the fitrim patches. Change since v2 [0]: >> >> * Replaced BUG_ON with WARN_ON in patch 2 >> >> * Added RB to patches 04/05/06/09 >> >> * Squashed "btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free in >> close_ctree" >> into patch 07. It was only sent to the mailing list as a followup. >> >> * Rebased all patches on latest misc-next. >> >> This has undergone multiple xfstest runs and I think is ready to be merged. >> >> [0] >> https://lore.kernel.org/linux-btrfs/20190211083510.27591-1-nbori...@suse.com/ >> >> >> Jeff Mahoney (1): >> btrfs: replace pending/pinned chunks lists with io tree >> >> Nikolay Borisov (11): >> btrfs: Honour FITRIM range constraints during free space trim > > This is vaguely off-topic, but I noticed that you can FITRIM a btrfs > filesystem mounted nologreplay. Assuming the fitrim code uses the free > space information to drive the discard calls, is it safe to do that with > unreplayed metadata? Pertinent question, indeed. But I'd defer to Filipe since he knows the log tree code. Filipe, FITRIM uses the freespace_ctl struct from block group to trim the freespace inside block groups, as well as the free device space to trim unallocated space. If we have a dirty log tree are those coherent with the dirty data i.e is it reflected in the BG's freespace cache that the data in the logs tree is actually allocated? If the answer is 'no' then it will be prudent to disallow trim in this case. > > (And no, I don't really know what nologreplay does, so please excuse my > ignorance...) Log replay means the content of the log tree (which is something like a WAL) must be copied back into the main btree. > > --D > >> btrfs: combine device update operations during transaction commit >> btrfs: Handle pending/pinned chunks before blockgroup relocation >> during device shrink >> btrfs: Rename and export clear_btree_io_tree >> btrfs: Populate ->orig_block_len during read_one_chunk >> btrfs: Introduce new bits for device allocation tree >> btrfs: Remove 'trans' argument from find_free_dev_extent(_start) >> btrfs: Factor out in_range macro >> btrfs: Optimize unallocated chunks discard >> btrfs: Implement find_first_clear_extent_bit >> btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit >> >> fs/btrfs/ctree.h| 8 +- >> fs/btrfs/dev-replace.c | 2 +- >> fs/btrfs/disk-io.c | 20 ++- >> fs/btrfs/extent-tree.c | 102 + >> fs/btrfs/extent_io.c| 103 +- >> fs/btrfs/extent_io.h| 19 ++- >> fs/btrfs/extent_map.c | 38 + >> fs/btrfs/extent_map.h | 1 - >> fs/btrfs/free-space-cache.c | 4 - >> fs/btrfs/transaction.c | 51 +-- >> fs/btrfs/transaction.h | 2 +- >> fs/btrfs/volumes.c | 277 ++-- >> fs/btrfs/volumes.h | 23 ++- >> 13 files changed, 332 insertions(+), 318 deletions(-) >> >> -- >> 2.17.1 >> >
Re: WARNING at fs/btrfs/delayed-ref.c:296 btrfs_merge_delayed_refs+0x3dc/0x410 (new on 5.0.4, not in 5.0.3)
On 26.03.19 г. 6:30 ч., Zygo Blaxell wrote: > On Mon, Mar 25, 2019 at 10:50:28PM -0400, Zygo Blaxell wrote: >> Running balance, rsync, and dedupe, I get kernel warnings every few >> minutes on 5.0.4. No warnings on 5.0.3 under similar conditions. >> >> Mount options are: flushoncommit,space_cache=v2,compress=zstd. >> >> There are two different stacks on the warnings. This one comes from >> btrfs balance: > > [snip] > > Possibly unrelated, but I'm also repeatably getting this in 5.0.4 and > not 5.0.3, after about 5 hours of uptime. Different processes, same > kernel stack: > > [Mon Mar 25 23:35:17 2019] kworker/u8:4: page allocation failure: > order:0, mode:0x404000(GFP_NOWAIT|__GFP_COMP), > nodemask=(null),cpuset=/,mems_allowed=0 > [Mon Mar 25 23:35:17 2019] CPU: 2 PID: 29518 Comm: kworker/u8:4 > Tainted: GW 5.0.4-zb64-303ce93b05c9+ #1 What commits does this kernel include because it doesn't seem to be a pristine upstream 5.0.4 ? Also what you are seeing below is definitely a bug in MM. The question is whether it's due to your doing faulty backports in the kernel or it's due to something that got automatically backported to 5.0.4 > [Mon Mar 25 23:35:17 2019] Hardware name: QEMU Standard PC (i440FX + > PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [Mon Mar 25 23:35:17 2019] Workqueue: btrfs-submit btrfs_submit_helper > [Mon Mar 25 23:35:17 2019] Call Trace: > [Mon Mar 25 23:35:17 2019] dump_stack+0x7d/0xbb > [Mon Mar 25 23:35:17 2019] warn_alloc+0x108/0x190 > [Mon Mar 25 23:35:17 2019] __alloc_pages_nodemask+0x12c4/0x13f0 > [Mon Mar 25 23:35:17 2019] ? rcu_read_lock_sched_held+0x68/0x70 > [Mon Mar 25 23:35:17 2019] ? __update_load_avg_se+0x208/0x280 > [Mon Mar 25 23:35:17 2019] cache_grow_begin+0x79/0x730 > [Mon Mar 25 23:35:17 2019] ? cache_grow_begin+0x79/0x730 > [Mon Mar 25 23:35:17 2019] ? cache_alloc_node+0x165/0x1e0 > [Mon Mar 25 23:35:17 2019] fallback_alloc+0x1e4/0x280 > [Mon Mar 25 23:35:17 2019] kmem_cache_alloc+0x2e9/0x310 > [Mon Mar 25 23:35:17 2019] btracker_queue+0x47/0x170 [dm_cache] > [Mon Mar 25 23:35:17 2019] __lookup+0x51a/0x600 [dm_cache_smq] > [Mon Mar 25 23:35:17 2019] ? smq_lookup+0x37/0x7b [dm_cache_smq] > [Mon Mar 25 23:35:17 2019] smq_lookup+0x5d/0x7b [dm_cache_smq] > [Mon Mar 25 23:35:18 2019] map_bio.part.40+0x14d/0x5d0 [dm_cache] > [Mon Mar 25 23:35:18 2019] ? bio_detain_shared+0xb3/0x120 [dm_cache] > [Mon Mar 25 23:35:18 2019] cache_map+0x120/0x170 [dm_cache] > [Mon Mar 25 23:35:18 2019] __map_bio+0x42/0x1f0 [dm_mod] > [Mon Mar 25 23:35:18 2019] __split_and_process_non_flush+0x152/0x1e0 > [dm_mod] > [Mon Mar 25 23:35:18 2019] __split_and_process_bio+0xd4/0x400 [dm_mod] > [Mon Mar 25 23:35:18 2019] ? lock_acquire+0xbc/0x1c0 > [Mon Mar 25 23:35:18 2019] ? dm_get_live_table+0x5/0xc0 [dm_mod] > [Mon Mar 25 23:35:18 2019] dm_make_request+0x4d/0x100 [dm_mod] > [Mon Mar 25 23:35:18 2019] generic_make_request+0x297/0x470 > [Mon Mar 25 23:35:18 2019] ? kvm_sched_clock_read+0x14/0x30 > [Mon Mar 25 23:35:18 2019] ? submit_bio+0x6c/0x140 > [Mon Mar 25 23:35:18 2019] submit_bio+0x6c/0x140 > [Mon Mar 25 23:35:18 2019] run_scheduled_bios+0x1e6/0x500 > [Mon Mar 25 23:35:18 2019] ? normal_work_helper+0x95/0x530 > [Mon Mar 25 23:35:18 2019] normal_work_helper+0x95/0x530 > [Mon Mar 25 23:35:18 2019] process_one_work+0x223/0x5d0 > [Mon Mar 25 23:35:18 2019] worker_thread+0x4f/0x3b0 > [Mon Mar 25 23:35:18 2019] kthread+0x106/0x140 > [Mon Mar 25 23:35:18 2019] ? process_one_work+0x5d0/0x5d0 > [Mon Mar 25 23:35:18 2019] ? kthread_park+0x90/0x90 > [Mon Mar 25 23:35:18 2019] ret_from_fork+0x3a/0x50 > [Mon Mar 25 23:35:18 2019] Mem-Info: > [Mon Mar 25 23:35:18 2019] active_anon:195872 inactive_anon:15658 > isolated_anon:0 > active_file:629653 inactive_file:308914 > isolated_file:0 > unevictable:65536 dirty:14449 > writeback:27580 unstable:0 > slab_reclaimable:492522 > slab_unreclaimable:94393 > mapped:10915 shmem:18846 pagetables:2178 > bounce:0 > free:66082 free_pcp:1963 free_cma:24 > [Mon Mar 25 23:35:18 2019] Node 0 active_anon:783488kB > inactive_anon:62632kB active_file:2516656kB inactive_file:1235604kB > unevictable:262144kB isolated(anon):0kB isolated(file):0kB mapped:43660kB > dirty:57796kB writeback:110320kB shmem:75384kB shmem_thp: 0kB > shmem_pmdmapped: 0kB anon_thp: 137216kB writeback_tmp:0kB unstable:0kB > all_unreclaimable? no > [Mon Mar 25 23:35:18 2019] Node 0 DMA free:15844kB min:132kB low:164kB > high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB > inactive_file:0kB
Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount
Thank you both for your input. see below. > > You sda and sdb are at gen 60233 while sdd and sde are at gen 60234. > > It's possible to allow kernel to manually assemble its device list using > > "device=" mount option. > > Since you're using RAID6, it's possible to recover using 2 devices only, > > but in that case you need "degraded" mount option. > > He has btrfs raid0 profile on top of hardware RAID6 devices. Correct, my FS is a "raid0" across four hardware-raid based raid6 devices. The underlying devices of the raid controller are fine, same as the volumes themselves. Only corruption seems to be on the btrfs side. Does your tip regarding mounting by explicitly specifying the devices still make sense? Will this figure out automatically which generation to use? I am at the moment in the process of using "btrfs restore" to pull more data from the filesystem without making any further changes. After that I am happy to continue testing, and will happily test your mentioned "skip_bg" patch - but if you think that there is some other way to mount (just for recovery purpose - read only is fine!) while having different gens on the devices, I highly appreciate it. Thanks Qu and Andrei!
Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount
Mount messages below. Thanks for your input, Qu! ## [42763.884134] BTRFS info (device sdd): disabling free space tree [42763.884138] BTRFS info (device sdd): force clearing of disk cache [42763.884140] BTRFS info (device sdd): has skinny extents [42763.885207] BTRFS error (device sdd): parent transid verify failed on 1048576 wanted 60234 found 60230 [42763.885263] BTRFS error (device sdd): failed to read chunk root [42763.900922] BTRFS error (device sdd): open_ctree failed ## Sent with ProtonMail Secure Email. ‐‐‐ Original Message ‐‐‐ On Tuesday, 26. March 2019 10:21, Qu Wenruo wrote: > On 2019/3/26 下午4:52, berodual_xyz wrote: > > > Thank you both for your input. > > see below. > > > > > > You sda and sdb are at gen 60233 while sdd and sde are at gen 60234. > > > > It's possible to allow kernel to manually assemble its device list using > > > > "device=" mount option. > > > > Since you're using RAID6, it's possible to recover using 2 devices only, > > > > but in that case you need "degraded" mount option. > > > > > > He has btrfs raid0 profile on top of hardware RAID6 devices. > > > > Correct, my FS is a "raid0" across four hardware-raid based raid6 devices. > > The underlying devices of the raid controller are fine, same as the volumes > > themselves. > > Then there is not much we can do. > > The super blocks shows all your 4 devices are in 2 different states. > (older generation with dirt log, newer generation without log). > > This means some writes didn't reach all devices. > > > Only corruption seems to be on the btrfs side. > > Please provide the kernel message when trying to mount the fs. > > > Does your tip regarding mounting by explicitly specifying the devices still > > make sense? > > Not really. For RAID0 case, it doesn't make much sense. > > > Will this figure out automatically which generation to use? > > You could try, as all the mount option is making btrfs completely RO (no > log replay), so it should be pretty safe. > > > I am at the moment in the process of using "btrfs restore" to pull more > > data from the filesystem without making any further changes. > > After that I am happy to continue testing, and will happily test your > > mentioned "skip_bg" patch - but if you think that there is some other way > > to mount (just for recovery purpose - read only is fine!) while having > > different gens on the devices, I highly appreciate it. > > With mounting failure dmesg, it should be pretty easy to determine > whether my skip_bg will work. > > Thanks, > Qu > > > Thanks Qu and Andrei!
[PATCH] Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
From: Filipe Manana Whan a filesystem is mounted with the nologreplay mount option, which requires it to be mounted in RO mode as well, we can not allow discard on free space inside block groups, because log trees refer to extents that are not pinned in a block group's free space cache (pinning the extents is precisely the first phase of replaying a log tree). So do not allow the fitrim ioctl to do anything when the filesystem is mounted with the nologreplay option, because later it can be mounted RW without that option, which causes log replay to happen and result in either a failure to replay the log trees (leading to a mount failure), a crash or some silent corruption. Reported-by: Darrick J. Wong Signed-off-by: Filipe Manana --- fs/btrfs/ioctl.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 494f0f10d70e..01808934d21f 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -501,6 +501,16 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, void __user *arg) if (!capable(CAP_SYS_ADMIN)) return -EPERM; + /* +* If the fs is mounted with nologreplay, which requires it to be +* mounted in RO mode as well, we can not allow discard on free space +* inside block groups, because log trees refer to extents that are not +* pinned in a block group's free space cache (pinning the extents is +* precisely the first phase of replaying a log tree). +*/ + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) + return -EROFS; + rcu_read_lock(); list_for_each_entry_rcu(device, &fs_info->fs_devices->devices, dev_list) { -- 2.11.0
Re: [PATCH v3 00/12] FITRIM improvements
On Tue, Mar 26, 2019 at 8:10 AM Nikolay Borisov wrote: > > [CC'ing Filipe as he should now better ] > > On 25.03.19 г. 20:44 ч., Darrick J. Wong wrote: > > On Mon, Mar 25, 2019 at 02:31:20PM +0200, Nikolay Borisov wrote: > >> Here is v3 of the fitrim patches. Change since v2 [0]: > >> > >> * Replaced BUG_ON with WARN_ON in patch 2 > >> > >> * Added RB to patches 04/05/06/09 > >> > >> * Squashed "btrfs: Transpose btrfs_close_devices/btrfs_mapping_tree_free > >> in close_ctree" > >> into patch 07. It was only sent to the mailing list as a followup. > >> > >> * Rebased all patches on latest misc-next. > >> > >> This has undergone multiple xfstest runs and I think is ready to be > >> merged. > >> > >> [0] > >> https://lore.kernel.org/linux-btrfs/20190211083510.27591-1-nbori...@suse.com/ > >> > >> > >> Jeff Mahoney (1): > >> btrfs: replace pending/pinned chunks lists with io tree > >> > >> Nikolay Borisov (11): > >> btrfs: Honour FITRIM range constraints during free space trim > > > > This is vaguely off-topic, but I noticed that you can FITRIM a btrfs > > filesystem mounted nologreplay. Assuming the fitrim code uses the free > > space information to drive the discard calls, is it safe to do that with > > unreplayed metadata? Nop, not safe, neither with this patchset nor without it. I've just sent a patch to do the same you did in your fixes for xfs and ext4: https://patchwork.kernel.org/patch/10870871/ Thanks for reporting it! > > Pertinent question, indeed. But I'd defer to Filipe since he knows the > log tree code. Filipe, FITRIM uses the freespace_ctl struct from block > group to trim the freespace inside block groups, as well as the free > device space to trim unallocated space. If we have a dirty log tree are > those coherent with the dirty data i.e is it reflected in the BG's > freespace cache that the data in the logs tree is actually allocated? If > the answer is 'no' then it will be prudent to disallow trim in this case. > > > > > > (And no, I don't really know what nologreplay does, so please excuse my > > ignorance...) > > Log replay means the content of the log tree (which is something like a > WAL) must be copied back into the main btree. > > > > > --D > > > >> btrfs: combine device update operations during transaction commit > >> btrfs: Handle pending/pinned chunks before blockgroup relocation > >> during device shrink > >> btrfs: Rename and export clear_btree_io_tree > >> btrfs: Populate ->orig_block_len during read_one_chunk > >> btrfs: Introduce new bits for device allocation tree > >> btrfs: Remove 'trans' argument from find_free_dev_extent(_start) > >> btrfs: Factor out in_range macro > >> btrfs: Optimize unallocated chunks discard > >> btrfs: Implement find_first_clear_extent_bit > >> btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit > >> > >> fs/btrfs/ctree.h| 8 +- > >> fs/btrfs/dev-replace.c | 2 +- > >> fs/btrfs/disk-io.c | 20 ++- > >> fs/btrfs/extent-tree.c | 102 + > >> fs/btrfs/extent_io.c| 103 +- > >> fs/btrfs/extent_io.h| 19 ++- > >> fs/btrfs/extent_map.c | 38 + > >> fs/btrfs/extent_map.h | 1 - > >> fs/btrfs/free-space-cache.c | 4 - > >> fs/btrfs/transaction.c | 51 +-- > >> fs/btrfs/transaction.h | 2 +- > >> fs/btrfs/volumes.c | 277 ++-- > >> fs/btrfs/volumes.h | 23 ++- > >> 13 files changed, 332 insertions(+), 318 deletions(-) > >> > >> -- > >> 2.17.1 > >> > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH v2] Btrfs: fix data bytes_may_use underflow with fallocate due to failed quota reserve
On Tue, Mar 26, 2019 at 3:57 AM robbieko wrote: > > From: Robbie Ko > > When doing fallocate, we first add the range to the reserve_list > and then reserve the quota. > If quota reservation fails, we'll release all reserved parts of > reserve_list. > However, cur_offset is not updated to indicate that this > range is already been inserted into the list. > Therefore, the same range is freed twice. > Once at list_for_each_entry loop, and once at the end of the > function. > This will result in WARN_ON on bytes_may_use when we free the > remaining space. > > At the end, under the 'out' label we have a call to: >btrfs_free_reserved_data_space(inode, data_reserved, alloc_start, > alloc_end - cur_offset); > The start offset, third argument, should be cur_offset. > Everything from alloc_start to cur_offset was freed by the > list_for_each_entry_safe_loop. > > Fixes: 18513091af94 ("btrfs: update btrfs_space_info's bytes_may_use timely") > Signed-off-by: Robbie Ko Reviewed-by: Filipe Manana Now it looks good, thanks. > --- > fs/btrfs/file.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c > index 34fe8a5..0832449 100644 > --- a/fs/btrfs/file.c > +++ b/fs/btrfs/file.c > @@ -3132,6 +3132,7 @@ static long btrfs_fallocate(struct file *file, int mode, > ret = btrfs_qgroup_reserve_data(inode, &data_reserved, > cur_offset, last_byte - cur_offset); > if (ret < 0) { > + cur_offset = last_byte; > free_extent_map(em); > break; > } > @@ -3181,7 +3182,7 @@ static long btrfs_fallocate(struct file *file, int mode, > /* Let go of our reservation. */ > if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE)) > btrfs_free_reserved_data_space(inode, data_reserved, > - alloc_start, alloc_end - cur_offset); > + cur_offset, alloc_end - cur_offset); > extent_changeset_free(data_reserved); > return ret; > } > -- > 1.9.1 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH] Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
On 26.03.19 г. 12:49 ч., fdman...@kernel.org wrote: > From: Filipe Manana > > Whan a filesystem is mounted with the nologreplay mount option, which > requires it to be mounted in RO mode as well, we can not allow discard on > free space inside block groups, because log trees refer to extents that > are not pinned in a block group's free space cache (pinning the extents is > precisely the first phase of replaying a log tree). > > So do not allow the fitrim ioctl to do anything when the filesystem is > mounted with the nologreplay option, because later it can be mounted RW > without that option, which causes log replay to happen and result in > either a failure to replay the log trees (leading to a mount failure), a > crash or some silent corruption. > > Reported-by: Darrick J. Wong > Signed-off-by: Filipe Manana Does it make sense to make the check a bit more specific and only return EROFS when NOLOGREPLAY and the log tree has non-null generation? In any case: Reviewed-by: Nikolay Borisov > --- > fs/btrfs/ioctl.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index 494f0f10d70e..01808934d21f 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -501,6 +501,16 @@ static noinline int btrfs_ioctl_fitrim(struct file > *file, void __user *arg) > if (!capable(CAP_SYS_ADMIN)) > return -EPERM; > > + /* > + * If the fs is mounted with nologreplay, which requires it to be > + * mounted in RO mode as well, we can not allow discard on free space > + * inside block groups, because log trees refer to extents that are not > + * pinned in a block group's free space cache (pinning the extents is > + * precisely the first phase of replaying a log tree). > + */ > + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) > + return -EROFS; > + > rcu_read_lock(); > list_for_each_entry_rcu(device, &fs_info->fs_devices->devices, > dev_list) { >
Re: [PATCH] Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
On Tue, Mar 26, 2019 at 12:17 PM Nikolay Borisov wrote: > > > > On 26.03.19 г. 12:49 ч., fdman...@kernel.org wrote: > > From: Filipe Manana > > > > Whan a filesystem is mounted with the nologreplay mount option, which > > requires it to be mounted in RO mode as well, we can not allow discard on > > free space inside block groups, because log trees refer to extents that > > are not pinned in a block group's free space cache (pinning the extents is > > precisely the first phase of replaying a log tree). > > > > So do not allow the fitrim ioctl to do anything when the filesystem is > > mounted with the nologreplay option, because later it can be mounted RW > > without that option, which causes log replay to happen and result in > > either a failure to replay the log trees (leading to a mount failure), a > > crash or some silent corruption. > > > > Reported-by: Darrick J. Wong > > Signed-off-by: Filipe Manana > > Does it make sense to make the check a bit more specific and only return > EROFS when NOLOGREPLAY and the log tree has non-null generation? It would make sense checking if there's actually a log tree as well. Neither the xfs nor ext4 (which is already in Linus' tree) do such equivalent checks, nor the proposed fstests test case makes sure a journal/log exists. Not against it, but this isn't a common use case either. > > In any case: > > Reviewed-by: Nikolay Borisov > > > --- > > fs/btrfs/ioctl.c | 10 ++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > > index 494f0f10d70e..01808934d21f 100644 > > --- a/fs/btrfs/ioctl.c > > +++ b/fs/btrfs/ioctl.c > > @@ -501,6 +501,16 @@ static noinline int btrfs_ioctl_fitrim(struct file > > *file, void __user *arg) > > if (!capable(CAP_SYS_ADMIN)) > > return -EPERM; > > > > + /* > > + * If the fs is mounted with nologreplay, which requires it to be > > + * mounted in RO mode as well, we can not allow discard on free space > > + * inside block groups, because log trees refer to extents that are > > not > > + * pinned in a block group's free space cache (pinning the extents is > > + * precisely the first phase of replaying a log tree). > > + */ > > + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) > > + return -EROFS; > > + > > rcu_read_lock(); > > list_for_each_entry_rcu(device, &fs_info->fs_devices->devices, > > dev_list) { > >
Re: [PATCH] Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
On 26.03.19 г. 14:35 ч., Filipe Manana wrote: > On Tue, Mar 26, 2019 at 12:17 PM Nikolay Borisov wrote: >> >> >> >> On 26.03.19 г. 12:49 ч., fdman...@kernel.org wrote: >>> From: Filipe Manana >>> >>> Whan a filesystem is mounted with the nologreplay mount option, which >>> requires it to be mounted in RO mode as well, we can not allow discard on >>> free space inside block groups, because log trees refer to extents that >>> are not pinned in a block group's free space cache (pinning the extents is >>> precisely the first phase of replaying a log tree). >>> >>> So do not allow the fitrim ioctl to do anything when the filesystem is >>> mounted with the nologreplay option, because later it can be mounted RW >>> without that option, which causes log replay to happen and result in >>> either a failure to replay the log trees (leading to a mount failure), a >>> crash or some silent corruption. >>> >>> Reported-by: Darrick J. Wong >>> Signed-off-by: Filipe Manana >> >> Does it make sense to make the check a bit more specific and only return >> EROFS when NOLOGREPLAY and the log tree has non-null generation? > > It would make sense checking if there's actually a log tree as well. > Neither the xfs nor ext4 (which is already in Linus' tree) do such > equivalent checks, nor the proposed fstests test case makes sure a > journal/log exists. > > Not against it, but this isn't a common use case either. I think of this as sorts of "optimisation" where if we don't have a tree then we can allow trim. Though this is much simpler so I'm fine with it as well. > >> >> In any case: >> >> Reviewed-by: Nikolay Borisov >> >>> --- >>> fs/btrfs/ioctl.c | 10 ++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >>> index 494f0f10d70e..01808934d21f 100644 >>> --- a/fs/btrfs/ioctl.c >>> +++ b/fs/btrfs/ioctl.c >>> @@ -501,6 +501,16 @@ static noinline int btrfs_ioctl_fitrim(struct file >>> *file, void __user *arg) >>> if (!capable(CAP_SYS_ADMIN)) >>> return -EPERM; >>> >>> + /* >>> + * If the fs is mounted with nologreplay, which requires it to be >>> + * mounted in RO mode as well, we can not allow discard on free space >>> + * inside block groups, because log trees refer to extents that are >>> not >>> + * pinned in a block group's free space cache (pinning the extents is >>> + * precisely the first phase of replaying a log tree). >>> + */ >>> + if (btrfs_test_opt(fs_info, NOLOGREPLAY)) >>> + return -EROFS; >>> + >>> rcu_read_lock(); >>> list_for_each_entry_rcu(device, &fs_info->fs_devices->devices, >>> dev_list) { >>> >
Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount
On 2019/3/26 下午6:24, berodual_xyz wrote: > Mount messages below. > > Thanks for your input, Qu! > > ## > [42763.884134] BTRFS info (device sdd): disabling free space tree > [42763.884138] BTRFS info (device sdd): force clearing of disk cache > [42763.884140] BTRFS info (device sdd): has skinny extents > [42763.885207] BTRFS error (device sdd): parent transid verify failed on > 1048576 wanted 60234 found 60230 So btrfs is using the latest superblock while the good one should be the old superblock. Btrfs-progs is able to just ignore the transid mismatch, but kernel doesn't and shouldn't. In fact we should allow btrfs rescue super to use super blocks from other device to replace the old one. So my patch won't help at all, the failure happens at the very beginning of the devices list initialization. BTW, if btrfs restore can't recover certain files, I don't believe any rescue kernel mount option can do more. Thanks, Qu > [42763.885263] BTRFS error (device sdd): failed to read chunk root > [42763.900922] BTRFS error (device sdd): open_ctree failed > ## > > > > > Sent with ProtonMail Secure Email. > > ‐‐‐ Original Message ‐‐‐ > On Tuesday, 26. March 2019 10:21, Qu Wenruo wrote: > >> On 2019/3/26 下午4:52, berodual_xyz wrote: >> >>> Thank you both for your input. >>> see below. >>> > You sda and sdb are at gen 60233 while sdd and sde are at gen 60234. > It's possible to allow kernel to manually assemble its device list using > "device=" mount option. > Since you're using RAID6, it's possible to recover using 2 devices only, > but in that case you need "degraded" mount option. He has btrfs raid0 profile on top of hardware RAID6 devices. >>> >>> Correct, my FS is a "raid0" across four hardware-raid based raid6 devices. >>> The underlying devices of the raid controller are fine, same as the volumes >>> themselves. >> >> Then there is not much we can do. >> >> The super blocks shows all your 4 devices are in 2 different states. >> (older generation with dirt log, newer generation without log). >> >> This means some writes didn't reach all devices. >> >>> Only corruption seems to be on the btrfs side. >> >> Please provide the kernel message when trying to mount the fs. >> >>> Does your tip regarding mounting by explicitly specifying the devices still >>> make sense? >> >> Not really. For RAID0 case, it doesn't make much sense. >> >>> Will this figure out automatically which generation to use? >> >> You could try, as all the mount option is making btrfs completely RO (no >> log replay), so it should be pretty safe. >> >>> I am at the moment in the process of using "btrfs restore" to pull more >>> data from the filesystem without making any further changes. >>> After that I am happy to continue testing, and will happily test your >>> mentioned "skip_bg" patch - but if you think that there is some other way >>> to mount (just for recovery purpose - read only is fine!) while having >>> different gens on the devices, I highly appreciate it. >> >> With mounting failure dmesg, it should be pretty easy to determine >> whether my skip_bg will work. >> >> Thanks, >> Qu >> >>> Thanks Qu and Andrei! > signature.asc Description: OpenPGP digital signature
Re: [PATCH] Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
On 2019/3/26 下午8:17, Nikolay Borisov wrote: > > > On 26.03.19 г. 12:49 ч., fdman...@kernel.org wrote: >> From: Filipe Manana >> >> Whan a filesystem is mounted with the nologreplay mount option, which >> requires it to be mounted in RO mode as well, we can not allow discard on >> free space inside block groups, because log trees refer to extents that >> are not pinned in a block group's free space cache (pinning the extents is >> precisely the first phase of replaying a log tree). >> >> So do not allow the fitrim ioctl to do anything when the filesystem is >> mounted with the nologreplay option, because later it can be mounted RW >> without that option, which causes log replay to happen and result in >> either a failure to replay the log trees (leading to a mount failure), a >> crash or some silent corruption. >> >> Reported-by: Darrick J. Wong >> Signed-off-by: Filipe Manana > > Does it make sense to make the check a bit more specific and only return > EROFS when NOLOGREPLAY and the log tree has non-null generation? To me fstrim is a WRITE operation, why it is allowed even in RO mount? Thanks, Qu > > In any case: > > Reviewed-by: Nikolay Borisov > >> --- >> fs/btrfs/ioctl.c | 10 ++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index 494f0f10d70e..01808934d21f 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -501,6 +501,16 @@ static noinline int btrfs_ioctl_fitrim(struct file >> *file, void __user *arg) >> if (!capable(CAP_SYS_ADMIN)) >> return -EPERM; >> >> +/* >> + * If the fs is mounted with nologreplay, which requires it to be >> + * mounted in RO mode as well, we can not allow discard on free space >> + * inside block groups, because log trees refer to extents that are not >> + * pinned in a block group's free space cache (pinning the extents is >> + * precisely the first phase of replaying a log tree). >> + */ >> +if (btrfs_test_opt(fs_info, NOLOGREPLAY)) >> +return -EROFS; >> + >> rcu_read_lock(); >> list_for_each_entry_rcu(device, &fs_info->fs_devices->devices, >> dev_list) { >>
Re: [PATCH] Btrfs: do not allow trimming when a fs is mounted with the nologreplay option
On Tue, Mar 26, 2019 at 09:40:08PM +0800, Qu Wenruo wrote: > > > On 2019/3/26 下午8:17, Nikolay Borisov wrote: > > > > > > On 26.03.19 г. 12:49 ч., fdman...@kernel.org wrote: > >> From: Filipe Manana > >> > >> Whan a filesystem is mounted with the nologreplay mount option, which > >> requires it to be mounted in RO mode as well, we can not allow discard on > >> free space inside block groups, because log trees refer to extents that > >> are not pinned in a block group's free space cache (pinning the extents is > >> precisely the first phase of replaying a log tree). > >> > >> So do not allow the fitrim ioctl to do anything when the filesystem is > >> mounted with the nologreplay option, because later it can be mounted RW > >> without that option, which causes log replay to happen and result in > >> either a failure to replay the log trees (leading to a mount failure), a > >> crash or some silent corruption. > >> > >> Reported-by: Darrick J. Wong > >> Signed-off-by: Filipe Manana > > > > Does it make sense to make the check a bit more specific and only return > > EROFS when NOLOGREPLAY and the log tree has non-null generation? > > To me fstrim is a WRITE operation, why it is allowed even in RO mount? It's write to the block device, not to the filesystem.
Re: [PATCH v2] fstests: Verify that removed device has its superblocks deleted
On 25.03.19 г. 23:55 ч., Anand Jain wrote: > > > On 3/25/19 10:07 PM, Nikolay Borisov wrote: >> When a device is removed from a btrfs filesystem its superblock copies >> must be deleted. > > AFAIK this bug was fixed a long time back in the kernel. Is there any > newer fix in the kernel? No there isn't but we currently don't have a test that covers that functionality and I have some changes in progress that touch superblock writing code so I'd rather be safe than sorry. > >> This test ensures this is indeed the case. > >> Signed-off-by: Nikolay Borisov > > Looks good. > > Reviewed-by: Anand Jain > >> --- >> >> Changes since v1: >> * Use _scratch_dev_pool_(get|put) to ensure the test uses exactly 2 >> devices. >> * Explicitly use -draid0 -mraid0 mkfs options for the scratch >> devices to >> ensure at least one of the device could be removed. Also add a >> comment about >> that. >> >> tests/btrfs/184 | 63 >> + >> tests/btrfs/184.out | 2 ++ >> tests/btrfs/group | 1 + >> 3 files changed, 66 insertions(+) >> create mode 100755 tests/btrfs/184 >> create mode 100644 tests/btrfs/184.out >> >> diff --git a/tests/btrfs/184 b/tests/btrfs/184 >> new file mode 100755 >> index ..49fe5c9c27bb >> --- /dev/null >> +++ b/tests/btrfs/184 >> @@ -0,0 +1,63 @@ >> +#! /bin/bash >> +# SPDX-License-Identifier: GPL-2.0 >> +# Copyright (c) 2019 SUSE LLC. All Rights Reserved. >> +# >> +# FS QA Test 184 >> +# >> +# Verify that when a device is removed from a multi-device >> +# filesystem its superblock copies are correctly deleted >> +# >> +seq=`basename $0` >> +seqres=$RESULT_DIR/$seq >> +echo "QA output created by $seq" >> + >> +here=`pwd` >> +tmp=/tmp/$$ >> +status=1 # failure is the default! >> +trap "_cleanup; exit \$status" 0 1 2 3 15 >> + >> +_cleanup() >> +{ >> + cd / >> + rm -f $tmp.* >> +} >> + >> +# get standard environment, filters and checks >> +. ./common/rc >> +. ./common/filter >> + >> +rm -f $seqres.full >> + >> +# real QA test starts here >> +_supported_fs btrfs >> +_supported_os Linux >> +_require_scratch >> +_require_scratch_dev_pool 2 >> +_require_btrfs_command inspect-internal dump-super >> + >> +_scratch_dev_pool_get 2 >> + >> +# Explicitly use raid0 mode to ensure at least one of the devices can be >> +# removed. >> +_scratch_pool_mkfs "-d raid0 -m raid0" >> $seqres.full 2>&1 || _fail >> "mkfs failed" >> +_scratch_mount >> + >> +# pick last dev in the list >> +dev_del=`echo ${SCRATCH_DEV_POOL} | awk '{print $NF}'` >> +$BTRFS_UTIL_PROG device delete $dev_del $SCRATCH_MNT || _fail "btrfs >> device delete failed" >> +for i in {0..2}; do >> + output=$($BTRFS_UTIL_PROG inspect-internal dump-super -s $i >> $dev_del 2>&1) >> + $BTRFS_UTIL_PROG inspect-internal dump-super -s $i $dev_del 2>&1 >> | grep -q "bad magic" >> + ret=$? >> + if [[ "$output" != "" && $ret -eq 1 ]]; then >> + _fail "Deleted dev superblocks not scratched" >> + fi >> +done >> +_scratch_unmount >> + >> +_scratch_dev_pool_put >> + >> +# success, all done >> +echo "Silence is golden" >> +status=0 >> +exit >> diff --git a/tests/btrfs/184.out b/tests/btrfs/184.out >> new file mode 100644 >> index ..b4ce96cfc047 >> --- /dev/null >> +++ b/tests/btrfs/184.out >> @@ -0,0 +1,2 @@ >> +QA output created by 184 >> +Silence is golden >> diff --git a/tests/btrfs/group b/tests/btrfs/group >> index f3227c1708d9..c1d215bf5ff8 100644 >> --- a/tests/btrfs/group >> +++ b/tests/btrfs/group >> @@ -186,3 +186,4 @@ >> 181 auto quick balance >> 182 auto quick balance >> 183 auto quick clone compress punch >> +184 auto quick volume >> >
Re: [PATCh v2 1/9] btrfs: Move btrfs_check_chunk_valid() to tree-check.[ch] and export it
On Tue, Mar 26, 2019 at 07:02:20AM +0800, Qu Wenruo wrote: > On 2019/3/26 上午1:06, David Sterba wrote: > > On Wed, Mar 20, 2019 at 02:37:09PM +0800, Qu Wenruo wrote: > >> By function, chunk item verification is more suitable to be done inside > >> tree-checker. > >> > >> So move btrfs_check_chunk_valid() to tree-checker.c and export it. > >> > >> And since it's now moved to tree-checker, also add a better comment for > >> what this function is doing. > >> > >> Signed-off-by: Qu Wenruo > >> --- > >> fs/btrfs/tree-checker.c | 99 + > >> fs/btrfs/tree-checker.h | 3 ++ > >> fs/btrfs/volumes.c | 94 +- > >> 3 files changed, 103 insertions(+), 93 deletions(-) > >> > >> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c > >> index b8cdaf472031..4e44323ae758 100644 > >> --- a/fs/btrfs/tree-checker.c > >> +++ b/fs/btrfs/tree-checker.c > >> @@ -448,6 +448,105 @@ static int check_block_group_item(struct > >> btrfs_fs_info *fs_info, > >>return 0; > >> } > >> > >> +/* > >> + * The common chunk check which could also work on super block sys chunk > >> array. > >> + * > >> + * Return -EUCLEAN if anything is corrupted. > > > > Well, that's still confusing if you say EUCLEAN in the commend and use > > EIO in the code. > > > Oh, that EIO to EUCLEAN change is in later patch (3/9). Yes, but this patch when viewed on itself is confusing. The EIO->EUCLEAN in the comment belongs to 3/9 too. > Do I need to resend the patchset? No, such small fixups I do myself but I need to point that out so we reach a common understanding and what's expected.
Re: [PATCH] btrfs: Enable btrfs/003
On 19.03.19 г. 12:58 ч., Nikolay Borisov wrote: > For a long time this test has been failing on all kinds of VM configuration, > which are using virtio_blk devices. This is due to the fact that scsi > devices are deletable and virtio_blk are not. However, this only prevents > device replace case to run and has no negative effect on the other > useful test cases. > > Re-enable btrfs/003 to run by making _require_deletable_scratch_dev_pool > private to the test case and modifying it to return success (0) or > failure (1) if devices are not deletable. Further modify the replace > test case to check the return value of this function and skip it if > devices are not deletable. > > Signed-off-by: Nikolay Borisov Ping > --- > common/rc | 12 > tests/btrfs/003 | 19 ++- > 2 files changed, 18 insertions(+), 13 deletions(-) > > diff --git a/common/rc b/common/rc > index 1c42515ff0ea..5693ba3cad18 100644 > --- a/common/rc > +++ b/common/rc > @@ -2961,18 +2961,6 @@ _require_scratch_dev_pool_equal_size() > done > } > > -# We will check if the device is deletable > -_require_deletable_scratch_dev_pool() > -{ > - local i > - local x > - for i in $SCRATCH_DEV_POOL; do > - x=`echo $i | cut -d"/" -f 3` > - if [ ! -f /sys/class/block/${x}/device/delete ]; then > - _notrun "$i is a device which is not deletable" > - fi > - done > -} > > # Check that fio is present, and it is able to execute given jobfile > _require_fio() > diff --git a/tests/btrfs/003 b/tests/btrfs/003 > index 938030ef4c65..2aeb9fe6325a 100755 > --- a/tests/btrfs/003 > +++ b/tests/btrfs/003 > @@ -17,6 +17,21 @@ dev_removed=0 > removed_dev_htl="" > trap "_cleanup; exit \$status" 0 1 2 3 15 > > +# Check if all scratch dev pools are deletable > +_require_deletable_scratch_dev_pool() > +{ > + local i > + local x > + for i in $SCRATCH_DEV_POOL; do > + x=`echo $i | cut -d"/" -f 3` > + if [ ! -f /sys/class/block/${x}/device/delete ]; then > + return 1 > + fi > + done > + > + return 0 > +} > + > _cleanup() > { > cd / > @@ -35,7 +50,6 @@ _supported_fs btrfs > _supported_os Linux > _require_scratch > _require_scratch_dev_pool 4 > -_require_deletable_scratch_dev_pool > _require_command "$WIPEFS_PROG" wipefs > > rm -f $seqres.full > @@ -111,6 +125,9 @@ _test_replace() > local ds > local d > > + # If scratch devs are not deletable skip this test > + if ! _require_deletable_scratch_dev_pool; then return 0; fi > + > # exclude the first and the last disk in the disk pool > n=$(($n-1)) > ds=${devs[@]:1:$(($n-1))} >
Re: [PATCh v2 2/9] btrfs: tree-checker: Make chunk item checker more readable
On Wed, Mar 20, 2019 at 11:41:44AM +0100, Johannes Thumshirn wrote: > Looks good, > Reviewed-by: Johannes Thumshirn > > Although I think it would've been worth to explicitly mention that you > increased the severity level from error to critical. Agreed, changelog updated.
Re: WARNING at fs/btrfs/delayed-ref.c:296 btrfs_merge_delayed_refs+0x3dc/0x410 (new on 5.0.4, not in 5.0.3)
On Tue, Mar 26, 2019 at 10:42:31AM +0200, Nikolay Borisov wrote: > > > On 26.03.19 г. 6:30 ч., Zygo Blaxell wrote: > > On Mon, Mar 25, 2019 at 10:50:28PM -0400, Zygo Blaxell wrote: > >> Running balance, rsync, and dedupe, I get kernel warnings every few > >> minutes on 5.0.4. No warnings on 5.0.3 under similar conditions. > >> > >> Mount options are: flushoncommit,space_cache=v2,compress=zstd. > >> > >> There are two different stacks on the warnings. This one comes from > >> btrfs balance: > > > > [snip] > > > > Possibly unrelated, but I'm also repeatably getting this in 5.0.4 and > > not 5.0.3, after about 5 hours of uptime. Different processes, same > > kernel stack: > > > > [Mon Mar 25 23:35:17 2019] kworker/u8:4: page allocation failure: > > order:0, mode:0x404000(GFP_NOWAIT|__GFP_COMP), > > nodemask=(null),cpuset=/,mems_allowed=0 > > [Mon Mar 25 23:35:17 2019] CPU: 2 PID: 29518 Comm: kworker/u8:4 > > Tainted: GW 5.0.4-zb64-303ce93b05c9+ #1 > > What commits does this kernel include because it doesn't seem to be a > pristine upstream 5.0.4 ? Also what you are seeing below is definitely a > bug in MM. The question is whether it's due to your doing faulty > backports in the kernel or it's due to something that got automatically > backported to 5.0.4 That was the first thing I thought of, so I reverted to vanilla 5.0.4, repeated the test, and obtained the same result. You may have a point about non-btrfs patches in 5.0.4, though. I previously tested 5.0.3 with most of the 5.0.4 fs/btrfs commits already included by cherry-pick: 1098803b8cb7 Btrfs: fix deadlock between clone/dedupe and rename 3486142a68e3 Btrfs: fix corruption reading shared and compressed extents after hole punching fb9c36acfab1 btrfs: scrub: fix circular locking dependency warning 9d7b327affb8 Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl 80dcd07c27df Btrfs: setup a nofs context for memory allocation at btrfs_create_tree() The commits that are in 5.0.4 but not in my last 5.0.3 test run are: ebbb48419e8a btrfs: init csum_list before possible free 88e610ae4c3a btrfs: ensure that a DUP or RAID1 block group has exactly two stripes 9c58f2ada4fa btrfs: drop the lock on error in btrfs_dev_replace_cancel and I don't see how those commits could lead to the observed changes in behavior. I didn't include them for 5.0.3 because my test scenario doesn't execute the code they touch. So the problem might be outside of btrfs completely.
Re: WARNING at fs/btrfs/delayed-ref.c:296 btrfs_merge_delayed_refs+0x3dc/0x410 (new on 5.0.4, not in 5.0.3)
On 26.03.19 г. 17:09 ч., Zygo Blaxell wrote: > On Tue, Mar 26, 2019 at 10:42:31AM +0200, Nikolay Borisov wrote: >> >> >> On 26.03.19 г. 6:30 ч., Zygo Blaxell wrote: >>> On Mon, Mar 25, 2019 at 10:50:28PM -0400, Zygo Blaxell wrote: Running balance, rsync, and dedupe, I get kernel warnings every few minutes on 5.0.4. No warnings on 5.0.3 under similar conditions. Mount options are: flushoncommit,space_cache=v2,compress=zstd. There are two different stacks on the warnings. This one comes from btrfs balance: >>> >>> [snip] >>> >>> Possibly unrelated, but I'm also repeatably getting this in 5.0.4 and >>> not 5.0.3, after about 5 hours of uptime. Different processes, same >>> kernel stack: >>> >>> [Mon Mar 25 23:35:17 2019] kworker/u8:4: page allocation failure: >>> order:0, mode:0x404000(GFP_NOWAIT|__GFP_COMP), >>> nodemask=(null),cpuset=/,mems_allowed=0 >>> [Mon Mar 25 23:35:17 2019] CPU: 2 PID: 29518 Comm: kworker/u8:4 >>> Tainted: GW 5.0.4-zb64-303ce93b05c9+ #1 >> >> What commits does this kernel include because it doesn't seem to be a >> pristine upstream 5.0.4 ? Also what you are seeing below is definitely a >> bug in MM. The question is whether it's due to your doing faulty >> backports in the kernel or it's due to something that got automatically >> backported to 5.0.4 > > That was the first thing I thought of, so I reverted to vanilla 5.0.4, > repeated the test, and obtained the same result. > > You may have a point about non-btrfs patches in 5.0.4, though. > I previously tested 5.0.3 with most of the 5.0.4 fs/btrfs commits > already included by cherry-pick: > > 1098803b8cb7 Btrfs: fix deadlock between clone/dedupe and rename > 3486142a68e3 Btrfs: fix corruption reading shared and compressed > extents after hole punching > fb9c36acfab1 btrfs: scrub: fix circular locking dependency warning > 9d7b327affb8 Btrfs: setup a nofs context for memory allocation at > __btrfs_set_acl > 80dcd07c27df Btrfs: setup a nofs context for memory allocation at > btrfs_create_tree() > > The commits that are in 5.0.4 but not in my last 5.0.3 test run are: > > ebbb48419e8a btrfs: init csum_list before possible free > 88e610ae4c3a btrfs: ensure that a DUP or RAID1 block group has exactly > two stripes > 9c58f2ada4fa btrfs: drop the lock on error in btrfs_dev_replace_cancel > > and I don't see how those commits could lead to the observed changes > in behavior. I didn't include them for 5.0.3 because my test scenario > doesn't execute the code they touch. So the problem might be outside > of btrfs completely. I think it might very well be outside of btrfs because you are seeing an order 0 failure when you have plenty of order 0 free pages. That's definitely something you might want to report to mm. >
Re: [PATCh v2 8/9] btrfs: tree-checker: Verify inode item
On Wed, Mar 20, 2019 at 02:37:16PM +0800, Qu Wenruo wrote: > @@ -1539,6 +1539,8 @@ do { >\ > #define BTRFS_INODE_COMPRESS (1 << 11) > > #define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31) > +#define BTRFS_INODE_FLAG_MASK(((1 << 12) - 1) |\ > + BTRFS_INODE_ROOT_ITEM_INIT) That's fragile, the mask constant should enumerate all bits it's supposed to cover, like +#define BTRFS_INODE_FLAG_MASK \ + (BTRFS_INODE_NODATASUM |\ +BTRFS_INODE_NODATACOW |\ +BTRFS_INODE_READONLY | \ +BTRFS_INODE_NOCOMPRESS | \ +BTRFS_INODE_PREALLOC | \ +BTRFS_INODE_SYNC | \ +BTRFS_INODE_IMMUTABLE |\ +BTRFS_INODE_APPEND | \ +BTRFS_INODE_NODUMP | \ +BTRFS_INODE_NOATIME | \ +BTRFS_INODE_DIRSYNC | \ +BTRFS_INODE_COMPRESS | \ +BTRFS_INODE_ROOT_ITEM_INIT) + > + u64 super_gen = btrfs_super_generation(fs_info->super_copy); > + u32 valid_mask = (S_IFMT | S_ISUID | S_ISGID | S_ISVTX | 0777); > + u32 mode; > + > + if ((key->objectid < BTRFS_FIRST_FREE_OBJECTID || > + key->objectid > BTRFS_LAST_FREE_OBJECTID) && > + key->objectid != BTRFS_ROOT_TREE_DIR_OBJECTID && > + key->objectid != BTRFS_FREE_INO_OBJECTID) { > + generic_err(fs_info, leaf, slot, > + "invalid key objectid: has %llu expect %llu or [%llu, %llu] or %llu", > + key->objectid, BTRFS_ROOT_TREE_DIR_OBJECTID, > + BTRFS_FIRST_FREE_OBJECTID, > + BTRFS_LAST_FREE_OBJECTID, > + BTRFS_FREE_INO_OBJECTID); > + goto error; > + } > + if (key->offset != 0) { > + inode_item_err(fs_info, leaf, slot, > + "invalid key offset: has %llu expect 0", > + key->offset); > + goto error; > + } > + iitem = btrfs_item_ptr(leaf, slot, struct btrfs_inode_item); > + > + /* Here we use super block generation + 1 to handle log tree */ > + if (btrfs_inode_generation(leaf, iitem) > super_gen + 1) { > + inode_item_err(fs_info, leaf, slot, > + "invalid inode generation: has %llu expect (0, %llu]", > +btrfs_inode_generation(leaf, iitem), > +super_gen + 1); > + goto error; > + } > + /* Note for ROOT_TREE_DIR_ITEM, mkfs could make its transid as 0 */ > + if (btrfs_inode_transid(leaf, iitem) > super_gen + 1) { > + inode_item_err(fs_info, leaf, slot, > + "invalid inode generation: has %llu expect [0, %llu]", > +btrfs_inode_transid(leaf, iitem), > +super_gen + 1); > + goto error; > + } > + > + /* > + * For size and nbytes it's better not to be too strict, as for dir > + * item its size/nbytes can easily get wrong, but doesn't affect > + * any thing of the fs. So here we skip the check. > + */ > + > + mode = btrfs_inode_mode(leaf, iitem); > + if (mode & ~valid_mask) { > + inode_item_err(fs_info, leaf, slot, > +"unknown mode bit detected: 0x%x", > +mode & ~valid_mask); > + goto error; > + } > + > + /* > + * S_IFMT is not bit mapped so we can't completely rely is_power_of_2(), > + * but is_power_of_2() can save us from checking FIFO/CHR/DIR/REG. > + * Only needs to check BLK, LNK and SOCKS > + */ > + if (!is_power_of_2(mode & S_IFMT)) { > + if (!S_ISLNK(mode) && ! S_ISBLK(mode) && !S_ISSOCK(mode)) { > + inode_item_err(fs_info, leaf, slot, > + "invalid mode: has 0%o expect valid S_IF* bit(s)", > +mode & S_IFMT); > + goto error; > + } > + } > + if (S_ISDIR(mode) && btrfs_inode_nlink(leaf, iitem) > 1) { > + inode_item_err(fs_info, leaf, slot, > +"invalid nlink: has %u expect no more than 1 for > dir", > + btrfs_inode_nlink(leaf, iitem)); > + goto error; > + } > + if (btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK) { > + inode_item_err(fs_info, leaf, slot, > +
BUG: workqueue lockup - pool cpus=0-3 flags=0x4 nice=0 stuck for 20915s! (on 4.20+)
On Tue, Mar 26, 2019 at 12:30:07AM -0400, Zygo Blaxell wrote: > On Mon, Mar 25, 2019 at 10:50:28PM -0400, Zygo Blaxell wrote: > > Running balance, rsync, and dedupe, [...] > > > > Mount options are: flushoncommit,space_cache=v2,compress=zstd. [snip] The 5.0.4 test machine locked up overnight. Userspace has been unresponsive for 6 continuous hours, and the kernel is spewing these: [57659.907080] BUG: workqueue lockup - pool cpus=0-3 flags=0x4 nice=0 stuck for 20915s! [57659.908242] Showing busy workqueues and worker pools: [57659.908978] workqueue events: flags=0x0 [57659.909549] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [57659.910418] pending: cache_reap [57659.910934] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [57659.911805] pending: cache_reap [57659.912321] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [57659.913189] pending: psi_update_work, cache_reap [57659.913919] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=7/256 [57659.914786] in-flight: 5391:cgroup1_release_agent, 5390:cgroup1_release_agent, 5387:cgroup1_release_agent, 5442:cgroup1_release_agent, 5388:cgroup1_release_agent, 23068:cgroup1_release_agent, 11703:cgroup1_release_agent [57659.917587] workqueue events_unbound: flags=0x2 [57659.918254] pwq 8: cpus=0-3 flags=0x4 nice=0 active=12/512 [57659.919074] pending: flush_to_ldisc, flush_to_ldisc, flush_to_ldisc, flush_to_ldisc, call_usermodehelper_exec_work, call_usermodehelper_exec_work, call_usermodehelper_exec_work, call_usermodehelper_exec_work, call_usermodehelper_exec_work, call_usermodehelper_exec_work, call_usermodehelper_exec_work, flush_to_ldisc [57659.923001] workqueue events_power_efficient: flags=0x80 [57659.923760] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [57659.924608] pending: sync_hw_clock [57659.925153] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [57659.926006] pending: neigh_periodic_work, check_lifetime [57659.926814] workqueue events_freezable_power_: flags=0x84 [57659.927579] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [57659.928433] pending: disk_events_workfn [57659.929040] workqueue mm_percpu_wq: flags=0x8 [57659.929662] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [57659.930516] pending: vmstat_update [57659.931059] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [57659.931908] pending: vmstat_update [57659.932448] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [57659.933301] pending: vmstat_update [57659.933845] workqueue writeback: flags=0x4e [57659.93] pwq 8: cpus=0-3 flags=0x4 nice=0 active=3/256 [57659.935239] in-flight: 22897:wb_workfn [57659.935827] pending: wb_workfn, wb_workfn [57659.936453] workqueue kblockd: flags=0x18 [57659.937030] pwq 7: cpus=3 node=0 flags=0x0 nice=-20 active=2/256 [57659.937903] pending: blk_mq_timeout_work, blk_mq_timeout_work [57659.938770] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256 [57659.939639] pending: blk_mq_timeout_work, blk_mq_timeout_work [57659.940508] workqueue ipv6_addrconf: flags=0x40008 [57659.941196] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 [57659.942021] pending: addrconf_verify_work [ipv6] [57659.942738] workqueue dm_bufio_cache: flags=0x8 [57659.943389] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [57659.944240] pending: work_fn [dm_bufio] [57659.944845] workqueue dm-cache: flags=0x8 [57659.945421] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [57659.946271] pending: do_waker [dm_cache] [57659.946887] workqueue kcopyd: flags=0x8 [57659.947440] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [57659.948291] pending: do_work [dm_mod] [57659.948875] workqueue btrfs-delalloc: flags=0xe [57659.949521] pwq 8: cpus=0-3 flags=0x4 nice=0 active=6/6 [57659.950289] pending: btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper [57659.952299] delayed: btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper, btrfs_delalloc_helper,
Re: WARNING at fs/btrfs/delayed-ref.c:296 btrfs_merge_delayed_refs+0x3dc/0x410 (new on 5.0.4, not in 5.0.3)
On Tue, Mar 26, 2019 at 05:13:53PM +0200, Nikolay Borisov wrote: > > > On 26.03.19 г. 17:09 ч., Zygo Blaxell wrote: > > On Tue, Mar 26, 2019 at 10:42:31AM +0200, Nikolay Borisov wrote: > >> > >> > >> On 26.03.19 г. 6:30 ч., Zygo Blaxell wrote: > >>> On Mon, Mar 25, 2019 at 10:50:28PM -0400, Zygo Blaxell wrote: > Running balance, rsync, and dedupe, I get kernel warnings every few > minutes on 5.0.4. No warnings on 5.0.3 under similar conditions. > > Mount options are: flushoncommit,space_cache=v2,compress=zstd. > > There are two different stacks on the warnings. This one comes from > btrfs balance: > >>> > >>> [snip] > >>> > >>> Possibly unrelated, but I'm also repeatably getting this in 5.0.4 and > >>> not 5.0.3, after about 5 hours of uptime. Different processes, same > >>> kernel stack: > >>> > >>> [Mon Mar 25 23:35:17 2019] kworker/u8:4: page allocation failure: > >>> order:0, mode:0x404000(GFP_NOWAIT|__GFP_COMP), > >>> nodemask=(null),cpuset=/,mems_allowed=0 > >>> [Mon Mar 25 23:35:17 2019] CPU: 2 PID: 29518 Comm: kworker/u8:4 > >>> Tainted: GW 5.0.4-zb64-303ce93b05c9+ #1 > >> > >> What commits does this kernel include because it doesn't seem to be a > >> pristine upstream 5.0.4 ? Also what you are seeing below is definitely a > >> bug in MM. The question is whether it's due to your doing faulty > >> backports in the kernel or it's due to something that got automatically > >> backported to 5.0.4 > > > > That was the first thing I thought of, so I reverted to vanilla 5.0.4, > > repeated the test, and obtained the same result. > > > > You may have a point about non-btrfs patches in 5.0.4, though. > > I previously tested 5.0.3 with most of the 5.0.4 fs/btrfs commits > > already included by cherry-pick: > > > > 1098803b8cb7 Btrfs: fix deadlock between clone/dedupe and rename > > 3486142a68e3 Btrfs: fix corruption reading shared and compressed > > extents after hole punching > > fb9c36acfab1 btrfs: scrub: fix circular locking dependency warning > > 9d7b327affb8 Btrfs: setup a nofs context for memory allocation at > > __btrfs_set_acl > > 80dcd07c27df Btrfs: setup a nofs context for memory allocation at > > btrfs_create_tree() > > > > The commits that are in 5.0.4 but not in my last 5.0.3 test run are: > > > > ebbb48419e8a btrfs: init csum_list before possible free > > 88e610ae4c3a btrfs: ensure that a DUP or RAID1 block group has exactly > > two stripes > > 9c58f2ada4fa btrfs: drop the lock on error in btrfs_dev_replace_cancel > > > > and I don't see how those commits could lead to the observed changes > > in behavior. I didn't include them for 5.0.3 because my test scenario > > doesn't execute the code they touch. So the problem might be outside > > of btrfs completely. > > I think it might very well be outside of btrfs because you are seeing an > order 0 failure when you have plenty of order 0 free pages. That's > definitely something you might want to report to mm. I found a similar incident in logs from older (and slightly different) test runs, this one on 4.20.13: [112241.575678] kworker/u8:17: page allocation failure: order:0, mode:0x404000(GFP_NOWAIT|__GFP_COMP), nodemask=(null) [112241.587462] kworker/u8:17 cpuset=/ mems_allowed=0 [112241.588442] CPU: 1 PID: 22891 Comm: kworker/u8:17 Not tainted 4.20.13-zb64+ #1 [112241.589550] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [112241.590814] Workqueue: btrfs-submit btrfs_submit_helper [112241.591611] Call Trace: [112241.592034] dump_stack+0x7d/0xbb [112241.592564] warn_alloc+0xfc/0x180 [112241.593121] __alloc_pages_nodemask+0x1297/0x13e0 [112241.593860] ? rcu_read_lock_sched_held+0x68/0x70 [112241.594607] cache_grow_begin+0x79/0x730 [112241.595226] ? cache_grow_begin+0x79/0x730 [112241.595875] ? cache_alloc_node+0x165/0x1e0 [112241.596573] fallback_alloc+0x1e4/0x280 [112241.597162] kmem_cache_alloc+0x2e9/0x310 [112241.597763] btracker_queue+0x47/0x170 [dm_cache] [112241.598465] __lookup+0x474/0x600 [dm_cache_smq] [112241.599151] ? smq_lookup+0x37/0x7b [dm_cache_smq] [112241.599867] smq_lookup+0x5d/0x7b [dm_cache_smq] [112241.600554] map_bio.part.40+0x14d/0x5d0 [dm_cache] [112241.601307] ? bio_detain_shared+0xb3/0x120 [dm_cache] [112241.602075] cache_map+0x120/0x1a0 [dm_cache] [112241.602813] __map_bio+0x42/0x1f0 [dm_mod] [112241.603506] __split_and_process_non_flush+0x10e/0x1e0 [dm_mod] [112241.604418] __split_and_process_bio+0xb2/0x1a0 [dm_mod] [112241.605371] ? __process_bio+0x170/0x170 [dm_mod] [112241.606099] __dm_make_request.isra.20+0x4c/0x100 [dm_mod] [112241.606936] generic_make_request+0x29d/0x470 [112241.607611] ? kvm_sched_clock_read+0x14/0x30
Re: [PATCh v2 8/9] btrfs: tree-checker: Verify inode item
On Mon, Mar 25, 2019 at 12:27:24PM +0800, Qu Wenruo wrote: > > > On 2019/3/20 下午2:37, Qu Wenruo wrote: > > There is a report in kernel bugzilla about mismatch file type in dir > > item and inode item. > > > > This inspires us to check inode mode in inode item. > > > > This patch will check the following members: > > - inode key objectid > > Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO. > > > > - inode key offset > > Should be 0 > > > > - inode item generation > > - inode item transid > > No newer than sb generation + 1. > > The +1 is for log tree. > > > > - inode item mode > > No unknown bits. > > No invalid S_IF* bit. > > NOTE: S_IFMT check is not enough, need to check every know type. > > > > - inode item nlink > > Dir should have no more link than 1. > > > > - inode item flags > > > > Signed-off-by: Qu Wenruo > > Reviewed-by: Nikolay Borisov > > There is some bug report of kernel producing free space cache inode with > mode 0, which is invalid and can be detected by this patch. > > Although the patch itself is good, I'm afraid we need to address the > invalid inode mode created by old kernel in btrfs-progs at least before > merging this patch into upstream. Can this be addressed on the kernel side? Like detecting the invalid mode, print a warning and the fix on the next write. The progs can detect and fix that too of course. So I'll keep the patch working as-is, we can relax the error to a warning if we're out of time or find out that it needs to be that way due to backward compatibilit reasons.
Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount
On Tue, Mar 26, 2019 at 12:44 AM Andrei Borzenkov wrote: > > > He has btrfs raid0 profile on top of hardware RAID6 devices. sys_chunk_array[2048]: item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 1048576) length 4194304 owner 2 stripe_len 65536 type SYSTEM io_align 4096 io_width 4096 sector_size 4096 num_stripes 1 Pretty sure the metadata profiles is "single". From the super, I can't tell what profile the data block groups use. -- Chris Murphy
Re: backup uuid_tree generation not consistent across multi device (raid0) btrfs - won´t mount
On Tue, Mar 26, 2019 at 11:38 AM Chris Murphy wrote: > > On Tue, Mar 26, 2019 at 12:44 AM Andrei Borzenkov wrote: > > > > > > He has btrfs raid0 profile on top of hardware RAID6 devices. > > sys_chunk_array[2048]: > item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 1048576) > length 4194304 owner 2 stripe_len 65536 type SYSTEM > io_align 4096 io_width 4096 sector_size 4096 > num_stripes 1 > > Pretty sure the metadata profiles is "single". From the super, I can't > tell what profile the data block groups use. system chunk is on two devices: num_stripes 1 sub_stripes 0 num_stripes 1 sub_stripes 1 Maybe it is raid0, but I thought dump super explicitly shows the profile if it's not single. e.g. SYSTEM|DUP or SYSTEM|RAID1 Only my single profile file systems lack a profile designation in the super. But I admit I have no raid0 file systems. -- Chris Murphy
[PATCH 05/15] btrfs: return whether extent is nocow or not
From: Goldwyn Rodrigues We require this to set the IOMAP_F_COW flag in iomap structure, in the later patches. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 2 +- fs/btrfs/inode.c | 9 +++-- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b7bbe5130a3b..2c49d3c46170 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3278,7 +3278,7 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location, struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, struct btrfs_root *root, int *was_new); int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, - struct inode *inode, u64 start, u64 len); + struct inode *inode, u64 start, u64 len, int *nocow); struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, struct page *page, size_t pg_offset, u64 start, u64 end, int create); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 80184d0c3b52..c8702e0b5e66 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7499,12 +7499,15 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em, int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, struct inode *inode, - u64 start, u64 len) + u64 start, u64 len, int *nocow) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct extent_map *em = *map; int ret = 0; + if (nocow) + *nocow = 0; + /* * We don't allocate a new extent in the following cases * @@ -7553,6 +7556,8 @@ int btrfs_get_extent_map_write(struct extent_map **map, */ btrfs_free_reserved_data_space_noquota(inode, start, len); + if (nocow) + *nocow = 1; /* skip COW */ goto out; } @@ -7579,7 +7584,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, struct extent_map *em; ret = btrfs_get_extent_map_write(map, bh_result, inode, - start, len); + start, len, NULL); if (ret < 0) return ret; em = *map; -- 2.16.4
[PATCH 06/15] btrfs: Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()
From: Goldwyn Rodrigues Since we will be using it in another part of the code, use a better name to declare it non-static Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 7 +-- fs/btrfs/inode.c | 14 +- 2 files changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2c49d3c46170..a3543a4a063d 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3280,8 +3280,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, struct inode *inode, u64 start, u64 len, int *nocow); struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, - struct page *page, size_t pg_offset, - u64 start, u64 end, int create); + struct page *page, size_t pg_offset, + u64 start, u64 end, int create); +void btrfs_update_ordered_extent(struct inode *inode, + const u64 offset, const u64 bytes, + const bool uptodate); int btrfs_update_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct inode *inode); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c8702e0b5e66..f721fc1e3f7f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -98,10 +98,6 @@ static struct extent_map *create_io_em(struct inode *inode, u64 start, u64 len, u64 ram_bytes, int compress_type, int type); -static void __endio_write_update_ordered(struct inode *inode, -const u64 offset, const u64 bytes, -const bool uptodate); - /* * Cleanup all submitted ordered extents in specified range to handle errors * from the btrfs_run_delalloc_range() callback. @@ -142,7 +138,7 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode, bytes -= PAGE_SIZE; } - return __endio_write_update_ordered(inode, offset, bytes, false); + return btrfs_update_ordered_extent(inode, offset, bytes, false); } static int btrfs_dirty_inode(struct inode *inode); @@ -8085,7 +8081,7 @@ static void btrfs_endio_direct_read(struct bio *bio) bio_put(bio); } -static void __endio_write_update_ordered(struct inode *inode, +void btrfs_update_ordered_extent(struct inode *inode, const u64 offset, const u64 bytes, const bool uptodate) { @@ -8138,7 +8134,7 @@ static void btrfs_endio_direct_write(struct bio *bio) struct btrfs_dio_private *dip = bio->bi_private; struct bio *dio_bio = dip->dio_bio; - __endio_write_update_ordered(dip->inode, dip->logical_offset, + btrfs_update_ordered_extent(dip->inode, dip->logical_offset, dip->bytes, !bio->bi_status); kfree(dip); @@ -8457,7 +8453,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, bio = NULL; } else { if (write) - __endio_write_update_ordered(inode, + btrfs_update_ordered_extent(inode, file_offset, dio_bio->bi_iter.bi_size, false); @@ -8597,7 +8593,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) */ if (dio_data.unsubmitted_oe_range_start < dio_data.unsubmitted_oe_range_end) - __endio_write_update_ordered(inode, + btrfs_update_ordered_extent(inode, dio_data.unsubmitted_oe_range_start, dio_data.unsubmitted_oe_range_end - dio_data.unsubmitted_oe_range_start, -- 2.16.4
[PATCH 02/15] btrfs: Carve out btrfs_get_extent_map_write() out of btrfs_get_blocks_write()
From: Goldwyn Rodrigues This makes btrfs_get_extent_map_write() independent of Direct I/O code. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/inode.c | 40 +++- 2 files changed, 29 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8ca1c0d120f4..9512f49262dd 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3277,6 +3277,8 @@ struct inode *btrfs_iget_path(struct super_block *s, struct btrfs_key *location, struct btrfs_path *path); struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, struct btrfs_root *root, int *was_new); +int btrfs_get_extent_map_write(struct extent_map **map, struct buffer_head *bh, + struct inode *inode, u64 start, u64 len); struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, struct page *page, size_t pg_offset, u64 start, u64 end, int create); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 82fdda8ff5ab..80184d0c3b52 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7496,11 +7496,10 @@ static int btrfs_get_blocks_direct_read(struct extent_map *em, return 0; } -static int btrfs_get_blocks_direct_write(struct extent_map **map, -struct buffer_head *bh_result, -struct inode *inode, -struct btrfs_dio_data *dio_data, -u64 start, u64 len) +int btrfs_get_extent_map_write(struct extent_map **map, + struct buffer_head *bh, + struct inode *inode, + u64 start, u64 len) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct extent_map *em = *map; @@ -7554,22 +7553,38 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, */ btrfs_free_reserved_data_space_noquota(inode, start, len); - goto skip_cow; + /* skip COW */ + goto out; } } /* this will cow the extent */ - len = bh_result->b_size; + if (bh) + len = bh->b_size; free_extent_map(em); *map = em = btrfs_new_extent_direct(inode, start, len); - if (IS_ERR(em)) { - ret = PTR_ERR(em); - goto out; - } + if (IS_ERR(em)) + return PTR_ERR(em); +out: + return ret; +} +static int btrfs_get_blocks_direct_write(struct extent_map **map, +struct buffer_head *bh_result, +struct inode *inode, +struct btrfs_dio_data *dio_data, +u64 start, u64 len) +{ + int ret = 0; + struct extent_map *em; + + ret = btrfs_get_extent_map_write(map, bh_result, inode, + start, len); + if (ret < 0) + return ret; + em = *map; len = min(len, em->len - (start - em->start)); -skip_cow: bh_result->b_blocknr = (em->block_start + (start - em->start)) >> inode->i_blkbits; bh_result->b_size = len; @@ -7590,7 +7605,6 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, dio_data->reserve -= len; dio_data->unsubmitted_oe_range_end = start + len; current->journal_info = dio_data; -out: return ret; } -- 2.16.4
[PATCH 03/15] btrfs: basic dax read
From: Goldwyn Rodrigues Perform a basic read using iomap support. The btrfs_iomap_begin() finds the extent at the position and fills the iomap data structure with the values. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/Makefile | 1 + fs/btrfs/ctree.h | 5 + fs/btrfs/dax.c| 49 + fs/btrfs/file.c | 12 +++- 4 files changed, 66 insertions(+), 1 deletion(-) create mode 100644 fs/btrfs/dax.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ca693dd554e9..1fa77b875ae9 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ uuid-tree.o props.o free-space-tree.o tree-checker.o +btrfs-$(CONFIG_FS_DAX) += dax.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 9512f49262dd..b7bbe5130a3b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle); void btrfs_reada_detach(void *handle); int btree_readahead_hook(struct extent_buffer *eb, int err); +#ifdef CONFIG_FS_DAX +/* dax.c */ +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); +#endif /* CONFIG_FS_DAX */ + static inline int is_fstree(u64 rootid) { if (rootid == BTRFS_FS_TREE_OBJECTID || diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c new file mode 100644 index ..bf3d46b0acb6 --- /dev/null +++ b/fs/btrfs/dax.c @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * DAX support for BTRFS + * + * Copyright (c) 2019 SUSE Linux + * Author: Goldwyn Rodrigues + */ + +#ifdef CONFIG_FS_DAX +#include +#include +#include "ctree.h" +#include "btrfs_inode.h" + +static int btrfs_iomap_begin(struct inode *inode, loff_t pos, + loff_t length, unsigned flags, struct iomap *iomap) +{ + struct extent_map *em; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); + if (em->block_start == EXTENT_MAP_HOLE) { + iomap->type = IOMAP_HOLE; + return 0; + } + iomap->type = IOMAP_MAPPED; + iomap->bdev = em->bdev; + iomap->dax_dev = fs_info->dax_dev; + iomap->offset = em->start; + iomap->length = em->len; + iomap->addr = em->block_start; + return 0; +} + +static const struct iomap_ops btrfs_iomap_ops = { + .iomap_begin= btrfs_iomap_begin, +}; + +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to) +{ + ssize_t ret; + struct inode *inode = file_inode(iocb->ki_filp); + + inode_lock_shared(inode); + ret = dax_iomap_rw(iocb, to, &btrfs_iomap_ops); + inode_unlock_shared(inode); + + return ret; +} +#endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 34fe8a58b0e9..b620f4e718b2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3288,9 +3288,19 @@ static int btrfs_file_open(struct inode *inode, struct file *filp) return generic_file_open(inode, filp); } +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) +{ +#ifdef CONFIG_FS_DAX + struct inode *inode = file_inode(iocb->ki_filp); + if (IS_DAX(file_inode(iocb->ki_filp))) + return btrfs_file_dax_read(iocb, to); +#endif + return generic_file_read_iter(iocb, to); +} + const struct file_operations btrfs_file_operations = { .llseek = btrfs_file_llseek, - .read_iter = generic_file_read_iter, + .read_iter = btrfs_file_read_iter, .splice_read= generic_file_splice_read, .write_iter = btrfs_file_write_iter, .mmap = btrfs_file_mmap, -- 2.16.4
[no subject]
Subject: [PATCH v2 00/15] btrfs dax support This patch set adds support for dax on the BTRFS filesystem. In order to support for CoW for btrfs, there were changes which had to be made to the dax handling. The important one is copying blocks into the same dax device before using them. I have some doubts: I have put them in patch headers of the individual patches. Git: https://github.com/goldwynr/linux/tree/btrfs-dax Changes since V1: - use iomap instead of redoing everything in btrfs - support for mmap writeprotecting on snapshotting
[PATCH 04/15] dax: Introduce IOMAP_F_COW for copy-on-write
From: Goldwyn Rodrigues The IOMAP_F_COW is a flag to notify dax that it needs to copy the data from iomap->cow_addr to iomap->addr, if the start/end of I/O are not page aligned. This also introduces dax_to_dax_copy() which performs a copy from one part of the device to another, to a maximum of one page. Question: Using iomap.cow_addr == 0 means the CoW is to be copied (or memset) from a hole. Would this be better handled through a flag? Signed-off-by: Goldwyn Rodrigues --- fs/dax.c | 36 include/linux/iomap.h | 3 +++ 2 files changed, 39 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index ca0671d55aa6..e254535dd830 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1051,6 +1051,28 @@ static bool dax_range_is_aligned(struct block_device *bdev, return true; } +static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr, + size_t len) +{ + loff_t blk_start, blk_pg; + void *saddr; + ssize_t map_len; + + /* A zero address is a hole. */ + if (iomap->cow_addr == 0) { + memset(daddr, 0, len); + return; + } + + blk_start = iomap->cow_addr + pos - iomap->cow_pos; + blk_pg = round_down(blk_start, PAGE_SIZE); + + map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE, + &saddr, NULL); + saddr += blk_start - blk_pg; + memcpy(daddr, saddr, len); +} + int __dax_zero_page_range(struct block_device *bdev, struct dax_device *dax_dev, sector_t sector, unsigned int offset, unsigned int size) @@ -1143,6 +1165,20 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, break; } + if (iomap->flags & IOMAP_F_COW) { + loff_t pg_end = round_up(end, PAGE_SIZE); + /* +* Copy the first part of the page +* Note: we pass offset as length +*/ + if (offset) + dax_to_dax_copy(iomap, pos - offset, kaddr, offset); + + /* Copy the last part of the range */ + if (end < pg_end) + dax_to_dax_copy(iomap, end, kaddr + offset + length, pg_end - end); + } + map_len = PFN_PHYS(map_len); kaddr += offset; map_len -= offset; diff --git a/include/linux/iomap.h b/include/linux/iomap.h index 0fefb5455bda..391785de1428 100644 --- a/include/linux/iomap.h +++ b/include/linux/iomap.h @@ -35,6 +35,7 @@ struct vm_fault; #define IOMAP_F_NEW0x01/* blocks have been newly allocated */ #define IOMAP_F_DIRTY 0x02/* uncommitted metadata */ #define IOMAP_F_BUFFER_HEAD0x04/* file system requires buffer heads */ +#define IOMAP_F_COW0x08/* cow before write */ /* * Flags that only need to be reported for IOMAP_REPORT requests: @@ -59,6 +60,8 @@ struct iomap { u64 length; /* length of mapping, bytes */ u16 type; /* type of mapping */ u16 flags; /* flags for mapping */ + u64 cow_addr; /* read address to perform CoW */ + loff_t cow_pos; /* file offset of cow_addr */ struct block_device *bdev; /* block device for I/O */ struct dax_device *dax_dev; /* dax_dev for dax operations */ void*inline_data; -- 2.16.4
[PATCH 01/15] btrfs: create a mount option for dax
From: Goldwyn Rodrigues This sets S_DAX in inode->i_flags, which can be used with IS_DAX(). The dax option is restricted to non multi-device mounts. dax interacts with the device directly instead of using bio, so all bio-hooks which we use for multi-device cannot be performed here. While regular read/writes could be manipulated with RAID0/1, mmap() is still an issue. Auto-setting free space tree, because dealing with free space inode (specifically readpages) is a nightmare. Auto-setting nodatasum because we don't get callback for writing checksums after mmap()s. Store the dax_device in fs_info which will be used in iomap code. Question: Since we have only one dax device, I thought fs_info is the best place. However, should it moved to btrfs_device? Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 4 fs/btrfs/ioctl.c | 5 - fs/btrfs/super.c | 26 ++ 4 files changed, 36 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b3642367a595..8ca1c0d120f4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1067,6 +1067,7 @@ struct btrfs_fs_info { u32 metadata_ratio; void *bdev_holder; + struct dax_device *dax_dev; /* private scrub information */ struct mutex scrub_lock; @@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info) #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26) #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27) #define BTRFS_MOUNT_REF_VERIFY (1 << 28) +#define BTRFS_MOUNT_DAX(1 << 29) #define BTRFS_DEFAULT_COMMIT_INTERVAL (30) #define BTRFS_DEFAULT_MAX_INLINE (2048) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6fe9197f6ee4..2bbb63b2fcff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb, goto fail_alloc; } + fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev); + /* * We want to check superblock checksum, the type is stored inside. * Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k). @@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) #endif btrfs_close_devices(fs_info->fs_devices); + fs_put_dax(fs_info->dax_dev); btrfs_mapping_tree_free(&fs_info->mapping_tree); percpu_counter_destroy(&fs_info->dirty_metadata_bytes); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ec2d8919e7fb..e66426e7692d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode) if (binode->flags & BTRFS_INODE_DIRSYNC) new_fl |= S_DIRSYNC; + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && S_ISREG(inode->i_mode)) + new_fl |= S_DAX; + set_mask_bits(&inode->i_flags, - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC, + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | S_DAX, new_fl); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 120e4340792a..2d448b9d6004 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -326,6 +326,7 @@ enum { Opt_treelog, Opt_notreelog, Opt_usebackuproot, Opt_user_subvol_rm_allowed, + Opt_dax, /* Deprecated options */ Opt_alloc_start, @@ -393,6 +394,7 @@ static const match_table_t tokens = { {Opt_notreelog, "notreelog"}, {Opt_usebackuproot, "usebackuproot"}, {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"}, + {Opt_dax, "dax"}, /* Deprecated options */ {Opt_alloc_start, "alloc_start=%s"}, @@ -745,6 +747,28 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, case Opt_user_subvol_rm_allowed: btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED); break; + case Opt_dax: +#ifdef CONFIG_FS_DAX + if (btrfs_super_num_devices(info->super_copy) > 1) { + btrfs_info(info, + "dax not supported for multi-device btrfs partition\n"); + ret = -EOPNOTSUPP; + goto out; + } + btrfs_set_opt(info->mount_opt, DAX); + btrfs_warn(info, "DAX enabled. Warning: EXPERIMENTAL, use at your own risk\n"); + btrfs_set_and_info(info, NODATASUM, + "auto-setting nodatasum (dax)"); + btrfs_clear_opt(info->mount_opt, SPACE_CACHE); + btrfs_set_and_info(in
[PATCH 14/15] btrfs: Disable dax-based defrag and send
From: Goldwyn Rodrigues This is temporary, and a TODO. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ioctl.c | 13 + fs/btrfs/send.c | 4 2 files changed, 17 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2e5137b01561..f532a8df2026 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2980,6 +2980,12 @@ static int btrfs_ioctl_defrag(struct file *file, void __user *argp) goto out; } + if (IS_DAX(inode)) { + btrfs_warn(root->fs_info, "File defrag is not supported with DAX"); + ret = -EOPNOTSUPP; + goto out; + } + if (argp) { if (copy_from_user(range, argp, sizeof(*range))) { @@ -4647,6 +4653,10 @@ static long btrfs_ioctl_balance(struct file *file, void __user *arg) if (!capable(CAP_SYS_ADMIN)) return -EPERM; + /* send can be on a directory, so check super block instead */ + if (btrfs_test_opt(fs_info, DAX)) + return -EOPNOTSUPP; + ret = mnt_want_write_file(file); if (ret) return ret; @@ -5499,6 +5509,9 @@ static int _btrfs_ioctl_send(struct file *file, void __user *argp, bool compat) struct btrfs_ioctl_send_args *arg; int ret; + if (IS_DAX(file_inode(file))) + return -EOPNOTSUPP; + if (compat) { #if defined(CONFIG_64BIT) && defined(CONFIG_COMPAT) struct btrfs_ioctl_send_args_32 args32; diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 7ea2d6b1f170..9679fd54db86 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -6609,6 +6609,10 @@ long btrfs_ioctl_send(struct file *mnt_file, struct btrfs_ioctl_send_args *arg) int sort_clone_roots = 0; int index; + /* send can be on a directory, so check super block instead */ + if (btrfs_test_opt(fs_info, DAX)) + return -EOPNOTSUPP; + if (!capable(CAP_SYS_ADMIN)) return -EPERM; -- 2.16.4
[PATCH 09/15] btrfs: add dax mmap support
From: Goldwyn Rodrigues Add a new vm_operations struct btrfs_dax_vm_ops specifically for dax files. Since we will be removing(nulling) readpages/writepages for dax return ENOEXEC only for non-dax files. dax_insert_entry() looks ugly. Do you think we should break it into dax_insert_cow_entry() and dax_insert_entry()? Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 1 + fs/btrfs/dax.c | 11 +++ fs/btrfs/file.c | 18 -- fs/dax.c | 17 ++--- 4 files changed, 38 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3bcd2a4959c1..0e5060933bde 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3802,6 +3802,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); /* dax.c */ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); +vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); #else static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) { diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 49619fe3f94f..927f962d1e88 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -157,4 +157,15 @@ ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *iter) } return ret; } + +vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) +{ + vm_fault_t ret; + pfn_t pfn; + ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops); + if (ret & VM_FAULT_NEEDDSYNC) + ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn); + + return ret; +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 3b320d0ab495..196c8f37ff9d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2214,15 +2214,29 @@ static const struct vm_operations_struct btrfs_file_vm_ops = { .page_mkwrite = btrfs_page_mkwrite, }; +#ifdef CONFIG_FS_DAX +static const struct vm_operations_struct btrfs_dax_vm_ops = { + .fault = btrfs_dax_fault, + .page_mkwrite = btrfs_dax_fault, + .pfn_mkwrite= btrfs_dax_fault, +}; +#else +#define btrfs_dax_vm_ops btrfs_file_vm_ops +#endif + static int btrfs_file_mmap(struct file *filp, struct vm_area_struct *vma) { struct address_space *mapping = filp->f_mapping; + struct inode *inode = file_inode(filp); - if (!mapping->a_ops->readpage) + if (!IS_DAX(inode) && !mapping->a_ops->readpage) return -ENOEXEC; file_accessed(filp); - vma->vm_ops = &btrfs_file_vm_ops; + if (IS_DAX(inode)) + vma->vm_ops = &btrfs_dax_vm_ops; + else + vma->vm_ops = &btrfs_file_vm_ops; return 0; } diff --git a/fs/dax.c b/fs/dax.c index 21ee3df6f02c..41061da42771 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -708,14 +708,15 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev, */ static void *dax_insert_entry(struct xa_state *xas, struct address_space *mapping, struct vm_fault *vmf, - void *entry, pfn_t pfn, unsigned long flags, bool dirty) + void *entry, pfn_t pfn, unsigned long flags, bool dirty, + bool cow) { void *new_entry = dax_make_entry(pfn, flags); if (dirty) __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) { + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) { unsigned long index = xas->xa_index; /* we are replacing a zero page with block mapping */ if (dax_is_pmd_entry(entry)) @@ -732,7 +733,7 @@ static void *dax_insert_entry(struct xa_state *xas, dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address); } - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) { /* * Only swap our new entry into the page cache if the current * entry is a zero page or an empty entry. If a normal PTE or @@ -1031,7 +1032,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, vm_fault_t ret; *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, - DAX_ZERO_PAGE, false); + DAX_ZERO_PAGE, false, false); ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); trace_dax_load_hole(inode, vmf, ret); @@ -1408,7 +1409,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, goto error_finish_iomap; entry = dax_insert_entry(&xas, mapping, vmf, entry, pfn, -0, write && !sync); +0, write && !sync, + (i
[PATCH 08/15] dax: add dax_iomap_cow to copy a mmap page before writing
From: Goldwyn Rodrigues dax_iomap_cow copies a page before presenting for mmap. Signed-off-by: Goldwyn Rodrigues --- fs/dax.c | 33 - 1 file changed, 32 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index e254535dd830..21ee3df6f02c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1269,6 +1269,33 @@ static bool dax_fault_is_synchronous(unsigned long flags, && (iomap->flags & IOMAP_F_DIRTY); } +static int dax_iomap_cow(struct iomap *iomap, loff_t pos, pfn_t *pfn) +{ + void *daddr; + pgoff_t pgoff; + long rc; + int id; + sector_t sector; + + pos = round_down(pos, PAGE_SIZE); + + sector = round_down(iomap->addr + iomap->offset - pos, PAGE_SIZE) >> 9; + rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff); + if (rc) + return rc; + + id = dax_read_lock(); + rc = dax_direct_access(iomap->dax_dev, pgoff, 1, &daddr, pfn); + if (rc < 0) + goto out; + + dax_to_dax_copy(iomap, pos, daddr, PAGE_SIZE); + +out: + dax_read_unlock(id); + return rc; +} + static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, int *iomap_errp, const struct iomap_ops *ops) { @@ -1372,7 +1399,11 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp, count_memcg_event_mm(vma->vm_mm, PGMAJFAULT); major = VM_FAULT_MAJOR; } - error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn); + + if (iomap.flags & IOMAP_F_COW) + error = dax_iomap_cow(&iomap, pos, &pfn); + else + error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn); if (error < 0) goto error_finish_iomap; -- 2.16.4
[PATCH 15/15] btrfs: Writeprotect mmap pages on snapshot
From: Goldwyn Rodrigues Inorder to make sure mmap'd files don't change after snapshot, writeprotect the mmap pages on snapshot. This is done by performing a data writeback on the pages (which simply mark the pages are wrprotected). This way if the user process tries to access the memory we will get another fault and we can perform a CoW. In order to accomplish this, we tag all CoW pages as PAGECACHE_TAG_TOWRITE, and add the mmapd inode in delalloc_inodes. During snapshot, it starts writeback of all delalloc'd inodes and here we perform a data writeback. We don't want to keep the inodes in delalloc_inodes until it umount (WARN_ON), so we remove it during inode evictions. This looks hackish. Other alternatives could be to create another list for mmap'd files or rename delalloc_inodes to writeback_inodes. Suggestions? Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/dax.c | 7 +++ fs/btrfs/inode.c | 13 - fs/dax.c | 3 +++ 4 files changed, 24 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 21068dc4a95a..68a63d93556a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3252,7 +3252,8 @@ int btrfs_create_subvol_root(struct btrfs_trans_handle *trans, struct btrfs_root *new_root, struct btrfs_root *parent_root, u64 new_dirid); - void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state, +void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode); +void btrfs_set_delalloc_extent(struct inode *inode, struct extent_state *state, unsigned *bits); void btrfs_clear_delalloc_extent(struct inode *inode, struct extent_state *state, unsigned *bits); diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index d73945d50b88..bcb961242c74 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -166,10 +166,17 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) { vm_fault_t ret; pfn_t pfn; + struct inode *inode = file_inode(vmf->vma->vm_file); + struct btrfs_inode *binode = BTRFS_I(inode); ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &pfn, NULL, &btrfs_iomap_ops); if (ret & VM_FAULT_NEEDDSYNC) ret = dax_finish_sync_fault(vmf, PE_SIZE_PTE, pfn); + /* Insert into delalloc so we get writeback calls on snapshots */ + if (vmf->flags & FAULT_FLAG_WRITE && + !test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags)) + btrfs_add_delalloc_inodes(binode->root, inode); + return ret; } diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5350e5f23728..3b72c1c96b34 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1713,7 +1713,7 @@ void btrfs_merge_delalloc_extent(struct inode *inode, struct extent_state *new, spin_unlock(&BTRFS_I(inode)->lock); } -static void btrfs_add_delalloc_inodes(struct btrfs_root *root, +void btrfs_add_delalloc_inodes(struct btrfs_root *root, struct inode *inode) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); @@ -5358,12 +5358,17 @@ void btrfs_evict_inode(struct inode *inode) { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_trans_handle *trans; + struct btrfs_inode *binode = BTRFS_I(inode); struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_block_rsv *rsv; int ret; trace_btrfs_inode_evict(inode); + if (IS_DAX(inode) + && test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags)) + btrfs_del_delalloc_inode(root, binode); + if (!root) { clear_inode(inode); return; @@ -8683,6 +8688,10 @@ static int btrfs_dax_writepages(struct address_space *mapping, { struct inode *inode = mapping->host; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_inode *binode = BTRFS_I(inode); + if ((wbc->sync_mode == WB_SYNC_ALL) && + test_bit(BTRFS_INODE_IN_DELALLOC_LIST, &binode->runtime_flags)) + btrfs_del_delalloc_inode(binode->root, binode); return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev, wbc); } @@ -9981,6 +9990,8 @@ static void btrfs_run_delalloc_work(struct btrfs_work *work) delalloc_work = container_of(work, struct btrfs_delalloc_work, work); inode = delalloc_work->inode; + if (IS_DAX(inode)) + filemap_fdatawrite(inode->i_mapping); filemap_flush(inode->i_mapping); if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &BTRFS_I(inode)->runtime_flags)) diff --git a/fs/dax.c b/fs/dax.c index 93146142bb00..c42e9cb486ef 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -7
[PATCH 10/15] btrfs: Add dax specific address_space_operations
From: Goldwyn Rodrigues Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/inode.c | 34 +++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index f721fc1e3f7f..21780ea14e5a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -28,6 +28,7 @@ #include #include #include +#include #include #include "ctree.h" #include "disk-io.h" @@ -65,6 +66,7 @@ static const struct inode_operations btrfs_dir_ro_inode_operations; static const struct inode_operations btrfs_special_inode_operations; static const struct inode_operations btrfs_file_inode_operations; static const struct address_space_operations btrfs_aops; +static const struct address_space_operations btrfs_dax_aops; static const struct file_operations btrfs_dir_file_operations; static const struct extent_io_ops btrfs_extent_io_ops; @@ -3757,7 +3759,10 @@ static int btrfs_read_locked_inode(struct inode *inode, switch (inode->i_mode & S_IFMT) { case S_IFREG: - inode->i_mapping->a_ops = &btrfs_aops; + if (btrfs_test_opt(fs_info, DAX)) + inode->i_mapping->a_ops = &btrfs_dax_aops; + else + inode->i_mapping->a_ops = &btrfs_aops; BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops; inode->i_fop = &btrfs_file_operations; inode->i_op = &btrfs_file_inode_operations; @@ -3778,6 +3783,7 @@ static int btrfs_read_locked_inode(struct inode *inode, } btrfs_sync_inode_flags_to_i_flags(inode); + return 0; } @@ -6538,7 +6544,10 @@ static int btrfs_create(struct inode *dir, struct dentry *dentry, */ inode->i_fop = &btrfs_file_operations; inode->i_op = &btrfs_file_inode_operations; - inode->i_mapping->a_ops = &btrfs_aops; + if (IS_DAX(inode) && S_ISREG(mode)) + inode->i_mapping->a_ops = &btrfs_dax_aops; + else + inode->i_mapping->a_ops = &btrfs_aops; err = btrfs_init_inode_security(trans, inode, dir, &dentry->d_name); if (err) @@ -8665,6 +8674,15 @@ static int btrfs_writepages(struct address_space *mapping, return extent_writepages(mapping, wbc); } +static int btrfs_dax_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + struct inode *inode = mapping->host; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + return dax_writeback_mapping_range(mapping, fs_info->fs_devices->latest_bdev, + wbc); +} + static int btrfs_readpages(struct file *file, struct address_space *mapping, struct list_head *pages, unsigned nr_pages) @@ -10436,7 +10454,10 @@ static int btrfs_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) inode->i_fop = &btrfs_file_operations; inode->i_op = &btrfs_file_inode_operations; - inode->i_mapping->a_ops = &btrfs_aops; + if (IS_DAX(inode)) + inode->i_mapping->a_ops = &btrfs_dax_aops; + else + inode->i_mapping->a_ops = &btrfs_aops; BTRFS_I(inode)->io_tree.ops = &btrfs_extent_io_ops; ret = btrfs_init_inode_security(trans, inode, dir, NULL); @@ -10892,6 +10913,13 @@ static const struct address_space_operations btrfs_aops = { .swap_deactivate = btrfs_swap_deactivate, }; +static const struct address_space_operations btrfs_dax_aops = { + .writepages = btrfs_dax_writepages, + .direct_IO = noop_direct_IO, + .set_page_dirty = noop_set_page_dirty, + .invalidatepage = noop_invalidatepage, +}; + static const struct inode_operations btrfs_file_inode_operations = { .getattr= btrfs_getattr, .setattr= btrfs_setattr, -- 2.16.4
[PATCH 12/15] btrfs: trace functions for btrfs_iomap_begin/end
From: Goldwyn Rodrigues This is for debug purposes only and can be skipped. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/dax.c | 3 +++ include/trace/events/btrfs.h | 56 2 files changed, 59 insertions(+) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 9488cae0f8b4..7900b5773829 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -27,6 +27,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos, struct extent_map *em; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + trace_btrfs_iomap_begin(inode, pos, length, flags); + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); if (flags & IOMAP_WRITE) { @@ -103,6 +105,7 @@ static int btrfs_iomap_end(struct inode *inode, loff_t pos, { struct btrfs_iomap *bi = iomap->private; u64 wend; + trace_btrfs_iomap_end(inode, pos, length, written, flags); if (!bi) return 0; diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index ab1cc33adbac..8779e5789a7c 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1850,6 +1850,62 @@ DEFINE_EVENT(btrfs__block_group, btrfs_skip_unused_block_group, TP_ARGS(bg_cache) ); +TRACE_EVENT(btrfs_iomap_begin, + + TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, int flags), + + TP_ARGS(inode, pos, length, flags), + + TP_STRUCT__entry_btrfs( + __field(u64,ino ) + __field(u64,pos ) + __field(u64,length ) + __field(int,flags ) + ), + + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), + __entry->ino= btrfs_ino(BTRFS_I(inode)); + __entry->pos= pos; + __entry->length = length; + __entry->flags = flags; + ), + + TP_printk_btrfs("ino=%llu pos=%llu len=%llu flags=0x%x", + __entry->ino, + __entry->pos, + __entry->length, + __entry->flags) +); + +TRACE_EVENT(btrfs_iomap_end, + + TP_PROTO(const struct inode *inode, loff_t pos, loff_t length, loff_t written, int flags), + + TP_ARGS(inode, pos, length, written, flags), + + TP_STRUCT__entry_btrfs( + __field(u64,ino ) + __field(u64,pos ) + __field(u64,length ) + __field(u64,written ) + __field(int,flags ) + ), + + TP_fast_assign_btrfs(btrfs_sb(inode->i_sb), + __entry->ino= btrfs_ino(BTRFS_I(inode)); + __entry->pos= pos; + __entry->length = length; + __entry->written= written; + __entry->flags = flags; + ), + + TP_printk_btrfs("ino=%llu pos=%llu len=%llu written=%llu flags=0x%x", + __entry->ino, + __entry->pos, + __entry->length, + __entry->written, + __entry->flags) +); #endif /* _TRACE_BTRFS_H */ /* This part must be outside protection */ -- 2.16.4
[PATCH 13/15] btrfs: handle dax page zeroing
From: Goldwyn Rodrigues btrfs_dax_zero_block() zeros part of the page, either from the front or the regular rest of the block. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 1 + fs/btrfs/dax.c| 29 +++-- fs/btrfs/inode.c | 4 fs/dax.c | 17 - fs/iomap.c| 9 + include/linux/dax.h | 11 +-- include/linux/iomap.h | 6 ++ 7 files changed, 56 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 750f9c70fabe..21068dc4a95a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3806,6 +3806,7 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, loff_t destoff, loff_t len, bool *is_same); +int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front); #else static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) { diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 7900b5773829..d73945d50b88 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -31,7 +31,7 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos, em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); - if (flags & IOMAP_WRITE) { + if (flags & (IOMAP_WRITE | IOMAP_ZERO)) { int ret = 0, nocow; struct extent_map *map = em; struct btrfs_iomap *bi; @@ -89,7 +89,8 @@ static int btrfs_iomap_begin(struct inode *inode, loff_t pos, iomap->bdev = em->bdev; iomap->dax_dev = fs_info->dax_dev; - if (em->block_start == EXTENT_MAP_HOLE) { + if (em->block_start == EXTENT_MAP_HOLE || + em->flags == EXTENT_FLAG_FILLING) { iomap->type = IOMAP_HOLE; return 0; } @@ -178,4 +179,28 @@ int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, { return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops); } + +/* + * zero a part of the page only. This should CoW (via iomap_begin) if required + */ +int btrfs_dax_zero_block(struct inode *inode, loff_t from, loff_t len, bool front) +{ + loff_t start = round_down(from, PAGE_SIZE); + loff_t end = round_up(from, PAGE_SIZE); + loff_t offset = from; + int ret = 0; + + if (front) { + len = from - start; + offset = start; + } else { + if (!len) + len = end - from; + } + + if (len) + ret = iomap_zero_range(inode, offset, len, NULL, &btrfs_iomap_ops); + + return (ret < 0) ? ret : 0; +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 21780ea14e5a..5350e5f23728 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4833,6 +4833,10 @@ int btrfs_truncate_block(struct inode *inode, loff_t from, loff_t len, (!len || IS_ALIGNED(len, blocksize))) goto out; +#ifdef CONFIG_FS_DAX + if (IS_DAX(inode)) + return btrfs_dax_zero_block(inode, from, len, front); +#endif block_start = round_down(from, blocksize); block_end = block_start + blocksize - 1; diff --git a/fs/dax.c b/fs/dax.c index 18998c5ee27a..93146142bb00 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1068,17 +1068,21 @@ static void dax_to_dax_copy(struct iomap *iomap, loff_t pos, void *daddr, blk_start = iomap->cow_addr + pos - iomap->cow_pos; blk_pg = round_down(blk_start, PAGE_SIZE); - map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), PAGE_SIZE, + map_len = dax_direct_access(iomap->dax_dev, PHYS_PFN(blk_pg), 1, &saddr, NULL); saddr += blk_start - blk_pg; memcpy(daddr, saddr, len); } -int __dax_zero_page_range(struct block_device *bdev, - struct dax_device *dax_dev, sector_t sector, - unsigned int offset, unsigned int size) +int __dax_zero_page_range(struct iomap *iomap, loff_t pos, + unsigned int offset, unsigned int size) { - if (dax_range_is_aligned(bdev, offset, size)) { + sector_t sector = iomap_sector(iomap, pos & PAGE_MASK); + struct block_device *bdev = iomap->bdev; + struct dax_device *dax_dev = iomap->dax_dev; + + if (!(iomap->flags & IOMAP_F_COW) && + dax_range_is_aligned(bdev, offset, size)) { sector_t start_sector = sector + (offset >> 9); return blkdev_issue_zeroout(bdev, start_sector, @@ -1098,6 +1102,9 @@ int __dax_zero_page_range(struct block_device *bdev, dax_read_unlock(id); return rc; } + if (iomap->flags & IOMAP_F_COW) + dax_to_dax_copy(iomap, pos & PAGE_MASK, +
[PATCH 07/15] btrfs: add dax write support
From: Goldwyn Rodrigues IOMAP_F_COW allows to inform the dax code, to first perform a copy which are not page-aligned before performing the write. A new struct btrfs_iomap is passed from iomap_begin() to iomap_end(), which contains all the accounting and locking information for CoW based writes. For writing to a hole, iomap->cow_addr is set to zero. Would this be better handled by a flag or can a valid filesystem block be at offset zero of the device? Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 6 +++ fs/btrfs/dax.c | 119 +-- fs/btrfs/file.c | 4 +- 3 files changed, 124 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a3543a4a063d..3bcd2a4959c1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); #ifdef CONFIG_FS_DAX /* dax.c */ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); +#else +static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) +{ + return 0; +} #endif /* CONFIG_FS_DAX */ static inline int is_fstree(u64 rootid) diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index bf3d46b0acb6..49619fe3f94f 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -9,30 +9,124 @@ #ifdef CONFIG_FS_DAX #include #include +#include #include "ctree.h" #include "btrfs_inode.h" +struct btrfs_iomap { + u64 start; + u64 end; + int nocow; + struct extent_changeset *data_reserved; + struct extent_state *cached_state; +}; + static int btrfs_iomap_begin(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap *iomap) { struct extent_map *em; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0); + + if (flags & IOMAP_WRITE) { + int ret = 0, nocow; + struct extent_map *map = em; + struct btrfs_iomap *bi; + + bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS); + if (!bi) + return -ENOMEM; + + bi->start = round_down(pos, PAGE_SIZE); + bi->end = round_up(pos + length, PAGE_SIZE); + + iomap->private = bi; + + /* Wait for existing ordered extents in range to finish */ + btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start); + + lock_extent_bits(&BTRFS_I(inode)->io_tree, bi->start, bi->end, &bi->cached_state); + + ret = btrfs_delalloc_reserve_space(inode, &bi->data_reserved, + bi->start, bi->end - bi->start); + if (ret) { + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end, + &bi->cached_state); + kfree(bi); + return ret; + } + + refcount_inc(&map->refs); + ret = btrfs_get_extent_map_write(&em, NULL, + inode, bi->start, bi->end - bi->start, &nocow); + if (ret) { + unlock_extent_cached(&BTRFS_I(inode)->io_tree, bi->start, bi->end, + &bi->cached_state); + btrfs_delalloc_release_space(inode, + bi->data_reserved, bi->start, + bi->end - bi->start, true); + extent_changeset_free(bi->data_reserved); + kfree(bi); + return ret; + } + if (!nocow) { + iomap->flags |= IOMAP_F_COW; + if (map->block_start != EXTENT_MAP_HOLE) { + iomap->cow_addr = map->block_start; + iomap->cow_pos = map->start; + } + } else { + bi->nocow = 1; + } + free_extent_map(map); + } + + iomap->offset = em->start; + iomap->length = em->len; + iomap->bdev = em->bdev; + iomap->dax_dev = fs_info->dax_dev; + if (em->block_start == EXTENT_MAP_HOLE) { iomap->type = IOMAP_HOLE; return 0; } + iomap->type = IOMAP_MAPPED; - iomap->bdev = em->bdev; - iomap->dax_dev = fs_info->dax_dev; - iomap->offset = em->start; - iomap->length = em->len; iomap->addr = em->block_start; return 0; } +static int btrfs_iomap_end(struct inode *inode, loff_t pos, + loff_t length, ssize_t written, unsigned flags, + struct iomap *iomap) +{ + struct btrfs_iomap *bi = iomap-
[PATCH 11/15] fs: dedup file range to use a compare function
From: Goldwyn Rodrigues With dax we cannot deal with readpage() etc. So, we create a funciton callback to perform the file data comparison and pass it to generic_remap_file_range_prep() so it can use iomap-based functions. This may not be the best way to solve this. Suggestions welcome. Signed-off-by: Goldwyn Rodrigues --- fs/btrfs/ctree.h | 3 +++ fs/btrfs/dax.c | 7 +++ fs/btrfs/ioctl.c | 13 +++- fs/dax.c | 58 fs/ocfs2/file.c | 2 +- fs/read_write.c | 9 +--- fs/xfs/xfs_reflink.c | 2 +- include/linux/dax.h | 2 ++ include/linux/fs.h | 4 +++- 9 files changed, 93 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0e5060933bde..750f9c70fabe 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3803,6 +3803,9 @@ int btree_readahead_hook(struct extent_buffer *eb, int err); ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to); ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from); vm_fault_t btrfs_dax_fault(struct vm_fault *vmf); +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, loff_t len, + bool *is_same); #else static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from) { diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c index 927f962d1e88..9488cae0f8b4 100644 --- a/fs/btrfs/dax.c +++ b/fs/btrfs/dax.c @@ -168,4 +168,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf) return ret; } + +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff, + struct inode *dest, loff_t destoff, loff_t len, + bool *is_same) +{ + return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, &btrfs_iomap_ops); +} #endif /* CONFIG_FS_DAX */ diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e66426e7692d..2e5137b01561 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3990,8 +3990,19 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in, if (ret < 0) goto out_unlock; +#ifdef CONFIG_FS_DAX + if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out))) + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, + btrfs_dax_file_range_compare); + else + ret = generic_remap_file_range_prep(file_in, pos_in, file_out, + pos_out, len, remap_flags, NULL); +#else ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out, - len, remap_flags); + len, remap_flags, NULL); +#endif + if (ret < 0 || *len == 0) goto out_unlock; diff --git a/fs/dax.c b/fs/dax.c index 41061da42771..18998c5ee27a 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1775,3 +1775,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf, return dax_insert_pfn_mkwrite(vmf, pfn, order); } EXPORT_SYMBOL_GPL(dax_finish_sync_fault); + + +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode *dest, + loff_t destoff, loff_t len, bool *is_same, const struct iomap_ops *ops) +{ + void *saddr, *daddr; + struct iomap s_iomap = {0}; + struct iomap d_iomap = {0}; + loff_t dstart, sstart; + bool same = true; + loff_t cmp_len, l; + int id, ret = 0; + + id = dax_read_lock(); + while (len) { + ret = ops->iomap_begin(src, srcoff, len, 0, &s_iomap); + if (ret < 0) { + if (ops->iomap_end) + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap); + return ret; + } + cmp_len = len; + if (cmp_len > s_iomap.offset + s_iomap.length - srcoff) + cmp_len = s_iomap.offset + s_iomap.length - srcoff; + + ret = ops->iomap_begin(dest, destoff, cmp_len, 0, &d_iomap); + if (ret < 0) { + if (ops->iomap_end) { + ops->iomap_end(src, srcoff, len, ret, 0, &s_iomap); + ops->iomap_end(dest, destoff, len, ret, 0, &d_iomap); + } + return ret; + } + if (cmp_len > d_iomap.offset + d_iomap.length - destoff) + cmp_len = d_iomap.offset + d_iomap.length - destoff; + + + sstart = (get_start_sect(s_iomap.bdev) << 9) + s_iomap.addr + (srcoff - s_iomap.offset); + l = dax_direct_access(s_iomap.dax_dev, PHYS_PFN(sstart), PHYS_PFN(cmp_len), &saddr, NULL); + dstart = (get_start_sect(d_iomap.bdev) <<
[PATCH v2 00/15] btrfs dax support
Sorry, messed up the subject the first time. This patch set adds support for dax on the BTRFS filesystem. In order to support for CoW for btrfs, there were changes which had to be made to the dax handling. The important one is copying blocks into the same dax device before using them. I have some doubts: I have put them in patch headers of the individual patches. Git: https://github.com/goldwynr/linux/tree/btrfs-dax Changes since V1: - use iomap instead of redoing everything in btrfs - support for mmap writeprotecting on snapshotting fs/btrfs/Makefile|1 fs/btrfs/ctree.h | 32 +- fs/btrfs/dax.c | 225 +-- fs/btrfs/disk-io.c |4 fs/btrfs/file.c | 34 +- fs/btrfs/inode.c | 114 - fs/btrfs/ioctl.c | 31 + fs/btrfs/send.c |4 fs/btrfs/super.c | 26 fs/dax.c | 164 --- fs/iomap.c |9 - fs/ocfs2/file.c |2 fs/read_write.c |9 + fs/xfs/xfs_reflink.c |2 include/linux/dax.h | 13 +- include/linux/fs.h |4 include/linux/iomap.h|9 + include/trace/events/btrfs.h | 56 ++ 18 files changed, 662 insertions(+), 77 deletions(-) -- Goldwyn
Re: [PATCH 01/15] btrfs: create a mount option for dax
On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote: > This sets S_DAX in inode->i_flags, which can be used with > IS_DAX(). > > The dax option is restricted to non multi-device mounts. > dax interacts with the device directly instead of using bio, so > all bio-hooks which we use for multi-device cannot be performed > here. While regular read/writes could be manipulated with > RAID0/1, mmap() is still an issue. > > Auto-setting free space tree, because dealing with free space > inode (specifically readpages) is a nightmare. > Auto-setting nodatasum because we don't get callback for writing > checksums after mmap()s. Congratulations on getting the bear to dance. But why? To me, the point of btrfs is all the cool stuff it does with built-in checksumming and snapshots and RAID and so on. DAX doesn't let you do any of that, so why would somebody want to use btrfs to manage DAX?
Re: [PATCH 07/10] dax: export functions for use with btrfs
On Wed, Dec 5, 2018 at 4:29 AM Goldwyn Rodrigues wrote: > > From: Goldwyn Rodrigues > > These functions are required for btrfs dax support. > > Signed-off-by: Goldwyn Rodrigues > --- > fs/dax.c| 35 --- > include/linux/dax.h | 16 > 2 files changed, 40 insertions(+), 11 deletions(-) Per MAINTAINERS, please copy the filesystem-dax developers on dax patches please.
[PATCH v6 07/07] btrfs: use common file type conversion
Deduplicate the btrfs file type conversion implementation - file systems that use the same file types as defined by POSIX do not need to define their own versions and can use the common helper functions decared in fs_types.h and implemented in fs_types.c Common implementation can be found via commit: bbe7449e2599 "fs: common implementation of file type" Acked-by: David Sterba Signed-off-by: Amir Goldstein Signed-off-by: Phillip Potter Reviewed-by: Jan Kara --- fs/btrfs/btrfs_inode.h | 2 -- fs/btrfs/delayed-inode.c| 2 +- fs/btrfs/inode.c| 32 +++- include/uapi/linux/btrfs_tree.h | 2 ++ 4 files changed, 18 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 6f5d07415dab..b16c13d51be0 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -203,8 +203,6 @@ struct btrfs_inode { struct inode vfs_inode; }; -extern unsigned char btrfs_filetype_table[]; - static inline struct btrfs_inode *BTRFS_I(const struct inode *inode) { return container_of(inode, struct btrfs_inode, vfs_inode); diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index c669f250d4a0..e61947f5eb76 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1692,7 +1692,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, name = (char *)(di + 1); name_len = btrfs_stack_dir_name_len(di); - d_type = btrfs_filetype_table[di->type]; + d_type = fs_ftype_to_dtype(di->type); btrfs_disk_key_to_cpu(&location, &di->location); over = !dir_emit(ctx, name, name_len, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 82fdda8ff5ab..047609c27913 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -73,17 +73,6 @@ struct kmem_cache *btrfs_trans_handle_cachep; struct kmem_cache *btrfs_path_cachep; struct kmem_cache *btrfs_free_space_cachep; -#define S_SHIFT 12 -static const unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = { - [S_IFREG >> S_SHIFT]= BTRFS_FT_REG_FILE, - [S_IFDIR >> S_SHIFT]= BTRFS_FT_DIR, - [S_IFCHR >> S_SHIFT]= BTRFS_FT_CHRDEV, - [S_IFBLK >> S_SHIFT]= BTRFS_FT_BLKDEV, - [S_IFIFO >> S_SHIFT]= BTRFS_FT_FIFO, - [S_IFSOCK >> S_SHIFT] = BTRFS_FT_SOCK, - [S_IFLNK >> S_SHIFT]= BTRFS_FT_SYMLINK, -}; - static int btrfs_setsize(struct inode *inode, struct iattr *attr); static int btrfs_truncate(struct inode *inode, bool skip_writeback); static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent); @@ -5797,10 +5786,6 @@ static struct dentry *btrfs_lookup(struct inode *dir, struct dentry *dentry, return d_splice_alias(inode, dentry); } -unsigned char btrfs_filetype_table[] = { - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK -}; - /* * All this infrastructure exists because dir_emit can fault, and we are holding * the tree lock when doing readdir. For now just allocate a buffer and copy @@ -5939,7 +5924,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) name_ptr = (char *)(entry + 1); read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1), name_len); - put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)], + put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)), &entry->type); btrfs_dir_item_key_to_cpu(leaf, di, &location); put_unaligned(location.objectid, &entry->ino); @@ -6344,7 +6329,20 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, static inline u8 btrfs_inode_type(struct inode *inode) { - return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT]; + /* +* compile-time asserts that generic FT_* types still match +* BTRFS_FT_* types +*/ + BUILD_BUG_ON(BTRFS_FT_UNKNOWN != FT_UNKNOWN); + BUILD_BUG_ON(BTRFS_FT_REG_FILE != FT_REG_FILE); + BUILD_BUG_ON(BTRFS_FT_DIR != FT_DIR); + BUILD_BUG_ON(BTRFS_FT_CHRDEV != FT_CHRDEV); + BUILD_BUG_ON(BTRFS_FT_BLKDEV != FT_BLKDEV); + BUILD_BUG_ON(BTRFS_FT_FIFO != FT_FIFO); + BUILD_BUG_ON(BTRFS_FT_SOCK != FT_SOCK); + BUILD_BUG_ON(BTRFS_FT_SYMLINK != FT_SYMLINK); + + return fs_umode_to_ftype(inode->i_mode); } /* diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h index e974f4bb5378..421239b98db2 100644 --- a/include/uapi/linux/btrfs_tree.h +++ b/include/uapi/linux/btrfs_tree.h @@ -307,6 +307,8 @@ * * Used by: * struct btrfs_dir_item.type + * + * Values 0..7 must match common file type values in fs_types.h. */ #define BTRFS_FT_UNKNOWN 0 #define BTRFS_FT_REG_FILE 1 -- 2.20.1
Re: Kernels 4.15..5.0.3: "WARNING: CPU: 2 PID: 4150 at fs/fs-writeback.c:2363 __writeback_inodes_sb_nr+0xa9/0xc0"
On Fri, Mar 22, 2019 at 05:26:52PM +, Filipe Manana wrote: > On Fri, Mar 22, 2019 at 3:59 PM David Sterba wrote: > > > > On Fri, Mar 22, 2019 at 09:32:37AM +0200, Nikolay Borisov wrote: > > > On 22.03.19 г. 6:17 ч., Zygo Blaxell wrote: > > > > When filesystems are mounted flushoncommit, I get this warning roughly > > > > every 30 seconds: > > > > > > > > [ 4575.142805] WARNING: CPU: 3 PID: 4150 at fs/fs-writeback.c:2363 > > > > __writeback_inodes_sb_nr+0xa9/0xc0 > > > > [ 4575.145567] Modules linked in: crct10dif_pclmul crc32_pclmul > > > > dm_cache_smq crc32c_intel dm_cache snd_pcm ghash_clmulni_intel > > > > aesni_intel sr_mod dm_persistent_data ppdev joydev dm_bio_prison > > > > aes_x86_64 crypto_simd snd_timer dm_bufio cryptd cdrom snd glue_helper > > > > dm_mod parport_pc soundcore sg floppy parport pcspkr psmouse bochs_drm > > > > rtc_cmos ide_pci_generic piix input_leds i2c_piix4 ide_core serio_raw > > > > evbug qemu_fw_cfg evdev ip_tables x_tables ipv6 crc_ccitt autofs4 > > > > [ 4575.160021] CPU: 3 PID: 4150 Comm: btrfs-transacti Tainted: G > > > > W 5.0.3-zb64+ #1 > > > > [ 4575.162484] Hardware name: QEMU Standard PC (i440FX + PIIX, > > > > 1996), BIOS 1.10.2-1 04/01/2014 > > > > [ 4575.164505] RIP: 0010:__writeback_inodes_sb_nr+0xa9/0xc0 > > > > [ 4575.165809] Code: 0f b6 d2 e8 b9 f8 ff ff 48 89 ee 48 89 df e8 > > > > 0e f8 ff ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 75 0b 48 83 c4 50 > > > > 5b 5d c3 <0f> 0b eb cb e8 4e e9 d6 ff 0f 1f 40 00 66 2e 0f 1f 84 00 00 > > > > 00 00 > > > > [ 4575.171927] RSP: 0018:a9cac0eabde8 EFLAGS: 00010246 > > > > [ 4575.173045] RAX: RBX: 9353e23af000 RCX: > > > > > > > > [ 4575.175639] RDX: 0002 RSI: 00030c67 RDI: > > > > a9cac0eabe30 > > > > [ 4575.177619] RBP: a9cac0eabdec R08: a9cac0eabdf0 R09: > > > > 9353f12da000 > > > > [ 4575.179736] R10: R11: 0001 R12: > > > > 9353e198 > > > > [ 4575.181661] R13: 9353e1981430 R14: 9353f27e4260 R15: > > > > 9353e1981518 > > > > [ 4575.183871] FS: () > > > > GS:9353f680() knlGS: > > > > [ 4575.185940] CS: 0010 DS: ES: CR0: 80050033 > > > > [ 4575.188072] CR2: 7fb81841fa20 CR3: 0002218c0006 CR4: > > > > 001606e0 > > > > [ 4575.190094] Call Trace: > > > > [ 4575.190828] btrfs_commit_transaction+0x7a6/0x9e0 > > > > [ 4575.192115] ? start_transaction+0x91/0x4d0 > > > > [ 4575.193197] transaction_kthread+0x146/0x180 > > > > [ 4575.194415] kthread+0x106/0x140 > > > > [ 4575.195403] ? btrfs_cleanup_transaction+0x620/0x620 > > > > [ 4575.196903] ? kthread_park+0x90/0x90 > > > > [ 4575.198412] ret_from_fork+0x3a/0x50 > > > > [ 4575.199374] irq event stamp: 54922780 > > > > [ 4575.200218] hardirqs last enabled at (54922779): > > > > [] _raw_spin_unlock_irqrestore+0x32/0x60 > > > > [ 4575.202753] hardirqs last disabled at (54922780): > > > > [] trace_hardirqs_off_thunk+0x1a/0x1c > > > > [ 4575.205921] softirqs last enabled at (54922378): > > > > [] __do_softirq+0x3a4/0x45f > > > > [ 4575.208350] softirqs last disabled at (54922361): > > > > [] irq_exit+0xe4/0xf0 > > > > [ 4575.210616] ---[ end trace 5309dcf3a1920eca ]--- > > > > > > > > For my own kernel builds I just comment out the line in fs-writeback.c, > > > > but that's not a real solution. > > > > > > > > > > This is a longstanding and known issue for which no good solution exists > > > ATM. > > > > The s_umount mutex is taken around the writeback_inodes_sb_nr call in > > btrfs_writeback_inodes_sb_nr: > > > > 4689 static void btrfs_writeback_inodes_sb_nr(struct btrfs_fs_info > > *fs_info, > > 4690 unsigned long nr_pages, int > > nr_items) > > 4691 { > > 4692 struct super_block *sb = fs_info->sb; > > 4693 > > 4694 if (down_read_trylock(&sb->s_umount)) { > > 4695 writeback_inodes_sb_nr(sb, nr_pages, > > WB_REASON_FS_FREE_SPACE); > > 4696 up_read(&sb->s_umount); > > 4697 } else { > > > > but __writeback_inodes_sb_nr still complains. > > Yes, because btrfs_witeback_inodes_sb_nr() is not what is called > during transaction commit if the fs is mounted with -o flushoncommit. > What is called is writeback_inodes_sb() (which gets > __writeback_inodes_sb_nr()). > > Calling btrfs_witeback_inodes_sb_nr() would just fallback to > btrfs_start_delalloc_roots() during a concurrent freeze (which does a > down_write() > on that semaphore), bringing back the problem Josef tried to fix at [1]. > > This problem of the warning, and the original problem [1], could > possibly be solved by making sure no regular transaction joins happen > after we > start the transaction commit, something along the lines of: > > h
Re: Kernels 4.15..5.0.3: "WARNING: CPU: 2 PID: 4150 at fs/fs-writeback.c:2363 __writeback_inodes_sb_nr+0xa9/0xc0"
On Tue, Mar 26, 2019 at 11:13 PM Zygo Blaxell wrote: > > On Fri, Mar 22, 2019 at 05:26:52PM +, Filipe Manana wrote: > > On Fri, Mar 22, 2019 at 3:59 PM David Sterba wrote: > > > > > > On Fri, Mar 22, 2019 at 09:32:37AM +0200, Nikolay Borisov wrote: > > > > On 22.03.19 г. 6:17 ч., Zygo Blaxell wrote: > > > > > When filesystems are mounted flushoncommit, I get this warning roughly > > > > > every 30 seconds: > > > > > > > > > > [ 4575.142805] WARNING: CPU: 3 PID: 4150 at > > > > > fs/fs-writeback.c:2363 __writeback_inodes_sb_nr+0xa9/0xc0 > > > > > [ 4575.145567] Modules linked in: crct10dif_pclmul crc32_pclmul > > > > > dm_cache_smq crc32c_intel dm_cache snd_pcm ghash_clmulni_intel > > > > > aesni_intel sr_mod dm_persistent_data ppdev joydev dm_bio_prison > > > > > aes_x86_64 crypto_simd snd_timer dm_bufio cryptd cdrom snd > > > > > glue_helper dm_mod parport_pc soundcore sg floppy parport pcspkr > > > > > psmouse bochs_drm rtc_cmos ide_pci_generic piix input_leds i2c_piix4 > > > > > ide_core serio_raw evbug qemu_fw_cfg evdev ip_tables x_tables ipv6 > > > > > crc_ccitt autofs4 > > > > > [ 4575.160021] CPU: 3 PID: 4150 Comm: btrfs-transacti Tainted: G > > > > > W 5.0.3-zb64+ #1 > > > > > [ 4575.162484] Hardware name: QEMU Standard PC (i440FX + PIIX, > > > > > 1996), BIOS 1.10.2-1 04/01/2014 > > > > > [ 4575.164505] RIP: 0010:__writeback_inodes_sb_nr+0xa9/0xc0 > > > > > [ 4575.165809] Code: 0f b6 d2 e8 b9 f8 ff ff 48 89 ee 48 89 df e8 > > > > > 0e f8 ff ff 48 8b 44 24 48 65 48 33 04 25 28 00 00 00 75 0b 48 83 c4 > > > > > 50 5b 5d c3 <0f> 0b eb cb e8 4e e9 d6 ff 0f 1f 40 00 66 2e 0f 1f 84 > > > > > 00 00 00 00 > > > > > [ 4575.171927] RSP: 0018:a9cac0eabde8 EFLAGS: 00010246 > > > > > [ 4575.173045] RAX: RBX: 9353e23af000 RCX: > > > > > > > > > > [ 4575.175639] RDX: 0002 RSI: 00030c67 RDI: > > > > > a9cac0eabe30 > > > > > [ 4575.177619] RBP: a9cac0eabdec R08: a9cac0eabdf0 R09: > > > > > 9353f12da000 > > > > > [ 4575.179736] R10: R11: 0001 R12: > > > > > 9353e198 > > > > > [ 4575.181661] R13: 9353e1981430 R14: 9353f27e4260 R15: > > > > > 9353e1981518 > > > > > [ 4575.183871] FS: () > > > > > GS:9353f680() knlGS: > > > > > [ 4575.185940] CS: 0010 DS: ES: CR0: 80050033 > > > > > [ 4575.188072] CR2: 7fb81841fa20 CR3: 0002218c0006 CR4: > > > > > 001606e0 > > > > > [ 4575.190094] Call Trace: > > > > > [ 4575.190828] btrfs_commit_transaction+0x7a6/0x9e0 > > > > > [ 4575.192115] ? start_transaction+0x91/0x4d0 > > > > > [ 4575.193197] transaction_kthread+0x146/0x180 > > > > > [ 4575.194415] kthread+0x106/0x140 > > > > > [ 4575.195403] ? btrfs_cleanup_transaction+0x620/0x620 > > > > > [ 4575.196903] ? kthread_park+0x90/0x90 > > > > > [ 4575.198412] ret_from_fork+0x3a/0x50 > > > > > [ 4575.199374] irq event stamp: 54922780 > > > > > [ 4575.200218] hardirqs last enabled at (54922779): > > > > > [] _raw_spin_unlock_irqrestore+0x32/0x60 > > > > > [ 4575.202753] hardirqs last disabled at (54922780): > > > > > [] trace_hardirqs_off_thunk+0x1a/0x1c > > > > > [ 4575.205921] softirqs last enabled at (54922378): > > > > > [] __do_softirq+0x3a4/0x45f > > > > > [ 4575.208350] softirqs last disabled at (54922361): > > > > > [] irq_exit+0xe4/0xf0 > > > > > [ 4575.210616] ---[ end trace 5309dcf3a1920eca ]--- > > > > > > > > > > For my own kernel builds I just comment out the line in > > > > > fs-writeback.c, > > > > > but that's not a real solution. > > > > > > > > > > > > > This is a longstanding and known issue for which no good solution exists > > > > ATM. > > > > > > The s_umount mutex is taken around the writeback_inodes_sb_nr call in > > > btrfs_writeback_inodes_sb_nr: > > > > > > 4689 static void btrfs_writeback_inodes_sb_nr(struct btrfs_fs_info > > > *fs_info, > > > 4690 unsigned long nr_pages, > > > int nr_items) > > > 4691 { > > > 4692 struct super_block *sb = fs_info->sb; > > > 4693 > > > 4694 if (down_read_trylock(&sb->s_umount)) { > > > 4695 writeback_inodes_sb_nr(sb, nr_pages, > > > WB_REASON_FS_FREE_SPACE); > > > 4696 up_read(&sb->s_umount); > > > 4697 } else { > > > > > > but __writeback_inodes_sb_nr still complains. > > > > Yes, because btrfs_witeback_inodes_sb_nr() is not what is called > > during transaction commit if the fs is mounted with -o flushoncommit. > > What is called is writeback_inodes_sb() (which gets > > __writeback_inodes_sb_nr()). > > > > Calling btrfs_witeback_inodes_sb_nr() would just fallback to > > btrfs_start_delalloc_roots() during a concurrent freeze (which does a > > down_write() > > on that semaphore), bringing
Re: [PATCH v6 07/07] btrfs: use common file type conversion
On Tue, Mar 26, 2019 at 09:39:34PM +, Phillip Potter wrote: > Deduplicate the btrfs file type conversion implementation - file systems > that use the same file types as defined by POSIX do not need to define > their own versions and can use the common helper functions decared in > fs_types.h and implemented in fs_types.c > > Common implementation can be found via commit: > bbe7449e2599 "fs: common implementation of file type" Now that the prerequisite patch is in master, I can pick the btrfs specific part. Thanks.
Re: [PATCh v2 8/9] btrfs: tree-checker: Verify inode item
On 2019/3/27 上午12:02, David Sterba wrote: > On Mon, Mar 25, 2019 at 12:27:24PM +0800, Qu Wenruo wrote: >> >> >> On 2019/3/20 下午2:37, Qu Wenruo wrote: >>> There is a report in kernel bugzilla about mismatch file type in dir >>> item and inode item. >>> >>> This inspires us to check inode mode in inode item. >>> >>> This patch will check the following members: >>> - inode key objectid >>> Should be ROOT_DIR_DIR or [256, (u64)-256] or FREE_INO. >>> >>> - inode key offset >>> Should be 0 >>> >>> - inode item generation >>> - inode item transid >>> No newer than sb generation + 1. >>> The +1 is for log tree. >>> >>> - inode item mode >>> No unknown bits. >>> No invalid S_IF* bit. >>> NOTE: S_IFMT check is not enough, need to check every know type. >>> >>> - inode item nlink >>> Dir should have no more link than 1. >>> >>> - inode item flags >>> >>> Signed-off-by: Qu Wenruo >>> Reviewed-by: Nikolay Borisov >> >> There is some bug report of kernel producing free space cache inode with >> mode 0, which is invalid and can be detected by this patch. >> >> Although the patch itself is good, I'm afraid we need to address the >> invalid inode mode created by old kernel in btrfs-progs at least before >> merging this patch into upstream. > > Can this be addressed on the kernel side? Like detecting the invalid > mode, print a warning and the fix on the next write. The progs can > detect and fix that too of course. So far even on older fs images (like those in btrfs-progs fsck tests), I noticed no such invalid free space inode at all. And from the history of that code, the mode is fixed to 100600 since 2010. Currently I believe it's uncommon to see that case. Furthermore, the btrfs-progs fix for such case is already submitted, as long as we have a minor release to include that fix, it should be OK. > > So I'll keep the patch working as-is, we can relax the error to a > warning if we're out of time or find out that it needs to be that way > due to backward compatibilit reasons. >
Re: [PATCH v6 07/07] btrfs: use common file type conversion
On Wed, Mar 27, 2019 at 12:55:39AM +0100, David Sterba wrote: > On Tue, Mar 26, 2019 at 09:39:34PM +, Phillip Potter wrote: > > Deduplicate the btrfs file type conversion implementation - file systems > > that use the same file types as defined by POSIX do not need to define > > their own versions and can use the common helper functions decared in > > fs_types.h and implemented in fs_types.c > > > > Common implementation can be found via commit: > > bbe7449e2599 "fs: common implementation of file type" > > Now that the prerequisite patch is in master, I can pick the btrfs > specific part. Thanks. Thank you for this. Regards, Phil
Re: A collection of btrfs lockup stack traces (4.14.106, no __sb_start_write)
On Tue, Mar 19, 2019 at 11:39:59PM -0400, Zygo Blaxell wrote: > I haven't been able to easily reproduce these in a test environment; > however, they have been happening several times a year on servers in > production. > > Kernel: most recent observation on 4.14.105 + cherry-picked deadlock > and misc hang fixes: > > btrfs: wakeup cleaner thread when adding delayed iput > Btrfs: fix deadlock when allocating tree block during leaf/node split > Btrfs: use nofs context when initializing security xattrs to avoid > deadlock > Btrfs: fix deadlock with memory reclaim during scrub > Btrfs: fix deadlock between clone/dedupe and rename > > Also observed on 4.20.13, and 4.14.0..4.14.105 (4.14.106 is currently > running, but hasn't locked up yet). > > Filesystem mount flags: compress=zstd,ssd,flushoncommit,space_cache=v2. > Configuration is either -draid1/-mraid1 or -dsingle/-mraid1. I've > also reproduced a lockup without flushoncommit. > > The machines that are locking up all run the same workload: > > rsync receiving data continuously (gigabytes aren't enough, > I can barely reproduce this once a month with 2TB of rsync > traffic from 10 simulated clients) > > bees doing continuous dedupe > > snapshots daily and after each rsync > > snapshot deletes as required to maintain free space > > scrubs twice monthly plus after each crash > > watchdog does a 'mkdir foo; rmdir foo' every few seconds. > If this takes longer than 50 minutes, collect a stack trace; > longer than 60 minutes, reboot the machine. > [...] > Here's a recent lockup example with lockdep output, from 4.14.105 with > the cherry-picked patches above: The above had __sb_start_write in the call stack, which occurs in 7 out of 15 call traces on one machine: > [Fri Mar 15 21:53:36 2019] crawl_294 D0 20349 31220 0x > [Fri Mar 15 21:53:36 2019] Call Trace: > [Fri Mar 15 21:53:36 2019] ? __schedule+0x429/0xbb0 > [Fri Mar 15 21:53:36 2019] ? rwsem_down_write_failed+0x134/0x2b0 > [Fri Mar 15 21:53:36 2019] schedule+0x39/0x90 > [Fri Mar 15 21:53:36 2019] rwsem_down_write_failed+0x139/0x2b0 > [Fri Mar 15 21:53:36 2019] ? call_rwsem_down_write_failed+0x13/0x20 > [Fri Mar 15 21:53:36 2019] call_rwsem_down_write_failed+0x13/0x20 > [Fri Mar 15 21:53:36 2019] down_write_nested+0x87/0xb0 > [Fri Mar 15 21:53:36 2019] btrfs_dedupe_file_range+0x8e/0x660 > [Fri Mar 15 21:53:36 2019] ? rcu_read_lock_sched_held+0x68/0x70 > [Fri Mar 15 21:53:36 2019] ? rcu_sync_lockdep_assert+0x30/0x60 > [Fri Mar 15 21:53:36 2019] ? __sb_start_write+0xcc/0x1b0 > [Fri Mar 15 21:53:36 2019] ? mnt_want_write_file+0x3b/0xb0 > [Fri Mar 15 21:53:36 2019] vfs_dedupe_file_range+0x22e/0x280 > [Fri Mar 15 21:53:36 2019] do_vfs_ioctl+0x24d/0x6b0 > [Fri Mar 15 21:53:36 2019] ? __fget+0x11f/0x210 > [Fri Mar 15 21:53:36 2019] SyS_ioctl+0x74/0x80 > [Fri Mar 15 21:53:36 2019] do_syscall_64+0x76/0x180 > [Fri Mar 15 21:53:36 2019] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > [Fri Mar 15 21:53:36 2019] RIP: 0033:0x7fd690048dd7 > [Fri Mar 15 21:53:36 2019] RSP: 002b:7fd687ffc0a8 EFLAGS: 0246 > ORIG_RAX: 0010 > [Fri Mar 15 21:53:36 2019] RAX: ffda RBX: 7fd568332780 RCX: > 7fd690048dd7 > [Fri Mar 15 21:53:36 2019] RDX: 7fd568332780 RSI: c0189436 RDI: > 0067 > [Fri Mar 15 21:53:36 2019] RBP: 7fd687ffc420 R08: 7fd56804c940 R09: > > [Fri Mar 15 21:53:36 2019] R10: 7fd568022ed0 R11: 0246 R12: > 0020 > [Fri Mar 15 21:53:36 2019] R13: 0018 R14: 7fd56804c958 R15: > 7fd687ffc428 Here's an example of a deadlock on 4.14.106 that doesn't call __sb_start_write: [Sat Mar 23 06:13:07 2019] sysrq: SysRq : Show Blocked State [Sat Mar 23 06:13:07 2019] taskPC stack pid father [Sat Mar 23 06:13:07 2019] btrfs-transacti D0 28857 2 0x8000 [Sat Mar 23 06:13:07 2019] Call Trace: [Sat Mar 23 06:13:07 2019] ? __schedule+0x429/0xbb0 [Sat Mar 23 06:13:07 2019] schedule+0x39/0x90 [Sat Mar 23 06:13:07 2019] schedule_timeout+0x20f/0x590 [Sat Mar 23 06:13:07 2019] ? lock_acquire+0xbc/0x200 [Sat Mar 23 06:13:07 2019] ? wait_for_common+0x3c/0x1f0 [Sat Mar 23 06:13:07 2019] ? wait_for_common+0x131/0x1f0 [Sat Mar 23 06:13:07 2019] wait_for_common+0x131/0x1f0 [Sat Mar 23 06:13:07 2019] ? wake_up_q+0x70/0x70 [Sat Mar 23 06:13:07 2019] __start_delalloc_inodes+0x22f/0x320 [Sat Mar 23 06:13:07 2019] btrfs_start_delalloc_roots+0x1cd/0x380 [Sat Mar 23 06:13:07 2019] btrfs_commit_transaction+0x807/0xa20 [Sat Mar 23 06:13:07 2019] ? start_transaction+0x89/0x4d0 [Sat Mar 23 06:13:07 2019] transaction_kthread+0x194/0x1d0 [Sat Mar 23 06:13:07 2019] kthread+0x10d/0x140 [Sat
[PATCH -next] btrfs: remove set but not used variable 'fs_devices'
Fixes gcc '-Wunused-but-set-variable' warning: fs/btrfs/volumes.c: In function 'btrfs_grow_device': fs/btrfs/volumes.c:2824:27: warning: variable 'fs_devices' set but not used [-Wunused-but-set-variable] It's not used after 6f32a50a232b ("btrfs: combine device update operations during transaction commit") Signed-off-by: YueHaibing --- fs/btrfs/volumes.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index afafc92e70e9..605230482009 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2821,7 +2821,6 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, { struct btrfs_fs_info *fs_info = device->fs_info; struct btrfs_super_block *super_copy = fs_info->super_copy; - struct btrfs_fs_devices *fs_devices; u64 old_total; u64 diff; @@ -2840,8 +2839,6 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans, return -EINVAL; } - fs_devices = fs_info->fs_devices; - btrfs_set_super_total_bytes(super_copy, round_down(old_total + diff, fs_info->sectorsize)); device->fs_devices->total_rw_bytes += diff;