Re: BUG: KASAN: use-after-free in bt_for_each+0x1ea/0x29f
On 4/4/18 5:28 PM, Ming Lei wrote: > Hi, > > The following warning is observed once when running dbench on NVMe with > the linus tree(top commit is 642e7fd23353). > > [ 1446.882043] > == > [ 1446.886884] BUG: KASAN: use-after-free in bt_for_each+0x1ea/0x29f > [ 1446.888045] Read of size 8 at addr 880055a60a00 by task dbench/13443 > [ 1446.889660] > [ 1446.889892] CPU: 1 PID: 13443 Comm: dbench Not tainted > 4.16.0_642e7fd23353_master+ #1 > [ 1446.891007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.10.2-2.fc27 04/01/2014 > [ 1446.892290] Call Trace: > [ 1446.892641] > [ 1446.892937] dump_stack+0xf0/0x191 > [ 1446.893600] ? dma_direct_map_page+0x6f/0x6f > [ 1446.894425] ? show_regs_print_info+0xa/0xa > [ 1446.895247] ? ext4_writepages+0x196d/0x1e6d > [ 1446.896063] ? do_writepages+0x57/0xa3 > [ 1446.896810] print_address_description+0x6e/0x23b > [ 1446.897882] ? bt_for_each+0x1ea/0x29f > [ 1446.898693] kasan_report+0x247/0x285 > [ 1446.899484] bt_for_each+0x1ea/0x29f > [ 1446.900233] ? blk_mq_tagset_busy_iter+0xa3/0xa3 > [ 1446.901190] ? generic_file_buffered_read+0x14b1/0x14b1 > [ 1446.903097] ? blk_mq_hctx_mark_pending.isra.0+0x5c/0x5c > [ 1446.904418] ? bio_free+0x64/0xaa > [ 1446.905113] ? debug_lockdep_rcu_enabled+0x26/0x52 > [ 1446.906332] ? bio_put+0x7a/0x10e > [ 1446.906811] ? debug_lockdep_rcu_enabled+0x26/0x52 > [ 1446.907527] ? blk_mq_hctx_mark_pending.isra.0+0x5c/0x5c > [ 1446.908334] blk_mq_queue_tag_busy_iter+0xd0/0xde > [ 1446.909023] blk_mq_in_flight+0xb4/0xdb > [ 1446.909619] ? blk_mq_exit_hctx+0x190/0x190 > [ 1446.910281] ? ext4_end_bio+0x25d/0x2a1 > [ 1446.911713] part_in_flight+0xc0/0x2ac > [ 1446.912470] ? ext4_put_io_end_defer+0x277/0x277 > [ 1446.913465] ? part_dec_in_flight+0x8f/0x8f > [ 1446.914375] ? __lock_acquire+0x38/0x8e5 > [ 1446.915182] ? bio_endio+0x3d9/0x41c > [ 1446.915936] ? __rcu_read_unlock+0x134/0x180 > [ 1446.916796] ? lock_acquire+0x2ba/0x32d > [ 1446.917570] ? blk_account_io_done+0xea/0x572 > [ 1446.918424] part_round_stats+0x167/0x1a3 > [ 1446.919188] ? part_round_stats_single.isra.1+0xc7/0xc7 > [ 1446.920187] blk_account_io_done+0x34d/0x572 > [ 1446.921056] ? blk_update_bidi_request+0x8f/0x8f > [ 1446.921923] ? blk_mq_run_hw_queue+0x13d/0x187 > [ 1446.922803] blk_mq_end_request+0x3f/0xbf > [ 1446.923631] nvme_complete_rq+0x305/0x348 [nvme_core] > [ 1446.924612] ? nvme_delete_ctrl_sync+0x5c/0x5c [nvme_core] > [ 1446.925696] ? nvme_pci_complete_rq+0x1f6/0x20c [nvme] > [ 1446.926673] ? kfree+0x21c/0x2ab > [ 1446.927317] ? nvme_pci_complete_rq+0x1f6/0x20c [nvme] > [ 1446.928239] __blk_mq_complete_request+0x391/0x3ee > [ 1446.928938] ? blk_mq_free_request+0x479/0x479 > [ 1446.929588] ? rcu_read_lock_bh_held+0x3a/0x3a > [ 1446.930321] ? enqueue_hrtimer+0x252/0x29a > [ 1446.930938] ? do_raw_spin_lock+0xd8/0xd8 > [ 1446.931532] ? debug_lockdep_rcu_enabled+0x26/0x52 > [ 1446.932425] blk_mq_complete_request+0x10e/0x159 > [ 1446.933341] ? hctx_lock+0xe8/0xe8 > [ 1446.933985] ? lock_contended+0x680/0x680 > [ 1446.934707] ? lock_downgrade+0x338/0x338 > [ 1446.935463] nvme_process_cq+0x26a/0x34d [nvme] > [ 1446.936297] ? nvme_init_hctx+0xa6/0xa6 [nvme] > [ 1446.937150] nvme_irq+0x23/0x51 [nvme] > [ 1446.937864] ? nvme_process_cq+0x34d/0x34d [nvme] > [ 1446.938713] __handle_irq_event_percpu+0x29d/0x568 > [ 1446.939516] ? __irq_wake_thread+0x99/0x99 > [ 1446.940241] ? rcu_user_enter+0x72/0x72 > [ 1446.940978] ? do_timer+0x25/0x25 > [ 1446.941650] ? do_raw_spin_unlock+0x146/0x179 > [ 1446.942514] ? __lock_acquire+0x38/0x8e5 > [ 1446.943305] ? debug_lockdep_rcu_enabled+0x26/0x52 > [ 1446.944242] ? lock_acquire+0x32d/0x32d > [ 1446.944995] ? lock_contended+0x680/0x680 > [ 1446.945718] handle_irq_event_percpu+0x7c/0xf7 > [ 1446.946438] ? __handle_irq_event_percpu+0x568/0x568 > [ 1446.947124] ? rcu_user_exit+0xa/0xa > [ 1446.947781] handle_irq_event+0x53/0x83 > [ 1446.948553] handle_edge_irq+0x1f2/0x279 > [ 1446.949397] handle_irq+0x1d8/0x1e9 > [ 1446.950094] do_IRQ+0x90/0x12d > [ 1446.950750] common_interrupt+0xf/0xf > [ 1446.951507] > [ 1446.951953] RIP: 0010:__blk_mq_get_tag+0x201/0x22d > [ 1446.952894] RSP: 0018:880055b467a0 EFLAGS: 0246 ORIG_RAX: > ffdc > [ 1446.954295] RAX: RBX: 88005952f648 RCX: > > [ 1446.955641] RDX: 0259 RSI: RDI: > ed000ab68d06 > [ 1446.956972] RBP: ed000ab68cf6 R08: 0007 R09: > > [ 1446.958356] R10: ed000a0ec0f2 R11: ed000a0ec0f1 R12: > 88007f113978 > [ 1446.959737] R13: 880055b46ce8 R14: dc00 R15: > 880058bf60c0 > [ 1446.961184] ? modules_open+0x5e/0x5e > [ 1446.961922] ? blk_mq_unique_tag+0xc5/0xc5 > [ 1446.962748] ? lock_acquire+0x32d/0x32d > [ 1446.963534] ? __rcu_read_unlock+0x134/0x180 > [ 1446.964393] ?
Re: 4.15.14 crash with iscsi target and dvd
Bart Van Assche wrote: > On Sun, 2018-04-01 at 14:27 -0400, Wakko Warner wrote: > > Wakko Warner wrote: > > > Wakko Warner wrote: > > > > I tested 4.14.32 last night with the same oops. 4.9.91 works fine. > > > > From the initiator, if I do cat /dev/sr1 > /dev/null it works. If I > > > > mount > > > > /dev/sr1 and then do find -type f | xargs cat > /dev/null the target > > > > crashes. I'm using the builtin iscsi target with pscsi. I can burn > > > > from > > > > the initiator with out problems. I'll test other kernels between 4.9 > > > > and > > > > 4.14. > > > > > > So I've tested 4.x.y where x one of 10 11 12 14 15 and y is the latest > > > patch > > > (except for 4.15 which was 1 behind) > > > Each of these kernels crash within seconds or immediate of doing find > > > -type > > > f | xargs cat > /dev/null from the initiator. > > > > I tried 4.10.0. It doesn't completely lockup the system, but the device > > that was used hangs. So from the initiator, it's /dev/sr1 and from the > > target it's /dev/sr0. Attempting to read /dev/sr0 after the oops causes the > > process to hang in D state. > > Hello Wakko, > > Thank you for having narrowed down this further. I think that you encountered > a regression either in the block layer core or in the SCSI core. Unfortunately > the number of changes between kernel versions v4.9 and v4.10 in these two > subsystems is huge. I see two possible ways forward: > - Either that you perform a bisect to identify the patch that introduced this > regression. However, I'm not sure whether you are familiar with the bisect > process. > - Or that you identify the command that triggers this crash such that others > can reproduce this issue without needing access to your setup. > > How about reproducing this crash with the below patch applied on top of > kernel v4.15.x? The additional output sent by this patch to the system log > should allow us to reproduce this issue by submitting the same SCSI command > with sg_raw. Sorry for not getting back in touch. My internet was down. I haven't tried the patch yet. I'll try to get to that tomorrow. The system with the issue is busy and I can't reboot it right now.
BUG: KASAN: use-after-free in bt_for_each+0x1ea/0x29f
Hi, The following warning is observed once when running dbench on NVMe with the linus tree(top commit is 642e7fd23353). [ 1446.882043] == [ 1446.886884] BUG: KASAN: use-after-free in bt_for_each+0x1ea/0x29f [ 1446.888045] Read of size 8 at addr 880055a60a00 by task dbench/13443 [ 1446.889660] [ 1446.889892] CPU: 1 PID: 13443 Comm: dbench Not tainted 4.16.0_642e7fd23353_master+ #1 [ 1446.891007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 1446.892290] Call Trace: [ 1446.892641] [ 1446.892937] dump_stack+0xf0/0x191 [ 1446.893600] ? dma_direct_map_page+0x6f/0x6f [ 1446.894425] ? show_regs_print_info+0xa/0xa [ 1446.895247] ? ext4_writepages+0x196d/0x1e6d [ 1446.896063] ? do_writepages+0x57/0xa3 [ 1446.896810] print_address_description+0x6e/0x23b [ 1446.897882] ? bt_for_each+0x1ea/0x29f [ 1446.898693] kasan_report+0x247/0x285 [ 1446.899484] bt_for_each+0x1ea/0x29f [ 1446.900233] ? blk_mq_tagset_busy_iter+0xa3/0xa3 [ 1446.901190] ? generic_file_buffered_read+0x14b1/0x14b1 [ 1446.903097] ? blk_mq_hctx_mark_pending.isra.0+0x5c/0x5c [ 1446.904418] ? bio_free+0x64/0xaa [ 1446.905113] ? debug_lockdep_rcu_enabled+0x26/0x52 [ 1446.906332] ? bio_put+0x7a/0x10e [ 1446.906811] ? debug_lockdep_rcu_enabled+0x26/0x52 [ 1446.907527] ? blk_mq_hctx_mark_pending.isra.0+0x5c/0x5c [ 1446.908334] blk_mq_queue_tag_busy_iter+0xd0/0xde [ 1446.909023] blk_mq_in_flight+0xb4/0xdb [ 1446.909619] ? blk_mq_exit_hctx+0x190/0x190 [ 1446.910281] ? ext4_end_bio+0x25d/0x2a1 [ 1446.911713] part_in_flight+0xc0/0x2ac [ 1446.912470] ? ext4_put_io_end_defer+0x277/0x277 [ 1446.913465] ? part_dec_in_flight+0x8f/0x8f [ 1446.914375] ? __lock_acquire+0x38/0x8e5 [ 1446.915182] ? bio_endio+0x3d9/0x41c [ 1446.915936] ? __rcu_read_unlock+0x134/0x180 [ 1446.916796] ? lock_acquire+0x2ba/0x32d [ 1446.917570] ? blk_account_io_done+0xea/0x572 [ 1446.918424] part_round_stats+0x167/0x1a3 [ 1446.919188] ? part_round_stats_single.isra.1+0xc7/0xc7 [ 1446.920187] blk_account_io_done+0x34d/0x572 [ 1446.921056] ? blk_update_bidi_request+0x8f/0x8f [ 1446.921923] ? blk_mq_run_hw_queue+0x13d/0x187 [ 1446.922803] blk_mq_end_request+0x3f/0xbf [ 1446.923631] nvme_complete_rq+0x305/0x348 [nvme_core] [ 1446.924612] ? nvme_delete_ctrl_sync+0x5c/0x5c [nvme_core] [ 1446.925696] ? nvme_pci_complete_rq+0x1f6/0x20c [nvme] [ 1446.926673] ? kfree+0x21c/0x2ab [ 1446.927317] ? nvme_pci_complete_rq+0x1f6/0x20c [nvme] [ 1446.928239] __blk_mq_complete_request+0x391/0x3ee [ 1446.928938] ? blk_mq_free_request+0x479/0x479 [ 1446.929588] ? rcu_read_lock_bh_held+0x3a/0x3a [ 1446.930321] ? enqueue_hrtimer+0x252/0x29a [ 1446.930938] ? do_raw_spin_lock+0xd8/0xd8 [ 1446.931532] ? debug_lockdep_rcu_enabled+0x26/0x52 [ 1446.932425] blk_mq_complete_request+0x10e/0x159 [ 1446.933341] ? hctx_lock+0xe8/0xe8 [ 1446.933985] ? lock_contended+0x680/0x680 [ 1446.934707] ? lock_downgrade+0x338/0x338 [ 1446.935463] nvme_process_cq+0x26a/0x34d [nvme] [ 1446.936297] ? nvme_init_hctx+0xa6/0xa6 [nvme] [ 1446.937150] nvme_irq+0x23/0x51 [nvme] [ 1446.937864] ? nvme_process_cq+0x34d/0x34d [nvme] [ 1446.938713] __handle_irq_event_percpu+0x29d/0x568 [ 1446.939516] ? __irq_wake_thread+0x99/0x99 [ 1446.940241] ? rcu_user_enter+0x72/0x72 [ 1446.940978] ? do_timer+0x25/0x25 [ 1446.941650] ? do_raw_spin_unlock+0x146/0x179 [ 1446.942514] ? __lock_acquire+0x38/0x8e5 [ 1446.943305] ? debug_lockdep_rcu_enabled+0x26/0x52 [ 1446.944242] ? lock_acquire+0x32d/0x32d [ 1446.944995] ? lock_contended+0x680/0x680 [ 1446.945718] handle_irq_event_percpu+0x7c/0xf7 [ 1446.946438] ? __handle_irq_event_percpu+0x568/0x568 [ 1446.947124] ? rcu_user_exit+0xa/0xa [ 1446.947781] handle_irq_event+0x53/0x83 [ 1446.948553] handle_edge_irq+0x1f2/0x279 [ 1446.949397] handle_irq+0x1d8/0x1e9 [ 1446.950094] do_IRQ+0x90/0x12d [ 1446.950750] common_interrupt+0xf/0xf [ 1446.951507] [ 1446.951953] RIP: 0010:__blk_mq_get_tag+0x201/0x22d [ 1446.952894] RSP: 0018:880055b467a0 EFLAGS: 0246 ORIG_RAX: ffdc [ 1446.954295] RAX: RBX: 88005952f648 RCX: [ 1446.955641] RDX: 0259 RSI: RDI: ed000ab68d06 [ 1446.956972] RBP: ed000ab68cf6 R08: 0007 R09: [ 1446.958356] R10: ed000a0ec0f2 R11: ed000a0ec0f1 R12: 88007f113978 [ 1446.959737] R13: 880055b46ce8 R14: dc00 R15: 880058bf60c0 [ 1446.961184] ? modules_open+0x5e/0x5e [ 1446.961922] ? blk_mq_unique_tag+0xc5/0xc5 [ 1446.962748] ? lock_acquire+0x32d/0x32d [ 1446.963534] ? __rcu_read_unlock+0x134/0x180 [ 1446.964393] ? rcu_read_lock_bh_held+0x3a/0x3a [ 1446.965282] blk_mq_get_tag+0x1ad/0x67a [ 1446.966079] ? __blk_mq_tag_idle+0x44/0x44 [ 1446.966891] ? wait_woken+0x13c/0x13c [ 1446.967638] ? debug_lockdep_rcu_enabled+0x26/0x52 [ 1446.968566] ? lock_acquire+0x32d/0x32d [
Re: [PATCH] blk-mq: order getting budget and driver tag
On 4/4/18 10:35 AM, Ming Lei wrote: > This patch orders getting budget and driver tag by making sure to acquire > driver tag after budget is got, this way can help to avoid the following > race: > > 1) before dispatch request from scheduler queue, get one budget first, then > dequeue a request, call it request A. > > 2) in another IO path for dispatching request B which is from hctx->dispatch, > driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(), > unfortunately the budget is held by request A. > > 3) meantime blk_mq_dispatch_rq_list() is called for dispatching request > A, and try to get driver tag first, unfortunately no driver tag is > available because the driver tag is held by request B > > 4) both two IO pathes can't move on, and IO stall is caused. > > This issue can be observed when running dbench on USB storage. Good catch, this can trigger on anything potentially, but of course more likely with limited budget and/or tag space. Classic ABBA deadlock. -- Jens Axboe
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, 4 Apr 2018, Ming Lei wrote: > On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote: > > In the example above: > > > > > > > irq 39, cpu list 0,4 > > > > > irq 40, cpu list 1,6 > > > > > irq 41, cpu list 2,5 > > > > > irq 42, cpu list 3,7 > > > > and assumed that at driver init time only CPU 0-3 are online then the > > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7. > > Indeed, and I just tested this case, and found that no interrupts are > delivered to CPU 4-7. > > In theory, the affinity has been assigned to these irq vectors, and > programmed to interrupt controller, I understand it should work. > > Could you explain it a bit why interrupts aren't delivered to CPU 4-7? As I explained before: "If the device is already in use when the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3 because the effective affinity of interrupts on X86 (and other architectures) is always a single CPU." IOW. If you set the affinity mask so it contains more than one CPU then the kernel selects a single CPU as target. The selected CPU must be online and if there is more than one online CPU in the mask then the kernel picks the one which has the least number of interrupts targeted at it. This selected CPU target is programmed into the corresponding interrupt chip (IOAPIC/MSI/MSIX) and it stays that way until the selected target CPU goes offline or the affinity mask changes. The reasons why we use single target delivery on X86 are: 1) Not all X86 systems support multi target delivery 2) If a system supports multi target delivery then the interrupt is preferrably delivered to the CPU with the lowest APIC ID (which usually corresponds to the lowest CPU number) due to hardware magic and only a very small percentage of interrupts are delivered to the other CPUs in the multi target set. So the benefit is rather dubious and extensive performance testing did not show any significant difference. 3) The management of multi targets on the software side is painful as the same low level vector number has to be allocated on all possible target CPUs. That's making a lot of things including hotplug more complex for very little - if at all - benefit. So at some point we ripped out the multi target support on X86 and moved everything to single target delivery mode. Other architectures never supported multi target delivery either due to hardware restrictions or for similar reasons why X86 dropped it. There might be a few architectures which support it, but I have no overview at the moment. The information is in procfs # cat /proc/irq/9/smp_affinity_list 0-3 # cat /proc/irq/9/effective_affinity_list 1 # cat /proc/irq/10/smp_affinity_list 0-3 # cat /proc/irq/10/effective_affinity_list 2 smp_affinity[_list] is the affinity which is set either by the kernel or by writing to /proc/irq/$N/smp_affinity[_list] effective_affinity[_list] is the affinity which is effective, i.e. the single target CPU to which the interrupt is affine at this point. As you can see in the above examples the target CPU is selected from the given possible target set and the internal spreading of the low level x86 vector allocation code picks a CPU which has the lowest number of interrupts targeted at it. Let's assume for the example below # cat /proc/irq/10/smp_affinity_list 0-3 # cat /proc/irq/10/effective_affinity_list 2 that CPU 3 was offline when the device was initialized. So there was no way to select it and when CPU 3 comes online there is no reason to change the affinity of that interrupt, at least not from the kernel POV. Actually we don't even have a mechanism to do so automagically. If I offline CPU 2 after onlining CPU 3 then the kernel has to move the interrupt away from CPU 2, so it selects CPU 3 as it's the one with the lowest number of interrupts targeted at it. Now this is a bit different if you use affinity managed interrupts like NVME and other devices do. Many of these devices create one queue per possible CPU, so the spreading is simple; One interrupt per possible cpu. Pretty boring. When the device has less queues than possible CPUs, then stuff gets more interesting. The queues and therefore the interrupts must be targeted at multiple CPUs. There is some logic which spreads them over the numa nodes and takes siblings into account when Hyperthreading is enabled. In both cases the managed interrupts are handled over CPU soft hotplug/unplug: 1) If a CPU is soft unplugged and an interrupt is targeted at the CPU then the interrupt is either moved to a still online CPU in the affinity mask or if the outgoing CPU is the last one in the affinity mask it is shut down. 2) If a CPU is soft plugged then the interrupts are scanned and the ones which are managed and shutdown checked whether the affinity mask contains the upcoming CPU. If that's the
[RFC PATCH 04/79] pipe: add inode field to struct pipe_inode_info
From: Jérôme GlissePipes are associated with a file and thus an inode, store a pointer back to the inode in struct pipe_inode_info, this will be use when testing pages haven't been truncated. Signed-off-by: Jérôme Glisse Cc: Eric Biggers Cc: Kees Cook Cc: Joe Lawrence Cc: Willy Tarreau Cc: Andrew Morton Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- fs/pipe.c | 2 ++ fs/splice.c | 1 + include/linux/pipe_fs_i.h | 2 ++ 3 files changed, 5 insertions(+) diff --git a/fs/pipe.c b/fs/pipe.c index 7b1954caf388..41e115b0bde7 100644 --- a/fs/pipe.c +++ b/fs/pipe.c @@ -715,6 +715,7 @@ static struct inode * get_pipe_inode(void) inode->i_pipe = pipe; pipe->files = 2; + pipe->inode = inode; pipe->readers = pipe->writers = 1; inode->i_fop = _fops; @@ -903,6 +904,7 @@ static int fifo_open(struct inode *inode, struct file *filp) pipe = alloc_pipe_info(); if (!pipe) return -ENOMEM; + pipe->inode = inode; pipe->files = 1; spin_lock(>i_lock); if (unlikely(inode->i_pipe)) { diff --git a/fs/splice.c b/fs/splice.c index 39e2dc01ac12..acab52a7fe56 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -927,6 +927,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, * PIPE_READERS appropriately. */ pipe->readers = 1; + pipe->inode = file_inode(in); current->splice_pipe = pipe; } diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 5a3bb3b7c9ad..171aa78ebbf0 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -44,6 +44,7 @@ struct pipe_buffer { * @fasync_writers: writer side fasync * @bufs: the circular array of pipe buffers * @user: the user who created this pipe + * @inode: inode this pipe is associated to **/ struct pipe_inode_info { struct mutex mutex; @@ -60,6 +61,7 @@ struct pipe_inode_info { struct fasync_struct *fasync_writers; struct pipe_buffer *bufs; struct user_struct *user; + struct inode *inode; }; /* -- 2.14.3
[RFC PATCH 06/79] mm/page: add helpers to dereference struct page index field
From: Jérôme GlisseRegroup all helpers that dereference struct page.index field into one place and require a the address_space (mapping) against which caller is looking the index (offset, pgoff, ...) Signed-off-by: Jérôme Glisse Cc: linux...@kvack.org CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- include/linux/mm-page.h | 136 include/linux/mm.h | 5 ++ 2 files changed, 141 insertions(+) create mode 100644 include/linux/mm-page.h diff --git a/include/linux/mm-page.h b/include/linux/mm-page.h new file mode 100644 index ..2981db45eeef --- /dev/null +++ b/include/linux/mm-page.h @@ -0,0 +1,136 @@ +/* + * Copyright 2018 Red Hat Inc. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of + * the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Authors: Jérôme Glisse + */ +/* + * This header file regroup everything that deal with struct page and has no + * outside dependency except basic types header files. + */ +/* Protected against rogue include ... do not include this file directly */ +#ifdef DOT_NOT_INCLUDE___INSIDE_MM +#ifndef MM_PAGE_H +#define MM_PAGE_H + +/* External struct dependencies: */ +struct address_space; + +/* External function dependencies: */ +extern pgoff_t __page_file_index(struct page *page); + + +/* + * _page_index() - return page index value (with special case for swap) + * @page: page struct pointer for which we want the index value + * @mapping: mapping against which we want the page index + * Returns: index value for the page in the given mapping + * + * The index value of a page is against a given mapping and page which belongs + * to swap cache need special handling. For swap cache page what we want is the + * swap offset which is store encoded with other fields in page->private. + */ +static inline unsigned long _page_index(struct page *page, + struct address_space *mapping) +{ + if (unlikely(PageSwapCache(page))) + return __page_file_index(page); + return page->index; +} + +/* + * _page_set_index() - set page index value against a give mapping + * @page: page struct pointer for which we want the index value + * @mapping: mapping against which we want the page index + * @index: index value to set + */ +static inline void _page_set_index(struct page *page, + struct address_space *mapping, + unsigned long index) +{ + page->index = index; +} + +/* + * _page_to_index() - page index value against a give mapping + * @page: page struct pointer for which we want the index value + * @mapping: mapping against which we want the page index + * Returns: index value for the page in the given mapping + * + * The index value of a page is against a given mapping. THP page need special + * handling has the index is set in the head page thus the final index value is + * the tail page index plus the number of page from current page to head page. + */ +static inline unsigned long _page_to_index(struct page *page, + struct address_space *mapping) +{ + unsigned long pgoff; + + if (likely(!PageTransTail(page))) + return page->index; + + /* +* We don't initialize ->index for tail pages: calculate based on +* head page +*/ + pgoff = compound_head(page)->index; + pgoff += page - compound_head(page); + return pgoff; +} + +/* + * _page_to_pgoff() - page pgoff value against a give mapping + * @page: page struct pointer for which we want the index value + * @mapping: mapping against which we want the page index + * Returns: pgoff value for the page in the given mapping + * + * The pgoff value of a page is against a given mapping. Hugetlb pages need + * special handling as for they have page->index in size of the huge pages + * (PMD_SIZE or PUD_SIZE), not in PAGE_SIZE as other types of pages. + * + * FIXME convert hugetlb to multi-order entries. + */ +static inline unsigned long _page_to_pgoff(struct page *page, + struct address_space *mapping) +{ + if (unlikely(PageHeadHuge(page))) + return page->index << compound_order(page); + + return _page_to_index(page, mapping); +} + +/* + * _page_offset() - page offset (in bytes) against a give mapping + * @page: page
[RFC PATCH 05/79] mm/swap: add an helper to get address_space from swap_entry_t
From: Jérôme GlisseEach swap entry is associated to a file and thus an address_space. That address_space is use for reading/writing to swap storage. This patch add an helper to get the address_space from swap_entry_t. Signed-off-by: Jérôme Glisse Cc: Michal Hocko Cc: Johannes Weiner Cc: Andrew Morton --- include/linux/swap.h | 1 + mm/swapfile.c| 7 +++ 2 files changed, 8 insertions(+) diff --git a/include/linux/swap.h b/include/linux/swap.h index a1a3f4ed94ce..e2155df84d77 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -475,6 +475,7 @@ extern int __swp_swapcount(swp_entry_t entry); extern int swp_swapcount(swp_entry_t entry); extern struct swap_info_struct *page_swap_info(struct page *); extern struct swap_info_struct *swp_swap_info(swp_entry_t entry); +struct address_space *swap_entry_to_address_space(swp_entry_t swap); extern bool reuse_swap_page(struct page *, int *); extern int try_to_free_swap(struct page *); struct backing_dev_info; diff --git a/mm/swapfile.c b/mm/swapfile.c index c7a33717d079..a913d4b45866 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3467,6 +3467,13 @@ struct swap_info_struct *swp_swap_info(swp_entry_t entry) return swap_info[swp_type(entry)]; } +struct address_space *swap_entry_to_address_space(swp_entry_t swap) +{ + struct swap_info_struct *sis = swp_swap_info(swap); + + return sis->swap_file->f_mapping; +} + struct swap_info_struct *page_swap_info(struct page *page) { swp_entry_t entry = { .val = page_private(page) }; -- 2.14.3
[RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue
From: Jérôme Glissehttps://cgit.freedesktop.org/~glisse/linux/log/?h=generic-write-protection-rfc This is an RFC for LSF/MM discussions. It impacts the file subsystem, the block subsystem and the mm subsystem. Hence it would benefit from a cross sub-system discussion. Patchset is not fully bake so take it with a graint of salt. I use it to illustrate the fact that it is doable and now that i did it once i believe i have a better and cleaner plan in my head on how to do this. I intend to share and discuss it at LSF/MM (i still need to write it down). That plan lead to quite different individual steps than this patchset takes and his also easier to split up in more manageable pieces. I also want to apologize for the size and number of patches (and i am not even sending them all). -- The Why ? I have two objectives: duplicate memory read only accross nodes and or devices and work around PCIE atomic limitations. More on each of those objective below. I also want to put forward that it can solve the page wait list issue ie having each page with its own wait list and thus avoiding long wait list traversale latency recently reported [1]. It does allow KSM for file back pages (truely generic KSM even between both anonymous and file back page). I am not sure how useful this can be, this was not an objective i did pursue, this is just a for free feature (see below). [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8 -- Per page wait list, so long page_waitqueue() ! Not implemented in this RFC but below is the logic and pseudo code at bottom of this email. When there is a contention on struct page lock bit, the caller which is trying to lock the page will add itself to a waitqueue. The issues here is that multiple pages share the same wait queue and on large system with a lot of ram this means we can quickly get to a long list of waiters for differents pages (or for the same page) on the same list [1]. The present patchset virtualy kills all places that need to access the page->mapping field and only a handfull are left, namely for testing page truncation and for vmscan. The former can be remove if we reuse the PG_waiters flag for a new PG_truncate flag set on truncation then we can virtualy kill all derefence of page->mapping (this patchset proves it is doable). NOTE THIS DOES NOT MEAN THAT MAPPING is FREE TO BE USE BY ANYONE TO STORE WHATEVER IN STRUCT PAGE. SORRY NO ! What this means whenever a thread want to spin on page until it can lock it then it can carefully replace the page->mapping with a waiter struct for a wait list. Thus each page under contention will have its own wait list. The fact that there is not many place that dereference page.mapping is important because this means now that any dereference must be done with preemption disabled (inside rcu read section) so that the waiter can free the waiter struct without fear for hazard (the struct is on the stack like today). Pseudo code at the end of this mail. Devil is in the details but after long meditation and pondering on this i believe this is a do-able solution. Note it does not rely on the write protection, nor does it technically need to kill all struct page mapping derefence. But the latter can really hurt performance if they have to be done under rcu read lock and the corresponding grace period needed before freeing waiter struct. -- KSM for everyone ! With generic write protection you can do KSM for file back page too (even if they have different offset, mapping or buffer_head). While i believe page sharing for containers is already solve with overlayfs, this might still be an interesting feature for some. Oh and crazy to crazy you can merge private anonymous page and file back page together ... Probably totaly useless but cool like crazy. -- KDM (Kernel Duplicate Memory) Most kernel development, especialy in mm sub-system, is about how to save resources, how to share as much of them as possible so that we maximize their availabilities for all the processes. Objective here is slightly different. Some user favor performances and already have properly sized system (ie they have enough resources for the task at hand). For performance it is sometimes better to use more resources to improve other parameters of the performance equation. This is especialy true for big system that either use several devices or spread accross several nodes or both. For those, sharing memory means peer to peer traffic. This can become a bottleneck and saturate the interconnect between those peers. If some data set under consideration is access read only then we can duplicate memory backing it on multiple nodes/devices. Access is then local to
[RFC PATCH 07/79] mm/page: add helpers to find mapping give a page and buffer head
From: Jérôme GlisseFor now this simply use exist page_mapping() inline. Latter it will use buffer head pointer as a key to lookup mapping for write protected page. Signed-off-by: Jérôme Glisse Cc: linux...@kvack.org CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- include/linux/mm-page.h | 12 1 file changed, 12 insertions(+) diff --git a/include/linux/mm-page.h b/include/linux/mm-page.h index 2981db45eeef..647a8a8cf9ba 100644 --- a/include/linux/mm-page.h +++ b/include/linux/mm-page.h @@ -132,5 +132,17 @@ static inline unsigned long _page_file_offset(struct page *page, return page->index << PAGE_SHIFT; } +/* + * fs_page_mapping_get_with_bh() - page mapping knowing buffer_head + * @page: page struct pointer for which we want the mapping + * @bh: buffer_head associated with the page for the mapping + * Returns: page mapping for the given buffer head + */ +static inline struct address_space *fs_page_mapping_get_with_bh( + struct page *page, struct buffer_head *bh) +{ + return page_mapping(page); +} + #endif /* MM_PAGE_H */ #endif /* DOT_NOT_INCLUDE___INSIDE_MM */ -- 2.14.3
[RFC PATCH 09/79] fs: add struct address_space to read_cache_page() callback argument
From: Jérôme GlisseAdd struct address_space to callback arguments of read_cache_page() and read_cache_pages(). Note this patch only add arguments and modify callback function signature, it does not make use of the new argument and thus it should be regression free. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- drivers/staging/lustre/lustre/mdc/mdc_request.c | 3 ++- fs/9p/vfs_addr.c| 13 - fs/afs/file.c | 7 --- fs/afs/internal.h | 2 +- fs/exofs/inode.c| 5 +++-- fs/fuse/file.c | 3 ++- fs/gfs2/aops.c | 5 +++-- fs/jffs2/file.c | 6 -- fs/jffs2/fs.c | 2 +- fs/jffs2/os-linux.h | 3 ++- fs/nfs/dir.c| 3 ++- fs/nfs/read.c | 3 ++- fs/nfs/symlink.c| 6 -- include/linux/pagemap.h | 8 ++-- mm/filemap.c| 14 +++--- mm/readahead.c | 4 ++-- 16 files changed, 61 insertions(+), 26 deletions(-) diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c index 03e55bca4ada..4814ef083824 100644 --- a/drivers/staging/lustre/lustre/mdc/mdc_request.c +++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c @@ -1122,7 +1122,8 @@ struct readpage_param { * in PAGE_SIZE (if PAGE_SIZE greater than LU_PAGE_SIZE), and the * lu_dirpage for this integrated page will be adjusted. **/ -static int mdc_read_page_remote(void *data, struct page *page0) +static int mdc_read_page_remote(void *data, struct address_space *mapping, + struct page *page0) { struct readpage_param *rp = data; struct page **page_pool; diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index e1cbdfdb7c68..61f70e63a525 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -99,6 +99,17 @@ static int v9fs_vfs_readpage(struct file *filp, struct page *page) return v9fs_fid_readpage(filp->private_data, page); } +/* + * This wrapper is needed to avoid forcing callback cast on read_cache_pages() + * and defeating compiler figuring out we are doing something wrong. + */ +static int v9fs_vfs_readpage_filler(void *data, struct address_space *mapping, + struct page *page) +{ + return v9fs_vfs_readpage(data, page); +} + + /** * v9fs_vfs_readpages - read a set of pages from 9P * @@ -122,7 +133,7 @@ static int v9fs_vfs_readpages(struct file *filp, struct address_space *mapping, if (ret == 0) return ret; - ret = read_cache_pages(mapping, pages, (void *)v9fs_vfs_readpage, filp); + ret = read_cache_pages(mapping, pages, v9fs_vfs_readpage_filler, filp); p9_debug(P9_DEBUG_VFS, " = %d\n", ret); return ret; } diff --git a/fs/afs/file.c b/fs/afs/file.c index a39192ced99e..f457b0144946 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -247,7 +247,8 @@ int afs_fetch_data(struct afs_vnode *vnode, struct key *key, struct afs_read *de /* * read page from file, directory or symlink, given a key to use */ -int afs_page_filler(void *data, struct page *page) +int afs_page_filler(void *data, struct address_space *mapping, + struct page *page) { struct inode *inode = page->mapping->host; struct afs_vnode *vnode = AFS_FS_I(inode); @@ -373,14 +374,14 @@ static int afs_readpage(struct file *file, struct page *page) if (file) { key = afs_file_key(file); ASSERT(key != NULL); - ret = afs_page_filler(key, page); + ret = afs_page_filler(key, page->mapping, page); } else { struct inode *inode = page->mapping->host; key = afs_request_key(AFS_FS_S(inode->i_sb)->cell); if (IS_ERR(key)) { ret = PTR_ERR(key); } else { - ret = afs_page_filler(key, page); + ret = afs_page_filler(key, page->mapping, page); key_put(key); } } diff --git a/fs/afs/internal.h b/fs/afs/internal.h index f38d6a561a84..4c449145f668 100644 --- a/fs/afs/internal.h +++ b/fs/afs/internal.h @@ -656,7 +656,7 @@ extern void afs_put_wb_key(struct afs_wb_key *); extern int
[RFC PATCH 08/79] mm/page: add helpers to find page mapping and private given a bio
From: Jérôme GlisseWhen page undergo io it is associated with a unique bio and thus we can use it to lookup other page fields which are relevant only for the bio under consideration. Note this only apply when page is special ie page->mapping is pointing to some special structure which is not a valid struct address_space. Signed-off-by: Jérôme Glisse Cc: linux...@kvack.org CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- include/linux/mm-page.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/mm-page.h b/include/linux/mm-page.h index 647a8a8cf9ba..6ec3ba19b1a4 100644 --- a/include/linux/mm-page.h +++ b/include/linux/mm-page.h @@ -24,6 +24,7 @@ /* External struct dependencies: */ struct address_space; +struct bio; /* External function dependencies: */ extern pgoff_t __page_file_index(struct page *page); @@ -144,5 +145,13 @@ static inline struct address_space *fs_page_mapping_get_with_bh( return page_mapping(page); } +static inline void bio_page_mapping_and_private(struct page *page, + struct bio *bio, struct address_space **mappingp, + unsigned long *privatep) +{ + *mappingp = page->mapping; + *privatep = page_private(page); +} + #endif /* MM_PAGE_H */ #endif /* DOT_NOT_INCLUDE___INSIDE_MM */ -- 2.14.3
[RFC PATCH 22/79] fs: add struct inode to block_read_full_page() arguments
From: Jérôme GlisseAdd struct inode to block_read_full_page(). Note this patch only add arguments and modify call site conservatily using page->mapping and thus the end result is as before this patch. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- fs/adfs/inode.c | 2 +- fs/affs/file.c | 2 +- fs/befs/linuxvfs.c | 3 ++- fs/bfs/file.c | 2 +- fs/block_dev.c | 2 +- fs/buffer.c | 4 ++-- fs/efs/inode.c | 2 +- fs/ext4/readpage.c | 3 ++- fs/freevxfs/vxfs_subr.c | 2 +- fs/hfs/inode.c | 2 +- fs/hfsplus/inode.c | 3 ++- fs/minix/inode.c| 2 +- fs/mpage.c | 2 +- fs/ocfs2/aops.c | 3 ++- fs/ocfs2/refcounttree.c | 3 ++- fs/omfs/file.c | 2 +- fs/qnx4/inode.c | 2 +- fs/reiserfs/inode.c | 3 ++- fs/sysv/itree.c | 2 +- fs/ufs/inode.c | 3 ++- include/linux/buffer_head.h | 2 +- 21 files changed, 29 insertions(+), 22 deletions(-) diff --git a/fs/adfs/inode.c b/fs/adfs/inode.c index 1100d5da84d0..2270ab3d5392 100644 --- a/fs/adfs/inode.c +++ b/fs/adfs/inode.c @@ -45,7 +45,7 @@ static int adfs_writepage(struct address_space *mapping, struct page *page, static int adfs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return block_read_full_page(page, adfs_get_block); + return block_read_full_page(page->mapping->host, page, adfs_get_block); } static void adfs_write_failed(struct address_space *mapping, loff_t to) diff --git a/fs/affs/file.c b/fs/affs/file.c index 55ab72c1b228..136cb90f332f 100644 --- a/fs/affs/file.c +++ b/fs/affs/file.c @@ -379,7 +379,7 @@ static int affs_writepage(struct address_space *mapping, struct page *page, static int affs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return block_read_full_page(page, affs_get_block); + return block_read_full_page(page->mapping->host, page, affs_get_block); } static void affs_write_failed(struct address_space *mapping, loff_t to) diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index f6844b4ae77f..4436123674d3 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -112,7 +112,8 @@ static int befs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return block_read_full_page(page, befs_get_block); + return block_read_full_page(page->mapping->host, page, + befs_get_block); } static sector_t diff --git a/fs/bfs/file.c b/fs/bfs/file.c index 1c4593429f7d..b1255ee4cd75 100644 --- a/fs/bfs/file.c +++ b/fs/bfs/file.c @@ -160,7 +160,7 @@ static int bfs_writepage(struct address_space *mapping, struct page *page, static int bfs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return block_read_full_page(page, bfs_get_block); + return block_read_full_page(page->mapping->host, page, bfs_get_block); } static void bfs_write_failed(struct address_space *mapping, loff_t to) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2bf1b17aeff3..9ac6bf760272 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -571,7 +571,7 @@ static int blkdev_writepage(struct address_space *mapping, struct page *page, static int blkdev_readpage(struct file * file, struct address_space *mapping, struct page * page) { - return block_read_full_page(page, blkdev_get_block); + return block_read_full_page(page->mapping->host,page,blkdev_get_block); } static int blkdev_readpages(struct file *file, struct address_space *mapping, diff --git a/fs/buffer.c b/fs/buffer.c index 99818e876ad8..aa7d9be68581 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2231,9 +2231,9 @@ EXPORT_SYMBOL(block_is_partially_uptodate); * set/clear_buffer_uptodate() functions propagate buffer state into the * page struct once IO has completed. */ -int block_read_full_page(struct page *page, get_block_t *get_block) +int block_read_full_page(struct inode *inode, struct page *page, +get_block_t *get_block) { - struct inode *inode = page->mapping->host; sector_t iblock, lblock; struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; unsigned int blocksize, bbits; diff --git a/fs/efs/inode.c b/fs/efs/inode.c index 05aab4a5e8a1..a2f47227124e 100644 --- a/fs/efs/inode.c +++ b/fs/efs/inode.c @@ -16,7 +16,7 @@ static int efs_readpage(struct file
[RFC PATCH 24/79] fs: add struct inode to nobh_writepage() arguments
From: Jérôme GlisseAdd struct inode to nobh_writepage(). Note this patch only add arguments and modify call site conservatily using page->mapping and thus the end result is as before this patch. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- fs/buffer.c | 5 ++--- fs/ext2/inode.c | 2 +- fs/gfs2/aops.c | 3 ++- include/linux/buffer_head.h | 4 ++-- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index aa7d9be68581..31298f4f0300 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2730,10 +2730,9 @@ EXPORT_SYMBOL(nobh_write_end); * that it tries to operate without attaching bufferheads to * the page. */ -int nobh_writepage(struct page *page, get_block_t *get_block, - struct writeback_control *wbc) +int nobh_writepage(struct inode *inode, struct page *page, + get_block_t *get_block, struct writeback_control *wbc) { - struct inode * const inode = page->mapping->host; loff_t i_size = i_size_read(inode); const pgoff_t end_index = i_size >> PAGE_SHIFT; unsigned offset; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 37439d1e544c..11b3c3e7ea65 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -926,7 +926,7 @@ static int ext2_nobh_writepage(struct address_space *mapping, struct page *page, struct writeback_control *wbc) { - return nobh_writepage(page, ext2_get_block, wbc); + return nobh_writepage(page->mapping->host, page, ext2_get_block, wbc); } static sector_t ext2_bmap(struct address_space *mapping, sector_t block) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 8cfd4c7d884c..ff02313b86e6 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -142,7 +142,8 @@ static int gfs2_writepage(struct address_space *mapping, struct page *page, if (ret <= 0) return ret; - return nobh_writepage(page, gfs2_get_block_noalloc, wbc); + return nobh_writepage(page->mapping->host, page, + gfs2_get_block_noalloc, wbc); } /* This is the same as calling block_write_full_page, but it also diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index cab143668834..fb68a3358330 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -265,8 +265,8 @@ int nobh_write_end(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page *, void *); int nobh_truncate_page(struct address_space *, loff_t, get_block_t *); -int nobh_writepage(struct page *page, get_block_t *get_block, -struct writeback_control *wbc); +int nobh_writepage(struct inode *inode, struct page *page, + get_block_t *get_block, struct writeback_control *wbc); void buffer_init(void); -- 2.14.3
[RFC PATCH 27/79] fs: add struct address_space to fscache_read*() callback arguments
From: Jérôme GlisseAdd struct address_space to fscache_read*() callback argument. Note this patch only add arguments and modify call site conservatily using page->mapping and thus the end result is as before this patch. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: David Howells Cc: linux-cach...@redhat.com Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- fs/9p/cache.c | 4 +++- fs/afs/file.c | 4 +++- fs/ceph/cache.c | 10 ++ fs/cifs/fscache.c | 6 -- fs/fscache/page.c | 1 + fs/nfs/fscache.c | 4 +++- include/linux/fscache-cache.h | 2 +- include/linux/fscache.h | 9 ++--- 8 files changed, 27 insertions(+), 13 deletions(-) diff --git a/fs/9p/cache.c b/fs/9p/cache.c index 8185bfe4492f..3f122d35c54d 100644 --- a/fs/9p/cache.c +++ b/fs/9p/cache.c @@ -273,7 +273,8 @@ void __v9fs_fscache_invalidate_page(struct address_space *mapping, } } -static void v9fs_vfs_readpage_complete(struct page *page, void *data, +static void v9fs_vfs_readpage_complete(struct address_space *mapping, + struct page *page, void *data, int error) { if (!error) @@ -299,6 +300,7 @@ int __v9fs_readpage_from_fscache(struct inode *inode, struct page *page) return -ENOBUFS; ret = fscache_read_or_alloc_page(v9inode->fscache, +page->mapping, page, v9fs_vfs_readpage_complete, NULL, diff --git a/fs/afs/file.c b/fs/afs/file.c index f87e997b9df9..23ff51343dd3 100644 --- a/fs/afs/file.c +++ b/fs/afs/file.c @@ -203,7 +203,8 @@ void afs_put_read(struct afs_read *req) /* * deal with notification that a page was read from the cache */ -static void afs_file_readpage_read_complete(struct page *page, +static void afs_file_readpage_read_complete(struct address_space *mapping, + struct page *page, void *data, int error) { @@ -271,6 +272,7 @@ int afs_page_filler(void *data, struct address_space *mapping, /* is it cached? */ #ifdef CONFIG_AFS_FSCACHE ret = fscache_read_or_alloc_page(vnode->cache, +page->mapping, page, afs_file_readpage_read_complete, NULL, diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c index a3ab265d3215..14438f1ed7e0 100644 --- a/fs/ceph/cache.c +++ b/fs/ceph/cache.c @@ -266,7 +266,9 @@ void ceph_fscache_file_set_cookie(struct inode *inode, struct file *filp) } } -static void ceph_readpage_from_fscache_complete(struct page *page, void *data, int error) +static void ceph_readpage_from_fscache_complete(struct address_space *mapping, + struct page *page, void *data, + int error) { if (!error) SetPageUptodate(page); @@ -293,9 +295,9 @@ int ceph_readpage_from_fscache(struct inode *inode, struct page *page) if (!cache_valid(ci)) return -ENOBUFS; - ret = fscache_read_or_alloc_page(ci->fscache, page, -ceph_readpage_from_fscache_complete, NULL, -GFP_KERNEL); + ret = fscache_read_or_alloc_page(ci->fscache, page->mapping, page, +ceph_readpage_from_fscache_complete, +NULL, GFP_KERNEL); switch (ret) { case 0: /* Page found */ diff --git a/fs/cifs/fscache.c b/fs/cifs/fscache.c index 8d4b7bc8ae91..25f259a83fe0 100644 --- a/fs/cifs/fscache.c +++ b/fs/cifs/fscache.c @@ -140,7 +140,8 @@ int cifs_fscache_release_page(struct page *page, gfp_t gfp) return 1; } -static void cifs_readpage_from_fscache_complete(struct page *page, void *ctx, +static void cifs_readpage_from_fscache_complete(struct address_space *mapping, + struct page *page, void *ctx, int error) { cifs_dbg(FYI, "%s: (0x%p/%d)\n", __func__, page, error); @@ -158,7 +159,8 @@ int __cifs_readpage_from_fscache(struct inode *inode, struct page *page) cifs_dbg(FYI, "%s: (fsc:%p, p:%p, i:0x%p\n",
[RFC PATCH 20/79] fs: add struct address_space to write_cache_pages() callback argument
From: Jérôme GlisseAdd struct address_space to callback arguments of write_cache_pages() Note this patch only add arguments and modify all callback functions signature, it does not make use of the new argument and thus it should be regression free. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- fs/exofs/inode.c | 2 +- fs/ext4/inode.c | 7 +++ fs/fuse/file.c| 1 + fs/mpage.c| 6 +++--- fs/nfs/write.c| 4 +++- fs/xfs/xfs_aops.c | 3 ++- include/linux/writeback.h | 4 ++-- mm/page-writeback.c | 9 - 8 files changed, 19 insertions(+), 17 deletions(-) diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index 41f6b04cbfca..54d6b7dbd4e7 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -691,7 +691,7 @@ static int write_exec(struct page_collect *pcol) * previous segment and will start a new collection. * Eventually caller must submit the last segment if present. */ -static int writepage_strip(struct page *page, +static int writepage_strip(struct page *page, struct address_space *mapping, struct writeback_control *wbc_unused, void *data) { struct page_collect *pcol = data; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 96dcae1937c8..63bf0160c579 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2697,10 +2697,9 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) return err; } -static int __writepage(struct page *page, struct writeback_control *wbc, - void *data) +static int __writepage(struct page *page, struct address_space *mapping, + struct writeback_control *wbc, void *data) { - struct address_space *mapping = data; int ret = ext4_writepage(mapping, page, wbc); mapping_set_error(mapping, ret); return ret; @@ -2746,7 +2745,7 @@ static int ext4_writepages(struct address_space *mapping, struct blk_plug plug; blk_start_plug(); - ret = write_cache_pages(mapping, wbc, __writepage, mapping); + ret = write_cache_pages(mapping, wbc, __writepage, NULL); blk_finish_plug(); goto out_writepages; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 3c602632b33a..e0562d04d84f 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1794,6 +1794,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req, } static int fuse_writepages_fill(struct page *page, + struct address_space *mapping, struct writeback_control *wbc, void *_data) { struct fuse_fill_wb_data *data = _data; diff --git a/fs/mpage.c b/fs/mpage.c index b03a82d5b908..d25f08f46090 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -479,8 +479,8 @@ void clean_page_buffers(struct page *page) clean_buffers(page, ~0U); } -static int __mpage_writepage(struct page *page, struct writeback_control *wbc, - void *data) +static int __mpage_writepage(struct page *page, struct address_space *_mapping, +struct writeback_control *wbc, void *data) { struct mpage_data *mpd = data; struct bio *bio = mpd->bio; @@ -734,7 +734,7 @@ int mpage_writepage(struct page *page, get_block_t get_block, .get_block = get_block, .use_writepage = 0, }; - int ret = __mpage_writepage(page, wbc, ); + int ret = __mpage_writepage(page, page->mapping, wbc, ); if (mpd.bio) { int op_flags = (wbc->sync_mode == WB_SYNC_ALL ? REQ_SYNC : 0); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index 1f7723eff542..ffab026b9632 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -693,7 +693,9 @@ int nfs_writepage(struct address_space *mapping, struct page *page, return ret; } -static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data) +static int nfs_writepages_callback(struct page *page, + struct address_space *mapping, + struct writeback_control *wbc, void *data) { int ret; diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 981a2a4e00e5..00922a82ede6 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1060,6 +1060,7 @@ xfs_writepage_map( STATIC int xfs_do_writepage( struct page *page, + struct address_space*mapping, struct writeback_control *wbc, void*data) { @@ -1179,7 +1180,7 @@ xfs_vm_writepage( }; int
[RFC PATCH 26/79] fs: add struct address_space to mpage_readpage() arguments
From: Jérôme GlisseAdd struct address_space to mpage_readpage(). Note this patch only add arguments and modify call site conservatily using page->mapping and thus the end result is as before this patch. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton --- fs/ext2/inode.c | 2 +- fs/fat/inode.c| 2 +- fs/gfs2/aops.c| 2 +- fs/hpfs/file.c| 2 +- fs/isofs/inode.c | 2 +- fs/jfs/inode.c| 2 +- fs/mpage.c| 14 -- fs/nilfs2/inode.c | 2 +- fs/qnx6/inode.c | 2 +- fs/udf/inode.c| 2 +- fs/xfs/xfs_aops.c | 2 +- include/linux/mpage.h | 3 ++- 12 files changed, 20 insertions(+), 17 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 11b3c3e7ea65..33873c0a4c14 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -872,7 +872,7 @@ static int ext2_writepage(struct address_space *mapping, struct page *page, static int ext2_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return mpage_readpage(page, ext2_get_block); + return mpage_readpage(page->mapping, page, ext2_get_block); } static int diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 4b70dcbcd192..9e6bc6364468 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -197,7 +197,7 @@ static int fat_writepages(struct address_space *mapping, static int fat_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return mpage_readpage(page, fat_get_block); + return mpage_readpage(page->mapping, page, fat_get_block); } static int fat_readpages(struct file *file, struct address_space *mapping, diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index ff02313b86e6..b42775bba6a1 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -524,7 +524,7 @@ static int __gfs2_readpage(void *file, struct address_space *mapping, error = stuffed_readpage(ip, page); unlock_page(page); } else { - error = mpage_readpage(page, gfs2_block_map); + error = mpage_readpage(page->mapping, page, gfs2_block_map); } if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c index 3f2cc3fcee80..620dd9709a2c 100644 --- a/fs/hpfs/file.c +++ b/fs/hpfs/file.c @@ -118,7 +118,7 @@ static int hpfs_get_block(struct inode *inode, sector_t iblock, struct buffer_he static int hpfs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return mpage_readpage(page, hpfs_get_block); + return mpage_readpage(page->mapping, page, hpfs_get_block); } static int hpfs_writepage(struct address_space *mapping, struct page *page, diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c index 541d89e0621a..7d73b1036321 100644 --- a/fs/isofs/inode.c +++ b/fs/isofs/inode.c @@ -1171,7 +1171,7 @@ struct buffer_head *isofs_bread(struct inode *inode, sector_t block) static int isofs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return mpage_readpage(page, isofs_get_block); + return mpage_readpage(page->mapping, page, isofs_get_block); } static int isofs_readpages(struct file *file, struct address_space *mapping, diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c index be71214f4937..be6da161bc81 100644 --- a/fs/jfs/inode.c +++ b/fs/jfs/inode.c @@ -297,7 +297,7 @@ static int jfs_writepages(struct address_space *mapping, static int jfs_readpage(struct file *file, struct address_space *mapping, struct page *page) { - return mpage_readpage(page, jfs_get_block); + return mpage_readpage(page->mapping, page, jfs_get_block); } static int jfs_readpages(struct file *file, struct address_space *mapping, diff --git a/fs/mpage.c b/fs/mpage.c index 8800bcde5f4e..52a6028e2066 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -143,12 +143,13 @@ map_buffer_to_page(struct inode *inode, struct page *page, * get_block() call. */ static struct bio * -do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, +do_mpage_readpage(struct bio *bio, struct address_space *mapping, + struct page *page, unsigned nr_pages, sector_t *last_block_in_bio, struct buffer_head *map_bh, unsigned long *first_logical_block, get_block_t get_block, gfp_t gfp) { - struct inode *inode = page->mapping->host; + struct inode *inode = mapping->host; const unsigned blkbits = inode->i_blkbits; const unsigned
[RFC PATCH 30/79] fs/block: add struct address_space to __block_write_begin() arguments
From: Jérôme GlisseAdd struct address_space to __block_write_begin() arguments. One step toward dropping reliance on page->mapping. -- identifier M; expression E1, E2, E3, E4; @@ struct address_space *M; ... -__block_write_begin(E1, E2, E3, E4) +__block_write_begin(M, E1, E2, E3, E4) @exists@ identifier M, F; expression E1, E2, E3, E4; @@ F(..., struct address_space *M, ...) {... -__block_write_begin(E1, E2, E3, E4) +__block_write_begin(M, E1, E2, E3, E4) ...} @exists@ identifier I; expression E1, E2, E3, E4, E5; @@ struct inode *I; ... -__block_write_begin(E1, E2, E3, E4) +__block_write_begin(I->i_mapping, E1, E2, E3, E4) @exists@ identifier I, F; expression E1, E2, E3, E4; @@ F(..., struct inode *I, ...) {... -__block_write_begin(E1, E2, E3, E4) +__block_write_begin(I->i_mapping, E1, E2, E3, E4) ...} @exists@ identifier P; expression E1, E2, E3, E4, E5; @@ struct page *P; ... -__block_write_begin(E1, E2, E3, E4) +__block_write_begin(P->mapping, E1, E2, E3, E4) @exists@ identifier P, F; expression E1, E2, E3, E4; @@ F(..., struct page *P, ...) {... -__block_write_begin(E1, E2, E3, E4) +__block_write_begin(P->mapping, E1, E2, E3, E4) ...} -- Signed-off-by: Jérôme Glisse Cc: Jens Axboe CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 10 +- fs/ext2/dir.c | 3 ++- fs/ext4/inline.c| 7 --- fs/ext4/inode.c | 8 +--- fs/gfs2/aops.c | 2 +- fs/minix/inode.c| 3 ++- fs/nilfs2/dir.c | 3 ++- fs/ocfs2/file.c | 2 +- fs/reiserfs/inode.c | 8 +--- fs/sysv/itree.c | 2 +- fs/ufs/inode.c | 3 ++- include/linux/buffer_head.h | 4 ++-- 12 files changed, 32 insertions(+), 23 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 8b2eb3dfb539..de16588d7f7f 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2028,8 +2028,8 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len, return err; } -int __block_write_begin(struct page *page, loff_t pos, unsigned len, - get_block_t *get_block) +int __block_write_begin(struct address_space *mapping, struct page *page, + loff_t pos, unsigned len, get_block_t *get_block) { return __block_write_begin_int(page, pos, len, get_block, NULL); } @@ -2090,7 +2090,7 @@ int block_write_begin(struct address_space *mapping, loff_t pos, unsigned len, if (!page) return -ENOMEM; - status = __block_write_begin(page, pos, len, get_block); + status = __block_write_begin(mapping, page, pos, len, get_block); if (unlikely(status)) { unlock_page(page); put_page(page); @@ -2495,7 +2495,7 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf, else end = PAGE_SIZE; - ret = __block_write_begin(page, 0, end, get_block); + ret = __block_write_begin(inode->i_mapping, page, 0, end, get_block); if (!ret) ret = block_commit_write(page, 0, end); @@ -2579,7 +2579,7 @@ int nobh_write_begin(struct address_space *mapping, *fsdata = NULL; if (page_has_buffers(page)) { - ret = __block_write_begin(page, pos, len, get_block); + ret = __block_write_begin(mapping, page, pos, len, get_block); if (unlikely(ret)) goto out_release; return ret; diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 3b8114def693..0d116d4e923c 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -453,7 +453,8 @@ ino_t ext2_inode_by_name(struct inode *dir, const struct qstr *child) static int ext2_prepare_chunk(struct page *page, loff_t pos, unsigned len) { - return __block_write_begin(page, pos, len, ext2_get_block); + return __block_write_begin(page->mapping, page, pos, len, + ext2_get_block); } /* Releases the page */ diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 70cf4c7b268a..ffdbd443c67a 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -580,10 +580,11 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping, goto out; if (ext4_should_dioread_nolock(inode)) { - ret = __block_write_begin(page, from, to, + ret = __block_write_begin(mapping, page, from, to, ext4_get_block_unwritten); } else - ret = __block_write_begin(page,
[RFC PATCH 28/79] fs: introduce page_is_truncated() helper
From: Jérôme GlisseSimple helper to unify all truncation test to one logic. This also unify logic that was bit different in various places. Convertion done using following coccinelle spatch on fs and mm dir: - @@ struct page * ppage; @@ -!ppage->mapping +page_is_truncated(ppage, mapping) @@ struct page * ppage; @@ -ppage->mapping != mapping +page_is_truncated(ppage, mapping) @@ struct page * ppage; @@ -ppage->mapping != inode->i_mapping +page_is_truncated(ppage, inode->i_mapping) - Followed by: git checkout mm/migrate.c mm/huge_memory.c mm/memory-failure.c git checkout mm/memcontrol.c fs/ext4/page-io.c fs/reiserfs/journal.c Hand editing: mm/memory.c do_page_mkwrite() fs/splice.c splice_to_pipe() fs/nfs/dir.c cache_page_release() fs/xfs/xfs_aops.c xfs_check_page_type() fs/xfs/xfs_aops.c xfs_vm_set_page_dirty() fs/buffer.c mark_buffer_write_io_error() fs/buffer.c page_cache_seek_hole_data() Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman Cc: Jeff Layton fixup! fs: introduce page_is_truncated() helper --- drivers/staging/lustre/lustre/llite/llite_mmap.c | 7 +-- fs/9p/vfs_file.c | 2 +- fs/afs/write.c | 2 +- fs/btrfs/extent_io.c | 4 ++-- fs/btrfs/file.c | 2 +- fs/btrfs/inode.c | 7 --- fs/btrfs/ioctl.c | 6 +++--- fs/btrfs/scrub.c | 2 +- fs/buffer.c | 8 fs/ceph/addr.c | 6 +++--- fs/cifs/file.c | 2 +- fs/ext4/inode.c | 10 +- fs/ext4/mballoc.c| 8 fs/f2fs/checkpoint.c | 4 ++-- fs/f2fs/data.c | 8 fs/f2fs/file.c | 2 +- fs/f2fs/super.c | 2 +- fs/fuse/file.c | 2 +- fs/gfs2/aops.c | 2 +- fs/gfs2/file.c | 4 ++-- fs/iomap.c | 2 +- fs/nfs/dir.c | 2 +- fs/nilfs2/file.c | 2 +- fs/ocfs2/aops.c | 2 +- fs/ocfs2/mmap.c | 2 +- fs/splice.c | 2 +- fs/ubifs/file.c | 2 +- fs/xfs/xfs_aops.c| 8 +--- include/linux/pagemap.h | 16 mm/filemap.c | 12 ++-- mm/memory.c | 5 - mm/page-writeback.c | 2 +- mm/truncate.c| 12 ++-- 33 files changed, 92 insertions(+), 67 deletions(-) diff --git a/drivers/staging/lustre/lustre/llite/llite_mmap.c b/drivers/staging/lustre/lustre/llite/llite_mmap.c index c0533bd6f352..6a9d310a7bfd 100644 --- a/drivers/staging/lustre/lustre/llite/llite_mmap.c +++ b/drivers/staging/lustre/lustre/llite/llite_mmap.c @@ -191,7 +191,7 @@ static int ll_page_mkwrite0(struct vm_area_struct *vma, struct page *vmpage, struct ll_inode_info *lli = ll_i2info(inode); lock_page(vmpage); - if (!vmpage->mapping) { + if (page_is_truncated(vmpage, inode->i_mapping)) { unlock_page(vmpage); /* page was truncated and lock was cancelled, return @@ -341,10 +341,13 @@ static int ll_fault(struct vm_fault *vmf) LASSERT(!(result & VM_FAULT_LOCKED)); if (result == 0) { struct page *vmpage = vmf->page; + struct address_space *mapping; + + mapping = vmf->vma->vm_file ? vmf->vma->vm_file->f_mapping : 0; /* check if this page has been truncated */ lock_page(vmpage); - if (unlikely(!vmpage->mapping)) { /* unlucky */ + if (unlikely(page_is_truncated(vmpage, mapping))) { /* unlucky */ unlock_page(vmpage); put_page(vmpage); vmf->page = NULL; diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c index 03c9e325bfbc..bf71ea1d7ff6 100644 ---
[RFC PATCH 32/79] fs/block: do not rely on page->mapping get it from the context
From: Jérôme GlisseThis patch remove most dereference of page->mapping and get the mapping from the call context (either already available in the function or by adding it to function arguments). Signed-off-by: Jérôme Glisse Cc: Jens Axboe CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/block_dev.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 502b6643bc74..dd9da97615e3 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -564,14 +564,14 @@ EXPORT_SYMBOL(thaw_bdev); static int blkdev_writepage(struct address_space *mapping, struct page *page, struct writeback_control *wbc) { - return block_write_full_page(page->mapping->host, page, + return block_write_full_page(mapping->host, page, blkdev_get_block, wbc); } static int blkdev_readpage(struct file * file, struct address_space *mapping, struct page * page) { - return block_read_full_page(page->mapping->host,page,blkdev_get_block); + return block_read_full_page(mapping->host,page,blkdev_get_block); } static int blkdev_readpages(struct file *file, struct address_space *mapping, @@ -1941,7 +1941,7 @@ EXPORT_SYMBOL_GPL(blkdev_read_iter); static int blkdev_releasepage(struct address_space *mapping, struct page *page, gfp_t wait) { - struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super; + struct super_block *super = BDEV_I(mapping->host)->bdev.bd_super; if (super && super->s_op->bdev_try_to_free_page) return super->s_op->bdev_try_to_free_page(super, page, wait); -- 2.14.3
[RFC PATCH 31/79] fs/block: add struct address_space to __block_write_begin_int() args
From: Jérôme GlisseAdd struct address_space to __block_write_begin_int() arguments. One step toward dropping reliance on page->mapping. -- @exists@ identifier M; expression E1, E2, E3, E4, E5; @@ struct address_space *M; ... -__block_write_begin_int(E1, E2, E3, E4, E5) +__block_write_begin_int(M, E1, E2, E3, E4, E5) @exists@ identifier M, F; expression E1, E2, E3, E4, E5; @@ F(..., struct address_space *M, ...) {... -__block_write_begin_int(E1, E2, E3, E4, E5) +__block_write_begin_int(M, E1, E2, E3, E4, E5) ...} -- Signed-off-by: Jérôme Glisse Cc: Jens Axboe CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index de16588d7f7f..c83878d0a4c0 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1943,8 +1943,9 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh, } } -int __block_write_begin_int(struct page *page, loff_t pos, unsigned len, - get_block_t *get_block, struct iomap *iomap) +int __block_write_begin_int(struct address_space *mapping, struct page *page, + loff_t pos, unsigned len, get_block_t *get_block, + struct iomap *iomap) { unsigned from = pos & (PAGE_SIZE - 1); unsigned to = from + len; @@ -2031,7 +2032,8 @@ int __block_write_begin_int(struct page *page, loff_t pos, unsigned len, int __block_write_begin(struct address_space *mapping, struct page *page, loff_t pos, unsigned len, get_block_t *get_block) { - return __block_write_begin_int(page, pos, len, get_block, NULL); + return __block_write_begin_int(mapping, page, pos, len, get_block, + NULL); } EXPORT_SYMBOL(__block_write_begin); -- 2.14.3
[RFC PATCH 33/79] fs/journal: add struct super_block to jbd2_journal_forget() arguments.
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct super_block to jbd2_journal_forget() arguments. spatch --sp-file zemantic-010a.spatch --in-place --dir fs/ -- @exists@ expression E1, E2; identifier I; @@ struct super_block *I; ... -jbd2_journal_forget(E1, E2) +jbd2_journal_forget(E1, I, E2) @exists@ expression E1, E2; identifier F, I; @@ F(..., struct super_block *I, ...) { ... -jbd2_journal_forget(E1, E2) +jbd2_journal_forget(E1, I, E2) ... } @exists@ expression E1, E2; identifier I; @@ struct block_device *I; ... -jbd2_journal_forget(E1, E2) +jbd2_journal_forget(E1, I->bd_super, E2) @exists@ expression E1, E2; identifier F, I; @@ F(..., struct block_device *I, ...) { ... -jbd2_journal_forget(E1, E2) +jbd2_journal_forget(E1, I->bd_super, E2) ... } @exists@ expression E1, E2; identifier I; @@ struct inode *I; ... -jbd2_journal_forget(E1, E2) +jbd2_journal_forget(E1, I->i_sb, E2) @exists@ expression E1, E2; identifier F, I; @@ F(..., struct inode *I, ...) { ... -jbd2_journal_forget(E1, E2) +jbd2_journal_forget(E1, I->i_sb, E2) ... } -- Signed-off-by: Jérôme Glisse Cc: "Theodore Ts'o" Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org --- fs/ext4/ext4_jbd2.c | 2 +- fs/jbd2/revoke.c | 2 +- fs/jbd2/transaction.c | 3 ++- include/linux/jbd2.h | 3 ++- 4 files changed, 6 insertions(+), 4 deletions(-) diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 2d593201cf7a..0804d564b529 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -224,7 +224,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle, (!is_metadata && !ext4_should_journal_data(inode))) { if (bh) { BUFFER_TRACE(bh, "call jbd2_journal_forget"); - err = jbd2_journal_forget(handle, bh); + err = jbd2_journal_forget(handle, inode->i_sb, bh); if (err) ext4_journal_abort_handle(where, line, __func__, bh, handle, err); diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c index 696ef15ec942..b6e2fd52acd6 100644 --- a/fs/jbd2/revoke.c +++ b/fs/jbd2/revoke.c @@ -381,7 +381,7 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr, set_buffer_revokevalid(bh); if (bh_in) { BUFFER_TRACE(bh_in, "call jbd2_journal_forget"); - jbd2_journal_forget(handle, bh_in); + jbd2_journal_forget(handle, bdev->bd_super, bh_in); } else { BUFFER_TRACE(bh, "call brelse"); __brelse(bh); diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index ac311037d7a5..e8c50bb5822c 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1482,7 +1482,8 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh) * Allow this call even if the handle has aborted --- it may be part of * the caller's cleanup after an abort. */ -int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh) +int jbd2_journal_forget (handle_t *handle, struct super_block *sb, +struct buffer_head *bh) { transaction_t *transaction = handle->h_transaction; journal_t *journal; diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index b708e5169d1d..d89749a179eb 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1358,7 +1358,8 @@ extern int jbd2_journal_get_undo_access(handle_t *, struct buffer_head *); voidjbd2_journal_set_triggers(struct buffer_head *, struct jbd2_buffer_trigger_type *type); extern int jbd2_journal_dirty_metadata (handle_t *, struct buffer_head *); -extern int jbd2_journal_forget (handle_t *, struct buffer_head *); +extern int jbd2_journal_forget (handle_t *, struct super_block *sb, + struct buffer_head *); extern void journal_sync_buffer (struct buffer_head *); extern int jbd2_journal_invalidatepage(journal_t *, struct page *, unsigned int, unsigned int); -- 2.14.3
[RFC PATCH 29/79] fs/block: add struct address_space to bdev_write_page() arguments
From: Jérôme GlisseAdd struct address_space to bdev_write_page() arguments. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Jens Axboe CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/block_dev.c | 4 +++- fs/mpage.c | 2 +- include/linux/blkdev.h | 5 +++-- mm/page_io.c | 7 --- 4 files changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 9ac6bf760272..502b6643bc74 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -678,6 +678,7 @@ EXPORT_SYMBOL_GPL(bdev_read_page); * bdev_write_page() - Start writing a page to a block device * @bdev: The device to write the page to * @sector: The offset on the device to write the page to (need not be aligned) + * @mapping: The address space the page belongs to * @page: The page to write * @wbc: The writeback_control for the write * @@ -694,7 +695,8 @@ EXPORT_SYMBOL_GPL(bdev_read_page); * Return: negative errno if an error occurs, 0 if submission was successful. */ int bdev_write_page(struct block_device *bdev, sector_t sector, - struct page *page, struct writeback_control *wbc) + struct address_space *mapping, struct page *page, + struct writeback_control *wbc) { int result; const struct block_device_operations *ops = bdev->bd_disk->fops; diff --git a/fs/mpage.c b/fs/mpage.c index 52a6028e2066..a75cea232f1a 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -619,7 +619,7 @@ static int __mpage_writepage(struct page *page, struct address_space *_mapping, if (bio == NULL) { if (first_unmapped == blocks_per_page) { if (!bdev_write_page(bdev, blocks[0] << (blkbits - 9), - page, wbc)) + mapping, page, wbc)) goto out; } bio = mpage_alloc(bdev, blocks[0] << (blkbits - 9), diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index ed63f3b69c12..0cf66b6993f4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -2053,8 +2053,9 @@ struct block_device_operations { extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int, unsigned long); extern int bdev_read_page(struct block_device *, sector_t, struct page *); -extern int bdev_write_page(struct block_device *, sector_t, struct page *, - struct writeback_control *); +extern int bdev_write_page(struct block_device *bdev, sector_t sector, + struct address_space *mapping, struct page *page, + struct writeback_control *wbc); #ifdef CONFIG_BLK_DEV_ZONED bool blk_req_needs_zone_write_lock(struct request *rq); diff --git a/mm/page_io.c b/mm/page_io.c index 402231dd1286..6e548b588490 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -282,12 +282,12 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, struct bio *bio; int ret; struct swap_info_struct *sis = page_swap_info(page); + struct file *swap_file = sis->swap_file; + struct address_space *mapping = swap_file->f_mapping; VM_BUG_ON_PAGE(!PageSwapCache(page), page); if (sis->flags & SWP_FILE) { struct kiocb kiocb; - struct file *swap_file = sis->swap_file; - struct address_space *mapping = swap_file->f_mapping; struct bio_vec bv = { .bv_page = page, .bv_len = PAGE_SIZE, @@ -325,7 +325,8 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return ret; } - ret = bdev_write_page(sis->bdev, swap_page_sector(page), page, wbc); + ret = bdev_write_page(sis->bdev, swap_page_sector(page), + mapping, page, wbc); if (!ret) { count_swpout_vm_event(page); return 0; -- 2.14.3
[RFC PATCH 34/79] fs/journal: add struct inode to jbd2_journal_revoke() arguments.
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct super_block to jbd2_journal_revoke() arguments. spatch --sp-file zemantic-011a.spatch --in-place --dir fs/ -- @exists@ expression E1, E2, E3; identifier I; @@ struct super_block *I; ... -jbd2_journal_revoke(E1, E2, E3) +jbd2_journal_revoke(E1, E2, I, E3) @exists@ expression E1, E2, E3; identifier F, I; @@ F(..., struct super_block *I, ...) { ... -jbd2_journal_revoke(E1, E2, E3) +jbd2_journal_revoke(E1, E2, I, E3) ... } @exists@ expression E1, E2, E3; identifier I; @@ struct inode *I; ... -jbd2_journal_revoke(E1, E2, E3) +jbd2_journal_revoke(E1, E2, I->i_sb, E3) @exists@ expression E1, E2, E3; identifier F, I; @@ F(..., struct inode *I, ...) { ... -jbd2_journal_revoke(E1, E2, E3) +jbd2_journal_revoke(E1, E2, I->i_sb, E3) ... } -- Signed-off-by: Jérôme Glisse Cc: "Theodore Ts'o" Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Signed-off-by: Jérôme Glisse --- fs/ext4/ext4_jbd2.c | 2 +- fs/jbd2/revoke.c | 2 +- include/linux/jbd2.h | 3 ++- 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 0804d564b529..5529badca994 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -237,7 +237,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle, * data!=journal && (is_metadata || should_journal_data(inode)) */ BUFFER_TRACE(bh, "call jbd2_journal_revoke"); - err = jbd2_journal_revoke(handle, blocknr, bh); + err = jbd2_journal_revoke(handle, blocknr, inode->i_sb, bh); if (err) { ext4_journal_abort_handle(where, line, __func__, bh, handle, err); diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c index b6e2fd52acd6..71e690ad9d44 100644 --- a/fs/jbd2/revoke.c +++ b/fs/jbd2/revoke.c @@ -320,7 +320,7 @@ void jbd2_journal_destroy_revoke(journal_t *journal) */ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr, - struct buffer_head *bh_in) + struct super_block *sb, struct buffer_head *bh_in) { struct buffer_head *bh = NULL; journal_t *journal; diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index d89749a179eb..c5133df80fd4 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1450,7 +1450,8 @@ extern void jbd2_journal_destroy_revoke_caches(void); extern intjbd2_journal_init_revoke_caches(void); extern void jbd2_journal_destroy_revoke(journal_t *); -extern intjbd2_journal_revoke (handle_t *, unsigned long long, struct buffer_head *); +extern intjbd2_journal_revoke (handle_t *, unsigned long long, + struct super_block *, struct buffer_head *); extern intjbd2_journal_cancel_revoke(handle_t *, struct journal_head *); extern void jbd2_journal_write_revoke_records(transaction_t *transaction, struct list_head *log_bufs); -- 2.14.3
[RFC PATCH 38/79] fs/buffer: add first buffer flag for first buffer_head in a page
From: Jérôme GlisseA common pattern in code is that we have a buffer_head and we want to get the first buffer_head in buffer_head list for a page. Before this patch it was simply done with page_buffers(bh->b_page). This patch introduce an helper bh_first_for_page(struct buffer_head *) which can use a new flag (also introduced in this patch) to find the first buffer_head struct for a given page. This patch use page_buffers(bh->b_page) for now but latter patch can update this helper to handle special page differently and instead scan buffer_head list until a buffer_head with first_for_page flag set is found. Signed-off-by: Jérôme Glisse Cc: Jens Axboe CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 4 ++-- include/linux/buffer_head.h | 18 ++ 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 422204701a3b..44beba15c38d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -276,7 +276,7 @@ static void end_buffer_async_read(struct address_space *mapping, * two buffer heads end IO at almost the same time and both * decide that the page is now completely done. */ - first = page_buffers(page); + first = bh_first_for_page(bh); local_irq_save(flags); bit_spin_lock(BH_Uptodate_Lock, >b_state); clear_buffer_async_read(bh); @@ -332,7 +332,7 @@ void end_buffer_async_write(struct address_space *mapping, struct page *page, SetPageError(page); } - first = page_buffers(page); + first = bh_first_for_page(bh); local_irq_save(flags); bit_spin_lock(BH_Uptodate_Lock, >b_state); diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 7ae60f59f27e..22e79307c055 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -39,6 +39,12 @@ enum bh_state_bits { BH_Prio,/* Buffer should be submitted with REQ_PRIO */ BH_Defer_Completion, /* Defer AIO completion to workqueue */ + /* +* First buffer_head for a page ie page->private is pointing to this +* buffer_head struct. +*/ + BH_FirstForPage, + BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities */ @@ -135,6 +141,7 @@ BUFFER_FNS(Unwritten, unwritten) BUFFER_FNS(Meta, meta) BUFFER_FNS(Prio, prio) BUFFER_FNS(Defer_Completion, defer_completion) +BUFFER_FNS(FirstForPage, first_for_page) #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) @@ -278,11 +285,22 @@ void buffer_init(void); * inline definitions */ +/* + * bh_first_for_page - return first buffer_head for a page + * @bh: buffer_head for which we want the first buffer_head for same page + * Returns: first buffer_head within the same page as given buffer_head + */ +static inline struct buffer_head *bh_first_for_page(struct buffer_head *bh) +{ + return page_buffers(bh->b_page); +} + static inline void attach_page_buffers(struct page *page, struct buffer_head *head) { get_page(page); SetPagePrivate(page); + set_buffer_first_for_page(head); set_page_private(page, (unsigned long)head); } -- 2.14.3
[RFC PATCH 35/79] fs/buffer: add struct address_space and struct page to end_io callback
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct address_space and struct page to the end_io callback of buffer head. Caller of this callback have more context information to find the match page and mapping. Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- drivers/md/md-bitmap.c | 3 ++- fs/btrfs/disk-io.c | 3 ++- fs/buffer.c | 26 +- fs/ext4/ext4.h | 3 ++- fs/ext4/ialloc.c| 3 ++- fs/gfs2/meta_io.c | 2 +- fs/jbd2/commit.c| 3 ++- fs/ntfs/aops.c | 9 ++--- fs/reiserfs/journal.c | 6 -- include/linux/buffer_head.h | 12 10 files changed, 46 insertions(+), 24 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index 239c7bb3929b..717e99eabce9 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -313,7 +313,8 @@ static void write_page(struct bitmap *bitmap, struct page *page, int wait) bitmap_file_kick(bitmap); } -static void end_bitmap_write(struct buffer_head *bh, int uptodate) +static void end_bitmap_write(struct address_space *mapping, struct page *page, +struct buffer_head *bh, int uptodate) { struct bitmap *bitmap = bh->b_private; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a976ccc6036b..df789cfdebd7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3123,7 +3123,8 @@ int open_ctree(struct super_block *sb, } ALLOW_ERROR_INJECTION(open_ctree, ERRNO); -static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate) +static void btrfs_end_buffer_write_sync(struct address_space *mapping, + struct page *page, struct buffer_head *bh, int uptodate) { if (uptodate) { set_buffer_uptodate(bh); diff --git a/fs/buffer.c b/fs/buffer.c index c83878d0a4c0..9f2c5e90b64d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -159,14 +159,16 @@ static void __end_buffer_read_notouch(struct buffer_head *bh, int uptodate) * Default synchronous end-of-IO handler.. Just mark it up-to-date and * unlock the buffer. This is what ll_rw_block uses too. */ -void end_buffer_read_sync(struct buffer_head *bh, int uptodate) +void end_buffer_read_sync(struct address_space *mapping, struct page *page, + struct buffer_head *bh, int uptodate) { __end_buffer_read_notouch(bh, uptodate); put_bh(bh); } EXPORT_SYMBOL(end_buffer_read_sync); -void end_buffer_write_sync(struct buffer_head *bh, int uptodate) +void end_buffer_write_sync(struct address_space *mapping, struct page *page, + struct buffer_head *bh, int uptodate) { if (uptodate) { set_buffer_uptodate(bh); @@ -250,12 +252,12 @@ __find_get_block_slow(struct block_device *bdev, sector_t block) * I/O completion handler for block_read_full_page() - pages * which come unlocked at the end of I/O. */ -static void end_buffer_async_read(struct buffer_head *bh, int uptodate) +static void end_buffer_async_read(struct address_space *mapping, + struct page *page, struct buffer_head *bh, int uptodate) { unsigned long flags; struct buffer_head *first; struct buffer_head *tmp; - struct page *page; int page_uptodate = 1; BUG_ON(!buffer_async_read(bh)); @@ -311,12 +313,12 @@ static void end_buffer_async_read(struct buffer_head *bh, int uptodate) * Completion handler for block_write_full_page() - pages which are unlocked * during I/O, and which have PageWriteback cleared upon I/O completion. */ -void end_buffer_async_write(struct buffer_head *bh, int uptodate) +void end_buffer_async_write(struct address_space *mapping, struct page *page, + struct buffer_head *bh, int uptodate) { unsigned long flags; struct buffer_head *first; struct buffer_head *tmp; - struct page *page; BUG_ON(!buffer_async_write(bh)); @@ -2311,7 +2313,7 @@ int block_read_full_page(struct inode *inode, struct page *page, for (i = 0; i < nr; i++) { bh = arr[i]; if (buffer_uptodate(bh)) - end_buffer_async_read(bh, 1); + end_buffer_async_read(inode->i_mapping, page, bh, 1); else submit_bh(REQ_OP_READ, 0, bh); } @@ -2517,7 +2519,8 @@ EXPORT_SYMBOL(block_page_mkwrite); * immediately, while under the page lock. So it needs a special end_io * handler which does not touch the bh after
[RFC PATCH 36/79] fs/buffer: add struct super_block to bforget() arguments
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct super_block to bforget() arguments. spatch --sp-file zemantic-012a.spatch --in-place --dir fs/ -- @exists@ expression E1; identifier I; @@ struct super_block *I; ... -bforget(E1) +bforget(I, E1) @exists@ expression E1; identifier F, I; @@ F(..., struct super_block *I, ...) { ... -bforget(E1) +bforget(I, E1) ... } @exists@ expression E1; identifier I; @@ struct inode *I; ... -bforget(E1) +bforget(I->i_sb, E1) @exists@ expression E1; identifier F, I; @@ F(..., struct inode *I, ...) { ... -bforget(E1) +bforget(I->i_sb, E1) ... } -- Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/bfs/file.c | 2 +- fs/ext2/inode.c | 4 ++-- fs/ext2/xattr.c | 4 ++-- fs/ext4/ext4_jbd2.c | 2 +- fs/fat/dir.c| 4 ++-- fs/jfs/resize.c | 2 +- fs/minix/itree_common.c | 6 +++--- fs/reiserfs/journal.c | 2 +- fs/reiserfs/resize.c| 2 +- fs/sysv/itree.c | 6 +++--- fs/ufs/util.c | 2 +- include/linux/buffer_head.h | 2 +- 12 files changed, 19 insertions(+), 19 deletions(-) diff --git a/fs/bfs/file.c b/fs/bfs/file.c index b1255ee4cd75..6d66cc137bc3 100644 --- a/fs/bfs/file.c +++ b/fs/bfs/file.c @@ -41,7 +41,7 @@ static int bfs_move_block(unsigned long from, unsigned long to, new = sb_getblk(sb, to); memcpy(new->b_data, bh->b_data, bh->b_size); mark_buffer_dirty(new); - bforget(bh); + bforget(sb, bh); brelse(new); return 0; } diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 33873c0a4c14..83ea6ad2cefa 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -536,7 +536,7 @@ static int ext2_alloc_branch(struct inode *inode, failed: for (i = 1; i < n; i++) - bforget(branch[i].bh); + bforget(inode->i_sb, branch[i].bh); for (i = 0; i < indirect_blks; i++) ext2_free_blocks(inode, new_blocks[i], 1); ext2_free_blocks(inode, new_blocks[i], num); @@ -1167,7 +1167,7 @@ static void ext2_free_branches(struct inode *inode, __le32 *p, __le32 *q, int de (__le32*)bh->b_data, (__le32*)bh->b_data + addr_per_block, depth); - bforget(bh); + bforget(inode->i_sb, bh); ext2_free_blocks(inode, nr, 1); mark_inode_dirty(inode); } diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c index 62d9a659a8ff..c77edf9afbce 100644 --- a/fs/ext2/xattr.c +++ b/fs/ext2/xattr.c @@ -733,7 +733,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh, /* We let our caller release old_bh, so we * need to duplicate the buffer before. */ get_bh(old_bh); - bforget(old_bh); + bforget(sb, old_bh); } else { /* Decrement the refcount only. */ le32_add_cpu((old_bh)->h_refcount, -1); @@ -802,7 +802,7 @@ ext2_xattr_delete_inode(struct inode *inode) bh->b_blocknr); ext2_free_blocks(inode, EXT2_I(inode)->i_file_acl, 1); get_bh(bh); - bforget(bh); + bforget(inode->i_sb, bh); unlock_buffer(bh); } else { le32_add_cpu((bh)->h_refcount, -1); diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c index 5529badca994..60fbf5336059 100644 --- a/fs/ext4/ext4_jbd2.c +++ b/fs/ext4/ext4_jbd2.c @@ -211,7 +211,7 @@ int __ext4_forget(const char *where, unsigned int line, handle_t *handle, /* In the no journal case, we can just do a bforget and return */ if (!ext4_handle_valid(handle)) { - bforget(bh); + bforget(inode->i_sb, bh); return 0; } diff --git a/fs/fat/dir.c b/fs/fat/dir.c index 8e100c3bf72c..b801f3d0220b 100644 --- a/fs/fat/dir.c +++ b/fs/fat/dir.c @@ -1126,7 +1126,7 @@ static int fat_zeroed_cluster(struct inode *dir, sector_t blknr, int nr_used, error: for (i = 0; i < n; i++) - bforget(bhs[i]); + bforget(sb, bhs[i]); return err; } @@ -1266,7 +1266,7 @@ static int fat_add_new_entries(struct inode *dir, void *slots, int
[RFC PATCH 37/79] fs/buffer: add struct super_block to __bforget() arguments
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct super_block to __bforget() arguments. spatch --sp-file zemantic-013a.spatch --in-place --dir fs/ spatch --sp-file zemantic-013a.spatch --in-place --dir include/ --include-headers -- @exists@ expression E1; identifier I; @@ struct super_block *I; ... -__bforget(E1) +__bforget(I, E1) @exists@ expression E1; identifier F, I; @@ F(..., struct super_block *I, ...) { ... -__bforget(E1) +__bforget(I, E1) ... } @exists@ expression E1; identifier I; @@ struct inode *I; ... -__bforget(E1) +__bforget(I->i_sb, E1) @exists@ expression E1; identifier F, I; @@ F(..., struct inode *I, ...) { ... -__bforget(E1) +__bforget(I->i_sb, E1) ... } -- Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 2 +- fs/jbd2/transaction.c | 2 +- include/linux/buffer_head.h | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 9f2c5e90b64d..422204701a3b 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1168,7 +1168,7 @@ EXPORT_SYMBOL(__brelse); * bforget() is like brelse(), except it discards any * potentially dirty data. */ -void __bforget(struct buffer_head *bh) +void __bforget(struct super_block *sb, struct buffer_head *bh) { clear_buffer_dirty(bh); if (bh->b_assoc_map) { diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index e8c50bb5822c..177616eb793c 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1560,7 +1560,7 @@ int jbd2_journal_forget (handle_t *handle, struct super_block *sb, if (!buffer_jbd(bh)) { spin_unlock(>j_list_lock); jbd_unlock_bh_state(bh); - __bforget(bh); + __bforget(sb, bh); goto drop; } } diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 82faae102ba2..7ae60f59f27e 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -192,7 +192,7 @@ struct buffer_head *__find_get_block(struct block_device *bdev, sector_t block, struct buffer_head *__getblk_gfp(struct block_device *bdev, sector_t block, unsigned size, gfp_t gfp); void __brelse(struct buffer_head *); -void __bforget(struct buffer_head *); +void __bforget(struct super_block *, struct buffer_head *); void __breadahead(struct block_device *, sector_t block, unsigned int size); struct buffer_head *__bread_gfp(struct block_device *, sector_t block, unsigned size, gfp_t gfp); @@ -306,7 +306,7 @@ static inline void brelse(struct buffer_head *bh) static inline void bforget(struct super_block *sb, struct buffer_head *bh) { if (bh) - __bforget(bh); + __bforget(sb, bh); } static inline struct buffer_head * -- 2.14.3
[RFC PATCH 50/79] fs: stop relying on mapping field of struct page, get it from context
From: Jérôme GlisseHoly grail, remove all usage of mapping field of struct page inside common fs code. spatch --sp-file zemantic-015a.spatch --in-place fs/*.c -- @exists@ struct page * P; identifier I; @@ struct address_space *I; ... -P->mapping +I @exists@ identifier F, I; struct page * P; @@ F(..., struct address_space *I, ...) { ... -P->mapping +I ... } @@ @@ -mapping = mapping; @@ @@ -struct address_space *mapping = _mapping; -- Hand edit: fs/mpage.c __mpage_writepage() coccinelle sematic is too hard ... Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 11 +-- fs/libfs.c | 2 +- fs/mpage.c | 9 - 3 files changed, 10 insertions(+), 12 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index b968ac0b65e8..39d8c7315b55 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -327,7 +327,7 @@ void end_buffer_async_write(struct address_space *mapping, struct page *page, set_buffer_uptodate(bh); } else { buffer_io_error(bh, ", lost async page write"); - mark_buffer_write_io_error(page->mapping, page, bh); + mark_buffer_write_io_error(mapping, page, bh); clear_buffer_uptodate(bh); SetPageError(page); } @@ -597,11 +597,10 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); * * The caller must hold lock_page_memcg(). */ -static void __set_page_dirty(struct page *page, struct address_space *_mapping, +static void __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { unsigned long flags; - struct address_space *mapping = page_mapping(page); spin_lock_irqsave(>tree_lock, flags); if (page_is_truncated(page, mapping)) { /* Race with truncate? */ @@ -1954,7 +1953,7 @@ int __block_write_begin_int(struct address_space *mapping, struct page *page, { unsigned from = pos & (PAGE_SIZE - 1); unsigned to = from + len; - struct inode *inode = page->mapping->host; + struct inode *inode = mapping->host; unsigned block_start, block_end; sector_t block; int err = 0; @@ -2456,7 +2455,7 @@ EXPORT_SYMBOL(cont_write_begin); int block_commit_write(struct address_space *mapping, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; + struct inode *inode = mapping->host; __block_commit_write(inode,page,from,to); return 0; } @@ -2705,7 +2704,7 @@ int nobh_write_end(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata) { - struct inode *inode = page->mapping->host; + struct inode *inode = mapping->host; struct buffer_head *head = fsdata; struct buffer_head *bh; BUG_ON(fsdata != NULL && page_has_buffers(page)); diff --git a/fs/libfs.c b/fs/libfs.c index ac76b269bbb7..585ef1f37d54 100644 --- a/fs/libfs.c +++ b/fs/libfs.c @@ -475,7 +475,7 @@ int simple_write_end(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata) { - struct inode *inode = page->mapping->host; + struct inode *inode = mapping->host; loff_t last_pos = pos + copied; /* zero the stale part of the page if we did a short copy */ diff --git a/fs/mpage.c b/fs/mpage.c index 1eec9d0df23e..ecdef63f464e 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -231,7 +231,7 @@ do_mpage_readpage(struct bio *bio, struct address_space *mapping, * so readpage doesn't have to repeat the get_block call */ if (buffer_uptodate(map_bh)) { - map_buffer_to_page(page->mapping->host, page, + map_buffer_to_page(mapping->host, page, map_bh, page_block); goto confused; } @@ -312,7 +312,7 @@ do_mpage_readpage(struct bio *bio, struct address_space *mapping, if (bio) bio = mpage_bio_submit(REQ_OP_READ, 0, bio); if (!PageUptodate(page)) - block_read_full_page(page->mapping->host, page, get_block); + block_read_full_page(mapping->host, page, get_block); else unlock_page(page); goto out; @@ -484,13 +484,12 @@ void clean_page_buffers(struct address_space *mapping, struct page
[RFC PATCH 39/79] fs/buffer: add struct address_space to clean_page_buffers() arguments
From: Jérôme GlisseAdd struct address_space to clean_page_buffers() arguments. One step toward dropping reliance on page->mapping. Signed-off-by: Jérôme Glisse Cc: Jens Axboe CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/block_dev.c | 2 +- fs/mpage.c | 9 + include/linux/buffer_head.h | 2 +- 3 files changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index dd9da97615e3..b653cd8fd1e3 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -712,7 +712,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector, if (result) { end_page_writeback(page); } else { - clean_page_buffers(page); + clean_page_buffers(mapping, page); unlock_page(page); } blk_queue_exit(bdev->bd_queue); diff --git a/fs/mpage.c b/fs/mpage.c index a75cea232f1a..624995c333e0 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -447,7 +447,8 @@ struct mpage_data { * We have our BIO, so we can now mark the buffers clean. Make * sure to only clean buffers which we know we'll be writing. */ -static void clean_buffers(struct page *page, unsigned first_unmapped) +static void clean_buffers(struct address_space *mapping, struct page *page, + unsigned first_unmapped) { unsigned buffer_counter = 0; struct buffer_head *bh, *head; @@ -477,9 +478,9 @@ static void clean_buffers(struct page *page, unsigned first_unmapped) * We don't need to calculate how many buffers are attached to the page, * we just need to specify a number larger than the maximum number of buffers. */ -void clean_page_buffers(struct page *page) +void clean_page_buffers(struct address_space *mapping, struct page *page) { - clean_buffers(page, ~0U); + clean_buffers(mapping, page, ~0U); } static int __mpage_writepage(struct page *page, struct address_space *_mapping, @@ -643,7 +644,7 @@ static int __mpage_writepage(struct page *page, struct address_space *_mapping, goto alloc_new; } - clean_buffers(page, first_unmapped); + clean_buffers(mapping, page, first_unmapped); BUG_ON(PageWriteback(page)); set_page_writeback(page); diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 22e79307c055..f3baf88a251b 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -248,7 +248,7 @@ int generic_write_end(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page *, void *); void page_zero_new_buffers(struct page *page, unsigned from, unsigned to); -void clean_page_buffers(struct page *page); +void clean_page_buffers(struct address_space *mapping, struct page *page); int cont_write_begin(struct file *, struct address_space *, loff_t, unsigned, unsigned, struct page **, void **, get_block_t *, loff_t *); -- 2.14.3
[RFC PATCH 63/79] mm/page: convert page's index lookup to be against specific mapping
From: Jérôme GlisseThis patch switch mm to lookup the page index or offset value to be against specific mapping. The page index value only have a meaning against a mapping. Using coccinelle: - @@ struct page *P; expression E; @@ -P->index = E +page_set_index(P, E) @@ struct page *P; @@ -P->index +page_index(P) @@ struct page *P; @@ -page_index(P) << PAGE_SHIFT +page_offset(P) @@ expression E; @@ -page_index(E) +_page_index(E, mapping) @@ expression E1, E2; @@ -page_set_index(E1, E2) +_page_set_index(E1, mapping, E2) @@ expression E; @@ -page_to_index(E) +_page_to_index(E, mapping) @@ expression E; @@ -page_to_pgoff(E) +_page_to_pgoff(E, mapping) @@ expression E; @@ -page_offset(E) +_page_offset(E, mapping) @@ expression E; @@ -page_file_offset(E) +_page_file_offset(E, mapping) - Signed-off-by: Jérôme Glisse Cc: Andrew Morton Cc: Mel Gorman Cc: linux...@kvack.org Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org --- mm/filemap.c| 26 ++ mm/page-writeback.c | 16 +--- mm/shmem.c | 11 +++ mm/truncate.c | 11 ++- 4 files changed, 36 insertions(+), 28 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 012a53964215..a41c7cfb6351 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -118,7 +118,8 @@ static int page_cache_tree_insert(struct address_space *mapping, void **slot; int error; - error = __radix_tree_create(>page_tree, page->index, 0, + error = __radix_tree_create(>page_tree, + _page_index(page, mapping), 0, , ); if (error) return error; @@ -155,7 +156,8 @@ static void page_cache_tree_delete(struct address_space *mapping, struct radix_tree_node *node; void **slot; - __radix_tree_lookup(>page_tree, page->index + i, + __radix_tree_lookup(>page_tree, + _page_index(page, mapping) + i, , ); VM_BUG_ON_PAGE(!node && nr != 1, page); @@ -791,12 +793,12 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) void (*freepage)(struct page *); unsigned long flags; - pgoff_t offset = old->index; + pgoff_t offset = _page_index(old, mapping); freepage = mapping->a_ops->freepage; get_page(new); new->mapping = mapping; - new->index = offset; + _page_set_index(new, mapping, offset); spin_lock_irqsave(>tree_lock, flags); __delete_from_page_cache(old, NULL); @@ -850,7 +852,7 @@ static int __add_to_page_cache_locked(struct page *page, get_page(page); page->mapping = mapping; - page->index = offset; + _page_set_index(page, mapping, offset); spin_lock_irq(>tree_lock); error = page_cache_tree_insert(mapping, page, shadowp); @@ -1500,7 +1502,7 @@ struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset) put_page(page); goto repeat; } - VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); + VM_BUG_ON_PAGE(_page_to_pgoff(page, mapping) != offset, page); } return page; } @@ -1559,7 +1561,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset, put_page(page); goto repeat; } - VM_BUG_ON_PAGE(page->index != offset, page); + VM_BUG_ON_PAGE(_page_index(page, mapping) != offset, page); } if (page && (fgp_flags & FGP_ACCESSED)) @@ -1751,7 +1753,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start, pages[ret] = page; if (++ret == nr_pages) { - *start = pages[ret - 1]->index + 1; + *start = _page_index(pages[ret - 1], mapping) + 1; goto out; } } @@ -1837,7 +1839,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index, * otherwise we can get both false positives and false * negatives, which is just confusing to the caller. */ - if (page->mapping == NULL || page_to_pgoff(page) != iter.index) { + if (page->mapping == NULL || _page_to_pgoff(page, mapping) != iter.index) { put_page(page); break; }
[RFC PATCH 64/79] mm/buffer: use _page_has_buffers() instead of page_has_buffers()
From: Jérôme GlisseThe former need the address_space for which the buffer_head is being lookup. -- @exists@ identifier M; expression E; @@ struct address_space *M; ... -page_buffers(E) +_page_buffers(E, M) @exists@ identifier M, F; expression E; @@ F(..., struct address_space *M, ...) {... -page_buffers(E) +_page_buffers(E, M) ...} @exists@ identifier M; expression E; @@ struct address_space *M; ... -page_has_buffers(E) +_page_has_buffers(E, M) @exists@ identifier M, F; expression E; @@ F(..., struct address_space *M, ...) {... -page_has_buffers(E) +_page_has_buffers(E, M) ...} @exists@ identifier I; expression E; @@ struct inode *I; ... -page_buffers(E) +_page_buffers(E, I->i_mapping) @exists@ identifier I, F; expression E; @@ F(..., struct inode *I, ...) {... -page_buffers(E) +_page_buffers(E, I->i_mapping) ...} @exists@ identifier I; expression E; @@ struct inode *I; ... -page_has_buffers(E) +_page_has_buffers(E, I->i_mapping) @exists@ identifier I, F; expression E; @@ F(..., struct inode *I, ...) {... -page_has_buffers(E) +_page_has_buffers(E, I->i_mapping) ...} -- Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- mm/migrate.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index c2a613283fa2..e4b20ac6cf36 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -768,10 +768,10 @@ int buffer_migrate_page(struct address_space *mapping, struct buffer_head *bh, *head; int rc; - if (!page_has_buffers(page)) + if (!_page_has_buffers(page, mapping)) return migrate_page(mapping, newpage, page, mode); - head = page_buffers(page); + head = _page_buffers(page, mapping); rc = migrate_page_move_mapping(mapping, newpage, page, head, mode, 0); -- 2.14.3
[RFC PATCH 52/79] fs/buffer: use _page_has_buffers() instead of page_has_buffers()
From: Jérôme GlisseThe former need the address_space for which the buffer_head is being lookup. -- @exists@ identifier M; expression E; @@ struct address_space *M; ... -page_buffers(E) +_page_buffers(E, M) @exists@ identifier M, F; expression E; @@ F(..., struct address_space *M, ...) {... -page_buffers(E) +_page_buffers(E, M) ...} @exists@ identifier M; expression E; @@ struct address_space *M; ... -page_has_buffers(E) +_page_has_buffers(E, M) @exists@ identifier M, F; expression E; @@ F(..., struct address_space *M, ...) {... -page_has_buffers(E) +_page_has_buffers(E, M) ...} @exists@ identifier I; expression E; @@ struct inode *I; ... -page_buffers(E) +_page_buffers(E, I->i_mapping) @exists@ identifier I, F; expression E; @@ F(..., struct inode *I, ...) {... -page_buffers(E) +_page_buffers(E, I->i_mapping) ...} @exists@ identifier I; expression E; @@ struct inode *I; ... -page_has_buffers(E) +_page_has_buffers(E, I->i_mapping) @exists@ identifier I, F; expression E; @@ F(..., struct inode *I, ...) {... -page_has_buffers(E) +_page_has_buffers(E, I->i_mapping) ...} -- Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 60 ++-- fs/mpage.c | 14 +++--- 2 files changed, 37 insertions(+), 37 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 3c424b7af5af..27b19c629308 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -89,13 +89,13 @@ void buffer_check_dirty_writeback(struct page *page, BUG_ON(!PageLocked(page)); - if (!page_has_buffers(page)) + if (!_page_has_buffers(page, mapping)) return; if (PageWriteback(page)) *writeback = true; - head = page_buffers(page); + head = _page_buffers(page, mapping); bh = head; do { if (buffer_locked(bh)) @@ -211,9 +211,9 @@ __find_get_block_slow(struct block_device *bdev, sector_t block) goto out; spin_lock(_mapping->private_lock); - if (!page_has_buffers(page)) + if (!_page_has_buffers(page, bd_mapping)) goto out_unlock; - head = page_buffers(page); + head = _page_buffers(page, bd_mapping); bh = head; do { if (!buffer_mapped(bh)) @@ -648,8 +648,8 @@ int __set_page_dirty_buffers(struct address_space *mapping, return !TestSetPageDirty(page); spin_lock(>private_lock); - if (page_has_buffers(page)) { - struct buffer_head *head = page_buffers(page); + if (_page_has_buffers(page, mapping)) { + struct buffer_head *head = _page_buffers(page, mapping); struct buffer_head *bh = head; do { @@ -913,7 +913,7 @@ static sector_t init_page_buffers(struct address_space *buffer, struct page *page, struct block_device *bdev, sector_t block, int size) { - struct buffer_head *head = page_buffers(page); + struct buffer_head *head = _page_buffers(page, buffer); struct buffer_head *bh = head; int uptodate = PageUptodate(page); sector_t end_block = blkdev_max_block(I_BDEV(bdev->bd_inode), size); @@ -969,8 +969,8 @@ grow_dev_page(struct block_device *bdev, sector_t block, BUG_ON(!PageLocked(page)); - if (page_has_buffers(page)) { - bh = page_buffers(page); + if (_page_has_buffers(page, inode->i_mapping)) { + bh = _page_buffers(page, inode->i_mapping); if (bh->b_size == size) { end_block = init_page_buffers(inode->i_mapping, page, bdev, (sector_t)index << sizebits, @@ -1490,7 +1490,7 @@ void block_invalidatepage(struct address_space *mapping, struct page *page, unsigned int stop = length + offset; BUG_ON(!PageLocked(page)); - if (!page_has_buffers(page)) + if (!_page_has_buffers(page, mapping)) goto out; /* @@ -1498,7 +1498,7 @@ void block_invalidatepage(struct address_space *mapping, struct page *page, */ BUG_ON(stop > PAGE_SIZE || stop < length); - head = page_buffers(page); + head = _page_buffers(page, mapping); bh = head; do { unsigned int next_off = curr_off + bh->b_size; @@ -1605,7 +1605,7 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len) for (i = 0; i < count; i++) {
[RFC PATCH 65/79] mm/swap: add struct swap_info_struct swap_readpage() arguments
From: Jérôme GlisseAdd struct swap_info_struct swap_readpage() arguments. One step toward dropping reliance on page->private during swap read back. Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- include/linux/swap.h | 6 -- mm/memory.c | 2 +- mm/page_io.c | 4 ++-- mm/swap_state.c | 12 4 files changed, 15 insertions(+), 9 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2f6abe9652f6..90c26ec2997c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -383,7 +383,8 @@ extern void kswapd_stop(int nid); #include /* for bio_end_io_t */ /* linux/mm/page_io.c */ -extern int swap_readpage(struct page *page, bool do_poll); +extern int swap_readpage(struct swap_info_struct *sis, struct page *page, +bool do_poll); extern int swap_writepage(struct address_space *mapping, struct page *page, struct writeback_control *wbc); extern void end_swap_bio_write(struct bio *bio); @@ -486,7 +487,8 @@ extern void exit_swap_address_space(unsigned int type); #else /* CONFIG_SWAP */ -static inline int swap_readpage(struct page *page, bool do_poll) +static inline int swap_readpage(struct swap_info_struct *sis, struct page *page, + bool do_poll) { return 0; } diff --git a/mm/memory.c b/mm/memory.c index 1311599a164b..6ffd76528e7b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2949,7 +2949,7 @@ int do_swap_page(struct vm_fault *vmf) __SetPageSwapBacked(page); set_page_private(page, entry.val); lru_cache_add_anon(page); - swap_readpage(page, true); + swap_readpage(si, page, true); } } else { if (vma_readahead) diff --git a/mm/page_io.c b/mm/page_io.c index 6e548b588490..f4e05c90c87e 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -349,11 +349,11 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc, return ret; } -int swap_readpage(struct page *page, bool synchronous) +int swap_readpage(struct swap_info_struct *sis, struct page *page, + bool synchronous) { struct bio *bio; int ret = 0; - struct swap_info_struct *sis = page_swap_info(page); blk_qc_t qc; struct gendisk *disk; diff --git a/mm/swap_state.c b/mm/swap_state.c index 39ae7cfad90f..40a2437e3c34 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -466,8 +466,10 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, struct page *retpage = __read_swap_cache_async(entry, gfp_mask, vma, addr, _was_allocated); - if (page_was_allocated) - swap_readpage(retpage, do_poll); + if (page_was_allocated) { + struct swap_info_struct *sis = swp_swap_info(entry); + swap_readpage(sis, retpage, do_poll); + } return retpage; } @@ -585,7 +587,8 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, if (!page) continue; if (page_allocated) { - swap_readpage(page, false); + struct swap_info_struct *sis = swp_swap_info(entry); + swap_readpage(sis, page, false); if (offset != entry_offset && likely(!PageTransCompound(page))) { SetPageReadahead(page); @@ -748,7 +751,8 @@ struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, if (!page) continue; if (page_allocated) { - swap_readpage(page, false); + struct swap_info_struct *sis = swp_swap_info(entry); + swap_readpage(sis, page, false); if (i != swap_ra->offset && likely(!PageTransCompound(page))) { SetPageReadahead(page); -- 2.14.3
[RFC PATCH 69/79] fs/journal: add struct address_space to jbd2_journal_try_to_free_buffers() arguments
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct address_space to jbd2_journal_try_to_free_buffers() arguments. <- @@ type T1, T2, T3; @@ int -jbd2_journal_try_to_free_buffers(T1 journal, T2 page, T3 gfp_mask) +jbd2_journal_try_to_free_buffers(T1 journal, struct address_space *mapping, T2 page, T3 gfp_mask) {...} @@ type T1, T2, T3; @@ int -jbd2_journal_try_to_free_buffers(T1, T2, T3) +jbd2_journal_try_to_free_buffers(T1, struct address_space *, T2, T3) ; @@ expression E1, E2, E3; @@ -jbd2_journal_try_to_free_buffers(E1, E2, E3) +jbd2_journal_try_to_free_buffers(E1, NULL, E2, E3) -> Signed-off-by: Jérôme Glisse Cc: "Theodore Ts'o" Cc: Jan Kara Cc: linux-e...@vger.kernel.org Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org --- fs/ext4/inode.c | 3 ++- fs/ext4/super.c | 4 ++-- fs/jbd2/transaction.c | 3 ++- include/linux/jbd2.h | 4 +++- 4 files changed, 9 insertions(+), 5 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 1a44d9acde53..ef53a57d9768 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3413,7 +3413,8 @@ static int ext4_releasepage(struct address_space *mapping, if (PageChecked(page)) return 0; if (journal) - return jbd2_journal_try_to_free_buffers(journal, page, wait); + return jbd2_journal_try_to_free_buffers(journal, NULL, page, + wait); else return try_to_free_buffers(mapping, page); } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8f98bc886569..cf2b74137fb2 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1138,8 +1138,8 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page, if (!_page_has_buffers(page, mapping)) return 0; if (journal) - return jbd2_journal_try_to_free_buffers(journal, page, - wait & ~__GFP_DIRECT_RECLAIM); + return jbd2_journal_try_to_free_buffers(journal, NULL, page, + wait & ~__GFP_DIRECT_RECLAIM); return try_to_free_buffers(mapping, page); } diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index bf673b33d436..6899e7b4036d 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -1984,7 +1984,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh) * Return 0 on failure, 1 on success */ int jbd2_journal_try_to_free_buffers(journal_t *journal, - struct page *page, gfp_t gfp_mask) +struct address_space *mapping, +struct page *page, gfp_t gfp_mask) { struct buffer_head *head; struct buffer_head *bh; diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h index c5133df80fd4..658a0d2f758f 100644 --- a/include/linux/jbd2.h +++ b/include/linux/jbd2.h @@ -1363,7 +1363,9 @@ extern int jbd2_journal_forget (handle_t *, struct super_block *sb, extern void journal_sync_buffer (struct buffer_head *); extern int jbd2_journal_invalidatepage(journal_t *, struct page *, unsigned int, unsigned int); -extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t); +extern int jbd2_journal_try_to_free_buffers(journal_t *, + struct address_space *, + struct page *, gfp_t); extern int jbd2_journal_stop(handle_t *); extern int jbd2_journal_flush (journal_t *); extern void jbd2_journal_lock_updates (journal_t *); -- 2.14.3
[RFC PATCH 71/79] mm: add struct address_space to set_page_dirty()
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct address_space to set_page_dirty() arguments. <- @@ identifier I1; type T1; @@ int -set_page_dirty(T1 I1) +set_page_dirty(struct address_space *_mapping, T1 I1) {...} @@ type T1; @@ int -set_page_dirty(T1) +set_page_dirty(struct address_space *, T1) ; @@ identifier I1; type T1; @@ int -set_page_dirty(T1 I1) +set_page_dirty(struct address_space *, T1) ; @@ expression E1; @@ -set_page_dirty(E1) +set_page_dirty(NULL, E1) -> Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c| 2 +- drivers/gpu/drm/drm_gem.c | 2 +- drivers/gpu/drm/i915/i915_gem.c| 6 ++--- drivers/gpu/drm/i915/i915_gem_fence_reg.c | 2 +- drivers/gpu/drm/i915/i915_gem_userptr.c| 2 +- drivers/gpu/drm/radeon/radeon_ttm.c| 2 +- drivers/gpu/drm/ttm/ttm_tt.c | 2 +- drivers/infiniband/core/umem_odp.c | 2 +- drivers/misc/vmw_vmci/vmci_queue_pair.c| 2 +- drivers/mtd/devices/block2mtd.c| 4 +-- drivers/platform/goldfish/goldfish_pipe.c | 2 +- drivers/sbus/char/oradax.c | 2 +- drivers/staging/lustre/lustre/llite/rw26.c | 2 +- drivers/staging/lustre/lustre/llite/vvp_io.c | 4 +-- .../interface/vchiq_arm/vchiq_2835_arm.c | 2 +- fs/9p/vfs_addr.c | 2 +- fs/afs/write.c | 2 +- fs/btrfs/extent_io.c | 2 +- fs/btrfs/file.c| 2 +- fs/btrfs/inode.c | 6 ++--- fs/btrfs/ioctl.c | 2 +- fs/btrfs/relocation.c | 2 +- fs/buffer.c| 6 ++--- fs/ceph/addr.c | 4 +-- fs/cifs/file.c | 4 +-- fs/exofs/dir.c | 2 +- fs/exofs/inode.c | 4 +-- fs/f2fs/checkpoint.c | 4 +-- fs/f2fs/data.c | 6 ++--- fs/f2fs/dir.c | 10 fs/f2fs/file.c | 10 fs/f2fs/gc.c | 6 ++--- fs/f2fs/inline.c | 18 ++--- fs/f2fs/inode.c| 6 ++--- fs/f2fs/node.c | 20 +++ fs/f2fs/node.h | 2 +- fs/f2fs/recovery.c | 2 +- fs/f2fs/segment.c | 12 - fs/f2fs/xattr.c| 6 ++--- fs/fuse/file.c | 2 +- fs/gfs2/file.c | 2 +- fs/hfs/bnode.c | 12 - fs/hfs/btree.c | 6 ++--- fs/hfsplus/bitmap.c| 8 +++--- fs/hfsplus/bnode.c | 30 +++--- fs/hfsplus/btree.c | 6 ++--- fs/hfsplus/xattr.c | 2 +- fs/iomap.c | 2 +- fs/jfs/jfs_metapage.c | 4 +-- fs/libfs.c | 2 +- fs/nfs/direct.c| 2 +- fs/ntfs/attrib.c | 8 +++--- fs/ntfs/bitmap.c | 4 +-- fs/ntfs/file.c | 2 +- fs/ntfs/lcnalloc.c | 4 +-- fs/ntfs/mft.c | 4 +-- fs/ntfs/usnjrnl.c | 2 +- fs/udf/file.c | 2 +- fs/ufs/inode.c | 2 +- include/linux/mm.h | 2 +- mm/filemap.c | 2 +- mm/gup.c | 2 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 2 +- mm/khugepaged.c
[RFC PATCH 70/79] mm: add struct address_space to mark_buffer_dirty()
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct address_space to mark_buffer_dirty() arguments. <- @@ identifier I1; type T1; @@ void -mark_buffer_dirty(T1 I1) +mark_buffer_dirty(struct address_space *_mapping, T1 I1) {...} @@ type T1; @@ void -mark_buffer_dirty(T1) +mark_buffer_dirty(struct address_space *, T1) ; @@ identifier I1; type T1; @@ void -mark_buffer_dirty(T1 I1) +mark_buffer_dirty(struct address_space *, T1) ; @@ expression E1; @@ -mark_buffer_dirty(E1) +mark_buffer_dirty(NULL, E1) -> Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/adfs/dir_f.c | 2 +- fs/affs/bitmap.c| 6 +++--- fs/affs/super.c | 2 +- fs/bfs/file.c | 2 +- fs/bfs/inode.c | 4 ++-- fs/buffer.c | 12 ++-- fs/ext2/balloc.c| 6 +++--- fs/ext2/ialloc.c| 8 fs/ext2/inode.c | 2 +- fs/ext2/super.c | 4 ++-- fs/ext2/xattr.c | 8 fs/ext4/ext4_jbd2.c | 4 ++-- fs/ext4/inode.c | 4 ++-- fs/ext4/mmp.c | 2 +- fs/ext4/resize.c| 2 +- fs/ext4/super.c | 2 +- fs/fat/inode.c | 4 ++-- fs/fat/misc.c | 2 +- fs/gfs2/bmap.c | 4 ++-- fs/gfs2/lops.c | 6 +++--- fs/hfs/mdb.c| 10 +- fs/hpfs/anode.c | 34 +- fs/hpfs/buffer.c| 8 fs/hpfs/dnode.c | 4 ++-- fs/hpfs/ea.c| 4 ++-- fs/hpfs/inode.c | 2 +- fs/hpfs/namei.c | 10 +- fs/hpfs/super.c | 6 +++--- fs/jbd2/recovery.c | 2 +- fs/jbd2/transaction.c | 2 +- fs/jfs/jfs_imap.c | 2 +- fs/jfs/jfs_mount.c | 2 +- fs/jfs/resize.c | 6 +++--- fs/jfs/super.c | 2 +- fs/minix/bitmap.c | 10 +- fs/minix/inode.c| 12 ++-- fs/nilfs2/alloc.c | 12 ++-- fs/nilfs2/btnode.c | 4 ++-- fs/nilfs2/btree.c | 38 +++--- fs/nilfs2/cpfile.c | 24 fs/nilfs2/dat.c | 4 ++-- fs/nilfs2/gcinode.c | 2 +- fs/nilfs2/ifile.c | 4 ++-- fs/nilfs2/inode.c | 2 +- fs/nilfs2/ioctl.c | 2 +- fs/nilfs2/mdt.c | 2 +- fs/nilfs2/segment.c | 4 ++-- fs/nilfs2/sufile.c | 26 +- fs/ntfs/file.c | 8 fs/ntfs/super.c | 2 +- fs/ocfs2/alloc.c| 2 +- fs/ocfs2/aops.c | 4 ++-- fs/ocfs2/inode.c| 2 +- fs/omfs/bitmap.c| 6 +++--- fs/omfs/dir.c | 8 fs/omfs/file.c | 4 ++-- fs/omfs/inode.c | 4 ++-- fs/reiserfs/file.c | 2 +- fs/reiserfs/inode.c | 4 ++-- fs/reiserfs/journal.c | 10 +- fs/reiserfs/resize.c| 2 +- fs/sysv/balloc.c| 2 +- fs/sysv/ialloc.c| 2 +- fs/sysv/inode.c | 8 fs/sysv/sysv.h | 4 ++-- fs/udf/balloc.c | 6 +++--- fs/udf/inode.c | 2 +- fs/udf/partition.c | 4 ++-- fs/udf/super.c | 8 fs/ufs/balloc.c | 4 ++-- fs/ufs/ialloc.c | 4 ++-- fs/ufs/inode.c | 8 fs/ufs/util.c | 2 +- include/linux/buffer_head.h | 2 +- 74 files changed, 220 insertions(+), 220 deletions(-) diff --git a/fs/adfs/dir_f.c b/fs/adfs/dir_f.c index 0fbfd0b04ae0..3d92f8d187bc 100644 --- a/fs/adfs/dir_f.c +++ b/fs/adfs/dir_f.c @@ -434,7 +434,7 @@ adfs_f_update(struct adfs_dir *dir, struct object_info *obj) } #endif for (i = dir->nr_buffers - 1; i >= 0; i--) - mark_buffer_dirty(dir->bh[i]); + mark_buffer_dirty(NULL, dir->bh[i]); ret = 0; out: diff --git a/fs/affs/bitmap.c b/fs/affs/bitmap.c index 5ba9ef2742f6..59b352075505 100644 --- a/fs/affs/bitmap.c +++ b/fs/affs/bitmap.c @@ -79,7 +79,7 @@ affs_free_block(struct super_block *sb, u32 block) tmp = be32_to_cpu(*(__be32 *)bh->b_data); *(__be32 *)bh->b_data = cpu_to_be32(tmp - mask); - mark_buffer_dirty(bh); + mark_buffer_dirty(NULL, bh); affs_mark_sb_dirty(sb); bm->bm_free++; @@ -223,7 +223,7 @@
[RFC PATCH 51/79] fs: stop relying on mapping field of struct page, get it from context
From: Jérôme GlisseHoly grail, remove all usage of mapping field of struct page inside common fs code. This is the manual conversion patch (so much can be done with coccinelle). Signed-off-by: Jérôme Glisse Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Jens Axboe Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- fs/buffer.c | 26 +- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 39d8c7315b55..3c424b7af5af 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -570,7 +570,9 @@ void write_boundary_block(struct block_device *bdev, void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode) { struct address_space *mapping = inode->i_mapping; - struct address_space *buffer_mapping = bh->b_page->mapping; + struct address_space *buffer_mapping; + + buffer_mapping = fs_page_mapping_get_with_bh(bh->b_page, bh); mark_buffer_dirty(bh); if (!mapping->private_data) { @@ -1138,10 +1140,13 @@ EXPORT_SYMBOL(mark_buffer_dirty); void mark_buffer_write_io_error(struct address_space *mapping, struct page *page, struct buffer_head *bh) { + BUG_ON(page != bh->b_page); + BUG_ON(mapping != bh->b_page->mapping); + set_buffer_write_io_error(bh); /* FIXME: do we need to set this in both places? */ - if (bh->b_page && !page_is_truncated(bh->b_page, bh->b_page->mapping)) - mapping_set_error(bh->b_page->mapping, -EIO); + if (bh->b_page && !page_is_truncated(page, mapping)) + mapping_set_error(mapping, -EIO); if (bh->b_assoc_map) mapping_set_error(bh->b_assoc_map, -EIO); } @@ -1172,7 +1177,10 @@ void __bforget(struct super_block *sb, struct buffer_head *bh) { clear_buffer_dirty(bh); if (bh->b_assoc_map) { - struct address_space *buffer_mapping = bh->b_page->mapping; + struct address_space *buffer_mapping; + + buffer_mapping = sb->s_bdev->bd_inode->i_mapping; + BUG_ON(buffer_mapping != bh->b_page->mapping); spin_lock(_mapping->private_lock); list_del_init(>b_assoc_buffers); @@ -1543,7 +1551,7 @@ void create_empty_buffers(struct address_space *mapping, struct page *page, } while (bh); tail->b_this_page = head; - spin_lock(>mapping->private_lock); + spin_lock(>private_lock); if (PageUptodate(page) || PageDirty(page)) { bh = head; do { @@ -1555,7 +1563,7 @@ void create_empty_buffers(struct address_space *mapping, struct page *page, } while (bh != head); } attach_page_buffers(page, head); - spin_unlock(>mapping->private_lock); + spin_unlock(>private_lock); } EXPORT_SYMBOL(create_empty_buffers); @@ -1833,7 +1841,7 @@ int __block_write_full_page(struct inode *inode, struct page *page, } while ((bh = bh->b_this_page) != head); SetPageError(page); BUG_ON(PageWriteback(page)); - mapping_set_error(page->mapping, err); + mapping_set_error(inode->i_mapping, err); set_page_writeback(page); do { struct buffer_head *next = bh->b_this_page; @@ -2541,7 +2549,7 @@ static void attach_nobh_buffers(struct address_space *mapping, BUG_ON(!PageLocked(page)); - spin_lock(>mapping->private_lock); + spin_lock(>private_lock); bh = head; do { if (PageDirty(page)) @@ -2551,7 +2559,7 @@ static void attach_nobh_buffers(struct address_space *mapping, bh = bh->b_this_page; } while (bh != head); attach_page_buffers(page, head); - spin_unlock(>mapping->private_lock); + spin_unlock(>private_lock); } /* -- 2.14.3
[RFC PATCH 68/79] mm/vma_address: convert page's index lookup to be against specific mapping
From: Jérôme GlissePass down the mapping ... Signed-off-by: Jérôme Glisse Cc: Andrew Morton Cc: Mel Gorman Cc: linux...@kvack.org Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org --- mm/internal.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index e6bd35182dae..43e9ed27362f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -336,7 +336,9 @@ extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma); static inline unsigned long __vma_address(struct page *page, struct vm_area_struct *vma) { - pgoff_t pgoff = page_to_pgoff(page); + struct address_space *mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; + + pgoff_t pgoff = _page_to_pgoff(page, mapping); return vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); } -- 2.14.3
[RFC PATCH 72/79] mm: add struct address_space to set_page_dirty_lock()
From: Jérôme GlisseFor the holy crusade to stop relying on struct page mapping field, add struct address_space to set_page_dirty_lock() arguments. <- @@ identifier I1; type T1; @@ int -set_page_dirty_lock(T1 I1) +set_page_dirty_lock(struct address_space *_mapping, T1 I1) {...} @@ type T1; @@ int -set_page_dirty_lock(T1) +set_page_dirty_lock(struct address_space *, T1) ; @@ identifier I1; type T1; @@ int -set_page_dirty_lock(T1 I1) +set_page_dirty_lock(struct address_space *, T1) ; @@ expression E1; @@ -set_page_dirty_lock(E1) +set_page_dirty_lock(NULL, E1) -> Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- arch/cris/arch-v32/drivers/cryptocop.c| 2 +- arch/powerpc/kvm/book3s_64_mmu_radix.c| 2 +- arch/powerpc/kvm/e500_mmu.c | 3 ++- arch/s390/kvm/interrupt.c | 4 ++-- arch/x86/kvm/svm.c| 2 +- block/bio.c | 4 ++-- drivers/gpu/drm/exynos/exynos_drm_g2d.c | 2 +- drivers/infiniband/core/umem.c| 2 +- drivers/infiniband/hw/hfi1/user_pages.c | 2 +- drivers/infiniband/hw/qib/qib_user_pages.c| 2 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 2 +- drivers/media/common/videobuf2/videobuf2-dma-contig.c | 2 +- drivers/media/common/videobuf2/videobuf2-dma-sg.c | 2 +- drivers/media/common/videobuf2/videobuf2-vmalloc.c| 2 +- drivers/misc/genwqe/card_utils.c | 2 +- drivers/staging/lustre/lustre/llite/rw26.c| 2 +- drivers/vhost/vhost.c | 2 +- fs/block_dev.c| 2 +- fs/direct-io.c| 2 +- fs/fuse/dev.c | 2 +- fs/fuse/file.c| 2 +- include/linux/mm.h| 2 +- mm/memory.c | 2 +- mm/page-writeback.c | 2 +- mm/process_vm_access.c| 2 +- net/ceph/pagevec.c| 2 +- 26 files changed, 29 insertions(+), 28 deletions(-) diff --git a/arch/cris/arch-v32/drivers/cryptocop.c b/arch/cris/arch-v32/drivers/cryptocop.c index a3c353472a8c..5cb42555c90b 100644 --- a/arch/cris/arch-v32/drivers/cryptocop.c +++ b/arch/cris/arch-v32/drivers/cryptocop.c @@ -2930,7 +2930,7 @@ static int cryptocop_ioctl_process(struct inode *inode, struct file *filp, unsig for (i = 0; i < nooutpages; i++){ int spdl_err; /* Mark output pages dirty. */ - spdl_err = set_page_dirty_lock(outpages[i]); + spdl_err = set_page_dirty_lock(NULL, outpages[i]); DEBUG(if (spdl_err < 0)printk("cryptocop_ioctl_process: set_page_dirty_lock returned %d\n", spdl_err)); } for (i = 0; i < nooutpages; i++){ diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 5cb4e4687107..8daefabe650e 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -482,7 +482,7 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu, if (page) { if (!ret && (pgflags & _PAGE_WRITE)) - set_page_dirty_lock(page); + set_page_dirty_lock(NULL, page); put_page(page); } diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c index ddbf8f0284c0..364ee7a5b268 100644 --- a/arch/powerpc/kvm/e500_mmu.c +++ b/arch/powerpc/kvm/e500_mmu.c @@ -556,7 +556,8 @@ static void free_gtlb(struct kvmppc_vcpu_e500 *vcpu_e500) PAGE_SIZE))); for (i = 0; i < vcpu_e500->num_shared_tlb_pages; i++) { - set_page_dirty_lock(vcpu_e500->shared_tlb_pages[i]); + set_page_dirty_lock(NULL, + vcpu_e500->shared_tlb_pages[i]); put_page(vcpu_e500->shared_tlb_pages[i]); } diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index b04616b57a94..6db8d4f5c74f 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -2616,7 +2616,7 @@ static int adapter_indicators_set(struct kvm *kvm, set_bit(bit, map); idx =
[RFC PATCH 79/79] mm/ksm: set page->mapping to page_ronly struct instead of stable_node.
From: Jérôme GlisseSet page->mapping to the page_ronly struct instead of stable_node struct. There is no functional change as page_ronly is just a field of stable_node. Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli --- mm/ksm.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index 6085068fb8b3..52b0ae291d23 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -39,6 +39,7 @@ #include #include #include +#include #include #include "internal.h" @@ -126,6 +127,7 @@ struct ksm_scan { /** * struct stable_node - node of the stable rbtree + * @ronly: Page read only struct wrapper (see include/linux/page_ronly.h). * @node: rb node of this ksm page in the stable tree * @head: (overlaying parent) _nodes indicates temporarily on that list * @hlist_dup: linked into the stable_node->hlist with a stable_node chain @@ -137,6 +139,7 @@ struct ksm_scan { * @nid: NUMA node id of stable tree in which linked (may not match kpfn) */ struct stable_node { + struct page_ronly ronly; union { struct rb_node node;/* when node of stable tree */ struct {/* when listed for migration */ @@ -318,13 +321,15 @@ static void __init ksm_slab_free(void) static inline struct stable_node *page_stable_node(struct page *page) { - return PageReadOnly(page) ? page_rmapping(page) : NULL; + struct page_ronly *ronly = page_ronly(page); + + return ronly ? container_of(ronly, struct stable_node, ronly) : NULL; } static inline void set_page_stable_node(struct page *page, struct stable_node *stable_node) { - page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_RONLY); + page_ronly_set(page, stable_node ? _node->ronly : NULL); } static __always_inline bool is_stable_node_chain(struct stable_node *chain) -- 2.14.3
[RFC PATCH 75/79] mm/page_ronly: add page read only core structure and helpers.
From: Jérôme GlissePage read only is a generic framework for page write protection. It reuses the same mechanism as KSM by using the lower bit of the page->mapping fields, and KSM is converted to use this generic framework. Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli --- include/linux/page_ronly.h | 169 + 1 file changed, 169 insertions(+) create mode 100644 include/linux/page_ronly.h diff --git a/include/linux/page_ronly.h b/include/linux/page_ronly.h new file mode 100644 index ..6312d4f015ea --- /dev/null +++ b/include/linux/page_ronly.h @@ -0,0 +1,169 @@ +/* + * Copyright 2015 Red Hat Inc. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 of + * the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Authors: Jérôme Glisse + */ +/* + * Page read only generic wrapper. This is common struct use to write protect + * page by means of forbidding anyone from inserting a pte (page table entry) + * with write flag set. It reuse the ksm mecanism (which use lower bit of the + * mapping field of struct page). + */ +#ifndef LINUX_PAGE_RONLY_H +#define LINUX_PAGE_RONLY_H +#ifdef CONFIG_PAGE_RONLY + +#include +#include +#include +#include + + +/* enum page_ronly_event - Event that trigger a call to unprotec(). + * + * @PAGE_RONLY_SWAPIN: Page fault on at an address with a swap entry pte. + * @PAGE_RONLY_WFAULT: Write page fault. + * @PAGE_RONLY_GUP: Get user page. + */ +enum page_ronly_event { + PAGE_RONLY_SWAPIN, + PAGE_RONLY_WFAULT, + PAGE_RONLY_GUP, +}; + +/* struct page_ronly_ops - Page read only operations. + * + * @unprotect: Callback to unprotect a page (mandatory). + * @rmap_walk: Callback to walk reverse mapping of a page (mandatory). + * + * Kernel user that want to use the page write protection mechanism have to + * provide a number of callback. + */ +struct page_ronly_ops { + struct page *(*unprotect)(struct page *page, + unsigned long addr, + struct vm_area_struct *vma, + enum page_ronly_event event); + int (*rmap_walk)(struct page *page, struct rmap_walk_control *rwc); +}; + +/* struct page_ronly - Replace page->mapping when a page is write protected. + * + * @ops: Pointer to page read only operations. + * + * Page that are write protect have their page->mapping field pointing to this + * wrapper structure. It must be allocated by page read only user and must be + * free (if needed) inside unprotect() callback. + */ +struct page_ronly { + const struct page_ronly_ops *ops; +}; + + +/* page_ronly() - Return page_ronly struct if any or NULL. + * + * @page: The page for which to replace the page->mapping field. + */ +static inline struct page_ronly *page_ronly(struct page *page) +{ + return PageReadOnly(page) ? page_rmapping(page) : NULL; +} + +/* page_ronly_set() - Replace page->mapping with ptr to page_ronly struct. + * + * @page: The page for which to replace the page->mapping field. + * @ronly: The page_ronly structure to set. + * + * Page must be locked. + */ +static inline void page_ronly_set(struct page *page, struct page_ronly *ronly) +{ + VM_BUG_ON_PAGE(!PageLocked(page), page); + + page->mapping = (void *)ronly + (PAGE_MAPPING_ANON|PAGE_MAPPING_RONLY); +} + +/* page_ronly_unprotect() - Unprotect a read only protected page. + * + * @page: The page to unprotect. + * @addr: Fault address that trigger the unprotect. + * @vma: The vma of the fault address. + * @event: Event which triggered the unprotect. + * + * Page must be locked and must be a read only page. + */ +static inline struct page *page_ronly_unprotect(struct page *page, + unsigned long addr, + struct vm_area_struct *vma, + enum page_ronly_event event) +{ + struct page_ronly *pageronly; + + VM_BUG_ON_PAGE(!PageLocked(page), page); + /* +* Rely on the page lock to protect against concurrent modifications +* to that page's node of the stable tree. +*/ + VM_BUG_ON_PAGE(!PageReadOnly(page), page); + pageronly = page_ronly(page); + if (pageronly) + return pageronly->ops->unprotect(page, addr, vma, event); + /* Safest fallback. */ + return page; +} + +/* page_ronly_rmap_walk() - Walk all CPU page table mapping of a
[RFC PATCH 76/79] mm/ksm: have ksm select PAGE_RONLY config.
From: Jérôme GlisseSigned-off-by: Jérôme Glisse Cc: Andrea Arcangeli --- mm/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/Kconfig b/mm/Kconfig index aeffb6e8dd21..6994a1fdf847 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -308,6 +308,7 @@ config MMU_NOTIFIER config KSM bool "Enable KSM for page merging" depends on MMU + select PAGE_RONLY help Enable Kernel Samepage Merging: KSM periodically scans those areas of an application's address space that an app has advised may be -- 2.14.3
[RFC PATCH 73/79] mm: pass down struct address_space to set_page_dirty()
From: Jérôme GlissePass down struct address_space to set_page_dirty() everywhere it is already available. <- @exists@ expression E; identifier F, M; @@ F(..., struct address_space * M, ...) { ... -set_page_dirty(NULL, E) +set_page_dirty(M, E) ... } @exists@ expression E; identifier M; @@ struct address_space * M; ... -set_page_dirty(NULL, E) +set_page_dirty(M, E) @exists@ expression E; identifier F, I; @@ F(..., struct inode * I, ...) { ... -set_page_dirty(NULL, E) +set_page_dirty(I->i_mapping, E) ... } @exists@ expression E; identifier I; @@ struct inode * I; ... -set_page_dirty(NULL, E) +set_page_dirty(I->i_mapping, E) -> Signed-off-by: Jérôme Glisse CC: Andrew Morton Cc: Alexander Viro Cc: linux-fsde...@vger.kernel.org Cc: Tejun Heo Cc: Jan Kara Cc: Josef Bacik Cc: Mel Gorman --- mm/filemap.c| 2 +- mm/khugepaged.c | 2 +- mm/memory.c | 2 +- mm/page-writeback.c | 4 ++-- mm/page_io.c| 4 ++-- mm/shmem.c | 18 +- mm/truncate.c | 2 +- 7 files changed, 17 insertions(+), 17 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index c1ee7431bc4d..a15c29350a6a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2717,7 +2717,7 @@ int filemap_page_mkwrite(struct vm_fault *vmf) * progress, we are guaranteed that writeback during freezing will * see the dirty page and writeprotect it again. */ - set_page_dirty(NULL, page); + set_page_dirty(inode->i_mapping, page); wait_for_stable_page(page); out: sb_end_pagefault(inode->i_sb); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ccd5da4e855f..b9a968172fb9 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1513,7 +1513,7 @@ static void collapse_shmem(struct mm_struct *mm, retract_page_tables(mapping, start); /* Everything is ready, let's unfreeze the new_page */ - set_page_dirty(NULL, new_page); + set_page_dirty(mapping, new_page); SetPageUptodate(new_page); page_ref_unfreeze(new_page, HPAGE_PMD_NR); mem_cgroup_commit_charge(new_page, memcg, false, true); diff --git a/mm/memory.c b/mm/memory.c index 20443ebf9c42..fbd80bb7a50a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2400,7 +2400,7 @@ static void fault_dirty_shared_page(struct vm_area_struct *vma, bool dirtied; bool page_mkwrite = vma->vm_ops && vma->vm_ops->page_mkwrite; - dirtied = set_page_dirty(NULL, page); + dirtied = set_page_dirty(mapping, page); VM_BUG_ON_PAGE(PageAnon(page), page); /* * Take a local copy of the address_space - page.mapping may be zeroed diff --git a/mm/page-writeback.c b/mm/page-writeback.c index eaa6c23ba752..59dc9a12efc7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2599,7 +2599,7 @@ int set_page_dirty_lock(struct address_space *_mapping, struct page *page) int ret; lock_page(page); - ret = set_page_dirty(NULL, page); + ret = set_page_dirty(_mapping, page); unlock_page(page); return ret; } @@ -2693,7 +2693,7 @@ int clear_page_dirty_for_io(struct page *page) * threads doing their things. */ if (page_mkclean(page)) - set_page_dirty(NULL, page); + set_page_dirty(mapping, page); /* * We carefully synchronise fault handlers against * installing a dirty pte and marking the page dirty diff --git a/mm/page_io.c b/mm/page_io.c index 5afc8b8a6b97..fd3133cd50d4 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -329,7 +329,7 @@ int __swap_writepage(struct address_space *mapping, struct page *page, * the normal direct-to-bio case as it could * be temporary. */ - set_page_dirty(NULL, page); + set_page_dirty(mapping, page); ClearPageReclaim(page); pr_err_ratelimited("Write error on dio swapfile (%llu)\n", page_file_offset(page)); @@ -348,7 +348,7 @@ int __swap_writepage(struct address_space *mapping, struct page *page, ret = 0; bio = get_swap_bio(GFP_NOIO, page, end_write_func); if (bio == NULL) { - set_page_dirty(NULL, page); + set_page_dirty(mapping, page); unlock_page(page); ret = -ENOMEM; goto out; diff --git a/mm/shmem.c b/mm/shmem.c index cb09fea4a9ce..eae03f684869 100644 ---
[RFC PATCH 78/79] mm/ksm: rename PAGE_MAPPING_KSM to PAGE_MAPPING_RONLY
From: Jérôme GlisseThis just rename all KSM specific helper to generic page read only name. No functional change. Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli --- fs/proc/page.c | 2 +- include/linux/page-flags.h | 30 +- mm/ksm.c | 12 ++-- mm/memory-failure.c| 2 +- mm/memory.c| 2 +- mm/migrate.c | 6 +++--- mm/mprotect.c | 2 +- mm/page_idle.c | 2 +- mm/rmap.c | 10 +- mm/swapfile.c | 2 +- 10 files changed, 37 insertions(+), 33 deletions(-) diff --git a/fs/proc/page.c b/fs/proc/page.c index 1491918a33c3..00cc037758ef 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -110,7 +110,7 @@ u64 stable_page_flags(struct page *page) u |= 1 << KPF_MMAP; if (PageAnon(page)) u |= 1 << KPF_ANON; - if (PageKsm(page)) + if (PageReadOnly(page)) u |= 1 << KPF_KSM; /* diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 50c2b8786831..0338fb5dde8d 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -374,12 +374,12 @@ PAGEFLAG(Idle, idle, PF_ANY) * page->mapping points to its anon_vma, not to a struct address_space; * with the PAGE_MAPPING_ANON bit set to distinguish it. See rmap.h. * - * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled, + * On an anonymous page in a VM_MERGEABLE area, if CONFIG_RONLY is enabled, * the PAGE_MAPPING_MOVABLE bit may be set along with the PAGE_MAPPING_ANON * bit; and then page->mapping points, not to an anon_vma, but to a private - * structure which KSM associates with that merged page. See ksm.h. + * structure which RONLY associates with that merged page. See page-ronly.h. * - * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable + * PAGE_MAPPING_RONLY without PAGE_MAPPING_ANON is used for non-lru movable * page and then page->mapping points a struct address_space. * * Please note that, confusingly, "page_mapping" refers to the inode @@ -388,7 +388,7 @@ PAGEFLAG(Idle, idle, PF_ANY) */ #define PAGE_MAPPING_ANON 0x1 #define PAGE_MAPPING_MOVABLE 0x2 -#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) +#define PAGE_MAPPING_RONLY (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) static __always_inline int PageMappingFlags(struct page *page) @@ -408,21 +408,25 @@ static __always_inline int __PageMovable(struct page *page) PAGE_MAPPING_MOVABLE; } -#ifdef CONFIG_KSM -/* - * A KSM page is one of those write-protected "shared pages" or "merged pages" - * which KSM maps into multiple mms, wherever identical anonymous page content - * is found in VM_MERGEABLE vmas. It's a PageAnon page, pointing not to any - * anon_vma, but to that page's node of the stable tree. +#ifdef CONFIG_PAGE_RONLY +/* PageReadOnly() - Returns true if page is read only, false otherwise. + * + * @page: Page under test. + * + * A read only page is one of those write-protected. Currently only KSM does + * write protect a page as "shared pages" or "merged pages" which KSM maps + * into multiple mms, wherever identical anonymous page content is found in + * VM_MERGEABLE vmas. It's a PageAnon page, pointing not to any anon_vma, + * but to that page's node of the stable tree. */ -static __always_inline int PageKsm(struct page *page) +static __always_inline int PageReadOnly(struct page *page) { page = compound_head(page); return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == - PAGE_MAPPING_KSM; + PAGE_MAPPING_RONLY; } #else -TESTPAGEFLAG_FALSE(Ksm) +TESTPAGEFLAG_FALSE(ReadOnly) #endif u64 stable_page_flags(struct page *page); diff --git a/mm/ksm.c b/mm/ksm.c index f9bd1251c288..6085068fb8b3 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -318,13 +318,13 @@ static void __init ksm_slab_free(void) static inline struct stable_node *page_stable_node(struct page *page) { - return PageKsm(page) ? page_rmapping(page) : NULL; + return PageReadOnly(page) ? page_rmapping(page) : NULL; } static inline void set_page_stable_node(struct page *page, struct stable_node *stable_node) { - page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); + page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_RONLY); } static __always_inline bool is_stable_node_chain(struct stable_node *chain) @@ -470,7 +470,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE); if (IS_ERR_OR_NULL(page))
[RFC PATCH 77/79] mm/ksm: hide set_page_stable_node() and page_stable_node()
From: Jérôme GlisseHiding this 2 functions as preparatory step for generalizing ksm write protection to other users. Moreover those two helpers can not be use meaningfully outside ksm.c as the struct they deal with is defined inside ksm.c. Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli --- include/linux/ksm.h | 12 mm/ksm.c| 11 +++ 2 files changed, 11 insertions(+), 12 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index 44368b19b27e..83c664080798 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -15,7 +15,6 @@ #include #include -struct stable_node; struct mem_cgroup; #ifdef CONFIG_KSM @@ -37,17 +36,6 @@ static inline void ksm_exit(struct mm_struct *mm) __ksm_exit(mm); } -static inline struct stable_node *page_stable_node(struct page *page) -{ - return PageKsm(page) ? page_rmapping(page) : NULL; -} - -static inline void set_page_stable_node(struct page *page, - struct stable_node *stable_node) -{ - page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); -} - /* * When do_swap_page() first faults in from swap what used to be a KSM page, * no problem, it will be assigned to this vma's anon_vma; but thereafter, diff --git a/mm/ksm.c b/mm/ksm.c index 1c16a4309c1d..f9bd1251c288 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -316,6 +316,17 @@ static void __init ksm_slab_free(void) mm_slot_cache = NULL; } +static inline struct stable_node *page_stable_node(struct page *page) +{ + return PageKsm(page) ? page_rmapping(page) : NULL; +} + +static inline void set_page_stable_node(struct page *page, + struct stable_node *stable_node) +{ + page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM); +} + static __always_inline bool is_stable_node_chain(struct stable_node *chain) { return chain->rmap_hlist_len == STABLE_NODE_CHAIN; -- 2.14.3
[RFC PATCH 74/79] mm/page_ronly: add config option for generic read only page framework.
From: Jérôme GlisseIt's really just a config option patch. Signed-off-by: Jérôme Glisse Cc: Andrea Arcangeli --- mm/Kconfig | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index c782e8fb7235..aeffb6e8dd21 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -149,6 +149,9 @@ config NO_BOOTMEM config MEMORY_ISOLATION bool +config PAGE_RONLY + bool + # # Only be set on architectures that have completely implemented memory hotplug # feature. If you are not sure, don't touch it. -- 2.14.3
[PATCH] blk-mq: order getting budget and driver tag
This patch orders getting budget and driver tag by making sure to acquire driver tag after budget is got, this way can help to avoid the following race: 1) before dispatch request from scheduler queue, get one budget first, then dequeue a request, call it request A. 2) in another IO path for dispatching request B which is from hctx->dispatch, driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(), unfortunately the budget is held by request A. 3) meantime blk_mq_dispatch_rq_list() is called for dispatching request A, and try to get driver tag first, unfortunately no driver tag is available because the driver tag is held by request B 4) both two IO pathes can't move on, and IO stall is caused. This issue can be observed when running dbench on USB storage. This patch fixes this issue by always getting budget before getting driver tag. Cc: sta...@vger.kernel.org Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops") Cc: Christoph HellwigCc: Bart Van Assche Cc: Omar Sandoval Signed-off-by: Ming Lei --- block/blk-mq.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 16e83e6df404..90838e998f66 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1188,7 +1188,12 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, struct blk_mq_queue_data bd; rq = list_first_entry(list, struct request, queuelist); - if (!blk_mq_get_driver_tag(rq, , false)) { + + hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); + if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) + break; + + if (!blk_mq_get_driver_tag(rq, NULL, false)) { /* * The initial allocation attempt failed, so we need to * rerun the hardware queue when a tag is freed. The @@ -1197,8 +1202,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * we'll re-run it below. */ if (!blk_mq_mark_tag_wait(, rq)) { - if (got_budget) - blk_mq_put_dispatch_budget(hctx); + blk_mq_put_dispatch_budget(hctx); /* * For non-shared tags, the RESTART check * will suffice. @@ -1209,11 +1213,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } } - if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) { - blk_mq_put_driver_tag(rq); - break; - } - list_del_init(>queuelist); bd.rq = rq; @@ -1812,11 +1811,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (q->elevator && !bypass_insert) goto insert; - if (!blk_mq_get_driver_tag(rq, NULL, false)) + if (!blk_mq_get_dispatch_budget(hctx)) goto insert; - if (!blk_mq_get_dispatch_budget(hctx)) { - blk_mq_put_driver_tag(rq); + if (!blk_mq_get_driver_tag(rq, NULL, false)) { + blk_mq_put_dispatch_budget(hctx); goto insert; } -- 2.9.5
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, Apr 04, 2018 at 02:45:18PM +0200, Thomas Gleixner wrote: > On Wed, 4 Apr 2018, Thomas Gleixner wrote: > > I'm aware how that hw-queue stuff works. But that only works if the > > spreading algorithm makes the interrupts affine to offline/not-present CPUs > > when the block device is initialized. > > > > In the example above: > > > > > > > irq 39, cpu list 0,4 > > > > > irq 40, cpu list 1,6 > > > > > irq 41, cpu list 2,5 > > > > > irq 42, cpu list 3,7 > > > > and assumed that at driver init time only CPU 0-3 are online then the > > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7. > > > > So the extra assignment to CPU 4-7 in the affinity mask has no effect > > whatsoever and even if the spreading result is 'perfect' it just looks > > perfect as it is not making any difference versus the original result: > > > > > > > irq 39, cpu list 0 > > > > > irq 40, cpu list 1 > > > > > irq 41, cpu list 2 > > > > > irq 42, cpu list 3 > > And looking deeper into the changes, I think that the first spreading step > has to use cpu_present_mask and not cpu_online_mask. > > Assume the following scenario: > > Machine with 8 present CPUs is booted, the 4 last CPUs are > unplugged. Device with 4 queues is initialized. > > The resulting spread is going to be exactly your example: > > irq 39, cpu list 0,4 > irq 40, cpu list 1,6 > irq 41, cpu list 2,5 > irq 42, cpu list 3,7 > > Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an > interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is > unplugged. Using cpu_present_mask the spread would be: > > irq 39, cpu list 0,1 > irq 40, cpu list 2,3 > irq 41, cpu list 4,5 > irq 42, cpu list 6,7 Given physical CPU hotplug isn't common, this way will make only irq 39 and irq 40 active most of times, so performance regression is caused just as Kashyap reported. > > while on a machine where CPU 4-7 are NOT present, but advertised as > possible the spread would be: > > irq 39, cpu list 0,4 > irq 40, cpu list 1,6 > irq 41, cpu list 2,5 > irq 42, cpu list 3,7 I think this way is still better, since performance regression can be avoided, and there is at least one CPU for covering one irq vector, in reality, it is often enough. As I mentioned in another email, I still don't understand why interrupts can't be delivered to CPU 4~7 after these CPUs become present & online. Seems in theory, interrupts should be delivered to these CPUs since affinity info has been programmed to interrupt controller already. Or do we still need CPU hotplug handler for device driver to tell device the CPU hotplug change for delivering interrupts to new added CPUs? Thanks, Ming
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, Apr 04, 2018 at 10:25:16AM +0200, Thomas Gleixner wrote: > On Wed, 4 Apr 2018, Ming Lei wrote: > > On Tue, Apr 03, 2018 at 03:32:21PM +0200, Thomas Gleixner wrote: > > > On Thu, 8 Mar 2018, Ming Lei wrote: > > > > 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible > > > > CPUs") > > > > irq 39, cpu list 0 > > > > irq 40, cpu list 1 > > > > irq 41, cpu list 2 > > > > irq 42, cpu list 3 > > > > > > > > 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible > > > > CPUs") > > > > irq 39, cpu list 0-2 > > > > irq 40, cpu list 3-4,6 > > > > irq 41, cpu list 5 > > > > irq 42, cpu list 7 > > > > > > > > 3) after applying this patch against V4.15+: > > > > irq 39, cpu list 0,4 > > > > irq 40, cpu list 1,6 > > > > irq 41, cpu list 2,5 > > > > irq 42, cpu list 3,7 > > > > > > That's more or less window dressing. If the device is already in use when > > > the offline CPUs get hot plugged, then the interrupts still stay on cpu > > > 0-3 > > > because the effective affinity of interrupts on X86 (and other > > > architectures) is always a single CPU. > > > > > > So this only might move interrupts to the hotplugged CPUs when the device > > > is initialized after CPU hotplug and the actual vector allocation moves an > > > interrupt out to the higher numbered CPUs if they have less vectors > > > allocated than the lower numbered ones. > > > > It works for blk-mq devices, such as NVMe. > > > > Now NVMe driver creates num_possible_cpus() hw queues, and each > > hw queue is assigned one msix irq vector. > > > > Storage is Client/Server model, that means the interrupt is only > > delivered to CPU after one IO request is submitted to hw queue and > > it is completed by this hw queue. > > > > When CPUs is hotplugged, and there will be IO submitted from these > > CPUs, then finally IOs complete and irq events are generated from > > hw queues, and notify these submission CPU by IRQ finally. > > I'm aware how that hw-queue stuff works. But that only works if the > spreading algorithm makes the interrupts affine to offline/not-present CPUs > when the block device is initialized. > > In the example above: > > > > > irq 39, cpu list 0,4 > > > > irq 40, cpu list 1,6 > > > > irq 41, cpu list 2,5 > > > > irq 42, cpu list 3,7 > > and assumed that at driver init time only CPU 0-3 are online then the > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7. Indeed, and I just tested this case, and found that no interrupts are delivered to CPU 4-7. In theory, the affinity has been assigned to these irq vectors, and programmed to interrupt controller, I understand it should work. Could you explain it a bit why interrupts aren't delivered to CPU 4-7? Thanks, Ming
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
On 03/30/2018 12:32 PM, Yi Zhang wrote: Hello I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me know if you need more info, thanks. Reproducer: 1. setup target #nvmetcli restore /etc/rdma.json 2. connect target on host #nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing 3. do fio background on host #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 & 4. offline cpu on host #echo 0 > /sys/devices/system/cpu/cpu1/online #echo 0 > /sys/devices/system/cpu/cpu2/online #echo 0 > /sys/devices/system/cpu/cpu3/online 5. clear target #nvmetcli clear 6. restore target #nvmetcli restore /etc/rdma.json 7. check console log on host Hi Yi, Does this happen with this applied? -- diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c index 996167f1de18..b89da55e8aaa 100644 --- a/block/blk-mq-rdma.c +++ b/block/blk-mq-rdma.c @@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set, const struct cpumask *mask; unsigned int queue, cpu; + goto fallback; + for (queue = 0; queue < set->nr_hw_queues; queue++) { mask = ib_get_vector_affinity(dev, first_vec + queue); if (!mask) --
Re: [PATCH] [v2] rbd: avoid Wreturn-type warnings
On Wed, Apr 4, 2018 at 2:53 PM, Arnd Bergmannwrote: > A new set of warnings appeared in next-20180403 in some configurations > when gcc cannot see that rbd_assert(0) leads to an unreachable code > path: > > drivers/block/rbd.c: In function 'rbd_img_is_write': > drivers/block/rbd.c:1397:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function '__rbd_obj_handle_request': > drivers/block/rbd.c:2499:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function 'rbd_obj_handle_write': > drivers/block/rbd.c:2471:1: error: control reaches end of non-void function > [-Werror=return-type] > > As the rbd_assert() here shows has no extra information beyond the verbose > BUG(), we can simply use BUG() directly in its place. This is reliably > detected as not returning on any architecture, since it doesn't depend > on the unlikely() comparison that confused gcc. > > Fixes: 3da691bf4366 ("rbd: new request handling code") > Signed-off-by: Arnd Bergmann > --- > drivers/block/rbd.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c > index 07dc5419bd63..5f7f4d4b78a8 100644 > --- a/drivers/block/rbd.c > +++ b/drivers/block/rbd.c > @@ -1392,7 +1392,7 @@ static bool rbd_img_is_write(struct rbd_img_request > *img_req) > case OBJ_OP_DISCARD: > return true; > default: > - rbd_assert(0); > + BUG(); > } > } > > @@ -2466,7 +2466,7 @@ static bool rbd_obj_handle_write(struct rbd_obj_request > *obj_req) > } > return false; > default: > - rbd_assert(0); > + BUG(); > } > } > > @@ -2494,7 +2494,7 @@ static bool __rbd_obj_handle_request(struct > rbd_obj_request *obj_req) > } > return false; > default: > - rbd_assert(0); > + BUG(); > } > } > Applied. Thanks, Ilya
Re: [PATCH] rbd: add missing return statements
On Wed, Apr 4, 2018 at 1:04 PM, Ilya Dryomovwrote: > On Wed, Apr 4, 2018 at 11:49 AM, Arnd Bergmann wrote: >> A new set of warnings appeared in next-20180403 in some configurations >> when gcc cannot see that rbd_assert(0) leads to an unreachable code >> path: >> >> drivers/block/rbd.c: In function 'rbd_img_is_write': >> drivers/block/rbd.c:1397:1: error: control reaches end of non-void function >> [-Werror=return-type] >> drivers/block/rbd.c: In function '__rbd_obj_handle_request': >> drivers/block/rbd.c:2499:1: error: control reaches end of non-void function >> [-Werror=return-type] >> drivers/block/rbd.c: In function 'rbd_obj_handle_write': >> drivers/block/rbd.c:2471:1: error: control reaches end of non-void function >> [-Werror=return-type] >> >> To work around this, we can add a return statement to each of these >> cases. An alternative would be to remove the unlikely() annotation >> in rbd_assert(), or to just use BUG()/BUG_ON() directly. This adds the >> return statements, guessing what the most reasonable behavior >> would be. > > Hi Arnd, > > I don't like these bogus return statements. Let's go with explicit > BUG/BUG_ON() instead. Sounds good. Sent a v2 now. Arnd
[PATCH] [v2] rbd: avoid Wreturn-type warnings
A new set of warnings appeared in next-20180403 in some configurations when gcc cannot see that rbd_assert(0) leads to an unreachable code path: drivers/block/rbd.c: In function 'rbd_img_is_write': drivers/block/rbd.c:1397:1: error: control reaches end of non-void function [-Werror=return-type] drivers/block/rbd.c: In function '__rbd_obj_handle_request': drivers/block/rbd.c:2499:1: error: control reaches end of non-void function [-Werror=return-type] drivers/block/rbd.c: In function 'rbd_obj_handle_write': drivers/block/rbd.c:2471:1: error: control reaches end of non-void function [-Werror=return-type] As the rbd_assert() here shows has no extra information beyond the verbose BUG(), we can simply use BUG() directly in its place. This is reliably detected as not returning on any architecture, since it doesn't depend on the unlikely() comparison that confused gcc. Fixes: 3da691bf4366 ("rbd: new request handling code") Signed-off-by: Arnd Bergmann--- drivers/block/rbd.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 07dc5419bd63..5f7f4d4b78a8 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1392,7 +1392,7 @@ static bool rbd_img_is_write(struct rbd_img_request *img_req) case OBJ_OP_DISCARD: return true; default: - rbd_assert(0); + BUG(); } } @@ -2466,7 +2466,7 @@ static bool rbd_obj_handle_write(struct rbd_obj_request *obj_req) } return false; default: - rbd_assert(0); + BUG(); } } @@ -2494,7 +2494,7 @@ static bool __rbd_obj_handle_request(struct rbd_obj_request *obj_req) } return false; default: - rbd_assert(0); + BUG(); } } -- 2.9.0
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, 4 Apr 2018, Thomas Gleixner wrote: > I'm aware how that hw-queue stuff works. But that only works if the > spreading algorithm makes the interrupts affine to offline/not-present CPUs > when the block device is initialized. > > In the example above: > > > > > irq 39, cpu list 0,4 > > > > irq 40, cpu list 1,6 > > > > irq 41, cpu list 2,5 > > > > irq 42, cpu list 3,7 > > and assumed that at driver init time only CPU 0-3 are online then the > hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7. > > So the extra assignment to CPU 4-7 in the affinity mask has no effect > whatsoever and even if the spreading result is 'perfect' it just looks > perfect as it is not making any difference versus the original result: > > > > > irq 39, cpu list 0 > > > > irq 40, cpu list 1 > > > > irq 41, cpu list 2 > > > > irq 42, cpu list 3 And looking deeper into the changes, I think that the first spreading step has to use cpu_present_mask and not cpu_online_mask. Assume the following scenario: Machine with 8 present CPUs is booted, the 4 last CPUs are unplugged. Device with 4 queues is initialized. The resulting spread is going to be exactly your example: irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 Now the 4 offline CPUs are plugged in again. These CPUs won't ever get an interrupt as all interrupts stay on CPU 0-3 unless one of these CPUs is unplugged. Using cpu_present_mask the spread would be: irq 39, cpu list 0,1 irq 40, cpu list 2,3 irq 41, cpu list 4,5 irq 42, cpu list 6,7 while on a machine where CPU 4-7 are NOT present, but advertised as possible the spread would be: irq 39, cpu list 0,4 irq 40, cpu list 1,6 irq 41, cpu list 2,5 irq 42, cpu list 3,7 Hmm? Thanks, tglx
Re: [PATCH] rbd: add missing return statements
On Wed, Apr 4, 2018 at 11:49 AM, Arnd Bergmannwrote: > A new set of warnings appeared in next-20180403 in some configurations > when gcc cannot see that rbd_assert(0) leads to an unreachable code > path: > > drivers/block/rbd.c: In function 'rbd_img_is_write': > drivers/block/rbd.c:1397:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function '__rbd_obj_handle_request': > drivers/block/rbd.c:2499:1: error: control reaches end of non-void function > [-Werror=return-type] > drivers/block/rbd.c: In function 'rbd_obj_handle_write': > drivers/block/rbd.c:2471:1: error: control reaches end of non-void function > [-Werror=return-type] > > To work around this, we can add a return statement to each of these > cases. An alternative would be to remove the unlikely() annotation > in rbd_assert(), or to just use BUG()/BUG_ON() directly. This adds the > return statements, guessing what the most reasonable behavior > would be. Hi Arnd, I don't like these bogus return statements. Let's go with explicit BUG/BUG_ON() instead. Thanks, Ilya
[PATCH] rbd: add missing return statements
A new set of warnings appeared in next-20180403 in some configurations when gcc cannot see that rbd_assert(0) leads to an unreachable code path: drivers/block/rbd.c: In function 'rbd_img_is_write': drivers/block/rbd.c:1397:1: error: control reaches end of non-void function [-Werror=return-type] drivers/block/rbd.c: In function '__rbd_obj_handle_request': drivers/block/rbd.c:2499:1: error: control reaches end of non-void function [-Werror=return-type] drivers/block/rbd.c: In function 'rbd_obj_handle_write': drivers/block/rbd.c:2471:1: error: control reaches end of non-void function [-Werror=return-type] To work around this, we can add a return statement to each of these cases. An alternative would be to remove the unlikely() annotation in rbd_assert(), or to just use BUG()/BUG_ON() directly. This adds the return statements, guessing what the most reasonable behavior would be. Fixes: 3da691bf4366 ("rbd: new request handling code") Signed-off-by: Arnd Bergmann--- drivers/block/rbd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 07dc5419bd63..9445a71a9cd6 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1394,6 +1394,7 @@ static bool rbd_img_is_write(struct rbd_img_request *img_req) default: rbd_assert(0); } + return false; } static void rbd_obj_handle_request(struct rbd_obj_request *obj_req); @@ -2468,6 +2469,7 @@ static bool rbd_obj_handle_write(struct rbd_obj_request *obj_req) default: rbd_assert(0); } + return true; } /* @@ -2496,6 +2498,7 @@ static bool __rbd_obj_handle_request(struct rbd_obj_request *obj_req) default: rbd_assert(0); } + return true; } static void rbd_obj_end_request(struct rbd_obj_request *obj_req) -- 2.9.0
Re: [PATCH V3 4/4] genirq/affinity: irq vector spread among online CPUs as far as possible
On Wed, 4 Apr 2018, Ming Lei wrote: > On Tue, Apr 03, 2018 at 03:32:21PM +0200, Thomas Gleixner wrote: > > On Thu, 8 Mar 2018, Ming Lei wrote: > > > 1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible > > > CPUs") > > > irq 39, cpu list 0 > > > irq 40, cpu list 1 > > > irq 41, cpu list 2 > > > irq 42, cpu list 3 > > > > > > 2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible > > > CPUs") > > > irq 39, cpu list 0-2 > > > irq 40, cpu list 3-4,6 > > > irq 41, cpu list 5 > > > irq 42, cpu list 7 > > > > > > 3) after applying this patch against V4.15+: > > > irq 39, cpu list 0,4 > > > irq 40, cpu list 1,6 > > > irq 41, cpu list 2,5 > > > irq 42, cpu list 3,7 > > > > That's more or less window dressing. If the device is already in use when > > the offline CPUs get hot plugged, then the interrupts still stay on cpu 0-3 > > because the effective affinity of interrupts on X86 (and other > > architectures) is always a single CPU. > > > > So this only might move interrupts to the hotplugged CPUs when the device > > is initialized after CPU hotplug and the actual vector allocation moves an > > interrupt out to the higher numbered CPUs if they have less vectors > > allocated than the lower numbered ones. > > It works for blk-mq devices, such as NVMe. > > Now NVMe driver creates num_possible_cpus() hw queues, and each > hw queue is assigned one msix irq vector. > > Storage is Client/Server model, that means the interrupt is only > delivered to CPU after one IO request is submitted to hw queue and > it is completed by this hw queue. > > When CPUs is hotplugged, and there will be IO submitted from these > CPUs, then finally IOs complete and irq events are generated from > hw queues, and notify these submission CPU by IRQ finally. I'm aware how that hw-queue stuff works. But that only works if the spreading algorithm makes the interrupts affine to offline/not-present CPUs when the block device is initialized. In the example above: > > > irq 39, cpu list 0,4 > > > irq 40, cpu list 1,6 > > > irq 41, cpu list 2,5 > > > irq 42, cpu list 3,7 and assumed that at driver init time only CPU 0-3 are online then the hotplug of CPU 4-7 will not result in any interrupt delivered to CPU 4-7. So the extra assignment to CPU 4-7 in the affinity mask has no effect whatsoever and even if the spreading result is 'perfect' it just looks perfect as it is not making any difference versus the original result: > > > irq 39, cpu list 0 > > > irq 40, cpu list 1 > > > irq 41, cpu list 2 > > > irq 42, cpu list 3 Thanks, tglx
Re: [PATCH] blk-mq: only run mapped hw queues in blk_mq_run_hw_queues()
On 03/30/2018 04:53 AM, Ming Lei wrote: > On Thu, Mar 29, 2018 at 01:49:29PM +0200, Christian Borntraeger wrote: >> >> >> On 03/29/2018 01:43 PM, Ming Lei wrote: >>> On Thu, Mar 29, 2018 at 12:49:55PM +0200, Christian Borntraeger wrote: On 03/29/2018 12:48 PM, Ming Lei wrote: > On Thu, Mar 29, 2018 at 12:10:11PM +0200, Christian Borntraeger wrote: >> >> >> On 03/29/2018 11:40 AM, Ming Lei wrote: >>> On Thu, Mar 29, 2018 at 11:09:08AM +0200, Christian Borntraeger wrote: On 03/29/2018 09:23 AM, Christian Borntraeger wrote: > > > On 03/29/2018 04:00 AM, Ming Lei wrote: >> On Wed, Mar 28, 2018 at 05:36:53PM +0200, Christian Borntraeger >> wrote: >>> >>> >>> On 03/28/2018 05:26 PM, Ming Lei wrote: Hi Christian, On Wed, Mar 28, 2018 at 09:45:10AM +0200, Christian Borntraeger wrote: > FWIW, this patch does not fix the issue for me: > > ostname=? addr=? terminal=? res=success' > [ 21.454961] WARNING: CPU: 3 PID: 1882 at block/blk-mq.c:1410 > __blk_mq_delay_run_hw_queue+0xbe/0xd8 > [ 21.454968] Modules linked in: scsi_dh_rdac scsi_dh_emc > scsi_dh_alua dm_mirror dm_region_hash dm_log dm_multipath dm_mod > autofs4 > [ 21.454984] CPU: 3 PID: 1882 Comm: dasdconf.sh Not tainted > 4.16.0-rc7+ #26 > [ 21.454987] Hardware name: IBM 2964 NC9 704 (LPAR) > [ 21.454990] Krnl PSW : c0131ea3 3ea2f7bf > (__blk_mq_delay_run_hw_queue+0xbe/0xd8) > [ 21.454996]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 > AS:3 CC:0 PM:0 RI:0 EA:3 > [ 21.455005] Krnl GPRS: 013abb69a000 013a > 013ac6c0dc00 0001 > [ 21.455008] 013abb69a710 > 013a 0001b691fd98 > [ 21.455011]0001b691fd98 013ace4775c8 > 0001 > [ 21.455014]013ac6c0dc00 00b47238 > 0001b691fc08 0001b691fbd0 > [ 21.455032] Krnl Code: 0069c596: ebaff0a4 lmg > %r10,%r15,160(%r15) > 0069c59c: c0f47a5e brcl > 15,68ba58 > #0069c5a2: a7f40001 > brc 15,69c5a4 > >0069c5a6: e340f0c4 lg > %r4,192(%r15) > 0069c5ac: ebaff0a4 lmg > %r10,%r15,160(%r15) > 0069c5b2: 07f4 bcr > 15,%r4 > 0069c5b4: c0e5feea brasl > %r14,69c388 > 0069c5ba: a7f4fff6 > brc 15,69c5a6 > [ 21.455067] Call Trace: > [ 21.455072] ([<0001b691fd98>] 0x1b691fd98) > [ 21.455079] [<0069c692>] > blk_mq_run_hw_queue+0xba/0x100 > [ 21.455083] [<0069c740>] > blk_mq_run_hw_queues+0x68/0x88 > [ 21.455089] [<0069b956>] > __blk_mq_complete_request+0x11e/0x1d8 > [ 21.455091] [<0069ba9c>] > blk_mq_complete_request+0x8c/0xc8 > [ 21.455103] [<008aa250>] > dasd_block_tasklet+0x158/0x490 > [ 21.455110] [<0014c742>] tasklet_hi_action+0x92/0x120 > [ 21.455118] [<00a7cfc0>] __do_softirq+0x120/0x348 > [ 21.455122] [<0014c212>] irq_exit+0xba/0xd0 > [ 21.455130] [<0010bf92>] do_IRQ+0x8a/0xb8 > [ 21.455133] [<00a7c298>] io_int_handler+0x130/0x298 > [ 21.455136] Last Breaking-Event-Address: > [ 21.455138] [<0069c5a2>] > __blk_mq_delay_run_hw_queue+0xba/0xd8 > [ 21.455140] ---[ end trace be43f99a5d1e553e ]--- > [ 21.510046] dasdconf.sh Warning: 0.0.241e is already online, > not configuring Thinking about this issue further, I can't understand the root cause for this issue. FWIW, Limiting CONFIG_NR_CPUS to 64 seems to make the problem go away. >>> >>> I think the following patch is needed, and this way aligns to the >>> mapping >>> created via managed IRQ at least. >>> >>> diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c >>> index