Re: [Cluster-devel] [syzbot] [gfs2?] KASAN: use-after-free Read in qd_unlock (2)
On Wed, Jul 26, 2023 at 06:45:55PM +0300, Dmitry Baryshkov wrote: > > > bisection log: > > > https://syzkaller.appspot.com/x/bisect.txt?x=17b48111a8 ... > > > dashboard link: > > > https://syzkaller.appspot.com/bug?extid=3f6a670108ce43356017 > I highly suspect that the bisect was wrong here. The only thing that > was changed by the mentioned commit is the device tree for the pretty > obscure platform, which is not 'Google Compute Engine'. Yeah, it's not even close. If you take a look at the bisection log (which is *always* a good idea before you put any faith in the syzbot bisection), you'd see the following: testing commit e1c04510f521e853019afeca2a5991a5ef8d6a5b gcc compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 kernel signature: f262f513a4ba5708b69a5fdd8c218746223996a8b2134a22f2916d16f23d01e8 run #0: crashed: unregister_netdevice: waiting for DEV to become free run #1: crashed: unregister_netdevice: waiting for DEV to become free run #2: crashed: unregister_netdevice: waiting for DEV to become free run #3: crashed: unregister_netdevice: waiting for DEV to become free run #4: crashed: unregister_netdevice: waiting for DEV to become free run #5: crashed: unregister_netdevice: waiting for DEV to become free run #6: crashed: unregister_netdevice: waiting for DEV to become free run #7: crashed: unregister_netdevice: waiting for DEV to become free run #8: crashed: unregister_netdevice: waiting for DEV to become free This is *nothing* like the problem reported on the dashboard, which is: BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:72 [inline] BUG: KASAN: use-after-free in _test_bit include/asm-generic/bitops/instrumented-non-atomic.h:141 [inline] BUG: KASAN: use-after-free in qd_unlock+0x30/0x2d0 fs/gfs2/quota.c:490 Read of size 8 at addr 888073997090 by task syz-executor221/5069 where the dereference had a stack trace which looked like this: _test_bit include/asm-generic/bitops/instrumented-non-atomic.h:141 [inline] qd_unlock+0x30/0x2d0 fs/gfs2/quota.c:490 gfs2_quota_sync+0x768/0x8b0 fs/gfs2/quota.c:1325 gfs2_sync_fs+0x49/0xb0 fs/gfs2/super.c:650 sync_filesystem+0xe8/0x220 fs/sync.c:56 generic_shutdown_super+0x6b/0x310 fs/super.c:474 kill_block_super+0x79/0xd0 fs/super.c:1386 deactivate_locked_super+0xa7/0xf0 fs/super.c:332 cleanup_mnt+0x494/0x520 fs/namespace.c:1291 task_work_run+0x243/0x300 kernel/task_work.c:179 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x644/0x2150 kernel/exit.c:867 and the memory was allocated via this stack trace: kmem_cache_alloc+0x1b3/0x350 mm/slub.c:3476 kmem_cache_zalloc include/linux/slab.h:710 [inline] qd_alloc+0x51/0x250 fs/gfs2/quota.c:216 gfs2_quota_init+0x7c4/0x10e0 fs/gfs2/quota.c:1415 gfs2_make_fs_rw+0x48e/0x590 fs/gfs2/super.c:153 gfs2_fill_super+0x2357/0x2700 fs/gfs2/ops_fstype.c:1274 get_tree_bdev+0x400/0x620 fs/super.c:1282 gfs2_get_tree+0x50/0x210 fs/gfs2/ops_fstype.c:1330 vfs_get_tree+0x88/0x270 fs/super.c:1489 do_new_mount+0x289/0xad0 fs/namespace.c:3145 do_mount fs/namespace.c:3488 [inline] __do_sys_mount fs/namespace.c:3697 [inline] __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 (And the memory was freed from an RCU path) - Ted
Re: [Cluster-devel] [PATCH 01/12] backing_dev: remove current->backing_dev_info
On Wed, May 31, 2023 at 09:50:15AM +0200, Christoph Hellwig wrote: > The last user of current->backing_dev_info disappeared in commit > b9b1335e6403 ("remove bdi_congested() and wb_congested() and related > functions"). Remove the field and all assignments to it. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Hannes Reinecke > Reviewed-by: Darrick J. Wong Acked-by: Theodore Ts'o
Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write
On Wed, May 31, 2023 at 09:50:17AM +0200, Christoph Hellwig wrote: > All callers of generic_perform_write need to updated ki_pos, move it into > common code. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Xiubo Li > Reviewed-by: Damien Le Moal > Reviewed-by: Hannes Reinecke > Acked-by: Darrick J. Wong Acked-by: Theodore Ts'o
Re: [Cluster-devel] [GIT PULL] gfs2 writepage fix
On Mon, Jan 23, 2023 at 11:05:56AM +0100, Jan Kara wrote: > Thanks for the heads up. So we had kind of a similar issue for ext4 but it > got caught by Ted during his regression runs so we've basically done what > GFS2 is doing for the merge window and now there's patchset pending to > convert the data=journal path as well because as Andreas states in his > changelog of the revert that's a bit more tricky. But at least for ext4 > the conversion of data=journal path resulted in much cleaner and shorter > code. https://thunk.org/tytso/images/firestarter-fs-development-without-testing.jpg :-) - Ted
Re: [Cluster-devel] [PATCH v2 06/14] jbd2: replace ll_rw_block()
On Thu, Sep 01, 2022 at 09:34:57PM +0800, Zhang Yi wrote: > ll_rw_block() is not safe for the sync read path because it cannot > guarantee that submitting read IO if the buffer has been locked. We > could get false positive EIO after wait_on_buffer() if the buffer has > been locked by others. So stop using ll_rw_block() in > journal_get_superblock(). We also switch to new bh_readahead_batch() > for the buffer array readahead path. > > Signed-off-by: Zhang Yi Thanks, looks good. Reviewed-by: Theodore Ts'o - Ted
Re: [Cluster-devel] [PATCH -v5] ext4: don't BUG if someone dirty pages without asking ext4 first
[un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. A related race was noted by Jan Kara in 2018[1]; however, more recently instead of it being a very hard-to-hit race, it could be reliably triggered by process_vm_writev(2) which was discovered by Syzbot[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that if some other kernel subsystem dirty pages without properly notifying the file system using page_mkwrite(), ext4 will BUG, while other file systems will not BUG (although data will still be lost). So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz [2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com Reported-by: Lee Jones Signed-off-by: Theodore Ts'o --- -v5 Argh, sent the wrong version of this patch for V4. (Which, if you tried testing it, would cause generic/013 on ext4/4k to die horribly. :-) This is the correct final version. fs/ext4/inode.c | 25 + 1 file changed, 25 insertions(+) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 01c9e4f743ba..531a94f48637 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, else len = PAGE_SIZE; + /* Should never happen but for bugs in other kernel subsystems */ + if (!page_has_buffers(page)) { + ext4_warning_inode(inode, + "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + return 0; + } + page_bufs = page_buffers(page); /* * We cannot do block allocation or other extent handling in this @@ -2594,6 +2603,22 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); + /* +* Should never happen but for buggy code in +* other subsystems that call +* set_page_dirty() without properly warning +* the file system first. See [1] for more +* information. +* +* [1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz +*/ + if (!page_has_buffers(page)) { + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + continue; + } + if (mpd->map.m_len == 0) mpd->first_page = page->index; mpd->next_page = page->index + 1; -- 2.31.0
Re: [Cluster-devel] [PATCH -v4] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
[un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. A related race was noted by Jan Kara in 2018[1]; however, more recently instead of it being a very hard-to-hit race, it could be reliably triggered by process_vm_writev(2) which was discovered by Syzbot[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that if some other kernel subsystem dirty pages without properly notifying the file system using page_mkwrite(), ext4 will BUG, while other file systems will not BUG (although data will still be lost). So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz [2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com Reported-by: Lee Jones Signed-off-by: Theodore Ts'o Cc: sta...@kernel.org --- v4 - only changes to the commit description to eliminate some inaccuracies and clarify the text. fs/ext4/inode.c | 27 ++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 01c9e4f743ba..008fe8750109 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, else len = PAGE_SIZE; + /* Should never happen but for bugs in other kernel subsystems */ + if (!page_has_buffers(page)) { + ext4_warning_inode(inode, + "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + return 0; + } + page_bufs = page_buffers(page); /* * We cannot do block allocation or other extent handling in this @@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) (mpd->wbc->sync_mode == WB_SYNC_NONE)) || unlikely(page->mapping != mapping)) { unlock_page(page); - continue; + goto out; } wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); + /* +* Should never happen but for buggy code in +* other subsystems that call +* set_page_dirty() without properly warning +* the file system first. See [1] for more +* information. +* +* [1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz +*/ + if (!page_has_buffers(page)) { + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + continue; + } + if (mpd->map.m_len == 0) mpd->first_page = page->index; mpd->next_page = page->index + 1; -- 2.31.0
Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
On Fri, Feb 25, 2022 at 08:40:36PM -0500, Theodore Ts'o wrote: > Well, that makes it process_vm_writev()'s is that it needs to know > when to call pin_user_file_pages(). Sorry, typed too fast. What I was trying to say is this make it process_vm_writev()'s problem to figure out when it should call pin_user_file_pages() versus some other pin_user_pages function. > I suspect that for many use cases > --- for example, if this is being used by a debugger to modify a > variable on a stack, or an anonymous page in the program's data > segment, process_vm_writev() *isn't* actually pinning a file. So they > want some kind of interface that automatically DTRT regardless of > whether the user pages being edited are file-backed or not > file-backed. - Ted
Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
On Fri, Feb 25, 2022 at 04:41:14PM -0800, John Hubbard wrote: > > > f2fs and btrfs's compressed file write support, by making things work > > much like the write(2) system call. Imagine if we had a > > "pin_user_pages_local()" which calls write_begin(), and a > > "unpin_user_pages_local()" which calls write_end(), and the > > Right, that would supply the missing connection to the filesystems. > > In fact, maybe these names about right: > > pin_user_file_pages() > unpin_user_file_pages() > > ...and then put them in a filesystem header file, because these are now > tightly coupled to filesystems, what with the need to call > .write_begin() and .write_end(). Well, that makes it process_vm_writev()'s is that it needs to know when to call pin_user_file_pages(). I suspect that for many use cases --- for example, if this is being used by a debugger to modify a variable on a stack, or an anonymous page in the program's data segment, process_vm_writev() *isn't* actually pinning a file. So they want some kind of interface that automatically DTRT regardless of whether the user pages being edited are file-backed or not file-backed. So some kind of [un]pin_user_pages_local() which will call write_{begin,end}() if necessary would be the most convenient for users such as process_vm_writev(). And perhaps would it make sense for pin_user_pages to optionally (or by default?) check for file-backed pages, and if it finds any, return an error or stop pinning pages at that point, so the system call can return EOPNOSUPP to the user, instead of silently causing user data to be lost or corrupted as is currently the case with xfs and btrfs (and ext4 once I patch it so it doesn't BUG). I'll note that at least one caller of pin_user_pages, in fs/io_uring.c takes it upon itself to check for file-backed pages, and returns EOPNOTSUPP if there are any found. Many that should be lifted to pin_user_pages()? For that matter, maybe pin_user_pages() and friends should take some new FOLL_ flags to indicate whether file-backed pages should be rejected, or perhaps they can promise they will only be holding the pin for a very short amount of time (FOLL_SHORTERM?), and then pin_user_pages() and unpin_user_pages() can automagically call write_begin() and write_end() if necessary? I dunno - Ted
Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote: > On 2/25/22 13:23, Theodore Ts'o wrote: > > [un]pin_user_pages_remote is dirtying pages without properly warning > > the file system in advance. This was noted by Jan Kara in 2018[1] and > > In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported > was actually that dio_bio_complete() was calling set_page_dirty_lock() > on pages that were not (any longer) set up for that. Fair enough, there are two problems that are getting conflated here, and that's my bad. The problem which Jan pointed out is one where the Direct I/O read path triggered a page fault, so page_mkwrite() was actually called. So in this case, the file system was actually notified, and the page was marked dirty after the file system was notified. But then the DIO read was racing with the page cleaner, which would call writepage(), and then clear the page, and then remove the buffer_heads. Then dio_bio_complete() would call set_page_dirty() a second time, and that's what would trigger the BUG. But in the syzbot reproducer, it's a different problem. In this case, process_vm_writev() calling [un]pin_user_pages_remote(), and page_mkwrite() is never getting called. So there is no need to race with the page cleaner, and so the BUG triggers much more reliably. > > more recently has resulted in bug reports by Syzbot in various Android > > kernels[2]. > > > > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in > > Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock() > call into mm/gup.c, but that merely refactored things. The callers are > all over the kernel, and those callers are what need changing in order > to fix this. >From my perspective, the bug is calling set_page_dirty() without first calling the file system's page_mkwrite(). This is necessary since the file system needs to allocate file system data blocks in preparation for a future writeback. Now, calling page_mkwrite() by itself is not enough, since the moment you make the page dirty, the page cleaner could go ahead and call writepage() behind your back and clean it. In actual practice, with a Direct I/O read request racing with writeback, this is race was quite hard to hit, because the that would imply that the background writepage() call would have to complete ahead of the synchronous read request, and the block layer generally prioritizes synchronous reads ahead of background write requests. So in practice, this race was ***very*** hard to hit. Jan may have reported it in 2018, but I don't think I've ever seen it happen myself. For process_vm_writev() this is a case where user pages are pinned and then released in short order, so I suspect that race with the page cleaner would also be very hard to hit. But we could completely remove the potential for the race, and also make things kinder for f2fs and btrfs's compressed file write support, by making things work much like the write(2) system call. Imagine if we had a "pin_user_pages_local()" which calls write_begin(), and a "unpin_user_pages_local()" which calls write_end(), and the presumption with the "[un]pin_user_pages_local" API is that you don't hold the pinned pages for very long --- say, not across a system call boundary, and then it would work the same way the write(2) system call works does except that in the case of process_vm_writev(2) the pages are identified by another process's address space where they happen to be mapped. This obviously doesn't work when pinning pages for remote DMA, because in that case the time between pin_user_pages_remote() and unpin_user_pages_remote() could be a long, long time, so that means we can't use using write_begin/write_end; we'd need to call page_mkwrite() when the pages are first pinned and then somehow prevent the page cleaner from touching a dirty page which is pinned for use by the remote DMA. Does that make sense? - Ted
[Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
[un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance. This was noted by Jan Kara in 2018[1] and more recently has resulted in bug reports by Syzbot in various Android kernels[2]. This is technically a bug in mm/gup.c, but arguably ext4 is fragile in that a buggy kernel subsystem which dirty pages without properly notifying the file system causes ext4 to BUG, while other file systems are not (although user data likely will be lost). I suspect in real life it is rare that people are using RDMA into file-backed memory, which is why no one has complained to ext4 developers except fuzzing programs. So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz [2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com Reported-by: Lee Jones Signed-off-by: Theodore Ts'o --- fs/ext4/inode.c | 27 ++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 01c9e4f743ba..008fe8750109 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, else len = PAGE_SIZE; + /* Should never happen but for bugs in other kernel subsystems */ + if (!page_has_buffers(page)) { + ext4_warning_inode(inode, + "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + return 0; + } + page_bufs = page_buffers(page); /* * We cannot do block allocation or other extent handling in this @@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) (mpd->wbc->sync_mode == WB_SYNC_NONE)) || unlikely(page->mapping != mapping)) { unlock_page(page); - continue; + goto out; } wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); + /* +* Should never happen but for buggy code in +* other subsystems that call +* set_page_dirty() without properly warning +* the file system first. See [1] for more +* information. +* +* [1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz +*/ + if (!page_has_buffers(page)) { + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + continue; + } + if (mpd->map.m_len == 0) mpd->first_page = page->index; mpd->next_page = page->index + 1; -- 2.31.0
Re: [Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
On Fri, Feb 25, 2022 at 12:51:28PM -0800, Eric Biggers wrote: > > > > [1] https://www.spinics.net/lists/linux-mm/msg142700.html > > Can you use a lore link > (https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz/T/#u)? Sure, thanks for finding the lore link! I did try searching for it last night, but for some reason I wasn't able to surface it. > > + /* > > +* Should never happen but for buggy code in > > +* other subsystemsa that call > > subsystemsa => subsystems Already fixed in my local version of the patch. > > +* [1] > > https://www.spinics.net/lists/linux-mm/msg142700.html > > Likewise, lore link here. Ack. - Ted
[Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first
[un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance (or faulting in the file data if the page is not yet in the page cache). This was noted by Jan Kara in 2018[1] and more recently has resulted in bug reports by Syzbot in various Android kernels[2]. This is technically a bug in the mm/gup.c codepath, but arguably ext4 is fragile in that a buggy get_user_pages() implementation causes ext4 to crash, where as other file systems are not crashing (although in some cases the user data will be lost since gup code is not properly informing the file system to potentially allocate blocks or reserve space when writing into a sparse portion of file). I suspect in real life it is rare that people are using RDMA into file-backed memory, which is why no one has complained to ext4 developers except fuzzing programs. So instead of crashing with a BUG, issue a warning (since there may be potential data loss) and just mark the page as clean to avoid unprivileged denial of service attacks until the problem can be properly fixed. More discussion and background can be found in the thread starting at [2]. [1] https://www.spinics.net/lists/linux-mm/msg142700.html [2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com Reported-by: Lee Jones Signed-off-by: Theodore Ts'o --- fs/ext4/inode.c | 27 ++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 01c9e4f743ba..f8fefbf67306 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, else len = PAGE_SIZE; + /* Should never happen but for buggy gup code */ + if (!page_has_buffers(page)) { + ext4_warning_inode(inode, + "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + return 0; + } + page_bufs = page_buffers(page); /* * We cannot do block allocation or other extent handling in this @@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) (mpd->wbc->sync_mode == WB_SYNC_NONE)) || unlikely(page->mapping != mapping)) { unlock_page(page); - continue; + goto out; } wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); + /* +* Should never happen but for buggy code in +* other subsystemsa that call +* set_page_dirty() without properly warning +* the file system first. See [1] for more +* information. +* +* [1] https://www.spinics.net/lists/linux-mm/msg142700.html +*/ + if (!page_has_buffers(page)) { + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + continue; + } + if (mpd->map.m_len == 0) mpd->first_page = page->index; mpd->next_page = page->index + 1; -- 2.31.0
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Wed, Feb 23, 2022 at 04:44:07PM -0800, John Hubbard wrote: > > Actually...I can confirm that real customers really are doing *exactly* > that! Despite the kernel crashes--because the crashes don't always > happen unless you have a large (supercomputer-sized) installation. And > even then it is not always root-caused properly. Interesting. The syzbot reproducer triggers *reliably* on ext4 using a 2 CPU qemu kernel running on a laptop, and it doesn't require root, so it's reasonable that Lee is pushing for a fix --- even if for the Android O or newer, Seccomp can probably prohibit trap process_vm_writev(2), but it seems unfortunate if say, someone running a Docker container could take down the entire host OS. - Ted
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote: > > Fair enough; on the other hand, we could also view this as making ext4 > > more robust against buggy code in other subsystems, and while other > > file systems may be losing user data if they are actually trying to do > > remote memory access to file-backed memory, apparently other file > > systems aren't noticing and so they're not crashing. > > Oh, we've noticed them, no question about that. We've got bug > reports going back years for systems being crashed, triggering BUGs > and/or corrupting data on both XFS and ext4 filesystems due to users > trying to run RDMA applications with file backed pages. Is this issue causing XFS to crash? I didn't know that. I tried the Syzbot reproducer with XFS mounted, and it didn't trigger any crashes. I'm sure data was getting corrupted, but I figured I should bring ext4 to the XFS level of "at least we're not reliably killing the kernel". On ext4, an unprivileged process can use process_vm_writev(2) to crash the system. I don't know how quickly we can get a fix into mm/gup.c, but if some other kernel path tries calling set_page_dirty() on a file-backed page without first asking permission from the file system, it seems to be nice if the file system doesn't BUG() --- as near as I can tell, xfs isn't crashing in this case, but ext4 is. - Ted
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote: > > The challenge is that fixing this "the right away" is probably not > > something we can backport into an LTS kernel, whether it's 5.15 or > > 5.10... or 4.19. > > Don't worry about stable backports to start with. Do it the "right way" > first and then we can consider if it needs to be backported or not. Fair enough; on the other hand, we could also view this as making ext4 more robust against buggy code in other subsystems, and while other file systems may be losing user data if they are actually trying to do remote memory access to file-backed memory, apparently other file systems aren't noticing and so they're not crashing. Issuing a warning and then not crashing is arguably a better way for ext4 to react, especially if there are other parts of the kernel that are randomly calling set_page_dirty() on file-backed memory without properly first informing the file system in a context where it can block and potentially do I/O to do things like allocate blocks. - Ted
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote: > > Just a small thing I'll say once, to get it out of my system. No action > required here, I just want it understood: > > Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via > pin_user_pages_remote()"), you would have written that like this: > > "process_vm_writev() is dirtying pages without properly warning the file > system in advance..." > > Because, there were many callers that were doing this: > > get_user_pages*() > ...use the pages... > > for_each_page() { > set_page_dirty*() > put_page() > } Sure, but that's not sufficient when modifying file-backed pages. Previously, there was only two ways of modifying file-backed pages in the page cache --- either using the write(2) system call, or when a mmaped page is modified by the userspace. In the case of write(2), the file system gets notified before the page cache is modified by a call to the address operation's write_begin() call, and after the page cache is modified, the address operation's write_end() call lets the file system know that the modification is done. After the write is done, the 30 second writeback timer is triggered, and in the case of ext4's data=journalling mode, we close the ext4 micro-transation (and therefore the time between write_begin and write_end calls needs to be minimal); otherwise this can block ext4 transactions. In the case of a user page fault, the address operation's page_mkwrite() is called, and at that point we will allocate any blocks needed to back memory if necessary (in the case of delayed allocation, file system space has to get reserved). The problem here for remote access is that page_mkwrite() can block, and it can potentially fail (e.g., with ENOSPC or ENOMEM). This is also why we can't just add the page buffers and do the file system allocation in set_page_dirty(), since set_page_dirty() isn't allowed to block. One approach might be to make all of the pages writeable when pin_user_pages_remote() is called. That that would mean that in the case of remote access via process_vm_writev or RDMA, all of the blocks will be allocated early. But that's probably better since at that point the userspace code is in a position to receive the error when setting up the RDMA memory, and we don't want to be asking the file system to do block allocation when incoming data is coming in from Infiniband or iWARP. > I see that ext4_warning_inode() has rate limiting, but it doesn't look > like it's really intended for a per-page rate. It looks like it is > per-superblock (yes?), so I suspect one instance of this problem, with > many pages involved, could hit the limit. > > Often, WARN_ON_ONCE() is used with per-page assertions. That's not great > either, but it might be better here. OTOH, I have minimal experience > with ext4_warning_inode() and maybe it really is just fine with per-page > failure rates. With the syzbot reproducer, we're not actually triggering the rate limiter, since the ext4 warning is only getting hit a bit more than once every 4 seconds. And I'm guessing that in the real world, people aren't actually trying to do remote direct access to file-backed memory, at least not using ext4, since that's an invitation to a kernel crash, and we would have gotten user complaints. If some user actually tries to use process_vm_writev for realsies, as opposed to a random fuzzer or from a malicious program , we do want to warn them about the potential data loss, so I'd prefer to warn once for each inode --- but I'm not convinced that it's worth the effort. Cheers, - Ted
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Fri, Feb 18, 2022 at 04:24:20AM +, Matthew Wilcox wrote: > On Thu, Feb 17, 2022 at 09:54:30PM -0500, Theodore Ts'o wrote: > > process_vm_writev() uses [un]pin_user_pages_remote() which is the same > > interface uses for RDMA. But it's not clear this is ever supposed to > > work for memory which is mmap'ed region backed by a file. > > pin_user_pages_remote() appears to assume that it is an anonymous > > region, since the get_user_pages functions in mm/gup.c don't call > > read_page() to read data into any pages that might not be mmaped in. > > ... it doesn't end up calling handle_mm_fault() in faultin_page()? Ah yes, sorry, I missed that. This is what happens when a syzbot bug is thrown to a file system developer, who then has to wade theough mm code for which he is not understand - Ted
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote: > Yes. And looking at the pair of backtraces below, this looks very much > like another aspect of the "get_user_pages problem" [1], originally > described in Jan Kara's 2018 email [2]. Hmm... I just posted my analysis, which tracks with yours; but I had forgotten about Jan's 2018 e-mail on the matter. > I'm getting close to posting an RFC for the direct IO conversion to > FOLL_PIN, but even after that, various parts of the kernel (reclaim, > filesystems/block layer) still need to be changed so as to use > page_maybe_dma_pinned() to help avoid this problem. There's a bit > more than that, actually. The challenge is that fixing this "the right away" is probably not something we can backport into an LTS kernel, whether it's 5.15 or 5.10... or 4.19. The only thing which can probably survive getting backported is something like this. It won't make the right thing happen if someone is trying to RDMA or call process_vm_writev() into a file-backed memory region --- but I'm not sure I care. Certainly if the goal is to make Android kernels, I'm pretty sure they are't either using RDMA, and I suspect they are probably masking out the process_vm_writev(2) system call (at least, for Android O and newer). So if the goal is to just to close some Syzbot bugs, what do folks think about this? - Ted commit 7711b1fda6f7f04274fa1cba6f092410262b0296 Author: Theodore Ts'o Date: Thu Feb 17 22:54:03 2022 -0500 ext4: work around bugs in mm/gup.c that can cause ext4 to BUG() [un]pin_user_pages_remote is dirtying pages without properly warning the file system in advance (or faulting in the file data if the page is not yet in the page cache). This was noted by Jan Kara in 2018[1] and more recently has resulted in bug reports by Syzbot in various Android kernels[2]. Fixing this for real is non-trivial, and will never be backportable into stable kernels. So this is a simple workaround that stops the kernel from BUG()'ing. The changed pages will not be properly written back, but given that the current gup code is missing the "read" in "read-modify-write", the dirty page in the page cache might contain corrupted data anyway. [1] https://www.spinics.net/lists/linux-mm/msg142700.html [2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com Reported-by: syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com Reported-by: Lee Jones Signed-off-by: Theodore Ts'o diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 01c9e4f743ba..3b2f336a90d1 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page, else len = PAGE_SIZE; + /* Should never happen but for buggy gup code */ + if (!page_has_buffers(page)) { + ext4_warning_inode(inode, + "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + return 0; + } + page_bufs = page_buffers(page); /* * We cannot do block allocation or other extent handling in this @@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd) wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); + /* Should never happen but for buggy gup code */ + if (!page_has_buffers(page)) { + ext4_warning_inode(mpd->inode, "page %lu does not have buffers attached", page->index); + ClearPageDirty(page); + unlock_page(page); + continue; + } + if (mpd->map.m_len == 0) mpd->first_page = page->index; mpd->next_page = page->index + 1;
Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()
On Wed, Feb 16, 2022 at 04:31:36PM +, Lee Jones wrote: > > After recently receiving a bug report from Syzbot [0] which was raised > specifically against the Android v5.10 kernel, I spent some time > trying to get to the crux. Firstly I reproduced the issue on the > reported kernel, then did the same using the latest release kernel > v5.16. > > The full kernel trace can be seen below at [1]. > Lee, thanks for your work in trimming down the syzkaller reproducer. The moral equivalent of what it is doing (except each system call is done in a separate thread, with synchronization so each gets executed in order, but perhaps on a different CPU) is: int dest_fd, src_fd, truncate_fd, mmap_fd; pid_t tid; struct iovec local_iov, remote_iov; dest_fd = open("./bus", O_RDWR|O_CREAT|O_NONBLOCK|O_SYNC| O_DIRECT|O_LARGEFILE|O_NOATIME, 0); src_fd = openat(AT_FDCWD, "/bin/bash", O_RDONLY); truncate_fd = syscall(__NR_open, "./bus", O_RDWR|O_CREAT|O_SYNC|O_NOATIME|O_ASYNC); ftruncate(truncate_fd, 0x2008002ul); mmap((void *) 0x2000ul /* addr */, 0x60ul /* length */, PROT_WRITE|PROT_EXEC|PROT_SEM||0x70, MAP_FIXED|MAP_SHARED, mmap_fd, 0); sendfile(dest_fd, src_fd, 0 /* offset */, 0x8005ul /* count */); local_iov.iov_base = (void *) 0x2034afa4; local_iov.iov_len = 0x1f80; remote_iov.iov_base = (void *) 0x2080; remote_iov.iov_len = 0x2034afa5; process_vm_writev(gettid(), &local_iov, 1, &remote_iov, 1, 0); sendfile(dest_fd, src_fd, 0 /* offset */, 0x1f05ul /* count */); Which is then executed repeataedly over and over again. (With the file descriptors closed at each loop, so the file is reopened each time.) So basically, we have a scratch file which is initialized using an sendfile using O_DIRECT. The scratch file is also mmap'ed into the process's address space, and we then *modify* that an mmap'ed reagion using process_vm_writev() --- and this is where the problem starts. process_vm_writev() uses [un]pin_user_pages_remote() which is the same interface uses for RDMA. But it's not clear this is ever supposed to work for memory which is mmap'ed region backed by a file. pin_user_pages_remote() appears to assume that it is an anonymous region, since the get_user_pages functions in mm/gup.c don't call read_page() to read data into any pages that might not be mmaped in. They also don't follow the mm / file system protocol of calling the file system's write_begin() or page_mkwrite() before modifying a page, and so when when process_vm_writev() calls unpin_user_pages_remote(), this tries to dirty the page, but since page_mkwrite() or write_begin() hasn't been called, buffers haven't been attached to the page, and so that triggers the following ext4 WARN_ON: [ 1451.095939] WARNING: CPU: 1 PID: 449 at fs/ext4/inode.c:3565 ext4_set_page_dirty+0x48/0x50 ... [ 1451.139877] Call Trace: [ 1451.140833] [ 1451.141889] folio_mark_dirty+0x2f/0x60 [ 1451.143408] set_page_dirty_lock+0x3e/0x60 [ 1451.145130] unpin_user_pages_dirty_lock+0x108/0x130 [ 1451.147405] process_vm_rw_single_vec.constprop.0+0x1b9/0x260 [ 1451.150135] process_vm_rw_core.constprop.0+0x156/0x280 [ 1451.159576] process_vm_rw+0xc4/0x110 Then when ext4_writepages() gets called, we trigger the BUG because buffers haven't been attached to the page. We can certainly avoid the BUG by checking to see if buffers are attached first, and if not, skip writing the page trigger a WARN_ON, and then forcibly clear the page dirty flag so don't permanently leak memory and allow the file system to be unmounted. (Note: we can't fix the problem by attaching the buffers in set_page_dirty(), since is occasionally called under spinlocks and without the page being locked, so we can't do any kind of allocation, so ix-nay on calling create_empty_buffers() which calls alloc_buffer_head().) BUT, that is really papering over the problem, since it's not clear it's valid to try to use get_user_pages() and friends (including [un]pin_user_pages_remote() on file-backed memory. And if it is supposed to be valid, then mm/gup.c needs to first call readpage() if the page is not present, so that if process_vm_writev() is only modifying a few bytes in the mmap'ed region, we need to fault in the page first, and then mm/gup.c needs to inform the file system to make sure that if pinned memory region is not yet allocated, than whatever needs to happen to allocate space, via page_mkwrite() has taken place. (And by the way, that means that pin_user_pages_remote() may need to return ENOSPC if there is not free space in the file system, and hence ENOSPC may need to reflected all the way back to process_vm_writev(). Alternatively, if we don't expect process_vm_writev() to work on file-backed memory, perhaps it and pin_user_pages_remote() should return some kind of error?
Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks
On Mon, Oct 25, 2021 at 08:24:26PM +0200, Andreas Gruenbacher wrote: > > For generic_perform_write() Dave Hansen attempted to move the fault-in > > after the uaccess in commit 998ef75ddb57 ("fs: do not prefault > > sys_write() user buffer pages"). This was reverted as it was exposing an > > ext4 bug. I don't [know] whether it was fixed but re-applying Dave's commit > > avoids the performance drop. > > Interesting. The revert of commit 998ef75ddb57 is in commit > 00a3d660cbac. Maybe Dave and Ted can tell us more about what went > wrong in ext4 and whether it's still an issue. The context for the revert can be found here[1]. [1] https://lore.kernel.org/lkml/20151005152236.ga8...@thunk.org/ And "what went wrong in ext4" was fixed here[2]. [2] https://lore.kernel.org/lkml/20151005152236.ga8...@thunk.org/ which landed upstream as commit b90197b65518 ("ext4: use private version of page_zero_new_buffers() for data=journal mode"). So it looks like the original issue which triggered the revert in 2015 should be addressed, and we can easily test it by using generic/208 with data=journal mode. There also seems to be a related discussion about whether we should unrevert 998ef75ddb57 here[3]. Hmm. there is a mention on that thread in [3], "Side note: search for "iov_iter_fault_in_writeable()" on lkml for a gfs2 patch-series that is buggy, exactly because it does *not* use the atomic user space accesses, and just tries to do the fault-in to hide the real bug." I assume that's related to the discussion on this thread? [3] https://lore.kernel.org/all/3221175.1624375...@warthog.procyon.org.uk/T/#u - Ted
Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote: > If this ever turn out to be a problem and with the vmapped stacks we > have good chances to get a proper stack traces on a potential overflow > we can add the scope API around the problematic code path with the > explanation why it is needed. Yeah, or maybe we can automate it? Can the reclaim code check how much stack space is left and do the right thing automatically? The reason why I'm nervous is that nojournal mode is not a common configuration, and "wait until production systems start failing" is not a strategy that I or many SRE-types find comforting. So if we can assure ourselves that the right thing will happen automatically, or that lockdep will detect a required GFP_NOFS when running tests, the happier I'll be. - Ted
Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote: > > > I'm convinced the current series is OK, only real life will tell us > > > whether > > > we missed something or not ;) > > > > I would like to extend the changelog of "jbd2: mark the transaction > > context with the scope GFP_NOFS context". > > > > " > > Please note that setups without journal do not suffer from potential > > recursion problems and so they do not need the scope protection because > > neither ->releasepage nor ->evict_inode (which are the only fs entry > > points from the direct reclaim) can reenter a locked context which is > > doing the allocation currently. > > " > > Could you comment on this Ted, please? I guess so there still is one way this could screw us, and it's this reason for GFP_NOFS: - to prevent from stack overflows during the reclaim because the allocation is performed from a deep context already The writepages call stack can be pretty deep. (Especially if we're using ext4 in no journal mode over, say, iSCSI.) How much stack space can get consumed by a reclaim? - Ted
Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote: > > OK, so I've been staring into the code and AFAIU current->journal_info > can contain my stored information. I could either hijack part of the > word as the ref counting is only consuming low 12b. But that looks too > ugly to live. Or I can allocate some placeholder. Yeah, I was looking at something similar. Can you guarantee that the context will only take one or two bits? (Looks like it only needs one bit ATM, even though at the moment you're storing the whole GFP mask, correct?) > But before going to play with that I am really wondering whether we need > all this with no journal at all. AFAIU what Jack told me it is the > journal lock(s) which is the biggest problem from the reclaim recursion > point of view. What would cause a deadlock in no journal mode? We still have the original problem for why we need GFP_NOFS even in ext2. If we are in a writeback path, and we need to allocate memory, we don't want to recurse back into the file system's writeback path. Certainly not for the same inode, and while we could make it work if the mm was writing back another inode, or another superblock, there are also stack depth considerations that would make this be a bad idea. So we do need to be able to assert GFP_NOFS even in no journal mode, and for any file system including ext2, for that matter. Because of the fact that we're going to have to play games with current->journal_info, maybe this is something that I should take responsibility for, and to go through the the ext4 tree after the main patch series go through? Maybe you could use xfs and ext2 as sample (simple) implementations? My only ask is that the memalloc nofs context be a well defined N bits, where N < 16, and I'll find some place to put them (probably journal_info). Thanks, - Ted
Re: [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote: > From: Michal Hocko > > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because > sb_getblk_gfp is not really needed as > sb_getblk > __getblk_gfp > __getblk_slow > grow_buffers > grow_dev_page > gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp > > so __GFP_FS is cleared unconditionally and therefore the above commit > didn't have any real effect in fact. > > This patch should not introduce any functional change. The main point > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to > make the review of the remaining usage easier. > > Signed-off-by: Michal Hocko > Reviewed-by: Jan Kara If I'm not mistaken, this patch is not dependent on any of the other patches in this series (and the other patches are not dependent on this one). Hence, I could take this patch via the ext4 tree, correct? - Ted
Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote: > From: Michal Hocko > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that > the transaction context uses memalloc_nofs_save and all allocations > within the this context inherit GFP_NOFS automatically, there is no > reason to mark specific allocations explicitly. > > This patch should not introduce any functional change. The main point > of this change is to reduce explicit GFP_NOFS usage inside ext4 code > to make the review of the remaining usage easier. > > Signed-off-by: Michal Hocko > Reviewed-by: Jan Kara Changes in the jbd2 layer aren't going to guarantee that memalloc_nofs_save() will be executed if we are running ext4 without a journal (aka in no journal mode). And this is a *very* common configuration; it's how ext4 is used inside Google in our production servers. So that means the earlier patches will probably need to be changed so the nOFS scope is done in the ext4_journal_{start,stop} functions in fs/ext4/ext4_jbd2.c. - Ted
Re: [Cluster-devel] [PATCH 08/12] ext4: Convert to private i_dquot field
On Tue, Nov 04, 2014 at 12:19:49PM +0100, Jan Kara wrote: > CC: linux-e...@vger.kernel.org > CC: "Theodore Ts'o" > Signed-off-by: Jan Kara Acked-by: Theodore Ts'o - Ted
Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
On Thu, Mar 13, 2014 at 07:33:02PM -0500, Steve French wrote: > On Thu, Mar 13, 2014 at 9:20 AM, Theodore Ts'o wrote: > > Previously, the no-op "mount -o mount /dev/xxx" operation when the > > file system is already mounted read-write causes an implied, > > unconditional syncfs(). This seems pretty stupid, and it's certainly > > documented or guaraunteed to do this, nor is it particularly useful, > > except in the case where the file system was mounted rw and is getting > > remounted read-only. > > Is there a case where a file system, not mounted read-only, > would want to skip the syncfs on remount? I don't know > of any particular reason to do a syncfs on remount unless > caching behavior is changing (or moving to read-only mount), > but if as you say it is documented and guaranteed... If the file system is mounted read-write, and it is transitioning to read-only, i.e. "mount -o remount,ro /" then you do want to write out all dirty data before you transition it to be read-only (otherwise you would lose data). It is my belief that this is the _only_ data integrity issue which is implied by remount (and this is more about not losing work done by previous system calls). The background reason for this commit is that a user complained on the ext4 list that he is using containers in a virtualization environment, and due to the init scripts which the user doesn't want to change, it is causing gazillions of no-op remounts, and this is causing ext4 (and xfs) to do send CACHE FLUSH commands because it is required by the syncfs(2) system call, which also calls sync_filesystem. These CACHE FLUSH commands are unneeded for anything, especially for these no-op remounts, and so I want them to go away for remounts, but they should still be there in response to syncfs(2) requests. Cheers, - Ted
Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
On Thu, Mar 13, 2014 at 04:28:23PM +, Steven Whitehouse wrote: > > I guess the same is true for other file systems which are mounted ro > too. So maybe a check for MS_RDONLY before doing the sync in those > cases? My original patch moved the sync_filesystem into the check for MS_RDONLY in the core VFS code. The objection was raised that there might be some file system out there that might depend on this behaviour. I can't imagine why, but I suppose it's at least theoretically possible. So the idea is that this particular patch is *guaranteed* not to make any difference. That way there can be no question about the patch'es correctness. I'm going to follow up with a patch for ext4 that does exactly that, but the idea is to allow each file system maintainer to do that for their own file system. I could do that as well for file systems that are "obviously" read-only, but then I'll find out that there's some wierd case where the file system can be used in a read-write fashion. (Example: UDF is normally used for DVD's, but at least in theory it can be used read/write --- I'm told that Windows supports read-write UDF file systems on USB sticks, and at least in theory it could be used as a inter-OS exchange format in situations where VFAT and exFAT might not be appropriate for various reasons.) Cheers, - Ted
[Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" Cc: linux-fsde...@vger.kernel.org Cc: Christoph Hellwig Cc: Artem Bityutskiy Cc: Adrian Hunter Cc: Evgeniy Dushistov Cc: Jan Kara Cc: OGAWA Hirofumi Cc: Anders Larsen Cc: Phillip Lougher Cc: Kees Cook Cc: Mikulas Patocka Cc: Petr Vandrovec Cc: x...@oss.sgi.com Cc: linux-bt...@vger.kernel.org Cc: linux-c...@vger.kernel.org Cc: samba-techni...@lists.samba.org Cc: codal...@coda.cs.cmu.edu Cc: linux-e...@vger.kernel.org Cc: linux-f2fs-de...@lists.sourceforge.net Cc: fuse-de...@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-...@lists.infradead.org Cc: jfs-discuss...@lists.sourceforge.net Cc: linux-...@vger.kernel.org Cc: linux-ni...@vger.kernel.org Cc: linux-ntfs-...@lists.sourceforge.net Cc: ocfs2-de...@oss.oracle.com Cc: reiserfs-de...@vger.kernel.org --- fs/adfs/super.c | 1 + fs/affs/super.c | 1 + fs/befs/linuxvfs.c | 1 + fs/btrfs/super.c | 1 + fs/cifs/cifsfs.c | 1 + fs/coda/inode.c | 1 + fs/cramfs/inode.c| 1 + fs/debugfs/inode.c | 1 + fs/devpts/inode.c| 1 + fs/efs/super.c | 1 + fs/ext2/super.c | 1 + fs/ext3/super.c | 2 ++ fs/ext4/super.c | 2 ++ fs/f2fs/super.c | 2 ++ fs/fat/inode.c | 2 ++ fs/freevxfs/vxfs_super.c | 1 + fs/fuse/inode.c | 1 + fs/gfs2/super.c | 2 ++ fs/hfs/super.c | 1 + fs/hfsplus/super.c | 1 + fs/hpfs/super.c | 2 ++ fs/isofs/inode.c | 1 + fs/jffs2/super.c | 1 + fs/jfs/super.c | 1 + fs/minix/inode.c | 1 + fs/ncpfs/inode.c | 1 + fs/nfs/super.c | 2 ++ fs/nilfs2/super.c| 1 + fs/ntfs/super.c | 2 ++ fs/ocfs2/super.c | 2 ++ fs/openpromfs/inode.c| 1 + fs/proc/root.c | 2 ++ fs/pstore/inode.c| 1 + fs/qnx4/inode.c | 1 + fs/qnx6/inode.c | 1 + fs/reiserfs/super.c | 1 + fs/romfs/super.c | 1 + fs/squashfs/super.c | 1 + fs/super.c | 2 -- fs/sysv/inode.c | 1 + fs/ubifs/super.c | 1 + fs/udf/super.c | 1 + fs/ufs/super.c | 1 + fs/xfs/xfs_super.c | 1 + 44 files changed, 53 insertions(+), 2 deletions(-) diff --git a/fs/adfs/super.c b/fs/adfs/super.c index 7b3003c..952aeb0 100644 --- a/fs/adfs/super.c +++ b/fs/adfs/super.c @@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char *options) static int adfs_remount(struct super_block *sb, int *flags, char *data) { + sync_filesystem(sb); *flags |= MS_NODIRATIME; return parse_options(sb, data); } diff --git a/fs/affs/super.c b/fs/affs/super.c index d098731..3074530 100644 --- a/fs/affs/super.c +++ b/fs/affs/super.c @@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char *data) pr_debug("AFFS: remount(flags=0x%x,opts=\"%s\")\n",*flags,data); + sync_filesystem(sb); *flags |= MS_NODIRATIME; memcpy(volume, sbi->s_volume, 32); diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index 845d2d6..56d70c8 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int silent) static int befs_remount(struct super_block *sb, int *flags, char *data) { + sync_filesystem(sb); if (!(*flags & MS_RDONLY)) return -EINVAL; return 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 97cc241..00cd0c5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1381,6 +1381,7 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) unsigned int old_metadata_ratio = fs_info->metadata_ratio; int ret; + sync_filesystem(sb); btrfs_remount_prepare(fs_info); ret = btrfs_parse_options(root, data); diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c index 849f613..4942c94 100644 --- a/fs/cifs/cifsfs.c +++ b/fs/cifs/cifsfs.c @@ -541,6 +541,7 @@ static int cifs_show_stats(struct seq_file *s, struct dentry *root) static int cifs_remount(struct super_block *sb, int *flags, char *data) { + sync_file
Re: [Cluster-devel] [trivial] treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
On Wed, Jun 19, 2013 at 12:15:53PM -0700, Joe Perches wrote: > Don't emit OOM warnings when k.alloc calls fail when > there there is a v.alloc immediately afterwards. > > Signed-off-by: Joe Perches For fs/ext4/super.c: Acked-by: "Theodore Ts'o" - Ted
Re: [Cluster-devel] [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument
On Tue, Apr 09, 2013 at 11:14:17AM +0200, Lukas Czerner wrote: > ->invalidatepage() aop now accepts range to invalidate so we can make > use of it in gfs2_invalidatepage(). > > Signed-off-by: Lukas Czerner > Cc: cluster-devel@redhat.com To the gfs2 development team, Since half of this patch series modifies ext4 extensively, and changes to the other file systems are relatively small, I plan to carry the invalidatepage patch set in the ext4 tree for the next development cycle (i.e., not the upcoming merge window, but the next one). To that end, it would be great if you take a look at this patch set and send us an Acked-by signoff. Thanks!! - Ted
Re: [Cluster-devel] [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument
On Tue, Apr 23, 2013 at 10:16:31AM -0400, Theodore Ts'o wrote: > On Tue, Apr 09, 2013 at 11:14:17AM +0200, Lukas Czerner wrote: > > ->invalidatepage() aop now accepts range to invalidate so we can make > > use of it in gfs2_invalidatepage(). > > > > Signed-off-by: Lukas Czerner > > Cc: cluster-devel@redhat.com > > To the gfs2 development team, Whoops, I missed Steven Whitehouse's Acked-by. Sorry for the noise, - Ted