from:"\"Theodore Ts'o\""

Re: [Cluster-devel] [syzbot] [gfs2?] KASAN: use-after-free Read in qd_unlock (2)

2023-07-26 Thread Theodore Ts&#x27;o

On Wed, Jul 26, 2023 at 06:45:55PM +0300, Dmitry Baryshkov wrote:
> > > bisection log:  
> > > https://syzkaller.appspot.com/x/bisect.txt?x=17b48111a8
  ...
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=3f6a670108ce43356017

> I highly suspect that the bisect was wrong here. The only thing that
> was changed by the mentioned commit is the device tree for the pretty
> obscure platform, which is not 'Google Compute Engine'.

Yeah, it's not even close.  If you take a look at the bisection log
(which is *always* a good idea before you put any faith in the syzbot
bisection), you'd see the following:

testing commit e1c04510f521e853019afeca2a5991a5ef8d6a5b gcc
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
kernel signature: 
f262f513a4ba5708b69a5fdd8c218746223996a8b2134a22f2916d16f23d01e8
run #0: crashed: unregister_netdevice: waiting for DEV to become free
run #1: crashed: unregister_netdevice: waiting for DEV to become free
run #2: crashed: unregister_netdevice: waiting for DEV to become free
run #3: crashed: unregister_netdevice: waiting for DEV to become free
run #4: crashed: unregister_netdevice: waiting for DEV to become free
run #5: crashed: unregister_netdevice: waiting for DEV to become free
run #6: crashed: unregister_netdevice: waiting for DEV to become free
run #7: crashed: unregister_netdevice: waiting for DEV to become free
run #8: crashed: unregister_netdevice: waiting for DEV to become free

This is *nothing* like the problem reported on the dashboard, which is:

BUG: KASAN: use-after-free in instrument_atomic_read 
include/linux/instrumented.h:72 [inline]
BUG: KASAN: use-after-free in _test_bit 
include/asm-generic/bitops/instrumented-non-atomic.h:141 [inline]
BUG: KASAN: use-after-free in qd_unlock+0x30/0x2d0 fs/gfs2/quota.c:490
Read of size 8 at addr 888073997090 by task syz-executor221/5069

where the dereference had a stack trace which looked like this:

 _test_bit include/asm-generic/bitops/instrumented-non-atomic.h:141 [inline]
 qd_unlock+0x30/0x2d0 fs/gfs2/quota.c:490
 gfs2_quota_sync+0x768/0x8b0 fs/gfs2/quota.c:1325
 gfs2_sync_fs+0x49/0xb0 fs/gfs2/super.c:650
 sync_filesystem+0xe8/0x220 fs/sync.c:56
 generic_shutdown_super+0x6b/0x310 fs/super.c:474
 kill_block_super+0x79/0xd0 fs/super.c:1386
 deactivate_locked_super+0xa7/0xf0 fs/super.c:332
 cleanup_mnt+0x494/0x520 fs/namespace.c:1291
 task_work_run+0x243/0x300 kernel/task_work.c:179
 exit_task_work include/linux/task_work.h:38 [inline]
 do_exit+0x644/0x2150 kernel/exit.c:867

and the memory was allocated via this stack trace:

 kmem_cache_alloc+0x1b3/0x350 mm/slub.c:3476
 kmem_cache_zalloc include/linux/slab.h:710 [inline]
 qd_alloc+0x51/0x250 fs/gfs2/quota.c:216
 gfs2_quota_init+0x7c4/0x10e0 fs/gfs2/quota.c:1415
 gfs2_make_fs_rw+0x48e/0x590 fs/gfs2/super.c:153
 gfs2_fill_super+0x2357/0x2700 fs/gfs2/ops_fstype.c:1274
 get_tree_bdev+0x400/0x620 fs/super.c:1282
 gfs2_get_tree+0x50/0x210 fs/gfs2/ops_fstype.c:1330
 vfs_get_tree+0x88/0x270 fs/super.c:1489
 do_new_mount+0x289/0xad0 fs/namespace.c:3145
 do_mount fs/namespace.c:3488 [inline]
 __do_sys_mount fs/namespace.c:3697 [inline]
 __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80

(And the memory was freed from an RCU path)

- Ted

Re: [Cluster-devel] [PATCH 01/12] backing_dev: remove current->backing_dev_info

2023-05-31 Thread Theodore Ts&#x27;o

On Wed, May 31, 2023 at 09:50:15AM +0200, Christoph Hellwig wrote:
> The last user of current->backing_dev_info disappeared in commit
> b9b1335e6403 ("remove bdi_congested() and wb_congested() and related
> functions").  Remove the field and all assignments to it.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Hannes Reinecke 
> Reviewed-by: Darrick J. Wong 

Acked-by: Theodore Ts'o

Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write

2023-05-31 Thread Theodore Ts&#x27;o

On Wed, May 31, 2023 at 09:50:17AM +0200, Christoph Hellwig wrote:
> All callers of generic_perform_write need to updated ki_pos, move it into
> common code.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Xiubo Li 
> Reviewed-by: Damien Le Moal 
> Reviewed-by: Hannes Reinecke 
> Acked-by: Darrick J. Wong 

Acked-by: Theodore Ts'o

Re: [Cluster-devel] [GIT PULL] gfs2 writepage fix

2023-01-23 Thread Theodore Ts&#x27;o

On Mon, Jan 23, 2023 at 11:05:56AM +0100, Jan Kara wrote:
> Thanks for the heads up. So we had kind of a similar issue for ext4 but it
> got caught by Ted during his regression runs so we've basically done what
> GFS2 is doing for the merge window and now there's patchset pending to
> convert the data=journal path as well because as Andreas states in his
> changelog of the revert that's a bit more tricky. But at least for ext4
> the conversion of data=journal path resulted in much cleaner and shorter
> code.

https://thunk.org/tytso/images/firestarter-fs-development-without-testing.jpg

:-)

- Ted

Re: [Cluster-devel] [PATCH v2 06/14] jbd2: replace ll_rw_block()

2022-09-05 Thread Theodore Ts&#x27;o

On Thu, Sep 01, 2022 at 09:34:57PM +0800, Zhang Yi wrote:
> ll_rw_block() is not safe for the sync read path because it cannot
> guarantee that submitting read IO if the buffer has been locked. We
> could get false positive EIO after wait_on_buffer() if the buffer has
> been locked by others. So stop using ll_rw_block() in
> journal_get_superblock(). We also switch to new bh_readahead_batch()
> for the buffer array readahead path.
> 
> Signed-off-by: Zhang Yi 

Thanks, looks good.

Reviewed-by: Theodore Ts'o 


- Ted

Re: [Cluster-devel] [PATCH -v5] ext4: don't BUG if someone dirty pages without asking ext4 first

2022-03-03 Thread Theodore Ts&#x27;o

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz
[2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com

Reported-by: 
syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com
Reported-by: Lee Jones 
Signed-off-by: Theodore Ts'o 
---

 -v5 Argh, sent the wrong version of this patch for V4.  (Which, if you tried
 testing it, would cause generic/013 on ext4/4k to die horribly. :-)  This
 is the correct final version.

 fs/ext4/inode.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..531a94f48637 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;
 
+   /* Should never happen but for bugs in other kernel subsystems */
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(inode,
+  "page %lu does not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   return 0;
+   }
+
page_bufs = page_buffers(page);
/*
 * We cannot do block allocation or other extent handling in this
@@ -2594,6 +2603,22 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
 
+   /*
+* Should never happen but for buggy code in
+* other subsystems that call
+* set_page_dirty() without properly warning
+* the file system first.  See [1] for more
+* information.
+*
+* [1] 
https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz
+*/
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(mpd->inode, "page %lu does 
not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   continue;
+   }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;
-- 
2.31.0

Re: [Cluster-devel] [PATCH -v4] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-03-02 Thread Theodore Ts&#x27;o

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  A related race was noted by Jan Kara in
2018[1]; however, more recently instead of it being a very hard-to-hit
race, it could be reliably triggered by process_vm_writev(2) which was
discovered by Syzbot[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that if some other kernel subsystem dirty pages without properly
notifying the file system using page_mkwrite(), ext4 will BUG, while
other file systems will not BUG (although data will still be lost).

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz
[2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com

Reported-by: 
syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com
Reported-by: Lee Jones 
Signed-off-by: Theodore Ts'o 
Cc: sta...@kernel.org
---
v4 - only changes to the commit description to eliminate some inaccuracies
 and clarify the text.

 fs/ext4/inode.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..008fe8750109 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;
 
+   /* Should never happen but for bugs in other kernel subsystems */
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(inode,
+  "page %lu does not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   return 0;
+   }
+
page_bufs = page_buffers(page);
/*
 * We cannot do block allocation or other extent handling in this
@@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
 (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
unlikely(page->mapping != mapping)) {
unlock_page(page);
-   continue;
+   goto out;
}
 
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
 
+   /*
+* Should never happen but for buggy code in
+* other subsystems that call
+* set_page_dirty() without properly warning
+* the file system first.  See [1] for more
+* information.
+*
+* [1] 
https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz
+*/
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(mpd->inode, "page %lu does 
not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   continue;
+   }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;
-- 
2.31.0

Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Theodore Ts&#x27;o

On Fri, Feb 25, 2022 at 08:40:36PM -0500, Theodore Ts'o wrote:
> Well, that makes it process_vm_writev()'s is that it needs to know
> when to call pin_user_file_pages().

Sorry, typed too fast.  What I was trying to say is this make it
process_vm_writev()'s problem to figure out when it should call
pin_user_file_pages() versus some other pin_user_pages function.

> I suspect that for many use cases
> --- for example, if this is being used by a debugger to modify a
> variable on a stack, or an anonymous page in the program's data
> segment, process_vm_writev() *isn't* actually pinning a file.  So they
> want some kind of interface that automatically DTRT regardless of
> whether the user pages being edited are file-backed or not
> file-backed.

- Ted

Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Theodore Ts&#x27;o

On Fri, Feb 25, 2022 at 04:41:14PM -0800, John Hubbard wrote:
> 
> > f2fs and btrfs's compressed file write support, by making things work
> > much like the write(2) system call.  Imagine if we had a
> > "pin_user_pages_local()" which calls write_begin(), and a
> > "unpin_user_pages_local()" which calls write_end(), and the
> 
> Right, that would supply the missing connection to the filesystems.
> 
> In fact, maybe these names about right:
> 
> pin_user_file_pages()
> unpin_user_file_pages()
> 
> ...and then put them in a filesystem header file, because these are now
> tightly coupled to filesystems, what with the need to call
> .write_begin() and .write_end().

Well, that makes it process_vm_writev()'s is that it needs to know
when to call pin_user_file_pages().  I suspect that for many use cases
--- for example, if this is being used by a debugger to modify a
variable on a stack, or an anonymous page in the program's data
segment, process_vm_writev() *isn't* actually pinning a file.  So they
want some kind of interface that automatically DTRT regardless of
whether the user pages being edited are file-backed or not
file-backed.

So some kind of [un]pin_user_pages_local() which will call
write_{begin,end}() if necessary would be the most convenient for
users such as process_vm_writev().   

And perhaps would it make sense for pin_user_pages to optionally (or
by default?) check for file-backed pages, and if it finds any, return
an error or stop pinning pages at that point, so the system call can
return EOPNOSUPP to the user, instead of silently causing user data to
be lost or corrupted as is currently the case with xfs and btrfs (and
ext4 once I patch it so it doesn't BUG).

I'll note that at least one caller of pin_user_pages, in fs/io_uring.c
takes it upon itself to check for file-backed pages, and returns
EOPNOTSUPP if there are any found.  Many that should be lifted to
pin_user_pages()?

For that matter, maybe pin_user_pages() and friends should take some
new FOLL_ flags to indicate whether file-backed pages should be
rejected, or perhaps they can promise they will only be holding the
pin for a very short amount of time (FOLL_SHORTERM?), and then
pin_user_pages() and unpin_user_pages() can automagically call
write_begin() and write_end() if necessary?  I dunno

   - Ted

Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Theodore Ts&#x27;o

On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote:
> On 2/25/22 13:23, Theodore Ts'o wrote:
> > [un]pin_user_pages_remote is dirtying pages without properly warning
> > the file system in advance.  This was noted by Jan Kara in 2018[1] and
> 
> In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported
> was actually that dio_bio_complete() was calling set_page_dirty_lock()
> on pages that were not (any longer) set up for that.

Fair enough, there are two problems that are getting conflated here,
and that's my bad.  The problem which Jan pointed out is one where the
Direct I/O read path triggered a page fault, so page_mkwrite() was
actually called.  So in this case, the file system was actually
notified, and the page was marked dirty after the file system was
notified.  But then the DIO read was racing with the page cleaner,
which would call writepage(), and then clear the page, and then remove
the buffer_heads.  Then dio_bio_complete() would call set_page_dirty()
a second time, and that's what would trigger the BUG.

But in the syzbot reproducer, it's a different problem.  In this case,
process_vm_writev() calling [un]pin_user_pages_remote(), and
page_mkwrite() is never getting called.  So there is no need to race
with the page cleaner, and so the BUG triggers much more reliably.

> > more recently has resulted in bug reports by Syzbot in various Android
> > kernels[2].
> > 
> > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
> 
> Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock()
> call into mm/gup.c, but that merely refactored things. The callers are
> all over the kernel, and those callers are what need changing in order
> to fix this.

>From my perspective, the bug is calling set_page_dirty() without first
calling the file system's page_mkwrite().  This is necessary since the
file system needs to allocate file system data blocks in preparation
for a future writeback.

Now, calling page_mkwrite() by itself is not enough, since the moment
you make the page dirty, the page cleaner could go ahead and call
writepage() behind your back and clean it.  In actual practice, with a
Direct I/O read request racing with writeback, this is race was quite
hard to hit, because the that would imply that the background
writepage() call would have to complete ahead of the synchronous read
request, and the block layer generally prioritizes synchronous reads
ahead of background write requests.  So in practice, this race was
***very*** hard to hit.  Jan may have reported it in 2018, but I don't
think I've ever seen it happen myself.

For process_vm_writev() this is a case where user pages are pinned and
then released in short order, so I suspect that race with the page
cleaner would also be very hard to hit.  But we could completely
remove the potential for the race, and also make things kinder for
f2fs and btrfs's compressed file write support, by making things work
much like the write(2) system call.  Imagine if we had a
"pin_user_pages_local()" which calls write_begin(), and a
"unpin_user_pages_local()" which calls write_end(), and the
presumption with the "[un]pin_user_pages_local" API is that you don't
hold the pinned pages for very long --- say, not across a system call
boundary, and then it would work the same way the write(2) system call
works does except that in the case of process_vm_writev(2) the pages
are identified by another process's address space where they happen to
be mapped.

This obviously doesn't work when pinning pages for remote DMA, because
in that case the time between pin_user_pages_remote() and
unpin_user_pages_remote() could be a long, long time, so that means we
can't use using write_begin/write_end; we'd need to call page_mkwrite()
when the pages are first pinned and then somehow prevent the page
cleaner from touching a dirty page which is pinned for use by the
remote DMA.

Does that make sense?

- Ted

[Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Theodore Ts&#x27;o

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance.  This was noted by Jan Kara in 2018[1] and
more recently has resulted in bug reports by Syzbot in various Android
kernels[2].

This is technically a bug in mm/gup.c, but arguably ext4 is fragile in
that a buggy kernel subsystem which dirty pages without properly
notifying the file system causes ext4 to BUG, while other file systems
are not (although user data likely will be lost).  I suspect in real
life it is rare that people are using RDMA into file-backed memory,
which is why no one has complained to ext4 developers except fuzzing
programs.

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz
[2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com

Reported-by: 
syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com
Reported-by: Lee Jones 
Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..008fe8750109 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;
 
+   /* Should never happen but for bugs in other kernel subsystems */
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(inode,
+  "page %lu does not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   return 0;
+   }
+
page_bufs = page_buffers(page);
/*
 * We cannot do block allocation or other extent handling in this
@@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
 (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
unlikely(page->mapping != mapping)) {
unlock_page(page);
-   continue;
+   goto out;
}
 
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
 
+   /*
+* Should never happen but for buggy code in
+* other subsystems that call
+* set_page_dirty() without properly warning
+* the file system first.  See [1] for more
+* information.
+*
+* [1] 
https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz
+*/
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(mpd->inode, "page %lu does 
not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   continue;
+   }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;
-- 
2.31.0

Re: [Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Theodore Ts&#x27;o

On Fri, Feb 25, 2022 at 12:51:28PM -0800, Eric Biggers wrote:
> > 
> > [1] https://www.spinics.net/lists/linux-mm/msg142700.html
> 
> Can you use a lore link
> (https://lore.kernel.org/linux-mm/20180103100430.ge4...@quack2.suse.cz/T/#u)?

Sure, thanks for finding the lore link!  I did try searching for it
last night, but for some reason I wasn't able to surface it.

> > +   /*
> > +* Should never happen but for buggy code in
> > +* other subsystemsa that call
> 
> subsystemsa => subsystems

Already fixed in my local version of the patch.

> > +* [1] 
> > https://www.spinics.net/lists/linux-mm/msg142700.html
> 
> Likewise, lore link here.

Ack.

- Ted

[Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

2022-02-25 Thread Theodore Ts&#x27;o

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance (or faulting in the file data if the page
is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
and more recently has resulted in bug reports by Syzbot in various
Android kernels[2].

This is technically a bug in the mm/gup.c codepath, but arguably ext4
is fragile in that a buggy get_user_pages() implementation causes ext4
to crash, where as other file systems are not crashing (although in
some cases the user data will be lost since gup code is not properly
informing the file system to potentially allocate blocks or reserve
space when writing into a sparse portion of file).  I suspect in real
life it is rare that people are using RDMA into file-backed memory,
which is why no one has complained to ext4 developers except fuzzing
programs.

So instead of crashing with a BUG, issue a warning (since there may be
potential data loss) and just mark the page as clean to avoid
unprivileged denial of service attacks until the problem can be
properly fixed.  More discussion and background can be found in the
thread starting at [2].

[1] https://www.spinics.net/lists/linux-mm/msg142700.html
[2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com

Reported-by: 
syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com
Reported-by: Lee Jones 
Signed-off-by: Theodore Ts'o 
---
 fs/ext4/inode.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..f8fefbf67306 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;
 
+   /* Should never happen but for buggy gup code */
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(inode,
+  "page %lu does not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   return 0;
+   }
+
page_bufs = page_buffers(page);
/*
 * We cannot do block allocation or other extent handling in this
@@ -2588,12 +2597,28 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
 (mpd->wbc->sync_mode == WB_SYNC_NONE)) ||
unlikely(page->mapping != mapping)) {
unlock_page(page);
-   continue;
+   goto out;
}
 
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
 
+   /*
+* Should never happen but for buggy code in
+* other subsystemsa that call
+* set_page_dirty() without properly warning
+* the file system first.  See [1] for more
+* information.
+*
+* [1] 
https://www.spinics.net/lists/linux-mm/msg142700.html
+*/
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(mpd->inode, "page %lu does 
not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   continue;
+   }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;
-- 
2.31.0

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-23 Thread Theodore Ts&#x27;o

On Wed, Feb 23, 2022 at 04:44:07PM -0800, John Hubbard wrote:
> 
> Actually...I can confirm that real customers really are doing *exactly* 
> that! Despite the kernel crashes--because the crashes don't always 
> happen unless you have a large (supercomputer-sized) installation. And 
> even then it is not always root-caused properly.

Interesting.  The syzbot reproducer triggers *reliably* on ext4 using
a 2 CPU qemu kernel running on a laptop, and it doesn't require root,
so it's reasonable that Lee is pushing for a fix --- even if for the
Android O or newer, Seccomp can probably prohibit trap
process_vm_writev(2), but it seems unfortunate if say, someone running
a Docker container could take down the entire host OS.

  - Ted

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-23 Thread Theodore Ts&#x27;o

On Thu, Feb 24, 2022 at 12:48:42PM +1100, Dave Chinner wrote:
> > Fair enough; on the other hand, we could also view this as making ext4
> > more robust against buggy code in other subsystems, and while other
> > file systems may be losing user data if they are actually trying to do
> > remote memory access to file-backed memory, apparently other file
> > systems aren't noticing and so they're not crashing.
> 
> Oh, we've noticed them, no question about that.  We've got bug
> reports going back years for systems being crashed, triggering BUGs
> and/or corrupting data on both XFS and ext4 filesystems due to users
> trying to run RDMA applications with file backed pages.

Is this issue causing XFS to crash?  I didn't know that.

I tried the Syzbot reproducer with XFS mounted, and it didn't trigger
any crashes.  I'm sure data was getting corrupted, but I figured I
should bring ext4 to the XFS level of "at least we're not reliably
killing the kernel".

On ext4, an unprivileged process can use process_vm_writev(2) to crash
the system.  I don't know how quickly we can get a fix into mm/gup.c,
but if some other kernel path tries calling set_page_dirty() on a
file-backed page without first asking permission from the file system,
it seems to be nice if the file system doesn't BUG() --- as near as I
can tell, xfs isn't crashing in this case, but ext4 is.

- Ted

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-23 Thread Theodore Ts&#x27;o

On Fri, Feb 18, 2022 at 08:51:54AM +0100, Greg Kroah-Hartman wrote:
> > The challenge is that fixing this "the right away" is probably not
> > something we can backport into an LTS kernel, whether it's 5.15 or
> > 5.10... or 4.19.
> 
> Don't worry about stable backports to start with.  Do it the "right way"
> first and then we can consider if it needs to be backported or not.

Fair enough; on the other hand, we could also view this as making ext4
more robust against buggy code in other subsystems, and while other
file systems may be losing user data if they are actually trying to do
remote memory access to file-backed memory, apparently other file
systems aren't noticing and so they're not crashing.  Issuing a
warning and then not crashing is arguably a better way for ext4 to
react, especially if there are other parts of the kernel that are
randomly calling set_page_dirty() on file-backed memory without
properly first informing the file system in a context where it can
block and potentially do I/O to do things like allocate blocks.

 - Ted

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-23 Thread Theodore Ts&#x27;o

On Thu, Feb 17, 2022 at 10:33:34PM -0800, John Hubbard wrote:
> 
> Just a small thing I'll say once, to get it out of my system. No action
> required here, I just want it understood:
> 
> Before commit 803e4572d7c5 ("mm/process_vm_access: set FOLL_PIN via
> pin_user_pages_remote()"), you would have written that like this:
> 
> "process_vm_writev() is dirtying pages without properly warning the file
> system in advance..."
> 
> Because, there were many callers that were doing this:
> 
> get_user_pages*()
> ...use the pages...
> 
> for_each_page() {
> set_page_dirty*()
> put_page()
> }

Sure, but that's not sufficient when modifying file-backed pages.
Previously, there was only two ways of modifying file-backed pages in
the page cache --- either using the write(2) system call, or when a
mmaped page is modified by the userspace.

In the case of write(2), the file system gets notified before the page
cache is modified by a call to the address operation's write_begin()
call, and after the page cache is modified, the address operation's
write_end() call lets the file system know that the modification is
done.  After the write is done, the 30 second writeback timer is
triggered, and in the case of ext4's data=journalling mode, we close
the ext4 micro-transation (and therefore the time between write_begin
and write_end calls needs to be minimal); otherwise this can block
ext4 transactions.

In the case of a user page fault, the address operation's
page_mkwrite() is called, and at that point we will allocate any
blocks needed to back memory if necessary (in the case of delayed
allocation, file system space has to get reserved).  The problem here
for remote access is that page_mkwrite() can block, and it can
potentially fail (e.g., with ENOSPC or ENOMEM).  This is also why we
can't just add the page buffers and do the file system allocation in
set_page_dirty(), since set_page_dirty() isn't allowed to block.

One approach might be to make all of the pages writeable when
pin_user_pages_remote() is called.  That that would mean that in the
case of remote access via process_vm_writev or RDMA, all of the blocks
will be allocated early.  But that's probably better since at that
point the userspace code is in a position to receive the error when
setting up the RDMA memory, and we don't want to be asking the file
system to do block allocation when incoming data is coming in from
Infiniband or iWARP.

> I see that ext4_warning_inode() has rate limiting, but it doesn't look
> like it's really intended for a per-page rate. It looks like it is
> per-superblock (yes?), so I suspect one instance of this problem, with
> many pages involved, could hit the limit.
> 
> Often, WARN_ON_ONCE() is used with per-page assertions. That's not great
> either, but it might be better here. OTOH, I have minimal experience
> with ext4_warning_inode() and maybe it really is just fine with per-page
> failure rates.

With the syzbot reproducer, we're not actually triggering the rate
limiter, since the ext4 warning is only getting hit a bit more than
once every 4 seconds.  And I'm guessing that in the real world, people
aren't actually trying to do remote direct access to file-backed
memory, at least not using ext4, since that's an invitation to a
kernel crash, and we would have gotten user complaints.  If some user
actually tries to use process_vm_writev for realsies, as opposed to a
random fuzzer or from a malicious program , we do want to warn them
about the potential data loss, so I'd prefer to warn once for each
inode --- but I'm not convinced that it's worth the effort.

Cheers,

- Ted

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-17 Thread Theodore Ts&#x27;o

On Fri, Feb 18, 2022 at 04:24:20AM +, Matthew Wilcox wrote:
> On Thu, Feb 17, 2022 at 09:54:30PM -0500, Theodore Ts'o wrote:
> > process_vm_writev() uses [un]pin_user_pages_remote() which is the same
> > interface uses for RDMA.  But it's not clear this is ever supposed to
> > work for memory which is mmap'ed region backed by a file.
> > pin_user_pages_remote() appears to assume that it is an anonymous
> > region, since the get_user_pages functions in mm/gup.c don't call
> > read_page() to read data into any pages that might not be mmaped in.
> 
> ... it doesn't end up calling handle_mm_fault() in faultin_page()?

Ah yes, sorry, I missed that.  This is what happens when a syzbot bug
is thrown to a file system developer, who then has to wade theough
mm code for which he is not understand

- Ted

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-17 Thread Theodore Ts&#x27;o

On Thu, Feb 17, 2022 at 05:06:45PM -0800, John Hubbard wrote:
> Yes. And looking at the pair of backtraces below, this looks very much
> like another aspect of the "get_user_pages problem" [1], originally
> described in Jan Kara's 2018 email [2].

Hmm... I just posted my analysis, which tracks with yours; but I had
forgotten about Jan's 2018 e-mail on the matter.

> I'm getting close to posting an RFC for the direct IO conversion to
> FOLL_PIN, but even after that, various parts of the kernel (reclaim,
> filesystems/block layer) still need to be changed so as to use
> page_maybe_dma_pinned() to help avoid this problem. There's a bit
> more than that, actually.

The challenge is that fixing this "the right away" is probably not
something we can backport into an LTS kernel, whether it's 5.15 or
5.10... or 4.19.

The only thing which can probably survive getting backported is
something like this.  It won't make the right thing happen if someone
is trying to RDMA or call process_vm_writev() into a file-backed
memory region --- but I'm not sure I care.  Certainly if the goal is
to make Android kernels, I'm pretty sure they are't either using RDMA,
and I suspect they are probably masking out the process_vm_writev(2)
system call (at least, for Android O and newer).  So if the goal is to
just to close some Syzbot bugs, what do folks think about this?

    - Ted

commit 7711b1fda6f7f04274fa1cba6f092410262b0296
Author: Theodore Ts'o 
Date:   Thu Feb 17 22:54:03 2022 -0500

ext4: work around bugs in mm/gup.c that can cause ext4 to BUG()

[un]pin_user_pages_remote is dirtying pages without properly warning
the file system in advance (or faulting in the file data if the page
is not yet in the page cache).  This was noted by Jan Kara in 2018[1]
and more recently has resulted in bug reports by Syzbot in various
Android kernels[2].

Fixing this for real is non-trivial, and will never be backportable
into stable kernels.  So this is a simple workaround that stops the
kernel from BUG()'ing.  The changed pages will not be properly written
back, but given that the current gup code is missing the "read" in
"read-modify-write", the dirty page in the page cache might contain
corrupted data anyway.

[1] https://www.spinics.net/lists/linux-mm/msg142700.html
[2] https://lore.kernel.org/r/yg0m6ijcnmfas...@google.com

Reported-by: 
syzbot+d59332e2db681cf18f0318a06e994ebbb529a...@syzkaller.appspotmail.com
Reported-by: Lee Jones 
Signed-off-by: Theodore Ts'o 

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 01c9e4f743ba..3b2f336a90d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1993,6 +1993,15 @@ static int ext4_writepage(struct page *page,
else
len = PAGE_SIZE;

+   /* Should never happen but for buggy gup code */
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(inode,
+  "page %lu does not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   return 0;
+   }
+
page_bufs = page_buffers(page);
/*
 * We cannot do block allocation or other extent handling in this
@@ -2594,6 +2603,14 @@ static int mpage_prepare_extent_to_map(struct 
mpage_da_data *mpd)
wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));

+   /* Should never happen but for buggy gup code */
+   if (!page_has_buffers(page)) {
+   ext4_warning_inode(mpd->inode, "page %lu does 
not have buffers attached", page->index);
+   ClearPageDirty(page);
+   unlock_page(page);
+   continue;
+   }
+
if (mpd->map.m_len == 0)
mpd->first_page = page->index;
mpd->next_page = page->index + 1;

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

2022-02-17 Thread Theodore Ts&#x27;o

On Wed, Feb 16, 2022 at 04:31:36PM +, Lee Jones wrote:
> 
> After recently receiving a bug report from Syzbot [0] which was raised
> specifically against the Android v5.10 kernel, I spent some time
> trying to get to the crux.  Firstly I reproduced the issue on the
> reported kernel, then did the same using the latest release kernel
> v5.16.
> 
> The full kernel trace can be seen below at [1].
> 

Lee, thanks for your work in trimming down the syzkaller reproducer.
The moral equivalent of what it is doing (except each system call is
done in a separate thread, with synchronization so each gets executed
in order, but perhaps on a different CPU) is:

int dest_fd, src_fd, truncate_fd, mmap_fd;
pid_t tid;
struct iovec local_iov, remote_iov;

dest_fd = open("./bus", O_RDWR|O_CREAT|O_NONBLOCK|O_SYNC|
   O_DIRECT|O_LARGEFILE|O_NOATIME, 0);
src_fd = openat(AT_FDCWD, "/bin/bash", O_RDONLY);
truncate_fd = syscall(__NR_open, "./bus", 
O_RDWR|O_CREAT|O_SYNC|O_NOATIME|O_ASYNC);
ftruncate(truncate_fd, 0x2008002ul);
mmap((void *) 0x2000ul /* addr */, 0x60ul /* length */,
 PROT_WRITE|PROT_EXEC|PROT_SEM||0x70, MAP_FIXED|MAP_SHARED, 
mmap_fd, 0);
sendfile(dest_fd, src_fd, 0 /* offset */, 0x8005ul /* count */);
local_iov.iov_base = (void *) 0x2034afa4;
local_iov.iov_len = 0x1f80;
remote_iov.iov_base = (void *) 0x2080;
remote_iov.iov_len = 0x2034afa5;
process_vm_writev(gettid(), &local_iov, 1, &remote_iov, 1, 0);
sendfile(dest_fd, src_fd, 0 /* offset */, 0x1f05ul /* count */);

Which is then executed repeataedly over and over again.  (With the
file descriptors closed at each loop, so the file is reopened each time.)

So basically, we have a scratch file which is initialized using an
sendfile using O_DIRECT.  The scratch file is also mmap'ed into the
process's address space, and we then *modify* that an mmap'ed reagion
using process_vm_writev() --- and this is where the problem starts.

process_vm_writev() uses [un]pin_user_pages_remote() which is the same
interface uses for RDMA.  But it's not clear this is ever supposed to
work for memory which is mmap'ed region backed by a file.
pin_user_pages_remote() appears to assume that it is an anonymous
region, since the get_user_pages functions in mm/gup.c don't call
read_page() to read data into any pages that might not be mmaped in.

They also don't follow the mm / file system protocol of calling the
file system's write_begin() or page_mkwrite() before modifying a page,
and so when when process_vm_writev() calls unpin_user_pages_remote(),
this tries to dirty the page, but since page_mkwrite() or
write_begin() hasn't been called, buffers haven't been attached to the
page, and so that triggers the following ext4 WARN_ON:

[ 1451.095939] WARNING: CPU: 1 PID: 449 at fs/ext4/inode.c:3565 
ext4_set_page_dirty+0x48/0x50
  ...
[ 1451.139877] Call Trace:
[ 1451.140833]  
[ 1451.141889]  folio_mark_dirty+0x2f/0x60
[ 1451.143408]  set_page_dirty_lock+0x3e/0x60
[ 1451.145130]  unpin_user_pages_dirty_lock+0x108/0x130
[ 1451.147405]  process_vm_rw_single_vec.constprop.0+0x1b9/0x260
[ 1451.150135]  process_vm_rw_core.constprop.0+0x156/0x280
[ 1451.159576]  process_vm_rw+0xc4/0x110

Then when ext4_writepages() gets called, we trigger the BUG because
buffers haven't been attached to the page.  We can certainly avoid the
BUG by checking to see if buffers are attached first, and if not, skip
writing the page trigger a WARN_ON, and then forcibly clear the page
dirty flag so don't permanently leak memory and allow the file system
to be unmounted.  (Note: we can't fix the problem by attaching the
buffers in set_page_dirty(), since is occasionally called under
spinlocks and without the page being locked, so we can't do any kind
of allocation, so ix-nay on calling create_empty_buffers() which calls
alloc_buffer_head().)

BUT, that is really papering over the problem, since it's not clear
it's valid to try to use get_user_pages() and friends (including
[un]pin_user_pages_remote() on file-backed memory.

And if it is supposed to be valid, then mm/gup.c needs to first call
readpage() if the page is not present, so that if process_vm_writev()
is only modifying a few bytes in the mmap'ed region, we need to fault
in the page first, and then mm/gup.c needs to inform the file system
to make sure that if pinned memory region is not yet allocated, than
whatever needs to happen to allocate space, via page_mkwrite() has
taken place.  (And by the way, that means that pin_user_pages_remote()
may need to return ENOSPC if there is not free space in the file
system, and hence ENOSPC may need to reflected all the way back to
process_vm_writev().

Alternatively, if we don't expect process_vm_writev() to work on
file-backed memory, perhaps it and pin_user_pages_remote() should
return some kind of error?

Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

2021-10-25 Thread Theodore Ts&#x27;o

On Mon, Oct 25, 2021 at 08:24:26PM +0200, Andreas Gruenbacher wrote:
> > For generic_perform_write() Dave Hansen attempted to move the fault-in
> > after the uaccess in commit 998ef75ddb57 ("fs: do not prefault
> > sys_write() user buffer pages"). This was reverted as it was exposing an
> > ext4 bug. I don't [know] whether it was fixed but re-applying Dave's commit
> > avoids the performance drop.
> 
> Interesting. The revert of commit 998ef75ddb57 is in commit
> 00a3d660cbac. Maybe Dave and Ted can tell us more about what went
> wrong in ext4 and whether it's still an issue.

The context for the revert can be found here[1].

[1] https://lore.kernel.org/lkml/20151005152236.ga8...@thunk.org/

And "what went wrong in ext4" was fixed here[2].

[2] https://lore.kernel.org/lkml/20151005152236.ga8...@thunk.org/

which landed upstream as commit b90197b65518 ("ext4: use private
version of page_zero_new_buffers() for data=journal mode").

So it looks like the original issue which triggered the revert in 2015
should be addressed, and we can easily test it by using generic/208
with data=journal mode.

There also seems to be a related discussion about whether we should
unrevert 998ef75ddb57 here[3].  Hmm. there is a mention on that thread
in [3], "Side note: search for "iov_iter_fault_in_writeable()" on lkml
for a gfs2 patch-series that is buggy, exactly because it does *not*
use the atomic user space accesses, and just tries to do the fault-in
to hide the real bug."  I assume that's related to the discussion on
this thread?

[3] https://lore.kernel.org/all/3221175.1624375...@warthog.procyon.org.uk/T/#u

- Ted

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-27 Thread Theodore Ts&#x27;o

On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

- Ted

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-26 Thread Theodore Ts&#x27;o

On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us 
> > > whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess   so there still is one way this could screw us, and it's this 
reason for GFP_NOFS:

- to prevent from stack overflows during the reclaim because
  the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

- Ted

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-17 Thread Theodore Ts&#x27;o

On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> 
> OK, so I've been staring into the code and AFAIU current->journal_info
> can contain my stored information. I could either hijack part of the
> word as the ref counting is only consuming low 12b. But that looks too
> ugly to live. Or I can allocate some placeholder.

Yeah, I was looking at something similar.  Can you guarantee that the
context will only take one or two bits?  (Looks like it only needs one
bit ATM, even though at the moment you're storing the whole GFP mask,
correct?)

> But before going to play with that I am really wondering whether we need
> all this with no journal at all. AFAIU what Jack told me it is the
> journal lock(s) which is the biggest problem from the reclaim recursion
> point of view. What would cause a deadlock in no journal mode?

We still have the original problem for why we need GFP_NOFS even in
ext2.  If we are in a writeback path, and we need to allocate memory,
we don't want to recurse back into the file system's writeback path.
Certainly not for the same inode, and while we could make it work if
the mm was writing back another inode, or another superblock, there
are also stack depth considerations that would make this be a bad
idea.  So we do need to be able to assert GFP_NOFS even in no journal
mode, and for any file system including ext2, for that matter.

Because of the fact that we're going to have to play games with
current->journal_info, maybe this is something that I should take
responsibility for, and to go through the the ext4 tree after the main
patch series go through?  Maybe you could use xfs and ext2 as sample
(simple) implementations?

My only ask is that the memalloc nofs context be a well defined N
bits, where N < 16, and I'll find some place to put them (probably
journal_info).

Thanks,

- Ted

Re: [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

2017-01-16 Thread Theodore Ts&#x27;o

On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> sb_getblk_gfp is not really needed as
> sb_getblk
>   __getblk_gfp
> __getblk_slow
>   grow_buffers
> grow_dev_page
> gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> 
> so __GFP_FS is cleared unconditionally and therefore the above commit
> didn't have any real effect in fact.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko 
> Reviewed-by: Jan Kara 

If I'm not mistaken, this patch is not dependent on any of the other
patches in this series (and the other patches are not dependent on
this one).  Hence, I could take this patch via the ext4 tree, correct?

 - Ted

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

2017-01-16 Thread Theodore Ts&#x27;o

On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> From: Michal Hocko 
> 
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko 
> Reviewed-by: Jan Kara 

Changes in the jbd2 layer aren't going to guarantee that
memalloc_nofs_save() will be executed if we are running ext4 without a
journal (aka in no journal mode).  And this is a *very* common
configuration; it's how ext4 is used inside Google in our production
servers.

So that means the earlier patches will probably need to be changed so
the nOFS scope is done in the ext4_journal_{start,stop} functions in
fs/ext4/ext4_jbd2.c.

- Ted

Re: [Cluster-devel] [PATCH 08/12] ext4: Convert to private i_dquot field

2014-11-04 Thread Theodore Ts&#x27;o

On Tue, Nov 04, 2014 at 12:19:49PM +0100, Jan Kara wrote:
> CC: linux-e...@vger.kernel.org
> CC: "Theodore Ts'o" 
> Signed-off-by: Jan Kara 

Acked-by: Theodore Ts'o 

- Ted

Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts&#x27;o

On Thu, Mar 13, 2014 at 07:33:02PM -0500, Steve French wrote:
> On Thu, Mar 13, 2014 at 9:20 AM, Theodore Ts'o  wrote:
> > Previously, the no-op "mount -o mount /dev/xxx" operation when the
> > file system is already mounted read-write causes an implied,
> > unconditional syncfs().  This seems pretty stupid, and it's certainly
> > documented or guaraunteed to do this, nor is it particularly useful,
> > except in the case where the file system was mounted rw and is getting
> > remounted read-only.
> 
> Is there a case where a file system, not mounted read-only,
> would want to skip the syncfs on remount?  I don't know
> of any particular reason to do a syncfs on remount unless
> caching behavior is changing (or moving to read-only mount),
> but if as you say it is documented and guaranteed...

If the file system is mounted read-write, and it is transitioning to
read-only, i.e. "mount -o remount,ro /" then you do want to write out
all dirty data before you transition it to be read-only (otherwise you
would lose data).

It is my belief that this is the _only_ data integrity issue which is
implied by remount (and this is more about not losing work done by
previous system calls).

The background reason for this commit is that a user complained on the
ext4 list that he is using containers in a virtualization environment,
and due to the init scripts which the user doesn't want to change, it
is causing gazillions of no-op remounts, and this is causing ext4 (and
xfs) to do send CACHE FLUSH commands because it is required by the
syncfs(2) system call, which also calls sync_filesystem.  These CACHE
FLUSH commands are unneeded for anything, especially for these no-op
remounts, and so I want them to go away for remounts, but they should
still be there in response to syncfs(2) requests.

Cheers,

- Ted

Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts&#x27;o

On Thu, Mar 13, 2014 at 04:28:23PM +, Steven Whitehouse wrote:
> 
> I guess the same is true for other file systems which are mounted ro
> too. So maybe a check for MS_RDONLY before doing the sync in those
> cases?

My original patch moved the sync_filesystem into the check for
MS_RDONLY in the core VFS code.  The objection was raised that there
might be some file system out there that might depend on this
behaviour.  I can't imagine why, but I suppose it's at least
theoretically possible.

So the idea is that this particular patch is *guaranteed* not to make
any difference.  That way there can be no question about the patch'es
correctness.

I'm going to follow up with a patch for ext4 that does exactly that,
but the idea is to allow each file system maintainer to do that for
their own file system.

I could do that as well for file systems that are "obviously"
read-only, but then I'll find out that there's some wierd case where
the file system can be used in a read-write fashion.  (Example: UDF is
normally used for DVD's, but at least in theory it can be used
read/write --- I'm told that Windows supports read-write UDF file
systems on USB sticks, and at least in theory it could be used as a
inter-OS exchange format in situations where VFAT and exFAT might not
be appropriate for various reasons.)

Cheers,

- Ted

[Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

2014-03-13 Thread Theodore Ts&#x27;o

Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" 
Cc: linux-fsde...@vger.kernel.org
Cc: Christoph Hellwig 
Cc: Artem Bityutskiy 
Cc: Adrian Hunter 
Cc: Evgeniy Dushistov 
Cc: Jan Kara 
Cc: OGAWA Hirofumi 
Cc: Anders Larsen 
Cc: Phillip Lougher 
Cc: Kees Cook 
Cc: Mikulas Patocka 
Cc: Petr Vandrovec 
Cc: x...@oss.sgi.com
Cc: linux-bt...@vger.kernel.org
Cc: linux-c...@vger.kernel.org
Cc: samba-techni...@lists.samba.org
Cc: codal...@coda.cs.cmu.edu
Cc: linux-e...@vger.kernel.org
Cc: linux-f2fs-de...@lists.sourceforge.net
Cc: fuse-de...@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-...@lists.infradead.org
Cc: jfs-discuss...@lists.sourceforge.net
Cc: linux-...@vger.kernel.org
Cc: linux-ni...@vger.kernel.org
Cc: linux-ntfs-...@lists.sourceforge.net
Cc: ocfs2-de...@oss.oracle.com
Cc: reiserfs-de...@vger.kernel.org
---
 fs/adfs/super.c  | 1 +
 fs/affs/super.c  | 1 +
 fs/befs/linuxvfs.c   | 1 +
 fs/btrfs/super.c | 1 +
 fs/cifs/cifsfs.c | 1 +
 fs/coda/inode.c  | 1 +
 fs/cramfs/inode.c| 1 +
 fs/debugfs/inode.c   | 1 +
 fs/devpts/inode.c| 1 +
 fs/efs/super.c   | 1 +
 fs/ext2/super.c  | 1 +
 fs/ext3/super.c  | 2 ++
 fs/ext4/super.c  | 2 ++
 fs/f2fs/super.c  | 2 ++
 fs/fat/inode.c   | 2 ++
 fs/freevxfs/vxfs_super.c | 1 +
 fs/fuse/inode.c  | 1 +
 fs/gfs2/super.c  | 2 ++
 fs/hfs/super.c   | 1 +
 fs/hfsplus/super.c   | 1 +
 fs/hpfs/super.c  | 2 ++
 fs/isofs/inode.c | 1 +
 fs/jffs2/super.c | 1 +
 fs/jfs/super.c   | 1 +
 fs/minix/inode.c | 1 +
 fs/ncpfs/inode.c | 1 +
 fs/nfs/super.c   | 2 ++
 fs/nilfs2/super.c| 1 +
 fs/ntfs/super.c  | 2 ++
 fs/ocfs2/super.c | 2 ++
 fs/openpromfs/inode.c| 1 +
 fs/proc/root.c   | 2 ++
 fs/pstore/inode.c| 1 +
 fs/qnx4/inode.c  | 1 +
 fs/qnx6/inode.c  | 1 +
 fs/reiserfs/super.c  | 1 +
 fs/romfs/super.c | 1 +
 fs/squashfs/super.c  | 1 +
 fs/super.c   | 2 --
 fs/sysv/inode.c  | 1 +
 fs/ubifs/super.c | 1 +
 fs/udf/super.c   | 1 +
 fs/ufs/super.c   | 1 +
 fs/xfs/xfs_super.c   | 1 +
 44 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/fs/adfs/super.c b/fs/adfs/super.c
index 7b3003c..952aeb0 100644
--- a/fs/adfs/super.c
+++ b/fs/adfs/super.c
@@ -212,6 +212,7 @@ static int parse_options(struct super_block *sb, char 
*options)
 
 static int adfs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
return parse_options(sb, data);
 }
diff --git a/fs/affs/super.c b/fs/affs/super.c
index d098731..3074530 100644
--- a/fs/affs/super.c
+++ b/fs/affs/super.c
@@ -530,6 +530,7 @@ affs_remount(struct super_block *sb, int *flags, char *data)
 
pr_debug("AFFS: remount(flags=0x%x,opts=\"%s\")\n",*flags,data);
 
+   sync_filesystem(sb);
*flags |= MS_NODIRATIME;
 
memcpy(volume, sbi->s_volume, 32);
diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 845d2d6..56d70c8 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -913,6 +913,7 @@ befs_fill_super(struct super_block *sb, void *data, int 
silent)
 static int
 befs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_filesystem(sb);
if (!(*flags & MS_RDONLY))
return -EINVAL;
return 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 97cc241..00cd0c5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1381,6 +1381,7 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
unsigned int old_metadata_ratio = fs_info->metadata_ratio;
int ret;
 
+   sync_filesystem(sb);
btrfs_remount_prepare(fs_info);
 
ret = btrfs_parse_options(root, data);
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f613..4942c94 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -541,6 +541,7 @@ static int cifs_show_stats(struct seq_file *s, struct 
dentry *root)
 
 static int cifs_remount(struct super_block *sb, int *flags, char *data)
 {
+   sync_file

Re: [Cluster-devel] [trivial] treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks

2013-06-23 Thread Theodore Ts&#x27;o

On Wed, Jun 19, 2013 at 12:15:53PM -0700, Joe Perches wrote:
> Don't emit OOM warnings when k.alloc calls fail when
> there there is a v.alloc immediately afterwards.
> 
> Signed-off-by: Joe Perches 

For fs/ext4/super.c:

Acked-by: "Theodore Ts'o" 

- Ted

Re: [Cluster-devel] [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument

2013-04-23 Thread Theodore Ts&#x27;o

On Tue, Apr 09, 2013 at 11:14:17AM +0200, Lukas Czerner wrote:
> ->invalidatepage() aop now accepts range to invalidate so we can make
> use of it in gfs2_invalidatepage().
> 
> Signed-off-by: Lukas Czerner 
> Cc: cluster-devel@redhat.com

To the gfs2 development team, 

Since half of this patch series modifies ext4 extensively, and changes
to the other file systems are relatively small, I plan to carry the
invalidatepage patch set in the ext4 tree for the next development
cycle (i.e., not the upcoming merge window, but the next one).  To
that end, it would be great if you take a look at this patch set and
send us an Acked-by signoff.

Thanks!!

- Ted

Re: [Cluster-devel] [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument

2013-04-23 Thread Theodore Ts&#x27;o

On Tue, Apr 23, 2013 at 10:16:31AM -0400, Theodore Ts'o wrote:
> On Tue, Apr 09, 2013 at 11:14:17AM +0200, Lukas Czerner wrote:
> > ->invalidatepage() aop now accepts range to invalidate so we can make
> > use of it in gfs2_invalidatepage().
> > 
> > Signed-off-by: Lukas Czerner 
> > Cc: cluster-devel@redhat.com
> 
> To the gfs2 development team, 

Whoops, I missed Steven Whitehouse's Acked-by.  Sorry for the noise,

  - Ted

Re: [Cluster-devel] [syzbot] [gfs2?] KASAN: use-after-free Read in qd_unlock (2)

Re: [Cluster-devel] [PATCH 01/12] backing_dev: remove current->backing_dev_info

Re: [Cluster-devel] [PATCH 03/12] filemap: update ki_pos in generic_perform_write

Re: [Cluster-devel] [GIT PULL] gfs2 writepage fix

Re: [Cluster-devel] [PATCH v2 06/14] jbd2: replace ll_rw_block()

Re: [Cluster-devel] [PATCH -v5] ext4: don't BUG if someone dirty pages without asking ext4 first

Re: [Cluster-devel] [PATCH -v4] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

Re: [Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

[Cluster-devel] [PATCH -v3] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

Re: [Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

[Cluster-devel] [PATCH -v2] ext4: don't BUG if kernel subsystems dirty pages without asking ext4 first

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [REPORT] kernel BUG at fs/ext4/inode.c:2620 - page_buffers()

Re: [Cluster-devel] [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

Re: [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"

Re: [Cluster-devel] [PATCH 08/12] ext4: Convert to private i_dquot field

Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

Re: [Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

[Cluster-devel] [PATCH] fs: push sync_filesystem() down to the file system's remount_fs()

Re: [Cluster-devel] [trivial] treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks

Re: [Cluster-devel] [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument

Re: [Cluster-devel] [PATCH v3 08/18] gfs2: use ->invalidatepage() length argument

33 matches

Site Navigation

Mail list logo

Footer information