On Mon, Sep 08, 2025 at 05:27:37PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > We have introduced the f_op->mmap_prepare hook to allow for setting up a
> > VMA far earlier in the process of mapping memory, reducing problematic
> > error handling paths, but this does not provide what all
> > drivers/filesystems need.
> >
> > In order to supply this, and to be able to move forward with removing
> > f_op->mmap altogether, introduce f_op->mmap_complete.
> >
> > This hook is called once the VMA is fully mapped and everything is done,
> > however with the mmap write lock and VMA write locks held.
> >
> > The hook is then provided with a fully initialised VMA which it can do what
> > it needs with, though the mmap and VMA write locks must remain held
> > throughout.
> >
> > It is not intended that the VMA be modified at this point, attempts to do
> > so will end in tears.
> >
> > This allows for operations such as pre-population typically via a remap, or
> > really anything that requires access to the VMA once initialised.
> >
> > In addition, a caller may need to take a lock in mmap_prepare, when it is
> > possible to modify the VMA, and release it on mmap_complete. In order to
> > handle errors which may arise between the two operations, f_op->mmap_abort
> > is provided.
> >
> > This hook should be used to drop any lock and clean up anything before the
> > VMA mapping operation is aborted. After this point the VMA will not be
> > added to any mapping and will not exist.
> >
> > We also add a new mmap_context field to the vm_area_desc type which can be
> > used to pass information pertinent to any locks which are held or any state
> > which is required for mmap_complete, abort to operate correctly.
> >
> > We also update the compatibility layer for nested filesystems which
> > currently still only specify an f_op->mmap() handler so that it correctly
> > invokes f_op->mmap_complete as necessary (note that no error can occur
> > between mmap_prepare and mmap_complete so mmap_abort will never be called
> > in this case).
> >
> > Also update the VMA tests to account for the changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoa...@oracle.com>
> > ---
> >   include/linux/fs.h               |  4 ++
> >   include/linux/mm_types.h         |  5 ++
> >   mm/util.c                        | 18 +++++--
> >   mm/vma.c                         | 82 ++++++++++++++++++++++++++++++--
> >   tools/testing/vma/vma_internal.h | 31 ++++++++++--
> >   5 files changed, 129 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 594bd4d0521e..bb432924993a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2195,6 +2195,10 @@ struct file_operations {
> >     int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> >                             unsigned int poll_flags);
> >     int (*mmap_prepare)(struct vm_area_desc *);
> > +   int (*mmap_complete)(struct file *, struct vm_area_struct *,
> > +                        const void *context);
> > +   void (*mmap_abort)(const struct file *, const void *vm_private_data,
> > +                      const void *context);
>
> Do we have a description somewhere what these things do, when they are
> called, and what a driver may be allowed to do with a VMA?

Yeah there's a doc patch that follows this.

>
> In particular, the mmap_complete() looks like another candidate for letting
> a driver just go crazy on the vma? :)

Well there's only so much we can do. In an ideal world we'd treat VMAs as
entirely internal data structures and pass some sort of opaque thing around, but
we have to keep things real here :)

So the main purpose of these changes is not so much to be as ambitious as
_that_, but to only provide the VMA _when it's safe to do so_.

Before we were providing a pointer to an incompletely-initialised VMA that was
not yet in the maple tree, with which the driver could do _anything_, and then
afterwards have:

a. a bunch of stuff left to do with a VMA that might be in some broken state due
   to drivers.
b. (the really bad case) have error paths to handle because the driver returned
   an error, but did who-knows-what with the VMA and page tables.

So we address this by:

1. mmap_prepare being done _super early_ and _not_ providing a VMA. We
   essentially ask the driver 'hey what do you want these fields that you are
   allowed to change in the VMA to be?'

2. mmap_complete being done _super_ late, essentially just before we release the
   VMA/mmap locks. If an error arises - we can just unmap it, easy. And then
   there's a lot less damage the driver can do.

I think it's probably the most sensible means of doing something about the
legacy we have where we've been rather too 'free and easy' with allowing drivers
to do whatever.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

Reply via email to