Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-05-05 Thread Suren Baghdasaryan
On Mon, May 5, 2025 at 6:50 AM Christian Brauner  wrote:
>
> On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote:
> > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan  
> > wrote:
> > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn  wrote:
> > > >
> > > > Hi!
> > > >
> > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > > forgot all about that...)
> > >
> > > Does this fact affect your previous comments? Just want to make sure
> > > I'm not missing something...
> >
> > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> > down a VMA, and using get_file_rcu() for the lockless lookup, I did
> > not realize that you could actually also race with all the other
> > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> > did not think about whether any of those code paths might leave a VMA
> > with a dangling ->vm_file pointer.
> >
> > I guess maybe that means you really do need to do the lookup from the
> > copied data, as you did in your patch; and that might require calling
> > get_file_active() on the copied ->vm_file pointer (instead of
> > get_file_rcu()), even though I think that is not really how
> > get_file_active() is supposed to be used (it's supposed to be used
> > when you know the original file hasn't been freed yet). Really what
>
> I think it's fine for get_file_active() to be used in this way. That
> ->vm_file pointer usage should get a fat comment above it explaining how
> what you're doing is safe.

Got it. Will use it in my next version. Thanks!

>
> > you'd want for that is basically a raw __get_file_rcu(), but that is
> > static and I think Christian wouldn't want to expose more of these
> > internals outside VFS...
>
> Yeah, no. I don't want that to be usable outside of that file.
>
> > (In that case, all the stuff below about get_file_rcu() would be moot.)
> >
> > Or you could pepper WRITE_ONCE() over all the places that write
> > ->vm_file, and ensure that ->vm_file is always NULLed before its
> > reference is dropped... but that seems a bit more ugly to me.
> >
> > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan  
> > > > wrote:
> > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
> > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan 
> > > > > >  wrote:
> > > > > > > With maple_tree supporting vma tree traversal under RCU and vma 
> > > > > > > and
> > > > > > > its important members being RCU-safe, /proc/pid/maps can be read 
> > > > > > > under
> > > > > > > RCU and without the need to read-lock mmap_lock. However vma 
> > > > > > > content
> > > > > > > can change from under us, therefore we make a copy of the vma and 
> > > > > > > we
> > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > > momentarily
> > > > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > > > reduce
> > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps 
> > > > > > > files
> > > > > > > (often a low priority task, such as monitoring/data collection 
> > > > > > > services)
> > > > > > > from blocking address space updates.
> > > > > > [...]
> > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > > --- a/fs/proc/task_mmu.c
> > > > > > > +++ b/fs/proc/task_mmu.c
> > > > > > [...]
> > > > > > > +/*
> > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are 
> > > > > > > used by
> > > > > > > + * show_map_vma.
> > > > > > > + */
> > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, 
> > > > > > > struct vm_area_struct *vma)
> > > > > > > +{
> > > > > > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > > +   int ret = -EAGAIN;
> > > > > > > +
> > > > > > > +   memcpy(copy, vma, sizeof(*vma));
> > > > > > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > > > > > +   goto out;
> > > > > >
> > > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > > >
> > > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > > pointer which always points to a file with a non-zero refcount 
> > > > > > (except
> > > > > > when it is NULL). That's why it takes a file** instead of a file* - 
> > > > > > if
> > > > > > it observes a zero refcount, it assumes that the pointer must have
> > > > > > been updated in the meantime, and retries. Calling get_file_rcu() 
> > > > > > on a
> > > > > > pointer that points to a file with zero refcount, which I think can
> > > > > > happen with this patch, will cause an endless loop.

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-05-05 Thread Christian Brauner
On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote:
> On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan  wrote:
> > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn  wrote:
> > >
> > > Hi!
> > >
> > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > forgot all about that...)
> >
> > Does this fact affect your previous comments? Just want to make sure
> > I'm not missing something...
> 
> When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> down a VMA, and using get_file_rcu() for the lockless lookup, I did
> not realize that you could actually also race with all the other
> places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> did not think about whether any of those code paths might leave a VMA
> with a dangling ->vm_file pointer.
> 
> I guess maybe that means you really do need to do the lookup from the
> copied data, as you did in your patch; and that might require calling
> get_file_active() on the copied ->vm_file pointer (instead of
> get_file_rcu()), even though I think that is not really how
> get_file_active() is supposed to be used (it's supposed to be used
> when you know the original file hasn't been freed yet). Really what

I think it's fine for get_file_active() to be used in this way. That
->vm_file pointer usage should get a fat comment above it explaining how
what you're doing is safe.

> you'd want for that is basically a raw __get_file_rcu(), but that is
> static and I think Christian wouldn't want to expose more of these
> internals outside VFS...

Yeah, no. I don't want that to be usable outside of that file.

> (In that case, all the stuff below about get_file_rcu() would be moot.)
> 
> Or you could pepper WRITE_ONCE() over all the places that write
> ->vm_file, and ensure that ->vm_file is always NULLed before its
> reference is dropped... but that seems a bit more ugly to me.
> 
> > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan  
> > > wrote:
> > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
> > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read 
> > > > > > under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > > reduce
> > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps 
> > > > > > files
> > > > > > (often a low priority task, such as monitoring/data collection 
> > > > > > services)
> > > > > > from blocking address space updates.
> > > > > [...]
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > [...]
> > > > > > +/*
> > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are 
> > > > > > used by
> > > > > > + * show_map_vma.
> > > > > > + */
> > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > > > > > vm_area_struct *vma)
> > > > > > +{
> > > > > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > +   int ret = -EAGAIN;
> > > > > > +
> > > > > > +   memcpy(copy, vma, sizeof(*vma));
> > > > > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > > > > +   goto out;
> > > > >
> > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > >
> > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > pointer which always points to a file with a non-zero refcount (except
> > > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > > it observes a zero refcount, it assumes that the pointer must have
> > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > > pointer that points to a file with zero refcount, which I think can
> > > > > happen with this patch, will cause an endless loop.
> > > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > > still behave nicely and not spuriously return NULL when the file* is
> > > > > concurrently updated to point to another file*; that's what that loop
> > > > > is for.)
> > > >
> > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > > checking the return value 

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Jann Horn
On Tue, Apr 29, 2025 at 10:33 PM Suren Baghdasaryan  wrote:
> On Tue, Apr 29, 2025 at 11:55 AM Jann Horn  wrote:
> > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan  
> > wrote:
> > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn  wrote:
> > > >
> > > > Hi!
> > > >
> > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > > forgot all about that...)
> > >
> > > Does this fact affect your previous comments? Just want to make sure
> > > I'm not missing something...
> >
> > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> > down a VMA, and using get_file_rcu() for the lockless lookup, I did
> > not realize that you could actually also race with all the other
> > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> > did not think about whether any of those code paths might leave a VMA
> > with a dangling ->vm_file pointer.
>
> So, let me summarize my understanding and see if it's correct.
>
> If we copy the original vma, ensure that it hasn't changed while we
> were copying (with mmap_lock_speculate_retry()) and then use
> get_file_rcu(©->vm_file) I think we are guaranteed no UAF because
> we are in RCU read section. At this point the only issue is that
> copy->vm_file might have lost its last refcount and get_file_rcu()
> would enter an infinite loop.

Yeah. (Using get_file_active() would avoid that.)

> So, to avoid that we have to use the
> original vma when calling get_file_rcu()

Sorry - I originally said that, but I didn't think about
SLAB_TYPESAFE_BY_RCU when I had that in mind.

> but then we should also
> ensure that vma itself does not change from under us due to
> SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change
> from under us we might end up accessing an invalid address if
> vma->vm_file is being modified concurrently.

Yeah, I think in theory we would have data races, since the file*
reads in get_file_rcu() could race with all the (plain) ->vm_file
pointer stores. So I guess it might actually be safer to use the
copied VMA's ->vm_file for this, with get_file_active(). Though that
would be abusing get_file_active() quite a bit, so brauner@ should
probably look over this early and see whether he thinks that's
acceptable...

> > I guess maybe that means you really do need to do the lookup from the
> > copied data, as you did in your patch; and that might require calling
> > get_file_active() on the copied ->vm_file pointer (instead of
> > get_file_rcu()), even though I think that is not really how
> > get_file_active() is supposed to be used (it's supposed to be used
> > when you know the original file hasn't been freed yet). Really what
> > you'd want for that is basically a raw __get_file_rcu(), but that is
> > static and I think Christian wouldn't want to expose more of these
> > internals outside VFS...
> > (In that case, all the stuff below about get_file_rcu() would be moot.)
> >
> > Or you could pepper WRITE_ONCE() over all the places that write
> > ->vm_file, and ensure that ->vm_file is always NULLed before its
> > reference is dropped... but that seems a bit more ugly to me.
>
> Ugh, yes. We either ensure no vma->vm_file tearing or use
> __get_file_rcu() on a copy of the vma. Or we have to stabilize the vma
> by locking it... Let me think about all these options. Thanks!



Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Suren Baghdasaryan
On Tue, Apr 29, 2025 at 11:55 AM Jann Horn  wrote:
>
> On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan  wrote:
> > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn  wrote:
> > >
> > > Hi!
> > >
> > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > > forgot all about that...)
> >
> > Does this fact affect your previous comments? Just want to make sure
> > I'm not missing something...
>
> When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
> down a VMA, and using get_file_rcu() for the lockless lookup, I did
> not realize that you could actually also race with all the other
> places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
> did not think about whether any of those code paths might leave a VMA
> with a dangling ->vm_file pointer.

So, let me summarize my understanding and see if it's correct.

If we copy the original vma, ensure that it hasn't changed while we
were copying (with mmap_lock_speculate_retry()) and then use
get_file_rcu(©->vm_file) I think we are guaranteed no UAF because
we are in RCU read section. At this point the only issue is that
copy->vm_file might have lost its last refcount and get_file_rcu()
would enter an infinite loop. So, to avoid that we have to use the
original vma when calling get_file_rcu() but then we should also
ensure that vma itself does not change from under us due to
SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change
from under us we might end up accessing an invalid address if
vma->vm_file is being modified concurrently.

>
> I guess maybe that means you really do need to do the lookup from the
> copied data, as you did in your patch; and that might require calling
> get_file_active() on the copied ->vm_file pointer (instead of
> get_file_rcu()), even though I think that is not really how
> get_file_active() is supposed to be used (it's supposed to be used
> when you know the original file hasn't been freed yet). Really what
> you'd want for that is basically a raw __get_file_rcu(), but that is
> static and I think Christian wouldn't want to expose more of these
> internals outside VFS...
> (In that case, all the stuff below about get_file_rcu() would be moot.)
>
> Or you could pepper WRITE_ONCE() over all the places that write
> ->vm_file, and ensure that ->vm_file is always NULLed before its
> reference is dropped... but that seems a bit more ugly to me.

Ugh, yes. We either ensure no vma->vm_file tearing or use
__get_file_rcu() on a copy of the vma. Or we have to stabilize the vma
by locking it... Let me think about all these options. Thanks!

>
> > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan  
> > > wrote:
> > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
> > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read 
> > > > > > under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > > reduce
> > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps 
> > > > > > files
> > > > > > (often a low priority task, such as monitoring/data collection 
> > > > > > services)
> > > > > > from blocking address space updates.
> > > > > [...]
> > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > > --- a/fs/proc/task_mmu.c
> > > > > > +++ b/fs/proc/task_mmu.c
> > > > > [...]
> > > > > > +/*
> > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are 
> > > > > > used by
> > > > > > + * show_map_vma.
> > > > > > + */
> > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > > > > > vm_area_struct *vma)
> > > > > > +{
> > > > > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > > > > +   int ret = -EAGAIN;
> > > > > > +
> > > > > > +   memcpy(copy, vma, sizeof(*vma));
> > > > > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > > > > +   goto out;
> > > > >
> > > > > I think this uses get_file_rcu() in a different way than intended.
> > > > >
> > > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > > pointer which always points to a file with a non-zero refcount (except
> > > > > when it is NULL). That's why it takes a

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Jann Horn
On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan  wrote:
> On Tue, Apr 29, 2025 at 10:21 AM Jann Horn  wrote:
> >
> > Hi!
> >
> > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> > forgot all about that...)
>
> Does this fact affect your previous comments? Just want to make sure
> I'm not missing something...

When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing
down a VMA, and using get_file_rcu() for the lockless lookup, I did
not realize that you could actually also race with all the other
places that set ->vm_file, like __mmap_new_file_vma() and so on; and I
did not think about whether any of those code paths might leave a VMA
with a dangling ->vm_file pointer.

I guess maybe that means you really do need to do the lookup from the
copied data, as you did in your patch; and that might require calling
get_file_active() on the copied ->vm_file pointer (instead of
get_file_rcu()), even though I think that is not really how
get_file_active() is supposed to be used (it's supposed to be used
when you know the original file hasn't been freed yet). Really what
you'd want for that is basically a raw __get_file_rcu(), but that is
static and I think Christian wouldn't want to expose more of these
internals outside VFS...
(In that case, all the stuff below about get_file_rcu() would be moot.)

Or you could pepper WRITE_ONCE() over all the places that write
->vm_file, and ensure that ->vm_file is always NULLed before its
reference is dropped... but that seems a bit more ugly to me.

> > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan  
> > wrote:
> > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
> > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan  
> > > > wrote:
> > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > pin pointer fields used when generating the output (currently only
> > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > space modifications, wait for them to end and retry. While we take
> > > > > the mmap_lock for reading during such contention, we do that 
> > > > > momentarily
> > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > reduce
> > > > > mmap_lock contention and prevent a process reading /proc/pid/maps 
> > > > > files
> > > > > (often a low priority task, such as monitoring/data collection 
> > > > > services)
> > > > > from blocking address space updates.
> > > > [...]
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > [...]
> > > > > +/*
> > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used 
> > > > > by
> > > > > + * show_map_vma.
> > > > > + */
> > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > > > > vm_area_struct *vma)
> > > > > +{
> > > > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > > > +   int ret = -EAGAIN;
> > > > > +
> > > > > +   memcpy(copy, vma, sizeof(*vma));
> > > > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > > > +   goto out;
> > > >
> > > > I think this uses get_file_rcu() in a different way than intended.
> > > >
> > > > As I understand it, get_file_rcu() is supposed to be called on a
> > > > pointer which always points to a file with a non-zero refcount (except
> > > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > > it observes a zero refcount, it assumes that the pointer must have
> > > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > > pointer that points to a file with zero refcount, which I think can
> > > > happen with this patch, will cause an endless loop.
> > > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > > still behave nicely and not spuriously return NULL when the file* is
> > > > concurrently updated to point to another file*; that's what that loop
> > > > is for.)
> > >
> > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > > checking the return value of get_file_rcu() and retrying speculation
> > > if it changed.
> >
> > I think you could probably still end up looping endlessly in get_file_rcu().

(Just to be clear: What I meant here is that get_file_rcu() loops
*internally*; get_file_rcu() is not guaranteed to ever return if the
pointed-to file has a zero refcount.)

> By "retrying speculation" I meant it in the sense of
> get_vma_snapshot() retry when it takes the mmap_read_lock and then
> does mmap_lock_speculate_try_begin to rest

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Suren Baghdasaryan
On Tue, Apr 29, 2025 at 10:21 AM Jann Horn  wrote:
>
> Hi!
>
> (I just noticed that I incorrectly assumed that VMAs use kfree_rcu
> (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
> forgot all about that...)

Does this fact affect your previous comments? Just want to make sure
I'm not missing something...

>
> On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan  wrote:
> > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
> > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan  
> > > wrote:
> > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > can change from under us, therefore we make a copy of the vma and we
> > > > pin pointer fields used when generating the output (currently only
> > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > space modifications, wait for them to end and retry. While we take
> > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > > (often a low priority task, such as monitoring/data collection services)
> > > > from blocking address space updates.
> > > [...]
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > [...]
> > > > +/*
> > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > > + * show_map_vma.
> > > > + */
> > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > > > vm_area_struct *vma)
> > > > +{
> > > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > > +   int ret = -EAGAIN;
> > > > +
> > > > +   memcpy(copy, vma, sizeof(*vma));
> > > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > > +   goto out;
> > >
> > > I think this uses get_file_rcu() in a different way than intended.
> > >
> > > As I understand it, get_file_rcu() is supposed to be called on a
> > > pointer which always points to a file with a non-zero refcount (except
> > > when it is NULL). That's why it takes a file** instead of a file* - if
> > > it observes a zero refcount, it assumes that the pointer must have
> > > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > > pointer that points to a file with zero refcount, which I think can
> > > happen with this patch, will cause an endless loop.
> > > (Just as background: For other usecases, get_file_rcu() is supposed to
> > > still behave nicely and not spuriously return NULL when the file* is
> > > concurrently updated to point to another file*; that's what that loop
> > > is for.)
> >
> > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> > checking the return value of get_file_rcu() and retrying speculation
> > if it changed.
>
> I think you could probably still end up looping endlessly in get_file_rcu().

By "retrying speculation" I meant it in the sense of
get_vma_snapshot() retry when it takes the mmap_read_lock and then
does mmap_lock_speculate_try_begin to restart speculation. I'm also
thinking about Liam's concern of guaranteeing forward progress for the
reader. Thinking maybe I should not drop mmap_read_lock immediately on
contention but generate some output (one vma or one page worth of
vmas) before dropping mmap_read_lock and proceeding with speculation.

>
> > > (If my understanding is correct, maybe we should document that more
> > > explicitly...)
> >
> > Good point. I'll add comments for get_file_rcu() as a separate patch.
> >
> > >
> > > Also, I think you are introducing an implicit assumption that
> > > remove_vma() does not NULL out the ->vm_file pointer (because that
> > > could cause tearing and could theoretically lead to a torn pointer
> > > being accessed here).
> > >
> > > One alternative might be to change the paths that drop references to
> > > vma->vm_file (search for vma_close to find them) such that they first
> > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > > maybe with a new helper like this:
> > >
> > > static void vma_fput(struct vm_area_struct *vma)
> > > {
> > >   struct file *file = vma->vm_file;
> > >
> > >   if (file) {
> > > WRITE_ONCE(vma->vm_file, NULL);
> > > fput(file);
> > >   }
> > > }
> > >
> > > Then on the lockless lookup path you could use get_file_rcu() on the
> > > ->vm_file pointer _of the original VMA_, and store the returned file*
> > > into copy->vm_file.
> >
> > Ack. Except for storing the return value of get_file_rcu(). I think
> > once we detect that  get_file_rcu() returns a different file we should
> > bail out and retry. The change in file is an indication that the v

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Jann Horn
Hi!

(I just noticed that I incorrectly assumed that VMAs use kfree_rcu
(not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I
forgot all about that...)

On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan  wrote:
> On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
> > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan  
> > wrote:
> > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > RCU and without the need to read-lock mmap_lock. However vma content
> > > can change from under us, therefore we make a copy of the vma and we
> > > pin pointer fields used when generating the output (currently only
> > > vm_file and anon_name). Afterwards we check for concurrent address
> > > space modifications, wait for them to end and retry. While we take
> > > the mmap_lock for reading during such contention, we do that momentarily
> > > only to record new mm_wr_seq counter. This change is designed to reduce
> > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > (often a low priority task, such as monitoring/data collection services)
> > > from blocking address space updates.
> > [...]
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > [...]
> > > +/*
> > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > + * show_map_vma.
> > > + */
> > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > > vm_area_struct *vma)
> > > +{
> > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > +   int ret = -EAGAIN;
> > > +
> > > +   memcpy(copy, vma, sizeof(*vma));
> > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > +   goto out;
> >
> > I think this uses get_file_rcu() in a different way than intended.
> >
> > As I understand it, get_file_rcu() is supposed to be called on a
> > pointer which always points to a file with a non-zero refcount (except
> > when it is NULL). That's why it takes a file** instead of a file* - if
> > it observes a zero refcount, it assumes that the pointer must have
> > been updated in the meantime, and retries. Calling get_file_rcu() on a
> > pointer that points to a file with zero refcount, which I think can
> > happen with this patch, will cause an endless loop.
> > (Just as background: For other usecases, get_file_rcu() is supposed to
> > still behave nicely and not spuriously return NULL when the file* is
> > concurrently updated to point to another file*; that's what that loop
> > is for.)
>
> Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
> checking the return value of get_file_rcu() and retrying speculation
> if it changed.

I think you could probably still end up looping endlessly in get_file_rcu().

> > (If my understanding is correct, maybe we should document that more
> > explicitly...)
>
> Good point. I'll add comments for get_file_rcu() as a separate patch.
>
> >
> > Also, I think you are introducing an implicit assumption that
> > remove_vma() does not NULL out the ->vm_file pointer (because that
> > could cause tearing and could theoretically lead to a torn pointer
> > being accessed here).
> >
> > One alternative might be to change the paths that drop references to
> > vma->vm_file (search for vma_close to find them) such that they first
> > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> > maybe with a new helper like this:
> >
> > static void vma_fput(struct vm_area_struct *vma)
> > {
> >   struct file *file = vma->vm_file;
> >
> >   if (file) {
> > WRITE_ONCE(vma->vm_file, NULL);
> > fput(file);
> >   }
> > }
> >
> > Then on the lockless lookup path you could use get_file_rcu() on the
> > ->vm_file pointer _of the original VMA_, and store the returned file*
> > into copy->vm_file.
>
> Ack. Except for storing the return value of get_file_rcu(). I think
> once we detect that  get_file_rcu() returns a different file we should
> bail out and retry. The change in file is an indication that the vma
> got changed from under us, so whatever we have is stale.

What does "different file" mean here - what file* would you compare
the returned one against?



Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Suren Baghdasaryan
On Tue, Apr 29, 2025 at 8:40 AM Jann Horn  wrote:
>
> On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan  wrote:
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> [...]
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> [...]
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > vm_area_struct *vma)
> > +{
> > +   struct vm_area_struct *copy = &priv->vma_copy;
> > +   int ret = -EAGAIN;
> > +
> > +   memcpy(copy, vma, sizeof(*vma));
> > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > +   goto out;
>
> I think this uses get_file_rcu() in a different way than intended.
>
> As I understand it, get_file_rcu() is supposed to be called on a
> pointer which always points to a file with a non-zero refcount (except
> when it is NULL). That's why it takes a file** instead of a file* - if
> it observes a zero refcount, it assumes that the pointer must have
> been updated in the meantime, and retries. Calling get_file_rcu() on a
> pointer that points to a file with zero refcount, which I think can
> happen with this patch, will cause an endless loop.
> (Just as background: For other usecases, get_file_rcu() is supposed to
> still behave nicely and not spuriously return NULL when the file* is
> concurrently updated to point to another file*; that's what that loop
> is for.)

Ah, I see. I wasn't aware of this subtlety. I think this is fixable by
checking the return value of get_file_rcu() and retrying speculation
if it changed.

> (If my understanding is correct, maybe we should document that more
> explicitly...)

Good point. I'll add comments for get_file_rcu() as a separate patch.

>
> Also, I think you are introducing an implicit assumption that
> remove_vma() does not NULL out the ->vm_file pointer (because that
> could cause tearing and could theoretically lead to a torn pointer
> being accessed here).
>
> One alternative might be to change the paths that drop references to
> vma->vm_file (search for vma_close to find them) such that they first
> NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
> maybe with a new helper like this:
>
> static void vma_fput(struct vm_area_struct *vma)
> {
>   struct file *file = vma->vm_file;
>
>   if (file) {
> WRITE_ONCE(vma->vm_file, NULL);
> fput(file);
>   }
> }
>
> Then on the lockless lookup path you could use get_file_rcu() on the
> ->vm_file pointer _of the original VMA_, and store the returned file*
> into copy->vm_file.

Ack. Except for storing the return value of get_file_rcu(). I think
once we detect that  get_file_rcu() returns a different file we should
bail out and retry. The change in file is an indication that the vma
got changed from under us, so whatever we have is stale.

>
> > +   if (!anon_vma_name_get_if_valid(copy))
> > +   goto put_file;
> > +
> > +   if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> > +   return 0;
>
> We only check for concurrent updates at this point, so up to here,
> anything we read from "copy" could contain torn pointers (both because
> memcpy() is not guaranteed to copy pointers atomically and because the
> updates to the original VMA are not done with WRITE_ONCE()).
> That probably means that something like the preceding
> anon_vma_name_get_if_valid() could crash on an access to a torn
> pointer.
> Please either do another mmap_lock_speculate_retry() check directly
> after the memcpy(), or ensure nothing before this point reads from
> "copy".

Ack. I'll add mmap_lock_speculate_retry() check right after memcpy().

>
> > +   /* Address space got modified, vma might be stale. Re-lock and 
> > retry. */
> > +   rcu_read_unlock();
> > +   ret = mmap_read_lock_killable(priv->mm);
> > +   if (!ret) {
> > +   /* mmap_lock_speculate_try_begin() succeeds when holding 
> > mmap_read_lock */
> > +   mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> > + 

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-29 Thread Jann Horn
On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan  wrote:
> With maple_tree supporting vma tree traversal under RCU and vma and
> its important members being RCU-safe, /proc/pid/maps can be read under
> RCU and without the need to read-lock mmap_lock. However vma content
> can change from under us, therefore we make a copy of the vma and we
> pin pointer fields used when generating the output (currently only
> vm_file and anon_name). Afterwards we check for concurrent address
> space modifications, wait for them to end and retry. While we take
> the mmap_lock for reading during such contention, we do that momentarily
> only to record new mm_wr_seq counter. This change is designed to reduce
> mmap_lock contention and prevent a process reading /proc/pid/maps files
> (often a low priority task, such as monitoring/data collection services)
> from blocking address space updates.
[...]
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f9d50a61167c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +/*
> + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> + * show_map_vma.
> + */
> +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> vm_area_struct *vma)
> +{
> +   struct vm_area_struct *copy = &priv->vma_copy;
> +   int ret = -EAGAIN;
> +
> +   memcpy(copy, vma, sizeof(*vma));
> +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> +   goto out;

I think this uses get_file_rcu() in a different way than intended.

As I understand it, get_file_rcu() is supposed to be called on a
pointer which always points to a file with a non-zero refcount (except
when it is NULL). That's why it takes a file** instead of a file* - if
it observes a zero refcount, it assumes that the pointer must have
been updated in the meantime, and retries. Calling get_file_rcu() on a
pointer that points to a file with zero refcount, which I think can
happen with this patch, will cause an endless loop.
(Just as background: For other usecases, get_file_rcu() is supposed to
still behave nicely and not spuriously return NULL when the file* is
concurrently updated to point to another file*; that's what that loop
is for.)
(If my understanding is correct, maybe we should document that more
explicitly...)

Also, I think you are introducing an implicit assumption that
remove_vma() does not NULL out the ->vm_file pointer (because that
could cause tearing and could theoretically lead to a torn pointer
being accessed here).

One alternative might be to change the paths that drop references to
vma->vm_file (search for vma_close to find them) such that they first
NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that,
maybe with a new helper like this:

static void vma_fput(struct vm_area_struct *vma)
{
  struct file *file = vma->vm_file;

  if (file) {
WRITE_ONCE(vma->vm_file, NULL);
fput(file);
  }
}

Then on the lockless lookup path you could use get_file_rcu() on the
->vm_file pointer _of the original VMA_, and store the returned file*
into copy->vm_file.

> +   if (!anon_vma_name_get_if_valid(copy))
> +   goto put_file;
> +
> +   if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> +   return 0;

We only check for concurrent updates at this point, so up to here,
anything we read from "copy" could contain torn pointers (both because
memcpy() is not guaranteed to copy pointers atomically and because the
updates to the original VMA are not done with WRITE_ONCE()).
That probably means that something like the preceding
anon_vma_name_get_if_valid() could crash on an access to a torn
pointer.
Please either do another mmap_lock_speculate_retry() check directly
after the memcpy(), or ensure nothing before this point reads from
"copy".

> +   /* Address space got modified, vma might be stale. Re-lock and retry. 
> */
> +   rcu_read_unlock();
> +   ret = mmap_read_lock_killable(priv->mm);
> +   if (!ret) {
> +   /* mmap_lock_speculate_try_begin() succeeds when holding 
> mmap_read_lock */
> +   mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> +   mmap_read_unlock(priv->mm);
> +   ret = -EAGAIN;
> +   }
> +
> +   rcu_read_lock();
> +
> +   anon_vma_name_put_if_valid(copy);
> +put_file:
> +   if (copy->vm_file)
> +   fput(copy->vm_file);
> +out:
> +   return ret;
> +}
[...]
> @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma,
> } else {
> *path = file_user_path(vma->vm_file);
> }
> -   return;
> +   goto out;
> }
>
> if (vma->vm_ops && vma->vm_ops->name) {
> *name = vma->vm_ops->name(vma);

This seems to me like a big, subtle change of semantics. After this
change, vm_ops->name() will no longer receive a real VMA; and in
particular, I think the .name impl

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-24 Thread Andrii Nakryiko
On Thu, Apr 24, 2025 at 9:42 AM Liam R. Howlett  wrote:
>
> * Andrii Nakryiko  [250424 12:04]:
> > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett  
> > > wrote:
> > > >
> > > > * Andrii Nakryiko  [250423 18:06]:
> > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma 
> > > > > > > > and
> > > > > > > > its important members being RCU-safe, /proc/pid/maps can be 
> > > > > > > > read under
> > > > > > > > RCU and without the need to read-lock mmap_lock. However vma 
> > > > > > > > content
> > > > > > > > can change from under us, therefore we make a copy of the vma 
> > > > > > > > and we
> > > > > > > > pin pointer fields used when generating the output (currently 
> > > > > > > > only
> > > > > > > > vm_file and anon_name). Afterwards we check for concurrent 
> > > > > > > > address
> > > > > > > > space modifications, wait for them to end and retry. While we 
> > > > > > > > take
> > > > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > > > momentarily
> > > > > > > > only to record new mm_wr_seq counter. This change is designed 
> > > > > > > > to reduce
> > > > > > >
> > > > > > > This is probably a stupid question, but why do we need to take a 
> > > > > > > lock
> > > > > > > just to record this counter? uprobes get away without taking 
> > > > > > > mmap_lock
> > > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > > whether there were any modifications in between. Why does this 
> > > > > > > change
> > > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > > >
> > > > > > Not a stupid question. mmap_read_lock() is used to wait for the 
> > > > > > writer
> > > > > > to finish what it's doing and then we continue by recording a new
> > > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > > get_vma_snapshot() does. But your question made me realize that we 
> > > > > > can
> > > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > > >
> > > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > > >
> > > > We can check this counter once per 4k block and maintain the same
> > > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > > they had an issue with changing it, but since we're on this road anyways
> > > > I'd thought I'd point out where we could end up.
> > >
> > > We would need to run that check on the last call to show_map() right
> > > before seq_file detects the overflow and flushes the page. On
> > > contention we will also be throwing away more prepared data (up to a
> > > page worth of records) vs only the last record. All in all I'm not
> > > convinced this is worth doing unless increased chances of data tearing
> > > is identified as a problem.
> > >
> >
> > Yep, I agree, with filling out 4K of data we run into much higher
> > chances of conflict, IMO. Not worth it, I'd say.
>
> Sounds good.
>
> If this is an issue we do have a path forward still.  Although it's less
> desirable.
>
> >
> > > >
> > > > I am concerned about live locking in either scenario, but I haven't
> > > > looked too deep into this pattern.
> > > >
> > > > I also don't love (as usual) the lack of ensured forward progress.
> > >
> > > Hmm. Maybe we should add a retry limit on
> > > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > > the mmap_read_lock and proceed with it? That would prevent a
> > > hyperactive writer from blocking the reader's forward progress
> > > indefinitely.
> >
> > Came here to say the same. I'd add a small number of retries (3-5?)
> > and then fallback to the read-locked approach. The main challenge is
> > to keep all this logic nicely isolated from the main VMA
> > search/printing logic.
> >
> > For a similar pattern in uprobes, we don't even bother to rety, we
> > just fallback to mmap_read_lock and proceed, under the assumption that
> > this is going to be very rare and thus not important from the overall
> > performance perspective.
>
> In this problem space we are dealing with a herd of readers caused by
> writers delaying an ever-growing line of readers, right?

I'm assuming that the common case is there is no writer, we attempt
lockless vma read, but then (very rarely) writer comes in and starts
to change something, disrupting the read. In uprobe vma loo

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-24 Thread Suren Baghdasaryan
On Thu, Apr 24, 2025 at 9:42 AM Liam R. Howlett  wrote:
>
> * Andrii Nakryiko  [250424 12:04]:
> > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett  
> > > wrote:
> > > >
> > > > * Andrii Nakryiko  [250423 18:06]:
> > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma 
> > > > > > > > and
> > > > > > > > its important members being RCU-safe, /proc/pid/maps can be 
> > > > > > > > read under
> > > > > > > > RCU and without the need to read-lock mmap_lock. However vma 
> > > > > > > > content
> > > > > > > > can change from under us, therefore we make a copy of the vma 
> > > > > > > > and we
> > > > > > > > pin pointer fields used when generating the output (currently 
> > > > > > > > only
> > > > > > > > vm_file and anon_name). Afterwards we check for concurrent 
> > > > > > > > address
> > > > > > > > space modifications, wait for them to end and retry. While we 
> > > > > > > > take
> > > > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > > > momentarily
> > > > > > > > only to record new mm_wr_seq counter. This change is designed 
> > > > > > > > to reduce
> > > > > > >
> > > > > > > This is probably a stupid question, but why do we need to take a 
> > > > > > > lock
> > > > > > > just to record this counter? uprobes get away without taking 
> > > > > > > mmap_lock
> > > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > > whether there were any modifications in between. Why does this 
> > > > > > > change
> > > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > > >
> > > > > > Not a stupid question. mmap_read_lock() is used to wait for the 
> > > > > > writer
> > > > > > to finish what it's doing and then we continue by recording a new
> > > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > > get_vma_snapshot() does. But your question made me realize that we 
> > > > > > can
> > > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > > >
> > > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > > >
> > > > We can check this counter once per 4k block and maintain the same
> > > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > > they had an issue with changing it, but since we're on this road anyways
> > > > I'd thought I'd point out where we could end up.
> > >
> > > We would need to run that check on the last call to show_map() right
> > > before seq_file detects the overflow and flushes the page. On
> > > contention we will also be throwing away more prepared data (up to a
> > > page worth of records) vs only the last record. All in all I'm not
> > > convinced this is worth doing unless increased chances of data tearing
> > > is identified as a problem.
> > >
> >
> > Yep, I agree, with filling out 4K of data we run into much higher
> > chances of conflict, IMO. Not worth it, I'd say.
>
> Sounds good.
>
> If this is an issue we do have a path forward still.  Although it's less
> desirable.
>
> >
> > > >
> > > > I am concerned about live locking in either scenario, but I haven't
> > > > looked too deep into this pattern.
> > > >
> > > > I also don't love (as usual) the lack of ensured forward progress.
> > >
> > > Hmm. Maybe we should add a retry limit on
> > > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > > the mmap_read_lock and proceed with it? That would prevent a
> > > hyperactive writer from blocking the reader's forward progress
> > > indefinitely.
> >
> > Came here to say the same. I'd add a small number of retries (3-5?)
> > and then fallback to the read-locked approach. The main challenge is
> > to keep all this logic nicely isolated from the main VMA
> > search/printing logic.
> >
> > For a similar pattern in uprobes, we don't even bother to rety, we
> > just fallback to mmap_read_lock and proceed, under the assumption that
> > this is going to be very rare and thus not important from the overall
> > performance perspective.
>
> In this problem space we are dealing with a herd of readers caused by
> writers delaying an ever-growing line of readers, right?

I don't know if we have a herd of readers. The problem statement was
for a low-priority reader (monitors) blocking a high-priority writer.
Multiple readers are of course possible, so I think as lo

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-24 Thread Liam R. Howlett
* Andrii Nakryiko  [250424 12:04]:
> On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan  wrote:
> >
> > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett  
> > wrote:
> > >
> > > * Andrii Nakryiko  [250423 18:06]:
> > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan  
> > > > wrote:
> > > > >
> > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan 
> > > > > >  wrote:
> > > > > > >
> > > > > > > With maple_tree supporting vma tree traversal under RCU and vma 
> > > > > > > and
> > > > > > > its important members being RCU-safe, /proc/pid/maps can be read 
> > > > > > > under
> > > > > > > RCU and without the need to read-lock mmap_lock. However vma 
> > > > > > > content
> > > > > > > can change from under us, therefore we make a copy of the vma and 
> > > > > > > we
> > > > > > > pin pointer fields used when generating the output (currently only
> > > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > > momentarily
> > > > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > > > reduce
> > > > > >
> > > > > > This is probably a stupid question, but why do we need to take a 
> > > > > > lock
> > > > > > just to record this counter? uprobes get away without taking 
> > > > > > mmap_lock
> > > > > > even for reads, and still record this seq counter. And then detect
> > > > > > whether there were any modifications in between. Why does this 
> > > > > > change
> > > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > > >
> > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > > to finish what it's doing and then we continue by recording a new
> > > > > sequence counter value and call mmap_read_unlock. This is what
> > > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > > dance we do in the get_vma_snapshot(). I think that should work.
> > > >
> > > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> > >
> > > We can check this counter once per 4k block and maintain the same
> > > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > > they had an issue with changing it, but since we're on this road anyways
> > > I'd thought I'd point out where we could end up.
> >
> > We would need to run that check on the last call to show_map() right
> > before seq_file detects the overflow and flushes the page. On
> > contention we will also be throwing away more prepared data (up to a
> > page worth of records) vs only the last record. All in all I'm not
> > convinced this is worth doing unless increased chances of data tearing
> > is identified as a problem.
> >
> 
> Yep, I agree, with filling out 4K of data we run into much higher
> chances of conflict, IMO. Not worth it, I'd say.

Sounds good.

If this is an issue we do have a path forward still.  Although it's less
desirable.

> 
> > >
> > > I am concerned about live locking in either scenario, but I haven't
> > > looked too deep into this pattern.
> > >
> > > I also don't love (as usual) the lack of ensured forward progress.
> >
> > Hmm. Maybe we should add a retry limit on
> > mmap_lock_speculate_try_begin() and once the limit is hit we just take
> > the mmap_read_lock and proceed with it? That would prevent a
> > hyperactive writer from blocking the reader's forward progress
> > indefinitely.
> 
> Came here to say the same. I'd add a small number of retries (3-5?)
> and then fallback to the read-locked approach. The main challenge is
> to keep all this logic nicely isolated from the main VMA
> search/printing logic.
> 
> For a similar pattern in uprobes, we don't even bother to rety, we
> just fallback to mmap_read_lock and proceed, under the assumption that
> this is going to be very rare and thus not important from the overall
> performance perspective.

In this problem space we are dealing with a herd of readers caused by
writers delaying an ever-growing line of readers, right?

Assuming there is a backup caused by a writer, then I don't know if the
retry is going to do anything more than heat the data centre.

The readers that take the read lock will get the data, while the others
who arrive during read locked time can try lockless, but will most
likely have a run time that extends beyond the readers holding the lock
and will probably be interrupted by the writer.

We can predict the new readers will also not make it through in time
because the earlier ones failed.  The new readers will then 

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-24 Thread Andrii Nakryiko
On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan  wrote:
>
> On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett  
> wrote:
> >
> > * Andrii Nakryiko  [250423 18:06]:
> > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > > >  wrote:
> > > > >
> > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan 
> > > > >  wrote:
> > > > > >
> > > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > > its important members being RCU-safe, /proc/pid/maps can be read 
> > > > > > under
> > > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > > pin pointer fields used when generating the output (currently only
> > > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > > space modifications, wait for them to end and retry. While we take
> > > > > > the mmap_lock for reading during such contention, we do that 
> > > > > > momentarily
> > > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > > reduce
> > > > >
> > > > > This is probably a stupid question, but why do we need to take a lock
> > > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > > even for reads, and still record this seq counter. And then detect
> > > > > whether there were any modifications in between. Why does this change
> > > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > > >
> > > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > > to finish what it's doing and then we continue by recording a new
> > > > sequence counter value and call mmap_read_unlock. This is what
> > > > get_vma_snapshot() does. But your question made me realize that we can
> > > > optimize m_start() further by not taking mmap_read_lock at all.
> > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > > dance we do in the get_vma_snapshot(). I think that should work.
> > >
> > > Ok, yeah, it would be great to avoid taking a lock in a common case!
> >
> > We can check this counter once per 4k block and maintain the same
> > 'tearing' that exists today instead of per-vma.  Not that anyone said
> > they had an issue with changing it, but since we're on this road anyways
> > I'd thought I'd point out where we could end up.
>
> We would need to run that check on the last call to show_map() right
> before seq_file detects the overflow and flushes the page. On
> contention we will also be throwing away more prepared data (up to a
> page worth of records) vs only the last record. All in all I'm not
> convinced this is worth doing unless increased chances of data tearing
> is identified as a problem.
>

Yep, I agree, with filling out 4K of data we run into much higher
chances of conflict, IMO. Not worth it, I'd say.

> >
> > I am concerned about live locking in either scenario, but I haven't
> > looked too deep into this pattern.
> >
> > I also don't love (as usual) the lack of ensured forward progress.
>
> Hmm. Maybe we should add a retry limit on
> mmap_lock_speculate_try_begin() and once the limit is hit we just take
> the mmap_read_lock and proceed with it? That would prevent a
> hyperactive writer from blocking the reader's forward progress
> indefinitely.

Came here to say the same. I'd add a small number of retries (3-5?)
and then fallback to the read-locked approach. The main challenge is
to keep all this logic nicely isolated from the main VMA
search/printing logic.

For a similar pattern in uprobes, we don't even bother to rety, we
just fallback to mmap_read_lock and proceed, under the assumption that
this is going to be very rare and thus not important from the overall
performance perspective.

>
> >
> > It seems like we have four cases for the vm area state now:
> > 1. we want to read a stable vma or set of vmas (per-vma locking)
> > 2. we want to read a stable mm state for reading (the very short named
> > mmap_lock_speculate_try_begin)
>
> and we don't mind retrying on contention. This one should be done
> under RCU protection.
>
> > 3. we ensure a stable vma/mm state for reading (mmap read lock)
> > 4. we are writing - get out of my way (mmap write lock).
>
> I wouldn't call #2 a vma state. More of a usecase when we want to read
> vma under RCU (valid but can change from under us) and then retry if
> it might have been modified from under us.
>
> >
> > Cheers,
> > Liam
> >



Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-24 Thread Suren Baghdasaryan
On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett  wrote:
>
> * Andrii Nakryiko  [250423 18:06]:
> > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan  
> > wrote:
> > >
> > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> > >  wrote:
> > > >
> > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan  
> > > > wrote:
> > > > >
> > > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > > can change from under us, therefore we make a copy of the vma and we
> > > > > pin pointer fields used when generating the output (currently only
> > > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > > space modifications, wait for them to end and retry. While we take
> > > > > the mmap_lock for reading during such contention, we do that 
> > > > > momentarily
> > > > > only to record new mm_wr_seq counter. This change is designed to 
> > > > > reduce
> > > >
> > > > This is probably a stupid question, but why do we need to take a lock
> > > > just to record this counter? uprobes get away without taking mmap_lock
> > > > even for reads, and still record this seq counter. And then detect
> > > > whether there were any modifications in between. Why does this change
> > > > need more heavy-weight mmap_read_lock to do speculative reads?
> > >
> > > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > > to finish what it's doing and then we continue by recording a new
> > > sequence counter value and call mmap_read_unlock. This is what
> > > get_vma_snapshot() does. But your question made me realize that we can
> > > optimize m_start() further by not taking mmap_read_lock at all.
> > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > > dance we do in the get_vma_snapshot(). I think that should work.
> >
> > Ok, yeah, it would be great to avoid taking a lock in a common case!
>
> We can check this counter once per 4k block and maintain the same
> 'tearing' that exists today instead of per-vma.  Not that anyone said
> they had an issue with changing it, but since we're on this road anyways
> I'd thought I'd point out where we could end up.

We would need to run that check on the last call to show_map() right
before seq_file detects the overflow and flushes the page. On
contention we will also be throwing away more prepared data (up to a
page worth of records) vs only the last record. All in all I'm not
convinced this is worth doing unless increased chances of data tearing
is identified as a problem.

>
> I am concerned about live locking in either scenario, but I haven't
> looked too deep into this pattern.
>
> I also don't love (as usual) the lack of ensured forward progress.

Hmm. Maybe we should add a retry limit on
mmap_lock_speculate_try_begin() and once the limit is hit we just take
the mmap_read_lock and proceed with it? That would prevent a
hyperactive writer from blocking the reader's forward progress
indefinitely.

>
> It seems like we have four cases for the vm area state now:
> 1. we want to read a stable vma or set of vmas (per-vma locking)
> 2. we want to read a stable mm state for reading (the very short named
> mmap_lock_speculate_try_begin)

and we don't mind retrying on contention. This one should be done
under RCU protection.

> 3. we ensure a stable vma/mm state for reading (mmap read lock)
> 4. we are writing - get out of my way (mmap write lock).

I wouldn't call #2 a vma state. More of a usecase when we want to read
vma under RCU (valid but can change from under us) and then retry if
it might have been modified from under us.

>
> Cheers,
> Liam
>



Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-23 Thread Liam R. Howlett
* Andrii Nakryiko  [250423 18:06]:
> On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan  wrote:
> >
> > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
> >  wrote:
> > >
> > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan  
> > > wrote:
> > > >
> > > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > > RCU and without the need to read-lock mmap_lock. However vma content
> > > > can change from under us, therefore we make a copy of the vma and we
> > > > pin pointer fields used when generating the output (currently only
> > > > vm_file and anon_name). Afterwards we check for concurrent address
> > > > space modifications, wait for them to end and retry. While we take
> > > > the mmap_lock for reading during such contention, we do that momentarily
> > > > only to record new mm_wr_seq counter. This change is designed to reduce
> > >
> > > This is probably a stupid question, but why do we need to take a lock
> > > just to record this counter? uprobes get away without taking mmap_lock
> > > even for reads, and still record this seq counter. And then detect
> > > whether there were any modifications in between. Why does this change
> > > need more heavy-weight mmap_read_lock to do speculative reads?
> >
> > Not a stupid question. mmap_read_lock() is used to wait for the writer
> > to finish what it's doing and then we continue by recording a new
> > sequence counter value and call mmap_read_unlock. This is what
> > get_vma_snapshot() does. But your question made me realize that we can
> > optimize m_start() further by not taking mmap_read_lock at all.
> > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> > try mmap_lock_speculate_try_begin() and only if it fails do the same
> > dance we do in the get_vma_snapshot(). I think that should work.
> 
> Ok, yeah, it would be great to avoid taking a lock in a common case!

We can check this counter once per 4k block and maintain the same
'tearing' that exists today instead of per-vma.  Not that anyone said
they had an issue with changing it, but since we're on this road anyways
I'd thought I'd point out where we could end up.

I am concerned about live locking in either scenario, but I haven't
looked too deep into this pattern.

I also don't love (as usual) the lack of ensured forward progress.

It seems like we have four cases for the vm area state now:
1. we want to read a stable vma or set of vmas (per-vma locking)
2. we want to read a stable mm state for reading (the very short named
mmap_lock_speculate_try_begin)
3. we ensure a stable vma/mm state for reading (mmap read lock)
4. we are writing - get out of my way (mmap write lock).

Cheers,
Liam




Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-23 Thread Andrii Nakryiko
On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan  wrote:
>
> On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
>  wrote:
> >
> > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > With maple_tree supporting vma tree traversal under RCU and vma and
> > > its important members being RCU-safe, /proc/pid/maps can be read under
> > > RCU and without the need to read-lock mmap_lock. However vma content
> > > can change from under us, therefore we make a copy of the vma and we
> > > pin pointer fields used when generating the output (currently only
> > > vm_file and anon_name). Afterwards we check for concurrent address
> > > space modifications, wait for them to end and retry. While we take
> > > the mmap_lock for reading during such contention, we do that momentarily
> > > only to record new mm_wr_seq counter. This change is designed to reduce
> >
> > This is probably a stupid question, but why do we need to take a lock
> > just to record this counter? uprobes get away without taking mmap_lock
> > even for reads, and still record this seq counter. And then detect
> > whether there were any modifications in between. Why does this change
> > need more heavy-weight mmap_read_lock to do speculative reads?
>
> Not a stupid question. mmap_read_lock() is used to wait for the writer
> to finish what it's doing and then we continue by recording a new
> sequence counter value and call mmap_read_unlock. This is what
> get_vma_snapshot() does. But your question made me realize that we can
> optimize m_start() further by not taking mmap_read_lock at all.
> Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
> try mmap_lock_speculate_try_begin() and only if it fails do the same
> dance we do in the get_vma_snapshot(). I think that should work.

Ok, yeah, it would be great to avoid taking a lock in a common case!

>
> >
> > > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > > (often a low priority task, such as monitoring/data collection services)
> > > from blocking address space updates.
> > > Note that this change has a userspace visible disadvantage: it allows
> > > for sub-page data tearing as opposed to the previous mechanism where
> > > data tearing could happen only between pages of generated output data.
> > > Since current userspace considers data tearing between pages to be
> > > acceptable, we assume is will be able to handle sub-page data tearing
> > > as well.
> > >
> > > Signed-off-by: Suren Baghdasaryan 
> > > ---
> > >  fs/proc/internal.h|   6 ++
> > >  fs/proc/task_mmu.c| 170 ++
> > >  include/linux/mm_inline.h |  18 
> > >  3 files changed, 177 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > > index 96122e91c645..6e1169c1f4df 100644
> > > --- a/fs/proc/internal.h
> > > +++ b/fs/proc/internal.h
> > > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > > struct task_struct *task;
> > > struct mm_struct *mm;
> > > struct vma_iterator iter;
> > > +   bool mmap_locked;
> > > +   loff_t last_pos;
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +   unsigned int mm_wr_seq;
> > > +   struct vm_area_struct vma_copy;
> > > +#endif
> > >  #ifdef CONFIG_NUMA
> > > struct mempolicy *task_mempolicy;
> > >  #endif
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index b9e4fbbdf6e6..f9d50a61167c 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct 
> > > proc_maps_private *priv)
> > >  }
> > >  #endif
> > >
> > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private 
> > > *priv,
> > > -   loff_t *ppos)
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +
> > > +static const struct seq_operations proc_pid_maps_op;
> > > +
> > > +/*
> > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > > + * show_map_vma.
> > > + */
> > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > > vm_area_struct *vma)
> > > +{
> > > +   struct vm_area_struct *copy = &priv->vma_copy;
> > > +   int ret = -EAGAIN;
> > > +
> > > +   memcpy(copy, vma, sizeof(*vma));
> > > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > > +   goto out;
> > > +
> > > +   if (!anon_vma_name_get_if_valid(copy))
> > > +   goto put_file;
> >
> > Given vm_file and anon_vma_name are both RCU-protected, if we take
> > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
> > even need getting/putting them, no?
>
> Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
> using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
> but pointing to a wrong object even if the rcu grace period did not
> pass. With my assumption that seq_file can't rollback once show_map()
> is done, 

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-23 Thread Suren Baghdasaryan
On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko
 wrote:
>
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan  wrote:
> >
> > With maple_tree supporting vma tree traversal under RCU and vma and
> > its important members being RCU-safe, /proc/pid/maps can be read under
> > RCU and without the need to read-lock mmap_lock. However vma content
> > can change from under us, therefore we make a copy of the vma and we
> > pin pointer fields used when generating the output (currently only
> > vm_file and anon_name). Afterwards we check for concurrent address
> > space modifications, wait for them to end and retry. While we take
> > the mmap_lock for reading during such contention, we do that momentarily
> > only to record new mm_wr_seq counter. This change is designed to reduce
>
> This is probably a stupid question, but why do we need to take a lock
> just to record this counter? uprobes get away without taking mmap_lock
> even for reads, and still record this seq counter. And then detect
> whether there were any modifications in between. Why does this change
> need more heavy-weight mmap_read_lock to do speculative reads?

Not a stupid question. mmap_read_lock() is used to wait for the writer
to finish what it's doing and then we continue by recording a new
sequence counter value and call mmap_read_unlock. This is what
get_vma_snapshot() does. But your question made me realize that we can
optimize m_start() further by not taking mmap_read_lock at all.
Instead of taking mmap_read_lock then doing drop_mmap_lock() we can
try mmap_lock_speculate_try_begin() and only if it fails do the same
dance we do in the get_vma_snapshot(). I think that should work.

>
> > mmap_lock contention and prevent a process reading /proc/pid/maps files
> > (often a low priority task, such as monitoring/data collection services)
> > from blocking address space updates.
> > Note that this change has a userspace visible disadvantage: it allows
> > for sub-page data tearing as opposed to the previous mechanism where
> > data tearing could happen only between pages of generated output data.
> > Since current userspace considers data tearing between pages to be
> > acceptable, we assume is will be able to handle sub-page data tearing
> > as well.
> >
> > Signed-off-by: Suren Baghdasaryan 
> > ---
> >  fs/proc/internal.h|   6 ++
> >  fs/proc/task_mmu.c| 170 ++
> >  include/linux/mm_inline.h |  18 
> >  3 files changed, 177 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> > index 96122e91c645..6e1169c1f4df 100644
> > --- a/fs/proc/internal.h
> > +++ b/fs/proc/internal.h
> > @@ -379,6 +379,12 @@ struct proc_maps_private {
> > struct task_struct *task;
> > struct mm_struct *mm;
> > struct vma_iterator iter;
> > +   bool mmap_locked;
> > +   loff_t last_pos;
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +   unsigned int mm_wr_seq;
> > +   struct vm_area_struct vma_copy;
> > +#endif
> >  #ifdef CONFIG_NUMA
> > struct mempolicy *task_mempolicy;
> >  #endif
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index b9e4fbbdf6e6..f9d50a61167c 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct 
> > proc_maps_private *priv)
> >  }
> >  #endif
> >
> > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> > -   loff_t *ppos)
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +
> > +static const struct seq_operations proc_pid_maps_op;
> > +
> > +/*
> > + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> > + * show_map_vma.
> > + */
> > +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> > vm_area_struct *vma)
> > +{
> > +   struct vm_area_struct *copy = &priv->vma_copy;
> > +   int ret = -EAGAIN;
> > +
> > +   memcpy(copy, vma, sizeof(*vma));
> > +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> > +   goto out;
> > +
> > +   if (!anon_vma_name_get_if_valid(copy))
> > +   goto put_file;
>
> Given vm_file and anon_vma_name are both RCU-protected, if we take
> rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
> even need getting/putting them, no?

Yeah, anon_vma_name indeed looks safe without pinning but vm_file is
using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer
but pointing to a wrong object even if the rcu grace period did not
pass. With my assumption that seq_file can't rollback once show_map()
is done, I would need a completely stable vma at the time show_map()
is executed so that it does not change from under us while we are
generating the output.
OTOH, if we indeed can rollback while generating seq_file output then
show_map() could output potentially invalid vma, then check for vma
changes and when detected, rollback seq_file and retry again.

>
>

Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU

2025-04-22 Thread Andrii Nakryiko
On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan  wrote:
>
> With maple_tree supporting vma tree traversal under RCU and vma and
> its important members being RCU-safe, /proc/pid/maps can be read under
> RCU and without the need to read-lock mmap_lock. However vma content
> can change from under us, therefore we make a copy of the vma and we
> pin pointer fields used when generating the output (currently only
> vm_file and anon_name). Afterwards we check for concurrent address
> space modifications, wait for them to end and retry. While we take
> the mmap_lock for reading during such contention, we do that momentarily
> only to record new mm_wr_seq counter. This change is designed to reduce

This is probably a stupid question, but why do we need to take a lock
just to record this counter? uprobes get away without taking mmap_lock
even for reads, and still record this seq counter. And then detect
whether there were any modifications in between. Why does this change
need more heavy-weight mmap_read_lock to do speculative reads?

> mmap_lock contention and prevent a process reading /proc/pid/maps files
> (often a low priority task, such as monitoring/data collection services)
> from blocking address space updates.
> Note that this change has a userspace visible disadvantage: it allows
> for sub-page data tearing as opposed to the previous mechanism where
> data tearing could happen only between pages of generated output data.
> Since current userspace considers data tearing between pages to be
> acceptable, we assume is will be able to handle sub-page data tearing
> as well.
>
> Signed-off-by: Suren Baghdasaryan 
> ---
>  fs/proc/internal.h|   6 ++
>  fs/proc/task_mmu.c| 170 ++
>  include/linux/mm_inline.h |  18 
>  3 files changed, 177 insertions(+), 17 deletions(-)
>
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index 96122e91c645..6e1169c1f4df 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -379,6 +379,12 @@ struct proc_maps_private {
> struct task_struct *task;
> struct mm_struct *mm;
> struct vma_iterator iter;
> +   bool mmap_locked;
> +   loff_t last_pos;
> +#ifdef CONFIG_PER_VMA_LOCK
> +   unsigned int mm_wr_seq;
> +   struct vm_area_struct vma_copy;
> +#endif
>  #ifdef CONFIG_NUMA
> struct mempolicy *task_mempolicy;
>  #endif
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index b9e4fbbdf6e6..f9d50a61167c 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct 
> proc_maps_private *priv)
>  }
>  #endif
>
> -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
> -   loff_t *ppos)
> +#ifdef CONFIG_PER_VMA_LOCK
> +
> +static const struct seq_operations proc_pid_maps_op;
> +
> +/*
> + * Take VMA snapshot and pin vm_file and anon_name as they are used by
> + * show_map_vma.
> + */
> +static int get_vma_snapshot(struct proc_maps_private *priv, struct 
> vm_area_struct *vma)
> +{
> +   struct vm_area_struct *copy = &priv->vma_copy;
> +   int ret = -EAGAIN;
> +
> +   memcpy(copy, vma, sizeof(*vma));
> +   if (copy->vm_file && !get_file_rcu(©->vm_file))
> +   goto out;
> +
> +   if (!anon_vma_name_get_if_valid(copy))
> +   goto put_file;

Given vm_file and anon_vma_name are both RCU-protected, if we take
rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't
even need getting/putting them, no?

I feel like I'm missing some important limitations, but I don't think
they are spelled out explicitly...

> +
> +   if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq))
> +   return 0;
> +
> +   /* Address space got modified, vma might be stale. Re-lock and retry. 
> */
> +   rcu_read_unlock();

Another question I have is whether we really need to copy vma into
priv->vma_copy to have a stable snapshot? Can't we just speculate like
with uprobes under assumption that data doesn't change. And once we
are done producing text output, confirm that speculation was
successful, and if not - retry?

We'd need an API for seq_file to rollback to previous good known
location for that, but that seems straightforward enough to do, no?
Just remember buffer position before speculation, write data, check
for no mm modifications, and if something changed, rollback seq file
to last position.


> +   ret = mmap_read_lock_killable(priv->mm);
> +   if (!ret) {
> +   /* mmap_lock_speculate_try_begin() succeeds when holding 
> mmap_read_lock */
> +   mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq);
> +   mmap_read_unlock(priv->mm);
> +   ret = -EAGAIN;
> +   }
> +
> +   rcu_read_lock();
> +
> +   anon_vma_name_put_if_valid(copy);
> +put_file:
> +   if (copy->vm_file)
> +   fput(copy->vm_f