Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Mon, May 5, 2025 at 6:50 AM Christian Brauner wrote: > > On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote: > > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan > > wrote: > > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > > > > > > > Hi! > > > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > > > forgot all about that...) > > > > > > Does this fact affect your previous comments? Just want to make sure > > > I'm not missing something... > > > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > > down a VMA, and using get_file_rcu() for the lockless lookup, I did > > not realize that you could actually also race with all the other > > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > > did not think about whether any of those code paths might leave a VMA > > with a dangling ->vm_file pointer. > > > > I guess maybe that means you really do need to do the lookup from the > > copied data, as you did in your patch; and that might require calling > > get_file_active() on the copied ->vm_file pointer (instead of > > get_file_rcu()), even though I think that is not really how > > get_file_active() is supposed to be used (it's supposed to be used > > when you know the original file hasn't been freed yet). Really what > > I think it's fine for get_file_active() to be used in this way. That > ->vm_file pointer usage should get a fat comment above it explaining how > what you're doing is safe. Got it. Will use it in my next version. Thanks! > > > you'd want for that is basically a raw __get_file_rcu(), but that is > > static and I think Christian wouldn't want to expose more of these > > internals outside VFS... > > Yeah, no. I don't want that to be usable outside of that file. > > > (In that case, all the stuff below about get_file_rcu() would be moot.) > > > > Or you could pepper WRITE_ONCE() over all the places that write > > ->vm_file, and ensure that ->vm_file is always NULLed before its > > reference is dropped... but that seems a bit more ugly to me. > > > > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan > > > > wrote: > > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan > > > > > > wrote: > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma > > > > > > > and > > > > > > > its important members being RCU-safe, /proc/pid/maps can be read > > > > > > > under > > > > > > > RCU and without the need to read-lock mmap_lock. However vma > > > > > > > content > > > > > > > can change from under us, therefore we make a copy of the vma and > > > > > > > we > > > > > > > pin pointer fields used when generating the output (currently only > > > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > > > space modifications, wait for them to end and retry. While we take > > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > > momentarily > > > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > > > reduce > > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps > > > > > > > files > > > > > > > (often a low priority task, such as monitoring/data collection > > > > > > > services) > > > > > > > from blocking address space updates. > > > > > > [...] > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > > > --- a/fs/proc/task_mmu.c > > > > > > > +++ b/fs/proc/task_mmu.c > > > > > > [...] > > > > > > > +/* > > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are > > > > > > > used by > > > > > > > + * show_map_vma. > > > > > > > + */ > > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, > > > > > > > struct vm_area_struct *vma) > > > > > > > +{ > > > > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > > > > + int ret = -EAGAIN; > > > > > > > + > > > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > > > + goto out; > > > > > > > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > > > pointer which always points to a file with a non-zero refcount > > > > > > (except > > > > > > when it is NULL). That's why it takes a file** instead of a file* - > > > > > > if > > > > > > it observes a zero refcount, it assumes that the pointer must have > > > > > > been updated in the meantime, and retries. Calling get_file_rcu() > > > > > > on a > > > > > > pointer that points to a file with zero refcount, which I think can > > > > > > happen with this patch, will cause an endless loop.
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 29, 2025 at 08:54:58PM +0200, Jann Horn wrote: > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan wrote: > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > > > > > Hi! > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > > forgot all about that...) > > > > Does this fact affect your previous comments? Just want to make sure > > I'm not missing something... > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > down a VMA, and using get_file_rcu() for the lockless lookup, I did > not realize that you could actually also race with all the other > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > did not think about whether any of those code paths might leave a VMA > with a dangling ->vm_file pointer. > > I guess maybe that means you really do need to do the lookup from the > copied data, as you did in your patch; and that might require calling > get_file_active() on the copied ->vm_file pointer (instead of > get_file_rcu()), even though I think that is not really how > get_file_active() is supposed to be used (it's supposed to be used > when you know the original file hasn't been freed yet). Really what I think it's fine for get_file_active() to be used in this way. That ->vm_file pointer usage should get a fat comment above it explaining how what you're doing is safe. > you'd want for that is basically a raw __get_file_rcu(), but that is > static and I think Christian wouldn't want to expose more of these > internals outside VFS... Yeah, no. I don't want that to be usable outside of that file. > (In that case, all the stuff below about get_file_rcu() would be moot.) > > Or you could pepper WRITE_ONCE() over all the places that write > ->vm_file, and ensure that ->vm_file is always NULLed before its > reference is dropped... but that seems a bit more ugly to me. > > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan > > > wrote: > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan > > > > > wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > > > its important members being RCU-safe, /proc/pid/maps can be read > > > > > > under > > > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > > > can change from under us, therefore we make a copy of the vma and we > > > > > > pin pointer fields used when generating the output (currently only > > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > > space modifications, wait for them to end and retry. While we take > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > momentarily > > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > > reduce > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps > > > > > > files > > > > > > (often a low priority task, such as monitoring/data collection > > > > > > services) > > > > > > from blocking address space updates. > > > > > [...] > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > > --- a/fs/proc/task_mmu.c > > > > > > +++ b/fs/proc/task_mmu.c > > > > > [...] > > > > > > +/* > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are > > > > > > used by > > > > > > + * show_map_vma. > > > > > > + */ > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > > > > vm_area_struct *vma) > > > > > > +{ > > > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > > > + int ret = -EAGAIN; > > > > > > + > > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > > + goto out; > > > > > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > > pointer which always points to a file with a non-zero refcount (except > > > > > when it is NULL). That's why it takes a file** instead of a file* - if > > > > > it observes a zero refcount, it assumes that the pointer must have > > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a > > > > > pointer that points to a file with zero refcount, which I think can > > > > > happen with this patch, will cause an endless loop. > > > > > (Just as background: For other usecases, get_file_rcu() is supposed to > > > > > still behave nicely and not spuriously return NULL when the file* is > > > > > concurrently updated to point to another file*; that's what that loop > > > > > is for.) > > > > > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by > > > > checking the return value
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 29, 2025 at 10:33 PM Suren Baghdasaryan wrote: > On Tue, Apr 29, 2025 at 11:55 AM Jann Horn wrote: > > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan > > wrote: > > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > > > > > > > Hi! > > > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > > > forgot all about that...) > > > > > > Does this fact affect your previous comments? Just want to make sure > > > I'm not missing something... > > > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > > down a VMA, and using get_file_rcu() for the lockless lookup, I did > > not realize that you could actually also race with all the other > > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > > did not think about whether any of those code paths might leave a VMA > > with a dangling ->vm_file pointer. > > So, let me summarize my understanding and see if it's correct. > > If we copy the original vma, ensure that it hasn't changed while we > were copying (with mmap_lock_speculate_retry()) and then use > get_file_rcu(©->vm_file) I think we are guaranteed no UAF because > we are in RCU read section. At this point the only issue is that > copy->vm_file might have lost its last refcount and get_file_rcu() > would enter an infinite loop. Yeah. (Using get_file_active() would avoid that.) > So, to avoid that we have to use the > original vma when calling get_file_rcu() Sorry - I originally said that, but I didn't think about SLAB_TYPESAFE_BY_RCU when I had that in mind. > but then we should also > ensure that vma itself does not change from under us due to > SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change > from under us we might end up accessing an invalid address if > vma->vm_file is being modified concurrently. Yeah, I think in theory we would have data races, since the file* reads in get_file_rcu() could race with all the (plain) ->vm_file pointer stores. So I guess it might actually be safer to use the copied VMA's ->vm_file for this, with get_file_active(). Though that would be abusing get_file_active() quite a bit, so brauner@ should probably look over this early and see whether he thinks that's acceptable... > > I guess maybe that means you really do need to do the lookup from the > > copied data, as you did in your patch; and that might require calling > > get_file_active() on the copied ->vm_file pointer (instead of > > get_file_rcu()), even though I think that is not really how > > get_file_active() is supposed to be used (it's supposed to be used > > when you know the original file hasn't been freed yet). Really what > > you'd want for that is basically a raw __get_file_rcu(), but that is > > static and I think Christian wouldn't want to expose more of these > > internals outside VFS... > > (In that case, all the stuff below about get_file_rcu() would be moot.) > > > > Or you could pepper WRITE_ONCE() over all the places that write > > ->vm_file, and ensure that ->vm_file is always NULLed before its > > reference is dropped... but that seems a bit more ugly to me. > > Ugh, yes. We either ensure no vma->vm_file tearing or use > __get_file_rcu() on a copy of the vma. Or we have to stabilize the vma > by locking it... Let me think about all these options. Thanks!
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 29, 2025 at 11:55 AM Jann Horn wrote: > > On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan wrote: > > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > > > > > Hi! > > > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > > forgot all about that...) > > > > Does this fact affect your previous comments? Just want to make sure > > I'm not missing something... > > When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing > down a VMA, and using get_file_rcu() for the lockless lookup, I did > not realize that you could actually also race with all the other > places that set ->vm_file, like __mmap_new_file_vma() and so on; and I > did not think about whether any of those code paths might leave a VMA > with a dangling ->vm_file pointer. So, let me summarize my understanding and see if it's correct. If we copy the original vma, ensure that it hasn't changed while we were copying (with mmap_lock_speculate_retry()) and then use get_file_rcu(©->vm_file) I think we are guaranteed no UAF because we are in RCU read section. At this point the only issue is that copy->vm_file might have lost its last refcount and get_file_rcu() would enter an infinite loop. So, to avoid that we have to use the original vma when calling get_file_rcu() but then we should also ensure that vma itself does not change from under us due to SLAB_TYPESAFE_BY_RCU used for vm_area_struct cache. If it does change from under us we might end up accessing an invalid address if vma->vm_file is being modified concurrently. > > I guess maybe that means you really do need to do the lookup from the > copied data, as you did in your patch; and that might require calling > get_file_active() on the copied ->vm_file pointer (instead of > get_file_rcu()), even though I think that is not really how > get_file_active() is supposed to be used (it's supposed to be used > when you know the original file hasn't been freed yet). Really what > you'd want for that is basically a raw __get_file_rcu(), but that is > static and I think Christian wouldn't want to expose more of these > internals outside VFS... > (In that case, all the stuff below about get_file_rcu() would be moot.) > > Or you could pepper WRITE_ONCE() over all the places that write > ->vm_file, and ensure that ->vm_file is always NULLed before its > reference is dropped... but that seems a bit more ugly to me. Ugh, yes. We either ensure no vma->vm_file tearing or use __get_file_rcu() on a copy of the vma. Or we have to stabilize the vma by locking it... Let me think about all these options. Thanks! > > > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan > > > wrote: > > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan > > > > > wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > > > its important members being RCU-safe, /proc/pid/maps can be read > > > > > > under > > > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > > > can change from under us, therefore we make a copy of the vma and we > > > > > > pin pointer fields used when generating the output (currently only > > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > > space modifications, wait for them to end and retry. While we take > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > momentarily > > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > > reduce > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps > > > > > > files > > > > > > (often a low priority task, such as monitoring/data collection > > > > > > services) > > > > > > from blocking address space updates. > > > > > [...] > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > > --- a/fs/proc/task_mmu.c > > > > > > +++ b/fs/proc/task_mmu.c > > > > > [...] > > > > > > +/* > > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are > > > > > > used by > > > > > > + * show_map_vma. > > > > > > + */ > > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > > > > vm_area_struct *vma) > > > > > > +{ > > > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > > > + int ret = -EAGAIN; > > > > > > + > > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > > + goto out; > > > > > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > > pointer which always points to a file with a non-zero refcount (except > > > > > when it is NULL). That's why it takes a
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 29, 2025 at 8:04 PM Suren Baghdasaryan wrote: > On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > > > Hi! > > > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > > forgot all about that...) > > Does this fact affect your previous comments? Just want to make sure > I'm not missing something... When I suggested using "WRITE_ONCE(vma->vm_file, NULL)" when tearing down a VMA, and using get_file_rcu() for the lockless lookup, I did not realize that you could actually also race with all the other places that set ->vm_file, like __mmap_new_file_vma() and so on; and I did not think about whether any of those code paths might leave a VMA with a dangling ->vm_file pointer. I guess maybe that means you really do need to do the lookup from the copied data, as you did in your patch; and that might require calling get_file_active() on the copied ->vm_file pointer (instead of get_file_rcu()), even though I think that is not really how get_file_active() is supposed to be used (it's supposed to be used when you know the original file hasn't been freed yet). Really what you'd want for that is basically a raw __get_file_rcu(), but that is static and I think Christian wouldn't want to expose more of these internals outside VFS... (In that case, all the stuff below about get_file_rcu() would be moot.) Or you could pepper WRITE_ONCE() over all the places that write ->vm_file, and ensure that ->vm_file is always NULLed before its reference is dropped... but that seems a bit more ugly to me. > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan > > wrote: > > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan > > > > wrote: > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > > can change from under us, therefore we make a copy of the vma and we > > > > > pin pointer fields used when generating the output (currently only > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > space modifications, wait for them to end and retry. While we take > > > > > the mmap_lock for reading during such contention, we do that > > > > > momentarily > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > reduce > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps > > > > > files > > > > > (often a low priority task, such as monitoring/data collection > > > > > services) > > > > > from blocking address space updates. > > > > [...] > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > > --- a/fs/proc/task_mmu.c > > > > > +++ b/fs/proc/task_mmu.c > > > > [...] > > > > > +/* > > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used > > > > > by > > > > > + * show_map_vma. > > > > > + */ > > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > > > vm_area_struct *vma) > > > > > +{ > > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > > + int ret = -EAGAIN; > > > > > + > > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > > + goto out; > > > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > > pointer which always points to a file with a non-zero refcount (except > > > > when it is NULL). That's why it takes a file** instead of a file* - if > > > > it observes a zero refcount, it assumes that the pointer must have > > > > been updated in the meantime, and retries. Calling get_file_rcu() on a > > > > pointer that points to a file with zero refcount, which I think can > > > > happen with this patch, will cause an endless loop. > > > > (Just as background: For other usecases, get_file_rcu() is supposed to > > > > still behave nicely and not spuriously return NULL when the file* is > > > > concurrently updated to point to another file*; that's what that loop > > > > is for.) > > > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by > > > checking the return value of get_file_rcu() and retrying speculation > > > if it changed. > > > > I think you could probably still end up looping endlessly in get_file_rcu(). (Just to be clear: What I meant here is that get_file_rcu() loops *internally*; get_file_rcu() is not guaranteed to ever return if the pointed-to file has a zero refcount.) > By "retrying speculation" I meant it in the sense of > get_vma_snapshot() retry when it takes the mmap_read_lock and then > does mmap_lock_speculate_try_begin to rest
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 29, 2025 at 10:21 AM Jann Horn wrote: > > Hi! > > (I just noticed that I incorrectly assumed that VMAs use kfree_rcu > (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I > forgot all about that...) Does this fact affect your previous comments? Just want to make sure I'm not missing something... > > On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan wrote: > > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan > > > wrote: > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > can change from under us, therefore we make a copy of the vma and we > > > > pin pointer fields used when generating the output (currently only > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > space modifications, wait for them to end and retry. While we take > > > > the mmap_lock for reading during such contention, we do that momentarily > > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > > > (often a low priority task, such as monitoring/data collection services) > > > > from blocking address space updates. > > > [...] > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > [...] > > > > +/* > > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > > > + * show_map_vma. > > > > + */ > > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > > vm_area_struct *vma) > > > > +{ > > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > > + int ret = -EAGAIN; > > > > + > > > > + memcpy(copy, vma, sizeof(*vma)); > > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > > + goto out; > > > > > > I think this uses get_file_rcu() in a different way than intended. > > > > > > As I understand it, get_file_rcu() is supposed to be called on a > > > pointer which always points to a file with a non-zero refcount (except > > > when it is NULL). That's why it takes a file** instead of a file* - if > > > it observes a zero refcount, it assumes that the pointer must have > > > been updated in the meantime, and retries. Calling get_file_rcu() on a > > > pointer that points to a file with zero refcount, which I think can > > > happen with this patch, will cause an endless loop. > > > (Just as background: For other usecases, get_file_rcu() is supposed to > > > still behave nicely and not spuriously return NULL when the file* is > > > concurrently updated to point to another file*; that's what that loop > > > is for.) > > > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by > > checking the return value of get_file_rcu() and retrying speculation > > if it changed. > > I think you could probably still end up looping endlessly in get_file_rcu(). By "retrying speculation" I meant it in the sense of get_vma_snapshot() retry when it takes the mmap_read_lock and then does mmap_lock_speculate_try_begin to restart speculation. I'm also thinking about Liam's concern of guaranteeing forward progress for the reader. Thinking maybe I should not drop mmap_read_lock immediately on contention but generate some output (one vma or one page worth of vmas) before dropping mmap_read_lock and proceeding with speculation. > > > > (If my understanding is correct, maybe we should document that more > > > explicitly...) > > > > Good point. I'll add comments for get_file_rcu() as a separate patch. > > > > > > > > Also, I think you are introducing an implicit assumption that > > > remove_vma() does not NULL out the ->vm_file pointer (because that > > > could cause tearing and could theoretically lead to a torn pointer > > > being accessed here). > > > > > > One alternative might be to change the paths that drop references to > > > vma->vm_file (search for vma_close to find them) such that they first > > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, > > > maybe with a new helper like this: > > > > > > static void vma_fput(struct vm_area_struct *vma) > > > { > > > struct file *file = vma->vm_file; > > > > > > if (file) { > > > WRITE_ONCE(vma->vm_file, NULL); > > > fput(file); > > > } > > > } > > > > > > Then on the lockless lookup path you could use get_file_rcu() on the > > > ->vm_file pointer _of the original VMA_, and store the returned file* > > > into copy->vm_file. > > > > Ack. Except for storing the return value of get_file_rcu(). I think > > once we detect that get_file_rcu() returns a different file we should > > bail out and retry. The change in file is an indication that the v
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
Hi! (I just noticed that I incorrectly assumed that VMAs use kfree_rcu (not SLAB_TYPESAFE_BY_RCU) when I wrote my review of this, somehow I forgot all about that...) On Tue, Apr 29, 2025 at 7:09 PM Suren Baghdasaryan wrote: > On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan > > wrote: > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > RCU and without the need to read-lock mmap_lock. However vma content > > > can change from under us, therefore we make a copy of the vma and we > > > pin pointer fields used when generating the output (currently only > > > vm_file and anon_name). Afterwards we check for concurrent address > > > space modifications, wait for them to end and retry. While we take > > > the mmap_lock for reading during such contention, we do that momentarily > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > > (often a low priority task, such as monitoring/data collection services) > > > from blocking address space updates. > > [...] > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > [...] > > > +/* > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > > + * show_map_vma. > > > + */ > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > vm_area_struct *vma) > > > +{ > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > + int ret = -EAGAIN; > > > + > > > + memcpy(copy, vma, sizeof(*vma)); > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > + goto out; > > > > I think this uses get_file_rcu() in a different way than intended. > > > > As I understand it, get_file_rcu() is supposed to be called on a > > pointer which always points to a file with a non-zero refcount (except > > when it is NULL). That's why it takes a file** instead of a file* - if > > it observes a zero refcount, it assumes that the pointer must have > > been updated in the meantime, and retries. Calling get_file_rcu() on a > > pointer that points to a file with zero refcount, which I think can > > happen with this patch, will cause an endless loop. > > (Just as background: For other usecases, get_file_rcu() is supposed to > > still behave nicely and not spuriously return NULL when the file* is > > concurrently updated to point to another file*; that's what that loop > > is for.) > > Ah, I see. I wasn't aware of this subtlety. I think this is fixable by > checking the return value of get_file_rcu() and retrying speculation > if it changed. I think you could probably still end up looping endlessly in get_file_rcu(). > > (If my understanding is correct, maybe we should document that more > > explicitly...) > > Good point. I'll add comments for get_file_rcu() as a separate patch. > > > > > Also, I think you are introducing an implicit assumption that > > remove_vma() does not NULL out the ->vm_file pointer (because that > > could cause tearing and could theoretically lead to a torn pointer > > being accessed here). > > > > One alternative might be to change the paths that drop references to > > vma->vm_file (search for vma_close to find them) such that they first > > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, > > maybe with a new helper like this: > > > > static void vma_fput(struct vm_area_struct *vma) > > { > > struct file *file = vma->vm_file; > > > > if (file) { > > WRITE_ONCE(vma->vm_file, NULL); > > fput(file); > > } > > } > > > > Then on the lockless lookup path you could use get_file_rcu() on the > > ->vm_file pointer _of the original VMA_, and store the returned file* > > into copy->vm_file. > > Ack. Except for storing the return value of get_file_rcu(). I think > once we detect that get_file_rcu() returns a different file we should > bail out and retry. The change in file is an indication that the vma > got changed from under us, so whatever we have is stale. What does "different file" mean here - what file* would you compare the returned one against?
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 29, 2025 at 8:40 AM Jann Horn wrote: > > On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan wrote: > > With maple_tree supporting vma tree traversal under RCU and vma and > > its important members being RCU-safe, /proc/pid/maps can be read under > > RCU and without the need to read-lock mmap_lock. However vma content > > can change from under us, therefore we make a copy of the vma and we > > pin pointer fields used when generating the output (currently only > > vm_file and anon_name). Afterwards we check for concurrent address > > space modifications, wait for them to end and retry. While we take > > the mmap_lock for reading during such contention, we do that momentarily > > only to record new mm_wr_seq counter. This change is designed to reduce > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > (often a low priority task, such as monitoring/data collection services) > > from blocking address space updates. > [...] > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > [...] > > +/* > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > + * show_map_vma. > > + */ > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > vm_area_struct *vma) > > +{ > > + struct vm_area_struct *copy = &priv->vma_copy; > > + int ret = -EAGAIN; > > + > > + memcpy(copy, vma, sizeof(*vma)); > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > + goto out; > > I think this uses get_file_rcu() in a different way than intended. > > As I understand it, get_file_rcu() is supposed to be called on a > pointer which always points to a file with a non-zero refcount (except > when it is NULL). That's why it takes a file** instead of a file* - if > it observes a zero refcount, it assumes that the pointer must have > been updated in the meantime, and retries. Calling get_file_rcu() on a > pointer that points to a file with zero refcount, which I think can > happen with this patch, will cause an endless loop. > (Just as background: For other usecases, get_file_rcu() is supposed to > still behave nicely and not spuriously return NULL when the file* is > concurrently updated to point to another file*; that's what that loop > is for.) Ah, I see. I wasn't aware of this subtlety. I think this is fixable by checking the return value of get_file_rcu() and retrying speculation if it changed. > (If my understanding is correct, maybe we should document that more > explicitly...) Good point. I'll add comments for get_file_rcu() as a separate patch. > > Also, I think you are introducing an implicit assumption that > remove_vma() does not NULL out the ->vm_file pointer (because that > could cause tearing and could theoretically lead to a torn pointer > being accessed here). > > One alternative might be to change the paths that drop references to > vma->vm_file (search for vma_close to find them) such that they first > NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, > maybe with a new helper like this: > > static void vma_fput(struct vm_area_struct *vma) > { > struct file *file = vma->vm_file; > > if (file) { > WRITE_ONCE(vma->vm_file, NULL); > fput(file); > } > } > > Then on the lockless lookup path you could use get_file_rcu() on the > ->vm_file pointer _of the original VMA_, and store the returned file* > into copy->vm_file. Ack. Except for storing the return value of get_file_rcu(). I think once we detect that get_file_rcu() returns a different file we should bail out and retry. The change in file is an indication that the vma got changed from under us, so whatever we have is stale. > > > + if (!anon_vma_name_get_if_valid(copy)) > > + goto put_file; > > + > > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > > + return 0; > > We only check for concurrent updates at this point, so up to here, > anything we read from "copy" could contain torn pointers (both because > memcpy() is not guaranteed to copy pointers atomically and because the > updates to the original VMA are not done with WRITE_ONCE()). > That probably means that something like the preceding > anon_vma_name_get_if_valid() could crash on an access to a torn > pointer. > Please either do another mmap_lock_speculate_retry() check directly > after the memcpy(), or ensure nothing before this point reads from > "copy". Ack. I'll add mmap_lock_speculate_retry() check right after memcpy(). > > > + /* Address space got modified, vma might be stale. Re-lock and > > retry. */ > > + rcu_read_unlock(); > > + ret = mmap_read_lock_killable(priv->mm); > > + if (!ret) { > > + /* mmap_lock_speculate_try_begin() succeeds when holding > > mmap_read_lock */ > > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); > > +
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Fri, Apr 18, 2025 at 7:50 PM Suren Baghdasaryan wrote: > With maple_tree supporting vma tree traversal under RCU and vma and > its important members being RCU-safe, /proc/pid/maps can be read under > RCU and without the need to read-lock mmap_lock. However vma content > can change from under us, therefore we make a copy of the vma and we > pin pointer fields used when generating the output (currently only > vm_file and anon_name). Afterwards we check for concurrent address > space modifications, wait for them to end and retry. While we take > the mmap_lock for reading during such contention, we do that momentarily > only to record new mm_wr_seq counter. This change is designed to reduce > mmap_lock contention and prevent a process reading /proc/pid/maps files > (often a low priority task, such as monitoring/data collection services) > from blocking address space updates. [...] > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index b9e4fbbdf6e6..f9d50a61167c 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c [...] > +/* > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > + * show_map_vma. > + */ > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > vm_area_struct *vma) > +{ > + struct vm_area_struct *copy = &priv->vma_copy; > + int ret = -EAGAIN; > + > + memcpy(copy, vma, sizeof(*vma)); > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > + goto out; I think this uses get_file_rcu() in a different way than intended. As I understand it, get_file_rcu() is supposed to be called on a pointer which always points to a file with a non-zero refcount (except when it is NULL). That's why it takes a file** instead of a file* - if it observes a zero refcount, it assumes that the pointer must have been updated in the meantime, and retries. Calling get_file_rcu() on a pointer that points to a file with zero refcount, which I think can happen with this patch, will cause an endless loop. (Just as background: For other usecases, get_file_rcu() is supposed to still behave nicely and not spuriously return NULL when the file* is concurrently updated to point to another file*; that's what that loop is for.) (If my understanding is correct, maybe we should document that more explicitly...) Also, I think you are introducing an implicit assumption that remove_vma() does not NULL out the ->vm_file pointer (because that could cause tearing and could theoretically lead to a torn pointer being accessed here). One alternative might be to change the paths that drop references to vma->vm_file (search for vma_close to find them) such that they first NULL out ->vm_file with a WRITE_ONCE() and do the fput() after that, maybe with a new helper like this: static void vma_fput(struct vm_area_struct *vma) { struct file *file = vma->vm_file; if (file) { WRITE_ONCE(vma->vm_file, NULL); fput(file); } } Then on the lockless lookup path you could use get_file_rcu() on the ->vm_file pointer _of the original VMA_, and store the returned file* into copy->vm_file. > + if (!anon_vma_name_get_if_valid(copy)) > + goto put_file; > + > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > + return 0; We only check for concurrent updates at this point, so up to here, anything we read from "copy" could contain torn pointers (both because memcpy() is not guaranteed to copy pointers atomically and because the updates to the original VMA are not done with WRITE_ONCE()). That probably means that something like the preceding anon_vma_name_get_if_valid() could crash on an access to a torn pointer. Please either do another mmap_lock_speculate_retry() check directly after the memcpy(), or ensure nothing before this point reads from "copy". > + /* Address space got modified, vma might be stale. Re-lock and retry. > */ > + rcu_read_unlock(); > + ret = mmap_read_lock_killable(priv->mm); > + if (!ret) { > + /* mmap_lock_speculate_try_begin() succeeds when holding > mmap_read_lock */ > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); > + mmap_read_unlock(priv->mm); > + ret = -EAGAIN; > + } > + > + rcu_read_lock(); > + > + anon_vma_name_put_if_valid(copy); > +put_file: > + if (copy->vm_file) > + fput(copy->vm_file); > +out: > + return ret; > +} [...] > @@ -266,39 +399,41 @@ static void get_vma_name(struct vm_area_struct *vma, > } else { > *path = file_user_path(vma->vm_file); > } > - return; > + goto out; > } > > if (vma->vm_ops && vma->vm_ops->name) { > *name = vma->vm_ops->name(vma); This seems to me like a big, subtle change of semantics. After this change, vm_ops->name() will no longer receive a real VMA; and in particular, I think the .name impl
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Thu, Apr 24, 2025 at 9:42 AM Liam R. Howlett wrote: > > * Andrii Nakryiko [250424 12:04]: > > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan > > wrote: > > > > > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett > > > wrote: > > > > > > > > * Andrii Nakryiko [250423 18:06]: > > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan > > > > > wrote: > > > > > > > > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > > > > > > wrote: > > > > > > > > > > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma > > > > > > > > and > > > > > > > > its important members being RCU-safe, /proc/pid/maps can be > > > > > > > > read under > > > > > > > > RCU and without the need to read-lock mmap_lock. However vma > > > > > > > > content > > > > > > > > can change from under us, therefore we make a copy of the vma > > > > > > > > and we > > > > > > > > pin pointer fields used when generating the output (currently > > > > > > > > only > > > > > > > > vm_file and anon_name). Afterwards we check for concurrent > > > > > > > > address > > > > > > > > space modifications, wait for them to end and retry. While we > > > > > > > > take > > > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > > > momentarily > > > > > > > > only to record new mm_wr_seq counter. This change is designed > > > > > > > > to reduce > > > > > > > > > > > > > > This is probably a stupid question, but why do we need to take a > > > > > > > lock > > > > > > > just to record this counter? uprobes get away without taking > > > > > > > mmap_lock > > > > > > > even for reads, and still record this seq counter. And then detect > > > > > > > whether there were any modifications in between. Why does this > > > > > > > change > > > > > > > need more heavy-weight mmap_read_lock to do speculative reads? > > > > > > > > > > > > Not a stupid question. mmap_read_lock() is used to wait for the > > > > > > writer > > > > > > to finish what it's doing and then we continue by recording a new > > > > > > sequence counter value and call mmap_read_unlock. This is what > > > > > > get_vma_snapshot() does. But your question made me realize that we > > > > > > can > > > > > > optimize m_start() further by not taking mmap_read_lock at all. > > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same > > > > > > dance we do in the get_vma_snapshot(). I think that should work. > > > > > > > > > > Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > > > > > We can check this counter once per 4k block and maintain the same > > > > 'tearing' that exists today instead of per-vma. Not that anyone said > > > > they had an issue with changing it, but since we're on this road anyways > > > > I'd thought I'd point out where we could end up. > > > > > > We would need to run that check on the last call to show_map() right > > > before seq_file detects the overflow and flushes the page. On > > > contention we will also be throwing away more prepared data (up to a > > > page worth of records) vs only the last record. All in all I'm not > > > convinced this is worth doing unless increased chances of data tearing > > > is identified as a problem. > > > > > > > Yep, I agree, with filling out 4K of data we run into much higher > > chances of conflict, IMO. Not worth it, I'd say. > > Sounds good. > > If this is an issue we do have a path forward still. Although it's less > desirable. > > > > > > > > > > > I am concerned about live locking in either scenario, but I haven't > > > > looked too deep into this pattern. > > > > > > > > I also don't love (as usual) the lack of ensured forward progress. > > > > > > Hmm. Maybe we should add a retry limit on > > > mmap_lock_speculate_try_begin() and once the limit is hit we just take > > > the mmap_read_lock and proceed with it? That would prevent a > > > hyperactive writer from blocking the reader's forward progress > > > indefinitely. > > > > Came here to say the same. I'd add a small number of retries (3-5?) > > and then fallback to the read-locked approach. The main challenge is > > to keep all this logic nicely isolated from the main VMA > > search/printing logic. > > > > For a similar pattern in uprobes, we don't even bother to rety, we > > just fallback to mmap_read_lock and proceed, under the assumption that > > this is going to be very rare and thus not important from the overall > > performance perspective. > > In this problem space we are dealing with a herd of readers caused by > writers delaying an ever-growing line of readers, right? I'm assuming that the common case is there is no writer, we attempt lockless vma read, but then (very rarely) writer comes in and starts to change something, disrupting the read. In uprobe vma loo
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Thu, Apr 24, 2025 at 9:42 AM Liam R. Howlett wrote: > > * Andrii Nakryiko [250424 12:04]: > > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan > > wrote: > > > > > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett > > > wrote: > > > > > > > > * Andrii Nakryiko [250423 18:06]: > > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan > > > > > wrote: > > > > > > > > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > > > > > > wrote: > > > > > > > > > > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma > > > > > > > > and > > > > > > > > its important members being RCU-safe, /proc/pid/maps can be > > > > > > > > read under > > > > > > > > RCU and without the need to read-lock mmap_lock. However vma > > > > > > > > content > > > > > > > > can change from under us, therefore we make a copy of the vma > > > > > > > > and we > > > > > > > > pin pointer fields used when generating the output (currently > > > > > > > > only > > > > > > > > vm_file and anon_name). Afterwards we check for concurrent > > > > > > > > address > > > > > > > > space modifications, wait for them to end and retry. While we > > > > > > > > take > > > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > > > momentarily > > > > > > > > only to record new mm_wr_seq counter. This change is designed > > > > > > > > to reduce > > > > > > > > > > > > > > This is probably a stupid question, but why do we need to take a > > > > > > > lock > > > > > > > just to record this counter? uprobes get away without taking > > > > > > > mmap_lock > > > > > > > even for reads, and still record this seq counter. And then detect > > > > > > > whether there were any modifications in between. Why does this > > > > > > > change > > > > > > > need more heavy-weight mmap_read_lock to do speculative reads? > > > > > > > > > > > > Not a stupid question. mmap_read_lock() is used to wait for the > > > > > > writer > > > > > > to finish what it's doing and then we continue by recording a new > > > > > > sequence counter value and call mmap_read_unlock. This is what > > > > > > get_vma_snapshot() does. But your question made me realize that we > > > > > > can > > > > > > optimize m_start() further by not taking mmap_read_lock at all. > > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same > > > > > > dance we do in the get_vma_snapshot(). I think that should work. > > > > > > > > > > Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > > > > > We can check this counter once per 4k block and maintain the same > > > > 'tearing' that exists today instead of per-vma. Not that anyone said > > > > they had an issue with changing it, but since we're on this road anyways > > > > I'd thought I'd point out where we could end up. > > > > > > We would need to run that check on the last call to show_map() right > > > before seq_file detects the overflow and flushes the page. On > > > contention we will also be throwing away more prepared data (up to a > > > page worth of records) vs only the last record. All in all I'm not > > > convinced this is worth doing unless increased chances of data tearing > > > is identified as a problem. > > > > > > > Yep, I agree, with filling out 4K of data we run into much higher > > chances of conflict, IMO. Not worth it, I'd say. > > Sounds good. > > If this is an issue we do have a path forward still. Although it's less > desirable. > > > > > > > > > > > I am concerned about live locking in either scenario, but I haven't > > > > looked too deep into this pattern. > > > > > > > > I also don't love (as usual) the lack of ensured forward progress. > > > > > > Hmm. Maybe we should add a retry limit on > > > mmap_lock_speculate_try_begin() and once the limit is hit we just take > > > the mmap_read_lock and proceed with it? That would prevent a > > > hyperactive writer from blocking the reader's forward progress > > > indefinitely. > > > > Came here to say the same. I'd add a small number of retries (3-5?) > > and then fallback to the read-locked approach. The main challenge is > > to keep all this logic nicely isolated from the main VMA > > search/printing logic. > > > > For a similar pattern in uprobes, we don't even bother to rety, we > > just fallback to mmap_read_lock and proceed, under the assumption that > > this is going to be very rare and thus not important from the overall > > performance perspective. > > In this problem space we are dealing with a herd of readers caused by > writers delaying an ever-growing line of readers, right? I don't know if we have a herd of readers. The problem statement was for a low-priority reader (monitors) blocking a high-priority writer. Multiple readers are of course possible, so I think as lo
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
* Andrii Nakryiko [250424 12:04]: > On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan wrote: > > > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett > > wrote: > > > > > > * Andrii Nakryiko [250423 18:06]: > > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan > > > > wrote: > > > > > > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > > > > > wrote: > > > > > > > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > > > > > wrote: > > > > > > > > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma > > > > > > > and > > > > > > > its important members being RCU-safe, /proc/pid/maps can be read > > > > > > > under > > > > > > > RCU and without the need to read-lock mmap_lock. However vma > > > > > > > content > > > > > > > can change from under us, therefore we make a copy of the vma and > > > > > > > we > > > > > > > pin pointer fields used when generating the output (currently only > > > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > > > space modifications, wait for them to end and retry. While we take > > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > > momentarily > > > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > > > reduce > > > > > > > > > > > > This is probably a stupid question, but why do we need to take a > > > > > > lock > > > > > > just to record this counter? uprobes get away without taking > > > > > > mmap_lock > > > > > > even for reads, and still record this seq counter. And then detect > > > > > > whether there were any modifications in between. Why does this > > > > > > change > > > > > > need more heavy-weight mmap_read_lock to do speculative reads? > > > > > > > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer > > > > > to finish what it's doing and then we continue by recording a new > > > > > sequence counter value and call mmap_read_unlock. This is what > > > > > get_vma_snapshot() does. But your question made me realize that we can > > > > > optimize m_start() further by not taking mmap_read_lock at all. > > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same > > > > > dance we do in the get_vma_snapshot(). I think that should work. > > > > > > > > Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > > > We can check this counter once per 4k block and maintain the same > > > 'tearing' that exists today instead of per-vma. Not that anyone said > > > they had an issue with changing it, but since we're on this road anyways > > > I'd thought I'd point out where we could end up. > > > > We would need to run that check on the last call to show_map() right > > before seq_file detects the overflow and flushes the page. On > > contention we will also be throwing away more prepared data (up to a > > page worth of records) vs only the last record. All in all I'm not > > convinced this is worth doing unless increased chances of data tearing > > is identified as a problem. > > > > Yep, I agree, with filling out 4K of data we run into much higher > chances of conflict, IMO. Not worth it, I'd say. Sounds good. If this is an issue we do have a path forward still. Although it's less desirable. > > > > > > > I am concerned about live locking in either scenario, but I haven't > > > looked too deep into this pattern. > > > > > > I also don't love (as usual) the lack of ensured forward progress. > > > > Hmm. Maybe we should add a retry limit on > > mmap_lock_speculate_try_begin() and once the limit is hit we just take > > the mmap_read_lock and proceed with it? That would prevent a > > hyperactive writer from blocking the reader's forward progress > > indefinitely. > > Came here to say the same. I'd add a small number of retries (3-5?) > and then fallback to the read-locked approach. The main challenge is > to keep all this logic nicely isolated from the main VMA > search/printing logic. > > For a similar pattern in uprobes, we don't even bother to rety, we > just fallback to mmap_read_lock and proceed, under the assumption that > this is going to be very rare and thus not important from the overall > performance perspective. In this problem space we are dealing with a herd of readers caused by writers delaying an ever-growing line of readers, right? Assuming there is a backup caused by a writer, then I don't know if the retry is going to do anything more than heat the data centre. The readers that take the read lock will get the data, while the others who arrive during read locked time can try lockless, but will most likely have a run time that extends beyond the readers holding the lock and will probably be interrupted by the writer. We can predict the new readers will also not make it through in time because the earlier ones failed. The new readers will then
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Thu, Apr 24, 2025 at 8:20 AM Suren Baghdasaryan wrote: > > On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett > wrote: > > > > * Andrii Nakryiko [250423 18:06]: > > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan > > > wrote: > > > > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > > > > wrote: > > > > > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > > > > wrote: > > > > > > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > > > its important members being RCU-safe, /proc/pid/maps can be read > > > > > > under > > > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > > > can change from under us, therefore we make a copy of the vma and we > > > > > > pin pointer fields used when generating the output (currently only > > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > > space modifications, wait for them to end and retry. While we take > > > > > > the mmap_lock for reading during such contention, we do that > > > > > > momentarily > > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > > reduce > > > > > > > > > > This is probably a stupid question, but why do we need to take a lock > > > > > just to record this counter? uprobes get away without taking mmap_lock > > > > > even for reads, and still record this seq counter. And then detect > > > > > whether there were any modifications in between. Why does this change > > > > > need more heavy-weight mmap_read_lock to do speculative reads? > > > > > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer > > > > to finish what it's doing and then we continue by recording a new > > > > sequence counter value and call mmap_read_unlock. This is what > > > > get_vma_snapshot() does. But your question made me realize that we can > > > > optimize m_start() further by not taking mmap_read_lock at all. > > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > > > > try mmap_lock_speculate_try_begin() and only if it fails do the same > > > > dance we do in the get_vma_snapshot(). I think that should work. > > > > > > Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > We can check this counter once per 4k block and maintain the same > > 'tearing' that exists today instead of per-vma. Not that anyone said > > they had an issue with changing it, but since we're on this road anyways > > I'd thought I'd point out where we could end up. > > We would need to run that check on the last call to show_map() right > before seq_file detects the overflow and flushes the page. On > contention we will also be throwing away more prepared data (up to a > page worth of records) vs only the last record. All in all I'm not > convinced this is worth doing unless increased chances of data tearing > is identified as a problem. > Yep, I agree, with filling out 4K of data we run into much higher chances of conflict, IMO. Not worth it, I'd say. > > > > I am concerned about live locking in either scenario, but I haven't > > looked too deep into this pattern. > > > > I also don't love (as usual) the lack of ensured forward progress. > > Hmm. Maybe we should add a retry limit on > mmap_lock_speculate_try_begin() and once the limit is hit we just take > the mmap_read_lock and proceed with it? That would prevent a > hyperactive writer from blocking the reader's forward progress > indefinitely. Came here to say the same. I'd add a small number of retries (3-5?) and then fallback to the read-locked approach. The main challenge is to keep all this logic nicely isolated from the main VMA search/printing logic. For a similar pattern in uprobes, we don't even bother to rety, we just fallback to mmap_read_lock and proceed, under the assumption that this is going to be very rare and thus not important from the overall performance perspective. > > > > > It seems like we have four cases for the vm area state now: > > 1. we want to read a stable vma or set of vmas (per-vma locking) > > 2. we want to read a stable mm state for reading (the very short named > > mmap_lock_speculate_try_begin) > > and we don't mind retrying on contention. This one should be done > under RCU protection. > > > 3. we ensure a stable vma/mm state for reading (mmap read lock) > > 4. we are writing - get out of my way (mmap write lock). > > I wouldn't call #2 a vma state. More of a usecase when we want to read > vma under RCU (valid but can change from under us) and then retry if > it might have been modified from under us. > > > > > Cheers, > > Liam > >
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Wed, Apr 23, 2025 at 5:24 PM Liam R. Howlett wrote: > > * Andrii Nakryiko [250423 18:06]: > > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan > > wrote: > > > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > > > wrote: > > > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > > > wrote: > > > > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > > can change from under us, therefore we make a copy of the vma and we > > > > > pin pointer fields used when generating the output (currently only > > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > > space modifications, wait for them to end and retry. While we take > > > > > the mmap_lock for reading during such contention, we do that > > > > > momentarily > > > > > only to record new mm_wr_seq counter. This change is designed to > > > > > reduce > > > > > > > > This is probably a stupid question, but why do we need to take a lock > > > > just to record this counter? uprobes get away without taking mmap_lock > > > > even for reads, and still record this seq counter. And then detect > > > > whether there were any modifications in between. Why does this change > > > > need more heavy-weight mmap_read_lock to do speculative reads? > > > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer > > > to finish what it's doing and then we continue by recording a new > > > sequence counter value and call mmap_read_unlock. This is what > > > get_vma_snapshot() does. But your question made me realize that we can > > > optimize m_start() further by not taking mmap_read_lock at all. > > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > > > try mmap_lock_speculate_try_begin() and only if it fails do the same > > > dance we do in the get_vma_snapshot(). I think that should work. > > > > Ok, yeah, it would be great to avoid taking a lock in a common case! > > We can check this counter once per 4k block and maintain the same > 'tearing' that exists today instead of per-vma. Not that anyone said > they had an issue with changing it, but since we're on this road anyways > I'd thought I'd point out where we could end up. We would need to run that check on the last call to show_map() right before seq_file detects the overflow and flushes the page. On contention we will also be throwing away more prepared data (up to a page worth of records) vs only the last record. All in all I'm not convinced this is worth doing unless increased chances of data tearing is identified as a problem. > > I am concerned about live locking in either scenario, but I haven't > looked too deep into this pattern. > > I also don't love (as usual) the lack of ensured forward progress. Hmm. Maybe we should add a retry limit on mmap_lock_speculate_try_begin() and once the limit is hit we just take the mmap_read_lock and proceed with it? That would prevent a hyperactive writer from blocking the reader's forward progress indefinitely. > > It seems like we have four cases for the vm area state now: > 1. we want to read a stable vma or set of vmas (per-vma locking) > 2. we want to read a stable mm state for reading (the very short named > mmap_lock_speculate_try_begin) and we don't mind retrying on contention. This one should be done under RCU protection. > 3. we ensure a stable vma/mm state for reading (mmap read lock) > 4. we are writing - get out of my way (mmap write lock). I wouldn't call #2 a vma state. More of a usecase when we want to read vma under RCU (valid but can change from under us) and then retry if it might have been modified from under us. > > Cheers, > Liam >
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
* Andrii Nakryiko [250423 18:06]: > On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan wrote: > > > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > > wrote: > > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > > wrote: > > > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > > RCU and without the need to read-lock mmap_lock. However vma content > > > > can change from under us, therefore we make a copy of the vma and we > > > > pin pointer fields used when generating the output (currently only > > > > vm_file and anon_name). Afterwards we check for concurrent address > > > > space modifications, wait for them to end and retry. While we take > > > > the mmap_lock for reading during such contention, we do that momentarily > > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > > > > This is probably a stupid question, but why do we need to take a lock > > > just to record this counter? uprobes get away without taking mmap_lock > > > even for reads, and still record this seq counter. And then detect > > > whether there were any modifications in between. Why does this change > > > need more heavy-weight mmap_read_lock to do speculative reads? > > > > Not a stupid question. mmap_read_lock() is used to wait for the writer > > to finish what it's doing and then we continue by recording a new > > sequence counter value and call mmap_read_unlock. This is what > > get_vma_snapshot() does. But your question made me realize that we can > > optimize m_start() further by not taking mmap_read_lock at all. > > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > > try mmap_lock_speculate_try_begin() and only if it fails do the same > > dance we do in the get_vma_snapshot(). I think that should work. > > Ok, yeah, it would be great to avoid taking a lock in a common case! We can check this counter once per 4k block and maintain the same 'tearing' that exists today instead of per-vma. Not that anyone said they had an issue with changing it, but since we're on this road anyways I'd thought I'd point out where we could end up. I am concerned about live locking in either scenario, but I haven't looked too deep into this pattern. I also don't love (as usual) the lack of ensured forward progress. It seems like we have four cases for the vm area state now: 1. we want to read a stable vma or set of vmas (per-vma locking) 2. we want to read a stable mm state for reading (the very short named mmap_lock_speculate_try_begin) 3. we ensure a stable vma/mm state for reading (mmap read lock) 4. we are writing - get out of my way (mmap write lock). Cheers, Liam
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Wed, Apr 23, 2025 at 2:49 PM Suren Baghdasaryan wrote: > > On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko > wrote: > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan > > wrote: > > > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > > its important members being RCU-safe, /proc/pid/maps can be read under > > > RCU and without the need to read-lock mmap_lock. However vma content > > > can change from under us, therefore we make a copy of the vma and we > > > pin pointer fields used when generating the output (currently only > > > vm_file and anon_name). Afterwards we check for concurrent address > > > space modifications, wait for them to end and retry. While we take > > > the mmap_lock for reading during such contention, we do that momentarily > > > only to record new mm_wr_seq counter. This change is designed to reduce > > > > This is probably a stupid question, but why do we need to take a lock > > just to record this counter? uprobes get away without taking mmap_lock > > even for reads, and still record this seq counter. And then detect > > whether there were any modifications in between. Why does this change > > need more heavy-weight mmap_read_lock to do speculative reads? > > Not a stupid question. mmap_read_lock() is used to wait for the writer > to finish what it's doing and then we continue by recording a new > sequence counter value and call mmap_read_unlock. This is what > get_vma_snapshot() does. But your question made me realize that we can > optimize m_start() further by not taking mmap_read_lock at all. > Instead of taking mmap_read_lock then doing drop_mmap_lock() we can > try mmap_lock_speculate_try_begin() and only if it fails do the same > dance we do in the get_vma_snapshot(). I think that should work. Ok, yeah, it would be great to avoid taking a lock in a common case! > > > > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > > (often a low priority task, such as monitoring/data collection services) > > > from blocking address space updates. > > > Note that this change has a userspace visible disadvantage: it allows > > > for sub-page data tearing as opposed to the previous mechanism where > > > data tearing could happen only between pages of generated output data. > > > Since current userspace considers data tearing between pages to be > > > acceptable, we assume is will be able to handle sub-page data tearing > > > as well. > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > fs/proc/internal.h| 6 ++ > > > fs/proc/task_mmu.c| 170 ++ > > > include/linux/mm_inline.h | 18 > > > 3 files changed, 177 insertions(+), 17 deletions(-) > > > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > > > index 96122e91c645..6e1169c1f4df 100644 > > > --- a/fs/proc/internal.h > > > +++ b/fs/proc/internal.h > > > @@ -379,6 +379,12 @@ struct proc_maps_private { > > > struct task_struct *task; > > > struct mm_struct *mm; > > > struct vma_iterator iter; > > > + bool mmap_locked; > > > + loff_t last_pos; > > > +#ifdef CONFIG_PER_VMA_LOCK > > > + unsigned int mm_wr_seq; > > > + struct vm_area_struct vma_copy; > > > +#endif > > > #ifdef CONFIG_NUMA > > > struct mempolicy *task_mempolicy; > > > #endif > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > > --- a/fs/proc/task_mmu.c > > > +++ b/fs/proc/task_mmu.c > > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct > > > proc_maps_private *priv) > > > } > > > #endif > > > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private > > > *priv, > > > - loff_t *ppos) > > > +#ifdef CONFIG_PER_VMA_LOCK > > > + > > > +static const struct seq_operations proc_pid_maps_op; > > > + > > > +/* > > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > > + * show_map_vma. > > > + */ > > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > > vm_area_struct *vma) > > > +{ > > > + struct vm_area_struct *copy = &priv->vma_copy; > > > + int ret = -EAGAIN; > > > + > > > + memcpy(copy, vma, sizeof(*vma)); > > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > > + goto out; > > > + > > > + if (!anon_vma_name_get_if_valid(copy)) > > > + goto put_file; > > > > Given vm_file and anon_vma_name are both RCU-protected, if we take > > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't > > even need getting/putting them, no? > > Yeah, anon_vma_name indeed looks safe without pinning but vm_file is > using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer > but pointing to a wrong object even if the rcu grace period did not > pass. With my assumption that seq_file can't rollback once show_map() > is done,
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Tue, Apr 22, 2025 at 3:49 PM Andrii Nakryiko wrote: > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan wrote: > > > > With maple_tree supporting vma tree traversal under RCU and vma and > > its important members being RCU-safe, /proc/pid/maps can be read under > > RCU and without the need to read-lock mmap_lock. However vma content > > can change from under us, therefore we make a copy of the vma and we > > pin pointer fields used when generating the output (currently only > > vm_file and anon_name). Afterwards we check for concurrent address > > space modifications, wait for them to end and retry. While we take > > the mmap_lock for reading during such contention, we do that momentarily > > only to record new mm_wr_seq counter. This change is designed to reduce > > This is probably a stupid question, but why do we need to take a lock > just to record this counter? uprobes get away without taking mmap_lock > even for reads, and still record this seq counter. And then detect > whether there were any modifications in between. Why does this change > need more heavy-weight mmap_read_lock to do speculative reads? Not a stupid question. mmap_read_lock() is used to wait for the writer to finish what it's doing and then we continue by recording a new sequence counter value and call mmap_read_unlock. This is what get_vma_snapshot() does. But your question made me realize that we can optimize m_start() further by not taking mmap_read_lock at all. Instead of taking mmap_read_lock then doing drop_mmap_lock() we can try mmap_lock_speculate_try_begin() and only if it fails do the same dance we do in the get_vma_snapshot(). I think that should work. > > > mmap_lock contention and prevent a process reading /proc/pid/maps files > > (often a low priority task, such as monitoring/data collection services) > > from blocking address space updates. > > Note that this change has a userspace visible disadvantage: it allows > > for sub-page data tearing as opposed to the previous mechanism where > > data tearing could happen only between pages of generated output data. > > Since current userspace considers data tearing between pages to be > > acceptable, we assume is will be able to handle sub-page data tearing > > as well. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > fs/proc/internal.h| 6 ++ > > fs/proc/task_mmu.c| 170 ++ > > include/linux/mm_inline.h | 18 > > 3 files changed, 177 insertions(+), 17 deletions(-) > > > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > > index 96122e91c645..6e1169c1f4df 100644 > > --- a/fs/proc/internal.h > > +++ b/fs/proc/internal.h > > @@ -379,6 +379,12 @@ struct proc_maps_private { > > struct task_struct *task; > > struct mm_struct *mm; > > struct vma_iterator iter; > > + bool mmap_locked; > > + loff_t last_pos; > > +#ifdef CONFIG_PER_VMA_LOCK > > + unsigned int mm_wr_seq; > > + struct vm_area_struct vma_copy; > > +#endif > > #ifdef CONFIG_NUMA > > struct mempolicy *task_mempolicy; > > #endif > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > index b9e4fbbdf6e6..f9d50a61167c 100644 > > --- a/fs/proc/task_mmu.c > > +++ b/fs/proc/task_mmu.c > > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct > > proc_maps_private *priv) > > } > > #endif > > > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, > > - loff_t *ppos) > > +#ifdef CONFIG_PER_VMA_LOCK > > + > > +static const struct seq_operations proc_pid_maps_op; > > + > > +/* > > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > > + * show_map_vma. > > + */ > > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > > vm_area_struct *vma) > > +{ > > + struct vm_area_struct *copy = &priv->vma_copy; > > + int ret = -EAGAIN; > > + > > + memcpy(copy, vma, sizeof(*vma)); > > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > > + goto out; > > + > > + if (!anon_vma_name_get_if_valid(copy)) > > + goto put_file; > > Given vm_file and anon_vma_name are both RCU-protected, if we take > rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't > even need getting/putting them, no? Yeah, anon_vma_name indeed looks safe without pinning but vm_file is using SLAB_TYPESAFE_BY_RCU cache, so it might still be a valid pointer but pointing to a wrong object even if the rcu grace period did not pass. With my assumption that seq_file can't rollback once show_map() is done, I would need a completely stable vma at the time show_map() is executed so that it does not change from under us while we are generating the output. OTOH, if we indeed can rollback while generating seq_file output then show_map() could output potentially invalid vma, then check for vma changes and when detected, rollback seq_file and retry again. > >
Re: [PATCH v3 7/8] mm/maps: read proc/pid/maps under RCU
On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan wrote: > > With maple_tree supporting vma tree traversal under RCU and vma and > its important members being RCU-safe, /proc/pid/maps can be read under > RCU and without the need to read-lock mmap_lock. However vma content > can change from under us, therefore we make a copy of the vma and we > pin pointer fields used when generating the output (currently only > vm_file and anon_name). Afterwards we check for concurrent address > space modifications, wait for them to end and retry. While we take > the mmap_lock for reading during such contention, we do that momentarily > only to record new mm_wr_seq counter. This change is designed to reduce This is probably a stupid question, but why do we need to take a lock just to record this counter? uprobes get away without taking mmap_lock even for reads, and still record this seq counter. And then detect whether there were any modifications in between. Why does this change need more heavy-weight mmap_read_lock to do speculative reads? > mmap_lock contention and prevent a process reading /proc/pid/maps files > (often a low priority task, such as monitoring/data collection services) > from blocking address space updates. > Note that this change has a userspace visible disadvantage: it allows > for sub-page data tearing as opposed to the previous mechanism where > data tearing could happen only between pages of generated output data. > Since current userspace considers data tearing between pages to be > acceptable, we assume is will be able to handle sub-page data tearing > as well. > > Signed-off-by: Suren Baghdasaryan > --- > fs/proc/internal.h| 6 ++ > fs/proc/task_mmu.c| 170 ++ > include/linux/mm_inline.h | 18 > 3 files changed, 177 insertions(+), 17 deletions(-) > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index 96122e91c645..6e1169c1f4df 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -379,6 +379,12 @@ struct proc_maps_private { > struct task_struct *task; > struct mm_struct *mm; > struct vma_iterator iter; > + bool mmap_locked; > + loff_t last_pos; > +#ifdef CONFIG_PER_VMA_LOCK > + unsigned int mm_wr_seq; > + struct vm_area_struct vma_copy; > +#endif > #ifdef CONFIG_NUMA > struct mempolicy *task_mempolicy; > #endif > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index b9e4fbbdf6e6..f9d50a61167c 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -127,13 +127,130 @@ static void release_task_mempolicy(struct > proc_maps_private *priv) > } > #endif > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, > - loff_t *ppos) > +#ifdef CONFIG_PER_VMA_LOCK > + > +static const struct seq_operations proc_pid_maps_op; > + > +/* > + * Take VMA snapshot and pin vm_file and anon_name as they are used by > + * show_map_vma. > + */ > +static int get_vma_snapshot(struct proc_maps_private *priv, struct > vm_area_struct *vma) > +{ > + struct vm_area_struct *copy = &priv->vma_copy; > + int ret = -EAGAIN; > + > + memcpy(copy, vma, sizeof(*vma)); > + if (copy->vm_file && !get_file_rcu(©->vm_file)) > + goto out; > + > + if (!anon_vma_name_get_if_valid(copy)) > + goto put_file; Given vm_file and anon_vma_name are both RCU-protected, if we take rcu_read_lock() at m_start/m_next before fetching VMA, we shouldn't even need getting/putting them, no? I feel like I'm missing some important limitations, but I don't think they are spelled out explicitly... > + > + if (!mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) > + return 0; > + > + /* Address space got modified, vma might be stale. Re-lock and retry. > */ > + rcu_read_unlock(); Another question I have is whether we really need to copy vma into priv->vma_copy to have a stable snapshot? Can't we just speculate like with uprobes under assumption that data doesn't change. And once we are done producing text output, confirm that speculation was successful, and if not - retry? We'd need an API for seq_file to rollback to previous good known location for that, but that seems straightforward enough to do, no? Just remember buffer position before speculation, write data, check for no mm modifications, and if something changed, rollback seq file to last position. > + ret = mmap_read_lock_killable(priv->mm); > + if (!ret) { > + /* mmap_lock_speculate_try_begin() succeeds when holding > mmap_read_lock */ > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); > + mmap_read_unlock(priv->mm); > + ret = -EAGAIN; > + } > + > + rcu_read_lock(); > + > + anon_vma_name_put_if_valid(copy); > +put_file: > + if (copy->vm_file) > + fput(copy->vm_f