Re: [PATCH v2 01/12] uprobes: update outdated comment
On 07/03, Oleg Nesterov wrote: > > > /* > > -* The NULL 'tsk' here ensures that any faults that occur here > > -* will not be accounted to the task. 'mm' *is* current->mm, > > -* but we treat this as a 'remote' access since it is > > -* essentially a kernel access to the memory. > > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since > > +* it is essentially a kernel access to the memory. > > */ > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); > > OK, this makes it less confusing, so > > Acked-by: Oleg Nesterov > > - > but it still looks confusing to me. This code used to pass tsk = NULL > only to avoid tsk->maj/min_flt++ in faultin_page(). > > But today mm_account_fault() increments these counters without checking > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to > just use get_user_pages() and remove this comment? Well, yes, it still looks confusing, imo. Andrii, I hope you won't mind if I redo/resend this and the next cleanup? The next one only updates the comment above uprobe_write_opcode(), but it would be nice to explain mmap_write_lock() in register_for_each_vma(). Oleg.
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
On 07/09, Andrii Nakryiko wrote: > > On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov wrote: > > > > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe > > > detects "improper" stack pointer progression, for example), > > > > In this case we a) assume that user-space tries to fool the kernel and > > Well, it's a bad assumption. User space might just be using fibers and > managing its own stack. Do you mean something like the "go" language? Yes, not supported. And from the kernel perspective it still looks as if user-space tries to fool the kernel. I mean, if you insert a ret-probe, the kernel assumes that it "owns" the stack, if nothing else the kernel has to change the ret-address on stack. I agree, this is not good. But again, what else the kernel can do in this case? > > Not really expected, and that is why the "TODO" comment in _unregister() > > was never implemented. Although the real reason is that we are lazy ;) > > Worked fine for 10+ years, which says something ;) Or may be it doesn't but we do not know because this code doesn't do uprobe_warn() ;) Oleg.
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
On 07/08, Andrii Nakryiko wrote: > > On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov wrote: > > > > And I forgot to mention... > > > > In any case __uprobe_unregister() can't ignore the error code from > > register_for_each_vma(). If it fails to restore the original insn, > > we should not remove this uprobe from uprobes_tree. > > > > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer) > > probed application. > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe > detects "improper" stack pointer progression, for example), In this case we a) assume that user-space tries to fool the kernel and b) the kernel can't handle this case in any case, thus uprobe_warn(). > but from > what I gather it's not really expected to fail on unregistration given > we successfully registered uprobe. Not really expected, and that is why the "TODO" comment in _unregister() was never implemented. Although the real reason is that we are lazy ;) But register_for_each_vma(NULL) can fail. Say, simply because kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should. A lot of other reasons. > I guess it's a decision between > leaking memory with an uprobe stuck in the tree or killing process due > to some very rare (or buggy) condition? Yes. I think in this case it is better to leak uprobe than kill the no longer probed task. Oleg.
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
And I forgot to mention... In any case __uprobe_unregister() can't ignore the error code from register_for_each_vma(). If it fails to restore the original insn, we should not remove this uprobe from uprobes_tree. Otherwise the next handle_swbp() will send SIGTRAP to the (no longer) probed application. On 07/05, Oleg Nesterov wrote: > > Tried to read this patch, but I fail to understand it. It looks > obvioulsy wrong to me, see below. > > I tend to agree with the comments from Peter, but lets ignore them > for the moment. > > On 07/01, Andrii Nakryiko wrote: > > > > static void put_uprobe(struct uprobe *uprobe) > > { > > - if (refcount_dec_and_test(>ref)) { > > + s64 v; > > + > > + /* > > +* here uprobe instance is guaranteed to be alive, so we use Tasks > > +* Trace RCU to guarantee that uprobe won't be freed from under us, if > > +* we end up being a losing "destructor" inside uprobe_treelock'ed > > +* section double-checking uprobe->ref value below. > > +* Note call_rcu_tasks_trace() + uprobe_free_rcu below. > > +*/ > > + rcu_read_lock_trace(); > > + > > + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref); > > + > > + if (unlikely((u32)v == 0)) { > > I must have missed something, but how can this ever happen? > > Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose > that this binary is not used, so _register() doesn't install breakpoints/etc. > > IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds. > > Now suppose that uprobe_unregister() is called right after that. It does > > uprobe = find_uprobe(inode, offset); > > this increments the counter, (u32)uprobe->ref == 2 > > __uprobe_unregister(...); > > this wont't change the counter, > > put_uprobe(uprobe); > > this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1. > > Where should the "final" put_uprobe() come from? > > IIUC, this patch lacks another put_uprobe() after consumer_del(), no? > > Oleg.
Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer
On 07/01, Andrii Nakryiko wrote: > > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -42,6 +42,11 @@ struct uprobe_consumer { > enum uprobe_filter_ctx ctx, > struct mm_struct *mm); > > + /* associated file offset of this probe */ > + loff_t offset; > + /* associated refctr file offset of this probe, or zero */ > + loff_t ref_ctr_offset; > + /* for internal uprobe infra use, consumers shouldn't touch fields > below */ > struct uprobe_consumer *next; Well, I don't really like this patch either... If nothing else because all the consumers in uprobe->consumers list must have the same offset/ref_ctr_offset. -- But I agree, the ugly uprobe_register_refctr() must die, we need a single function int uprobe_register(inode, offset, ref_ctr_offset, consumer); This change is trivial. -- And speaking of cleanups, I think another change makes sense: - int uprobe_register(...); + struct uprobe* uprobe_register(...); so that uprobe_register() returns uprobe or ERR_PTR. - void uprobe_unregister(inode, offset, consumer); + void uprobe_unregister(uprobe, consumer); this way unregister() doesn't need the extra find_uprobe() + put_uprobe(). The same for uprobe_apply(). The necessary changes in kernel/trace/trace_uprobe.c are trivial, we just need to change struct trace_uprobe - struct inode*inode; + struct uprobe *uprobe; and fix the compilation errors. As for kernel/trace/bpf_trace.c, I guess struct bpf_uprobe needs the new ->uprobe member, we can't kill bpf_uprobe->offset because of bpf_uprobe_multi_link_fill_link_info(), but I think this is not that bad. What do you think? Oleg.
Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management
Tried to read this patch, but I fail to understand it. It looks obvioulsy wrong to me, see below. I tend to agree with the comments from Peter, but lets ignore them for the moment. On 07/01, Andrii Nakryiko wrote: > > static void put_uprobe(struct uprobe *uprobe) > { > - if (refcount_dec_and_test(>ref)) { > + s64 v; > + > + /* > + * here uprobe instance is guaranteed to be alive, so we use Tasks > + * Trace RCU to guarantee that uprobe won't be freed from under us, if > + * we end up being a losing "destructor" inside uprobe_treelock'ed > + * section double-checking uprobe->ref value below. > + * Note call_rcu_tasks_trace() + uprobe_free_rcu below. > + */ > + rcu_read_lock_trace(); > + > + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref); > + > + if (unlikely((u32)v == 0)) { I must have missed something, but how can this ever happen? Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose that this binary is not used, so _register() doesn't install breakpoints/etc. IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds. Now suppose that uprobe_unregister() is called right after that. It does uprobe = find_uprobe(inode, offset); this increments the counter, (u32)uprobe->ref == 2 __uprobe_unregister(...); this wont't change the counter, put_uprobe(uprobe); this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1. Where should the "final" put_uprobe() come from? IIUC, this patch lacks another put_uprobe() after consumer_del(), no? Oleg.
Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
On 07/01, Andrii Nakryiko wrote: > > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct > mm_struct *mm, > * @vaddr: the virtual address to store the opcode. > * @opcode: opcode to be written at @vaddr. > * > - * Called with mm->mmap_lock held for write. > + * Called with mm->mmap_lock held for read or write. > * Return 0 (success) or a negative errno. Thanks, Acked-by: Oleg Nesterov I'll try to send the patch which explains the reasons for mmap_write_lock() in register_for_each_vma() later. Oleg.
Re: [PATCH v2 01/12] uprobes: update outdated comment
Sorry for the late reply. I'll try to read this version/discussion when I have time... yes, I have already promised this before, sorry :/ On 07/01, Andrii Nakryiko wrote: > > There is no task_struct passed into get_user_pages_remote() anymore, > drop the parts of comment mentioning NULL tsk, it's just confusing at > this point. Agreed. > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, > unsigned long vaddr) > goto out; > > /* > - * The NULL 'tsk' here ensures that any faults that occur here > - * will not be accounted to the task. 'mm' *is* current->mm, > - * but we treat this as a 'remote' access since it is > - * essentially a kernel access to the memory. > + * 'mm' *is* current->mm, but we treat this as a 'remote' access since > + * it is essentially a kernel access to the memory. >*/ > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); OK, this makes it less confusing, so Acked-by: Oleg Nesterov - but it still looks confusing to me. This code used to pass tsk = NULL only to avoid tsk->maj/min_flt++ in faultin_page(). But today mm_account_fault() increments these counters without checking FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to just use get_user_pages() and remove this comment? Oleg.
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
On 06/25, Andrii Nakryiko wrote: > > On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov wrote: > > > > Why? > > > > So far I don't understand this change. Quite possibly I missed something, > > but in this case the changelog should explain the problem more clearly. > > > > I just went off of "Called with mm->mmap_lock held for write." comment > in uprobe_write_opcode(), tbh. Ah, indeed... and git blame makes me sad ;) I _think_ that 29dedee0e693a updated this comment without any thinking, but today I can't recall. In any case, today this nothing to do with mem_cgroup_charge(). Not sure __replace_page() is correct (in this respect) when it returns -EAGAIN but this is another story. > If we don't actually need writer > mmap_lock, we should probably update at least that comment. Agreed. > There is a > lot going on in uprobe_write_opcode(), and I don't understand all the > requirements there. Heh. Neither me today ;) Oleg.
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
On 06/25, Masami Hiramatsu wrote: > > On Mon, 24 Jun 2024 17:21:34 -0700 > Andrii Nakryiko wrote: > > > Given unapply_uprobe() can call remove_breakpoint() which eventually > > calls uprobe_write_opcode(), which can modify a set of memory pages and > > expects mm->mmap_lock held for write, it needs to have writer lock. > > Oops, it is an actual bug, right? Why? So far I don't understand this change. Quite possibly I missed something, but in this case the changelog should explain the problem more clearly. Oleg.
Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management
Again, I'll try to read (at least this) patch later this week, just one cosmetic nit for now... On 06/24, Andrii Nakryiko wrote: > > + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both* > + * epoch and refcnt parts atomically with one atomic_add(). > + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and > + * *increment* epoch part. > + */ > +#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL) > +#define UPROBE_REFCNT_PUT (0xLL) How about #define UPROBE_REFCNT_GET ((1ULL << 32) + 1ULL) #define UPROBE_REFCNT_PUT ((1ULL << 32) - 1ULL) ? this makes them symmetrical and makes it clear why atomic64_add(UPROBE_REFCNT_PUT) works as described in the comment. Feel free to ignore if you don't like it. Oleg.
Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()
I don't think I can review, I forgot everything, but I'll try to read this series when I have time to (try to) understand what it does... On 06/24, Andrii Nakryiko wrote: > > Given unapply_uprobe() can call remove_breakpoint() which eventually > calls uprobe_write_opcode(), which can modify a set of memory pages and > expects mm->mmap_lock held for write, it needs to have writer lock. > > Fix this by switching to mmap_write_lock()/mmap_write_unlock(). > > Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed > task") > Signed-off-by: Andrii Nakryiko > --- > kernel/events/uprobes.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 197fbe4663b5..e896eeecb091 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct > mm_struct *mm) > struct vm_area_struct *vma; > int err = 0; > > - mmap_read_lock(mm); > + mmap_write_lock(mm); Can you explain what exactly is wrong? FOLL_FORCE/etc is fine under mmap_read_lock(), __replace_page() seems too... I recall that initially uprobes.c always took mmap_sem for reading, then register_for_each_vma() was changed by 77fc4af1b59d12 but there was other reasons for this change... Again, I don't understand this code today, quite possibly I missed something, I am just trying to understand. Well, it seems there is another reason for this change... Currently 2 unapply_uprobe()'s can race with each other if they try to update the same page. But in this case we can rely on -EAGAIN from __replace_page() ? Oleg.
Re: [PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()
On 05/20, Andrii Nakryiko wrote: > > Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily") > Reported-by: Breno Leitao > Signed-off-by: Andrii Nakryiko > --- > kernel/trace/trace_uprobe.c | 14 +- > 1 file changed, 9 insertions(+), 5 deletions(-) Reviewed-by: Oleg Nesterov