Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-10 Thread Oleg Nesterov
On 07/03, Oleg Nesterov wrote:
>
> > /*
> > -* The NULL 'tsk' here ensures that any faults that occur here
> > -* will not be accounted to the task.  'mm' *is* current->mm,
> > -* but we treat this as a 'remote' access since it is
> > -* essentially a kernel access to the memory.
> > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +* it is essentially a kernel access to the memory.
> >  */
> > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov 
>
> -
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?

Well, yes, it still looks confusing, imo.

Andrii, I hope you won't mind if I redo/resend this and the next cleanup?

The next one only updates the comment above uprobe_write_opcode(), but
it would be nice to explain mmap_write_lock() in register_for_each_vma().

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Oleg Nesterov
On 07/09, Andrii Nakryiko wrote:
>
> On Tue, Jul 9, 2024 at 11:49 AM Oleg Nesterov  wrote:
> >
> > > Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> > > detects "improper" stack pointer progression, for example),
> >
> > In this case we a) assume that user-space tries to fool the kernel and
>
> Well, it's a bad assumption. User space might just be using fibers and
> managing its own stack.

Do you mean something like the "go" language?

Yes, not supported. And from the kernel perspective it still looks as if
user-space tries to fool the kernel. I mean, if you insert a ret-probe,
the kernel assumes that it "owns" the stack, if nothing else the kernel
has to change the ret-address on stack.

I agree, this is not good. But again, what else the kernel can do in
this case?

> > Not really expected, and that is why the "TODO" comment in _unregister()
> > was never implemented. Although the real reason is that we are lazy ;)
>
> Worked fine for 10+ years, which says something ;)

Or may be it doesn't but we do not know because this code doesn't do
uprobe_warn() ;)

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-09 Thread Oleg Nesterov
On 07/08, Andrii Nakryiko wrote:
>
> On Sun, Jul 7, 2024 at 7:48 AM Oleg Nesterov  wrote:
> >
> > And I forgot to mention...
> >
> > In any case __uprobe_unregister() can't ignore the error code from
> > register_for_each_vma(). If it fails to restore the original insn,
> > we should not remove this uprobe from uprobes_tree.
> >
> > Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
> > probed application.
>
> Yep, that would be unfortunate (just like SIGILL sent when uretprobe
> detects "improper" stack pointer progression, for example),

In this case we a) assume that user-space tries to fool the kernel and
b) the kernel can't handle this case in any case, thus uprobe_warn().

> but from
> what I gather it's not really expected to fail on unregistration given
> we successfully registered uprobe.

Not really expected, and that is why the "TODO" comment in _unregister()
was never implemented. Although the real reason is that we are lazy ;)

But register_for_each_vma(NULL) can fail. Say, simply because
kmalloc(GFP_KERNEL) in build_map_info() can fail even if it "never" should.
A lot of other reasons.

> I guess it's a decision between
> leaking memory with an uprobe stuck in the tree or killing process due
> to some very rare (or buggy) condition?

Yes. I think in this case it is better to leak uprobe than kill the
no longer probed task.

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-07 Thread Oleg Nesterov
And I forgot to mention...

In any case __uprobe_unregister() can't ignore the error code from
register_for_each_vma(). If it fails to restore the original insn,
we should not remove this uprobe from uprobes_tree.

Otherwise the next handle_swbp() will send SIGTRAP to the (no longer)
probed application.

On 07/05, Oleg Nesterov wrote:
>
> Tried to read this patch, but I fail to understand it. It looks
> obvioulsy wrong to me, see below.
>
> I tend to agree with the comments from Peter, but lets ignore them
> for the moment.
>
> On 07/01, Andrii Nakryiko wrote:
> >
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -   if (refcount_dec_and_test(>ref)) {
> > +   s64 v;
> > +
> > +   /*
> > +* here uprobe instance is guaranteed to be alive, so we use Tasks
> > +* Trace RCU to guarantee that uprobe won't be freed from under us, if
> > +* we end up being a losing "destructor" inside uprobe_treelock'ed
> > +* section double-checking uprobe->ref value below.
> > +* Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> > +*/
> > +   rcu_read_lock_trace();
> > +
> > +   v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> > +
> > +   if (unlikely((u32)v == 0)) {
>
> I must have missed something, but how can this ever happen?
>
> Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
> that this binary is not used, so _register() doesn't install breakpoints/etc.
>
> IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.
>
> Now suppose that uprobe_unregister() is called right after that. It does
>
>   uprobe = find_uprobe(inode, offset);
>
> this increments the counter, (u32)uprobe->ref == 2
>
>   __uprobe_unregister(...);
>
> this wont't change the counter,
>
>   put_uprobe(uprobe);
>
> this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.
>
> Where should the "final" put_uprobe() come from?
>
> IIUC, this patch lacks another put_uprobe() after consumer_del(), no?
>
> Oleg.




Re: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer

2024-07-07 Thread Oleg Nesterov
On 07/01, Andrii Nakryiko wrote:
>
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -42,6 +42,11 @@ struct uprobe_consumer {
>   enum uprobe_filter_ctx ctx,
>   struct mm_struct *mm);
>  
> + /* associated file offset of this probe */
> + loff_t offset;
> + /* associated refctr file offset of this probe, or zero */
> + loff_t ref_ctr_offset;
> + /* for internal uprobe infra use, consumers shouldn't touch fields 
> below */
>   struct uprobe_consumer *next;


Well, I don't really like this patch either...

If nothing else because all the consumers in uprobe->consumers list
must have the same offset/ref_ctr_offset.

--
But I agree, the ugly uprobe_register_refctr() must die, we need a single
function

int uprobe_register(inode, offset, ref_ctr_offset, consumer);

This change is trivial.

--
And speaking of cleanups, I think another change makes sense:

-   int uprobe_register(...);
+   struct uprobe* uprobe_register(...);

so that uprobe_register() returns uprobe or ERR_PTR.

-   void uprobe_unregister(inode, offset, consumer);
+   void uprobe_unregister(uprobe, consumer);

this way unregister() doesn't need the extra find_uprobe() + put_uprobe().
The same for uprobe_apply().

The necessary changes in kernel/trace/trace_uprobe.c are trivial, we just
need to change struct trace_uprobe

-   struct inode*inode;
+   struct uprobe   *uprobe;

and fix the compilation errors.


As for kernel/trace/bpf_trace.c, I guess struct bpf_uprobe  needs the new
->uprobe member, we can't kill bpf_uprobe->offset because of
bpf_uprobe_multi_link_fill_link_info(), but I think this is not that bad.

What do you think?

Oleg.




Re: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-07-05 Thread Oleg Nesterov
Tried to read this patch, but I fail to understand it. It looks
obvioulsy wrong to me, see below.

I tend to agree with the comments from Peter, but lets ignore them
for the moment.

On 07/01, Andrii Nakryiko wrote:
>
>  static void put_uprobe(struct uprobe *uprobe)
>  {
> - if (refcount_dec_and_test(>ref)) {
> + s64 v;
> +
> + /*
> +  * here uprobe instance is guaranteed to be alive, so we use Tasks
> +  * Trace RCU to guarantee that uprobe won't be freed from under us, if
> +  * we end up being a losing "destructor" inside uprobe_treelock'ed
> +  * section double-checking uprobe->ref value below.
> +  * Note call_rcu_tasks_trace() + uprobe_free_rcu below.
> +  */
> + rcu_read_lock_trace();
> +
> + v = atomic64_add_return(UPROBE_REFCNT_PUT, >ref);
> +
> + if (unlikely((u32)v == 0)) {

I must have missed something, but how can this ever happen?

Suppose uprobe_register(inode) is called the 1st time. To simplify, suppose
that this binary is not used, so _register() doesn't install breakpoints/etc.

IIUC, with this change (u32)uprobe->ref == 1 when uprobe_register() succeeds.

Now suppose that uprobe_unregister() is called right after that. It does

uprobe = find_uprobe(inode, offset);

this increments the counter, (u32)uprobe->ref == 2

__uprobe_unregister(...);

this wont't change the counter,

put_uprobe(uprobe);

this drops the reference added by find_uprobe(), (u32)uprobe->ref == 1.

Where should the "final" put_uprobe() come from?

IIUC, this patch lacks another put_uprobe() after consumer_del(), no?

Oleg.




Re: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()

2024-07-03 Thread Oleg Nesterov
On 07/01, Andrii Nakryiko wrote:
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct 
> mm_struct *mm,
>   * @vaddr: the virtual address to store the opcode.
>   * @opcode: opcode to be written at @vaddr.
>   *
> - * Called with mm->mmap_lock held for write.
> + * Called with mm->mmap_lock held for read or write.
>   * Return 0 (success) or a negative errno.

Thanks,

Acked-by: Oleg Nesterov 


I'll try to send the patch which explains the reasons for mmap_write_lock()
in register_for_each_vma() later.

Oleg.




Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-03 Thread Oleg Nesterov
Sorry for the late reply. I'll try to read this version/discussion
when I have time... yes, I have already promised this before, sorry :/

On 07/01, Andrii Nakryiko wrote:
>
> There is no task_struct passed into get_user_pages_remote() anymore,
> drop the parts of comment mentioning NULL tsk, it's just confusing at
> this point.

Agreed.

> @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> unsigned long vaddr)
>   goto out;
>
>   /*
> -  * The NULL 'tsk' here ensures that any faults that occur here
> -  * will not be accounted to the task.  'mm' *is* current->mm,
> -  * but we treat this as a 'remote' access since it is
> -  * essentially a kernel access to the memory.
> +  * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> +  * it is essentially a kernel access to the memory.
>*/
>   result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);

OK, this makes it less confusing, so

Acked-by: Oleg Nesterov 


-
but it still looks confusing to me. This code used to pass tsk = NULL
only to avoid tsk->maj/min_flt++ in faultin_page().

But today mm_account_fault() increments these counters without checking
FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
just use get_user_pages() and remove this comment?

Oleg.




Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-25 Thread Oleg Nesterov
On 06/25, Andrii Nakryiko wrote:
>
> On Tue, Jun 25, 2024 at 7:51 AM Oleg Nesterov  wrote:
> >
> > Why?
> >
> > So far I don't understand this change. Quite possibly I missed something,
> > but in this case the changelog should explain the problem more clearly.
> >
>
> I just went off of "Called with mm->mmap_lock held for write." comment
> in uprobe_write_opcode(), tbh.

Ah, indeed... and git blame makes me sad ;)

I _think_ that 29dedee0e693a updated this comment without any thinking,
but today I can't recall. In any case, today this nothing to do with
mem_cgroup_charge(). Not sure __replace_page() is correct (in this respect)
when it returns -EAGAIN but this is another story.

> If we don't actually need writer
> mmap_lock, we should probably update at least that comment.

Agreed.

> There is a
> lot going on in uprobe_write_opcode(), and I don't understand all the
> requirements there.

Heh. Neither me today ;)

Oleg.




Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-25 Thread Oleg Nesterov
On 06/25, Masami Hiramatsu wrote:
>
> On Mon, 24 Jun 2024 17:21:34 -0700
> Andrii Nakryiko  wrote:
>
> > Given unapply_uprobe() can call remove_breakpoint() which eventually
> > calls uprobe_write_opcode(), which can modify a set of memory pages and
> > expects mm->mmap_lock held for write, it needs to have writer lock.
>
> Oops, it is an actual bug, right?

Why?

So far I don't understand this change. Quite possibly I missed something,
but in this case the changelog should explain the problem more clearly.

Oleg.




Re: [PATCH 04/12] uprobes: revamp uprobe refcounting and lifetime management

2024-06-25 Thread Oleg Nesterov
Again, I'll try to read (at least this) patch later this week,
just one cosmetic nit for now...

On 06/24, Andrii Nakryiko wrote:
>
> + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both*
> + * epoch and refcnt parts atomically with one atomic_add().
> + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and
> + * *increment* epoch part.
> + */
> +#define UPROBE_REFCNT_GET ((1LL << 32) | 1LL)
> +#define UPROBE_REFCNT_PUT (0xLL)

How about

#define UPROBE_REFCNT_GET ((1ULL << 32) + 1ULL)
#define UPROBE_REFCNT_PUT ((1ULL << 32) - 1ULL)

? this makes them symmetrical and makes it clear why
atomic64_add(UPROBE_REFCNT_PUT) works as described in the comment.

Feel free to ignore if you don't like it.

Oleg.




Re: [PATCH 02/12] uprobes: grab write mmap lock in unapply_uprobe()

2024-06-25 Thread Oleg Nesterov
I don't think I can review, I forgot everything, but I'll try to read this
series when I have time to (try to) understand what it does...

On 06/24, Andrii Nakryiko wrote:
>
> Given unapply_uprobe() can call remove_breakpoint() which eventually
> calls uprobe_write_opcode(), which can modify a set of memory pages and
> expects mm->mmap_lock held for write, it needs to have writer lock.
>
> Fix this by switching to mmap_write_lock()/mmap_write_unlock().
>
> Fixes: da1816b1caec ("uprobes: Teach handler_chain() to filter out the probed 
> task")
> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/events/uprobes.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 197fbe4663b5..e896eeecb091 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1235,7 +1235,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct 
> mm_struct *mm)
>   struct vm_area_struct *vma;
>   int err = 0;
>
> - mmap_read_lock(mm);
> + mmap_write_lock(mm);

Can you explain what exactly is wrong?

FOLL_FORCE/etc is fine under mmap_read_lock(), __replace_page() seems too...

I recall that initially uprobes.c always took mmap_sem for reading, then
register_for_each_vma() was changed by 77fc4af1b59d12 but there was other
reasons for this change...

Again, I don't understand this code today, quite possibly I missed something,
I am just trying to understand.

Well, it seems there is another reason for this change... Currently 2
unapply_uprobe()'s can race with each other if they try to update the same
page. But in this case we can rely on -EAGAIN from __replace_page() ?

Oleg.




Re: [PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-21 Thread Oleg Nesterov
On 05/20, Andrii Nakryiko wrote:
>
> Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily")
> Reported-by: Breno Leitao 
> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/trace/trace_uprobe.c | 14 +-
>  1 file changed, 9 insertions(+), 5 deletions(-)

Reviewed-by: Oleg Nesterov