Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-09-02 Thread Linus Torvalds
On Mon, Sep 2, 2019 at 10:04 AM Eric W. Biederman  wrote:
>
> I like using the storage we will later use for the rcu_head.
>
> Is the intention your new variable xxx start as 0, and the only
> on the second write it becomes 1 and we take action?
>
> That should work but it is a funny way to encode a decrement.  I think
> it would be more straight forward to use refcount_dec_and_test.
>
> So something like this:

I like how this patch looks. It makes more sense to me than some of
the ad-hoc cases, and I wonder if this might be a pattern in general.

We have a very different "some users don't need RCU" in the dentry
code, and recently in the credential handling code. So I wonder if
this is a larger pattern, but I think your patch looks good
independently on its own.

But this is all based on "that patch _feels_ conceptually right",
rather than any deep thinking or (God forbid) any actual testing.

Linus


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-09-02 Thread Eric W. Biederman
Oleg Nesterov  writes:

> On 08/30, Eric W. Biederman wrote:
>>
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head 
>> *rhp)
>>  put_task_struct(tsk);
>>  }
>>  
>> +void put_dead_task_struct(struct task_struct *task)
>> +{
>> +bool delay = false;
>> +unsigned long flags;
>> +
>> +/* Is the task both reaped and no longer being scheduled? */
>> +raw_spin_lock_irqsave(>pi_lock, flags);
>> +if ((task->state == TASK_DEAD) &&
>> +(cmpxchg(>exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
>> +delay = true;
>> +raw_spin_lock_irqrestore(>pi_lock, flags);
>> +
>> +/* If both are true use rcu delay the put_task_struct */
>> +if (delay)
>> +call_rcu(>rcu, delayed_put_task_struct);
>> +else
>> +put_task_struct(task);
>> +}
>>  
>>  void release_task(struct task_struct *p)
>>  {
>> @@ -222,76 +240,13 @@ void release_task(struct task_struct *p)
>>  
>>  write_unlock_irq(_lock);
>>  release_thread(p);
>> -call_rcu(>rcu, delayed_put_task_struct);
>> +put_dead_task_struct(p);
>
> I had a similar change in mind, see below. This is subjective, but to me
> it looks more simple and clean.
>
> Oleg.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8dc1811..1f9b021 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1134,7 +1134,10 @@ struct task_struct {
>  
>   struct tlbflush_unmap_batch tlb_ubc;
>  
> - struct rcu_head rcu;
> + union {
> + boolxxx;
> + struct rcu_head rcu;
> + };
>  
>   /* Cache last used pipe for splice(): */
>   struct pipe_inode_info  *splice_pipe;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a75b6a7..baacfce 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>   put_task_struct(tsk);
>  }
>  
> +void call_delayed_put_task_struct(struct task_struct *p)
> +{
> + if (xchg(>xxx, 1))
> + call_rcu(>rcu, delayed_put_task_struct);
> +}

I like using the storage we will later use for the rcu_head.

Is the intention your new variable xxx start as 0, and the only
on the second write it becomes 1 and we take action?

That should work but it is a funny way to encode a decrement.  I think
it would be more straight forward to use refcount_dec_and_test.

So something like this:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..99a4518b9b17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1142,7 +1142,10 @@ struct task_struct {
 
struct tlbflush_unmap_batch tlb_ubc;
 
-   struct rcu_head rcu;
+   union {
+   refcount_t  rcu_users;
+   struct rcu_head rcu;
+   };
 
/* Cache last used pipe for splice(): */
struct pipe_inode_info  *splice_pipe;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..8bd51af44bf8 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,7 @@ static inline void put_task_struct(struct task_struct *t)
__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_task_struct_rcu_user(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..a42fd8889036 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
 }
 
+void put_task_struct_rcu_user(struct task_struct *task)
+{
+   if (refcount_dec_and_test(>rcu_users))
+   call_rcu(>rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,10 +227,10 @@ void release_task(struct task_struct *p)
 
write_unlock_irq(_lock);
release_thread(p);
-   call_rcu(>rcu, delayed_put_task_struct);
+   put_task_struct_rcu_user(p);
 
p = leader;
if (unlikely(zap_leader))
goto repeat;
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 2852d0e76ea3..dc4799210e05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,11 +900,15 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
if (orig->cpus_ptr == >cpus_mask)
tsk->cpus_ptr = >cpus_mask;
 
-   /*
-* One for us, one for whoever does the "release_task()" (usually
-* parent)
+   /* One for the user space visible state that goes away when
+* the processes zombie is reaped.
+* One for the reference from the scheduler.
+*
+* The reference count is ignored and free_task is called
+* 

Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-09-02 Thread Peter Zijlstra
On Mon, Sep 02, 2019 at 04:44:24PM +0200, Oleg Nesterov wrote:

> speaking of the users of task_rcu_dereference(), membarrier_global_expedited()
> does
> 
>   rcu_read_lock();
>   p = task_rcu_dereference(_rq(cpu)->curr);
>   if (p && p->mm && (atomic_read(>mm->membarrier_state) &
>  MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
> 
> This asks for READ_ONCE, but this is minor. Why can't p->mm be freed?
> 
> I guess it is fine to read the garbage from >mm->membarrier_state if we 
> race
> with the exiting task, but in theory this looks unsafe if 
> CONFIG_DEBUG_PAGEALLOC.
> 
> Another possible user of probe_slab_address() or I am totally confused?

You're quite right; that's busted.

Due to the lack of READ_ONCE() on p->mm, the above can in fact turn into
a NULL deref when we hit do_exit() around exit_mm(). The first p->mm
read is before and sees !NULL, the second is after and does observe
NULL, and *bang*.

I suppose it wants to be something like:

mm = READ_ONCE(p->mm);
if (mm && probe_address())

(I'm not sure _slab_ is a useful part of the name; it should work on
kernel memory irrespective of the allocator)

If it got freed, that CPU already just did something that implies
smp_mb() so we're good. So whatever garbage gets read, is fine. Either
we do a superfluous IPI or not is immaterial.




Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-09-02 Thread Oleg Nesterov
On 09/02, Peter Zijlstra wrote:
>
> On Mon, Sep 02, 2019 at 03:40:03PM +0200, Oleg Nesterov wrote:
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 8dc1811..1f9b021 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1134,7 +1134,10 @@ struct task_struct {
> >  
> > struct tlbflush_unmap_batch tlb_ubc;
> >  
> > -   struct rcu_head rcu;
> > +   union {
> > +   boolxxx;
> > +   struct rcu_head rcu;
> > +   };
> >  
> > /* Cache last used pipe for splice(): */
> > struct pipe_inode_info  *splice_pipe;
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index a75b6a7..baacfce 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head 
> > *rhp)
> > put_task_struct(tsk);
> >  }
> >  
> > +void call_delayed_put_task_struct(struct task_struct *p)
> > +{
> > +   if (xchg(>xxx, 1))
> > +   call_rcu(>rcu, delayed_put_task_struct);
> > +}
> 
> I think this is the first usage of xchg() on _Bool, also not all archs
> implement xchg8()

I didn't even notice I used "bool" ;)


speaking of the users of task_rcu_dereference(), membarrier_global_expedited()
does

rcu_read_lock();
p = task_rcu_dereference(_rq(cpu)->curr);
if (p && p->mm && (atomic_read(>mm->membarrier_state) &
   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {

This asks for READ_ONCE, but this is minor. Why can't p->mm be freed?

I guess it is fine to read the garbage from >mm->membarrier_state if we race
with the exiting task, but in theory this looks unsafe if 
CONFIG_DEBUG_PAGEALLOC.

Another possible user of probe_slab_address() or I am totally confused?

Oleg.



Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-09-02 Thread Peter Zijlstra
On Mon, Sep 02, 2019 at 03:40:03PM +0200, Oleg Nesterov wrote:
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8dc1811..1f9b021 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1134,7 +1134,10 @@ struct task_struct {
>  
>   struct tlbflush_unmap_batch tlb_ubc;
>  
> - struct rcu_head rcu;
> + union {
> + boolxxx;
> + struct rcu_head rcu;
> + };
>  
>   /* Cache last used pipe for splice(): */
>   struct pipe_inode_info  *splice_pipe;
> diff --git a/kernel/exit.c b/kernel/exit.c
> index a75b6a7..baacfce 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>   put_task_struct(tsk);
>  }
>  
> +void call_delayed_put_task_struct(struct task_struct *p)
> +{
> + if (xchg(>xxx, 1))
> + call_rcu(>rcu, delayed_put_task_struct);
> +}

I think this is the first usage of xchg() on _Bool, also not all archs
implement xchg8()



Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-09-02 Thread Oleg Nesterov
On 08/30, Eric W. Biederman wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
>   put_task_struct(tsk);
>  }
>  
> +void put_dead_task_struct(struct task_struct *task)
> +{
> + bool delay = false;
> + unsigned long flags;
> +
> + /* Is the task both reaped and no longer being scheduled? */
> + raw_spin_lock_irqsave(>pi_lock, flags);
> + if ((task->state == TASK_DEAD) &&
> + (cmpxchg(>exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
> + delay = true;
> + raw_spin_lock_irqrestore(>pi_lock, flags);
> +
> + /* If both are true use rcu delay the put_task_struct */
> + if (delay)
> + call_rcu(>rcu, delayed_put_task_struct);
> + else
> + put_task_struct(task);
> +}
>  
>  void release_task(struct task_struct *p)
>  {
> @@ -222,76 +240,13 @@ void release_task(struct task_struct *p)
>  
>   write_unlock_irq(_lock);
>   release_thread(p);
> - call_rcu(>rcu, delayed_put_task_struct);
> + put_dead_task_struct(p);

I had a similar change in mind, see below. This is subjective, but to me
it looks more simple and clean.

Oleg.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811..1f9b021 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1134,7 +1134,10 @@ struct task_struct {
 
struct tlbflush_unmap_batch tlb_ubc;
 
-   struct rcu_head rcu;
+   union {
+   boolxxx;
+   struct rcu_head rcu;
+   };
 
/* Cache last used pipe for splice(): */
struct pipe_inode_info  *splice_pipe;
diff --git a/kernel/exit.c b/kernel/exit.c
index a75b6a7..baacfce 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,11 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
 }
 
+void call_delayed_put_task_struct(struct task_struct *p)
+{
+   if (xchg(>xxx, 1))
+   call_rcu(>rcu, delayed_put_task_struct);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,7 +227,7 @@ void release_task(struct task_struct *p)
 
write_unlock_irq(_lock);
release_thread(p);
-   call_rcu(>rcu, delayed_put_task_struct);
+   call_delayed_put_task_struct(p);
 
p = leader;
if (unlikely(zap_leader))
diff --git a/kernel/fork.c b/kernel/fork.c
index d8ae0f1..e90f6de 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -900,11 +900,7 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
if (orig->cpus_ptr == >cpus_mask)
tsk->cpus_ptr = >cpus_mask;
 
-   /*
-* One for us, one for whoever does the "release_task()" (usually
-* parent)
-*/
-   refcount_set(>usage, 2);
+   refcount_set(>usage, 1);
 #ifdef CONFIG_BLK_DEV_IO_TRACE
tsk->btrace_seq = 0;
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f1..e77389c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
/* Task is done with its stack. */
put_task_stack(prev);
 
-   put_task_struct(prev);
+   call_delayed_put_task_struct(prev);
}
 
tick_nohz_task_switch();



Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov  wrote:
>>
>>
>> Yes, please see
>>
>> [PATCH 2/3] introduce probe_slab_address()
>> https://lore.kernel.org/lkml/20141027195425.gc11...@redhat.com/
>>
>> I sent 5 years ago ;) Do you think
>>
>> /*
>>  * Same as probe_kernel_address(), but @addr must be the valid 
>> pointer
>>  * to a slab object, potentially freed/reused/unmapped.
>>  */
>> #ifdef CONFIG_DEBUG_PAGEALLOC
>> #define probe_slab_address(addr, retval)\
>> probe_kernel_address(addr, retval)
>> #else
>> #define probe_slab_address(addr, retval)\
>> ({  \
>> (retval) = *(typeof(retval) *)(addr);   \
>> 0;  \
>> })
>> #endif
>>
>> can work?
>
> Ugh. I would much rather handle the general case, because honestly,
> tracing has had a lot of issues with our hacky "probe_kernel_read()"
> stuff that bases itself on user addresses.
>
> It's also one of the few remaining users of "set_fs()" in core code,
> and we really should try to get rid of those.
>
> So your code would work for this particular case, but not for other
> cases that can trap simply because the pointer isn't reliable (tracing
> being the main case for that - but if the source of the pointer itself
> might have been free'd, you would also have that situation).
>
> So I'd really prefer to have something a bit fancier. On most
> architectures, doing a good exception fixup for kernel addresses is
> really not that hard.
>
> On x86, for example, we actually have *exactly* that. The
> "__get_user_asm()" macro is basically it. It purely does a load
> instruction from an unchecked address.
>
> (It's a really odd syntax, but you could remove the __chk_user_ptr()
> from the __get_user_size() macro, and now you'd have basically a "any
> regular size kernel access with exception handlng").
>
> But yes, your hack is I guess optimal for this particular case where
> you simply can depend on "we know the pointer was valid, we just don't
> know if it was freed".
>
> Hmm. Don't we RCU-free the task struct? Because then we don't even
> need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
> the pointer as long as we have the RCU read lock. We do that in other
> cases.

Sort of.  The rcu delay happens when release_task calls
delayed_put_task_struct.  Which unfortunately means that anytime after
exit_notify, release_task can operate on a task.  So it is possible
that by the time do_dead_task is called the rcu grace period is up.


Which is the problem the users of task_rcu_dereference are fighting.
They are performing an rcu walk on the set of cups in task_numa_migrate
and in the userspace membarrier system calls.

For a short while we the rcu delay in put_task_struct but that required
changes all of the place and was just a pain to work with.

Then I did:
> commit 8c7904a00b06d2ee51149794b619e07369fcf9d4
> Author: Eric W. Biederman 
> Date:   Fri Mar 31 02:31:37 2006 -0800
> 
> [PATCH] task: RCU protect task->usage
> 
> A big problem with rcu protected data structures that are also reference
> counted is that you must jump through several hoops to increase the 
> reference
> count.  I think someone finally implemented atomic_inc_not_zero() to
> automate the common case.  Unfortunately this means you must special case 
> the
> rcu access case.
> 
> When data structures are only visible via rcu in a manner that is not
> determined by the reference count on the object (i.e.  tasks are visible 
> until
> their zombies are reaped) there is a much simpler technique we can employ.
> Simply delaying the decrement of the reference count until the rcu 
> interval is
> over.
> 
> What that means is that the proc code that looks up a task and later
> wants to sleep can now do:
> 
> rcu_read_lock();
> task = find_task_by_pid(some_pid);
> if (task) {
> get_task_struct(task);
> }
> rcu_read_unlock();
> 
> The effect on the rest of the kernel is that put_task_struct becomes 
> cheaper
> and immediate, and in the case where the task has been reaped it frees the
> task immediate instead of unnecessarily waiting an until the rcu interval 
> is
> over.
> 
> Cleanup of task_struct does not happen when its reference count drops to
> zero, instead cleanup happens when release_task is called.  Tasks can only
> be looked up via rcu before release_task is called.  All rcu protected
> members of task_struct are freed by release_task.
> 
> Therefore we can move call_rcu from put_task_struct into release_task.  
> And
> we can modify release_task to not immediately release the reference count
> 

Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Linus Torvalds
On Fri, Aug 30, 2019 at 9:44 AM Oleg Nesterov  wrote:
>
> ->curr is not protected by RCU, the last schedule does put_task_struct()
> in finish_task_switch().

Right you are.

It's only the sighand allocation itself that is RCU-allocated (using
SLAB_TYPESAFE_BY_RCU, so only the backing page freeing is RCU-delayed,
it can be re-used immediately).

For some reason I thought the main thread struct was too, but that was
just my fevered imagination.

> Of course we can change this and add another call_rcu (actually we can do
> better), and after that we do not need task_rcu_dereference() at all.

No, we wouldn't do another RCU call, we'd just make task_struct_cachep
be SLAB_TYPESAFE_BY_RCU too. In the reuse case, you have no cost at
all.

However, the overhead of RCU freeing is real. It's much less for the
SLAB_TYPESAFE_BY_RCU case (at least for small allocations) than for
"free every single allocaiton by RCU", but it's still very real.

(For small allocations, you only take the RCU hit when you free the
backing store of the pages, which is much less frequent - but for
something big like the task_struct, I don't know how much of a
buffering the slab allocation ends up being)

So it's probably better to take the hit at task_rcu_dereference() than
add RCU logic to the task freeing.

  Linus


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Oleg Nesterov
On 08/30, Linus Torvalds wrote:
>
> But yes, your hack is I guess optimal for this particular case where
> you simply can depend on "we know the pointer was valid, we just don't
> know if it was freed".
>
> Hmm. Don't we RCU-free the task struct? Because then we don't even
> need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
> the pointer as long as we have the RCU read lock.

For example,

rcu_read_lock();
p = task_rcu_dereference(_rq(cpu)->curr);
rcu_read_unlock();

->curr is not protected by RCU, the last schedule does put_task_struct()
in finish_task_switch().

Of course we can change this and add another call_rcu (actually we can do
better), and after that we do not need task_rcu_dereference() at all.

Oleg.



Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Linus Torvalds
On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov  wrote:
>
>
> Yes, please see
>
> [PATCH 2/3] introduce probe_slab_address()
> https://lore.kernel.org/lkml/20141027195425.gc11...@redhat.com/
>
> I sent 5 years ago ;) Do you think
>
> /*
>  * Same as probe_kernel_address(), but @addr must be the valid pointer
>  * to a slab object, potentially freed/reused/unmapped.
>  */
> #ifdef CONFIG_DEBUG_PAGEALLOC
> #define probe_slab_address(addr, retval)\
> probe_kernel_address(addr, retval)
> #else
> #define probe_slab_address(addr, retval)\
> ({  \
> (retval) = *(typeof(retval) *)(addr);   \
> 0;  \
> })
> #endif
>
> can work?

Ugh. I would much rather handle the general case, because honestly,
tracing has had a lot of issues with our hacky "probe_kernel_read()"
stuff that bases itself on user addresses.

It's also one of the few remaining users of "set_fs()" in core code,
and we really should try to get rid of those.

So your code would work for this particular case, but not for other
cases that can trap simply because the pointer isn't reliable (tracing
being the main case for that - but if the source of the pointer itself
might have been free'd, you would also have that situation).

So I'd really prefer to have something a bit fancier. On most
architectures, doing a good exception fixup for kernel addresses is
really not that hard.

On x86, for example, we actually have *exactly* that. The
"__get_user_asm()" macro is basically it. It purely does a load
instruction from an unchecked address.

(It's a really odd syntax, but you could remove the __chk_user_ptr()
from the __get_user_size() macro, and now you'd have basically a "any
regular size kernel access with exception handlng").

But yes, your hack is I guess optimal for this particular case where
you simply can depend on "we know the pointer was valid, we just don't
know if it was freed".

Hmm. Don't we RCU-free the task struct? Because then we don't even
need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
the pointer as long as we have the RCU read lock. We do that in other
cases.

Linus


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Oleg Nesterov
On 08/30, Linus Torvalds wrote:
>
> Side note: that code had better not be performance-critical, because
> "probe_kernel_address()" is actually really really slow.

Yes, please see

[PATCH 2/3] introduce probe_slab_address()
https://lore.kernel.org/lkml/20141027195425.gc11...@redhat.com/

I sent 5 years ago ;) Do you think

/*
 * Same as probe_kernel_address(), but @addr must be the valid pointer
 * to a slab object, potentially freed/reused/unmapped.
 */
#ifdef CONFIG_DEBUG_PAGEALLOC
#define probe_slab_address(addr, retval)\
probe_kernel_address(addr, retval)
#else
#define probe_slab_address(addr, retval)\
({  \
(retval) = *(typeof(retval) *)(addr);   \
0;  \
})
#endif

can work?

Oleg.



Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Linus Torvalds
On Fri, Aug 30, 2019 at 8:40 AM Russell King - ARM Linux admin
 wrote:
>
> Ah, ok.  Might be worth some comments - I find the comments in that
> function particularly unhelpful (even after Oleg mentions this is
> case 2.)

Yeah, a comment like "This might fault if we race with the task
scheduling away and being destroyed, but we check that below".

But that code has some performance issues too, as mentioned. Which is
all kinds of sad since it clearly _tries_ to perform well with RCU
locking and optimistic accesses etc.

 Linus


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Linus Torvalds
On Fri, Aug 30, 2019 at 8:30 AM Linus Torvalds
 wrote:
>
> Do you actually see that behavior?
>
> Because the foillowing lines:
>
> smp_rmb();
> if (unlikely(task != READ_ONCE(*ptask)))
> goto retry;

Side note: that code had better not be performance-critical, because
"probe_kernel_address()" is actually really really slow.

We really should do a real set of "read kernel with fault handling" functions.

We have *one* right now: load_unaligned_zeropad(), but that one
assumes that at least the first byte is valid and that the fault can
only be because of unaligned page crossers.

The problem with "probe_kernel_address()" is that it really just
re-uses the user access functions, and then plays games to make them
work for kernel addresses. Games we shouldn't play, and it's all very
expensive when it really shouldn't need to be. Including changing
limits, but also doing all the system instructions to allow user
accesses (PAN on ARM, clac/stac on x86).

Doing a set of "access kernel with error handling" should be trivial,
it's just that every architecture needs to do it. So we'd probably
need to do something where architectures can say "I have it", and fall
back on the silly legacy implementation otherwise..

 Linus


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Russell King - ARM Linux admin
On Fri, Aug 30, 2019 at 08:30:00AM -0700, Linus Torvalds wrote:
> On Fri, Aug 30, 2019 at 7:08 AM Russell King - ARM Linux admin
>  wrote:
> >
> > which means that when probe_kernel_address() returns -EFAULT, the
> > destination is left uninitialised.  In the case of
> > task_rcu_dereference(), this means that "siginfo" can be used without
> > having been initialised, resulting in this function returning an
> > indeterminant result (based on the value of an uninitialised variable
> > on the stack.)
> 
> Do you actually see that behavior?

No, it was an observation of the code.

> Because the foillowing lines:
> 
> smp_rmb();
> if (unlikely(task != READ_ONCE(*ptask)))
> goto retry;
> 
> are what is supposed to protect it - yes, it could have faulted, but
> only if 'task' isn't valid any more, and we just re-checked it.

Ah, ok.  Might be worth some comments - I find the comments in that
function particularly unhelpful (even after Oleg mentions this is
case 2.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Linus Torvalds
On Fri, Aug 30, 2019 at 7:08 AM Russell King - ARM Linux admin
 wrote:
>
> which means that when probe_kernel_address() returns -EFAULT, the
> destination is left uninitialised.  In the case of
> task_rcu_dereference(), this means that "siginfo" can be used without
> having been initialised, resulting in this function returning an
> indeterminant result (based on the value of an uninitialised variable
> on the stack.)

Do you actually see that behavior?

Because the foillowing lines:

smp_rmb();
if (unlikely(task != READ_ONCE(*ptask)))
goto retry;

are what is supposed to protect it - yes, it could have faulted, but
only if 'task' isn't valid any more, and we just re-checked it.

  Linus


Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Oleg Nesterov
On 08/30, Russell King - ARM Linux admin wrote:
>
> which means that when probe_kernel_address() returns -EFAULT, the
> destination is left uninitialised.  In the case of
> task_rcu_dereference(), this means that "siginfo" can be used without
> having been initialised,

Yes, but this is fine, please see the long comment below (case 2).

If probe_kernel_address() fails, "sighand" is not initialized. but this
doesn't differ from the case when we inspect the random value if this
task_struct was freed, then reallocated as another thing, then freed and
reallocated as task_struct again.

Oleg.



[BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

2019-08-30 Thread Russell King - ARM Linux admin
In commit 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and
try_get_task_struct()") probe_kernel_address() is used without checking
it's return value, resulting in undefined behaviour of this function.

I stumbled over this due to looking at the definition of this function,
due to a patch submission for ARM, and delving into the internals of the
probe_kernel_*() functions.

Essentially, probe_kernel_address() uses probe_kernel_read(), which
eventually uses __copy_from_user_inatomic().
__copy_from_user_inatomic() is defined thusly:

 * NOTE: only copy_from_user() zero-pads the destination in case of short copy.
 * Neither __copy_from_user() nor __copy_from_user_inatomic() zero anything
 * at all; their callers absolutely must check the return value.

which means that when probe_kernel_address() returns -EFAULT, the
destination is left uninitialised.  In the case of
task_rcu_dereference(), this means that "siginfo" can be used without
having been initialised, resulting in this function returning an
indeterminant result (based on the value of an uninitialised variable
on the stack.)

One option would be to explicitly initialise "sighand" on function
entry, or maybe check the return value of probe_kernel_address().  It
looks non-trivial due to the interaction with put_task_struct().  I
suggest someone who knows this code needs to patch this issue.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up