Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-25 Thread Mathieu Desnoyers
- On Sep 25, 2017, at 8:25 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Sep 25, 2017 at 08:10:54PM +0800, Boqun Feng wrote:
>> > static void membarrier_register_private_expedited(void)
>> > {
>> > struct task_struct *p = current;
>> > 
>> > if (READ_ONCE(p->mm->membarrier_private_expedited))
>> > return;
>> > membarrier_arch_register_private_expedited(p);
> 
> Should we not then also do:
> 
>   barrier();
> 
>> > WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
>> > }
> 
> to avoid the compiler lifting that store?

membarrier_arch_register_private_expedited() being a function call, I
recall compilers cannot move load/stores across those. Moreover, even if
that function would happen to be eventually inlined, synchronize_sched()
is needed at the end of the function to ensure the scheduler will observe
the thread flags before it returns. That too would then act as a compiler
barrier if that function is ever inlined in the future.

So do you think we should still add the barrier() as documentation, or is
having synchronize_sched() in the callee enough ?

By the way, I think I should add a READ_ONCE() in membarrier_private_expedited
to pair with the WRITE_ONCE() in registration, such as:

if (!READ_ONCE(current->mm->membarrier_private_expedited))
return -EPERM;

Thanks!

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-25 Thread Mathieu Desnoyers
- On Sep 25, 2017, at 8:25 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Mon, Sep 25, 2017 at 08:10:54PM +0800, Boqun Feng wrote:
>> > static void membarrier_register_private_expedited(void)
>> > {
>> > struct task_struct *p = current;
>> > 
>> > if (READ_ONCE(p->mm->membarrier_private_expedited))
>> > return;
>> > membarrier_arch_register_private_expedited(p);
> 
> Should we not then also do:
> 
>   barrier();
> 
>> > WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
>> > }
> 
> to avoid the compiler lifting that store?

membarrier_arch_register_private_expedited() being a function call, I
recall compilers cannot move load/stores across those. Moreover, even if
that function would happen to be eventually inlined, synchronize_sched()
is needed at the end of the function to ensure the scheduler will observe
the thread flags before it returns. That too would then act as a compiler
barrier if that function is ever inlined in the future.

So do you think we should still add the barrier() as documentation, or is
having synchronize_sched() in the callee enough ?

By the way, I think I should add a READ_ONCE() in membarrier_private_expedited
to pair with the WRITE_ONCE() in registration, such as:

if (!READ_ONCE(current->mm->membarrier_private_expedited))
return -EPERM;

Thanks!

Mathieu



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-25 Thread Peter Zijlstra
On Mon, Sep 25, 2017 at 08:10:54PM +0800, Boqun Feng wrote:
> > static void membarrier_register_private_expedited(void)
> > {
> > struct task_struct *p = current;
> > 
> > if (READ_ONCE(p->mm->membarrier_private_expedited))
> > return;
> > membarrier_arch_register_private_expedited(p);

Should we not then also do:

barrier();

> > WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
> > }

to avoid the compiler lifting that store?


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-25 Thread Peter Zijlstra
On Mon, Sep 25, 2017 at 08:10:54PM +0800, Boqun Feng wrote:
> > static void membarrier_register_private_expedited(void)
> > {
> > struct task_struct *p = current;
> > 
> > if (READ_ONCE(p->mm->membarrier_private_expedited))
> > return;
> > membarrier_arch_register_private_expedited(p);

Should we not then also do:

barrier();

> > WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
> > }

to avoid the compiler lifting that store?


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-25 Thread Boqun Feng
On Sun, Sep 24, 2017 at 02:23:04PM +, Mathieu Desnoyers wrote:
[...]
> >> 
> >> copy_mm() is performed without holding current->sighand->siglock, so
> >> it appears to be racing with concurrent membarrier register cmd.
> > 
> > Speak of racing, I think we currently have a problem if we do a
> > register_private_expedited in one thread and do a
> > membarrer_private_expedited in another thread(sharing the same mm), as
> > follow:
> > 
> > {t1,t2,t3 sharing the same ->mm}
> > CPU 0   CPU 1   CPU2
> > === 
> > 
> > {in thread t1}
> > membarrier_register_private_expedited():
> >   ...
> >   WRITE_ONCE(->mm->membarrier_private_expedited, 1);
> >   membarrier_arch_register_private_expedited():
> > ...
> > 
> > 
> > {in thread t2}
> > membarrier_private_expedited():
> >   
> > READ_ONCE(->mm->membarrier_private_expedited); // == 1
> >   ...
> >   for_each_online_cpu()
> > ...
> > curr>
> > if (p && p->mm == current->mm) // 
> > false
> > 
> > 
> > {about 
> > to switch to t3}
> > 
> > rq->curr = t3;
> > 
> > 
> > context_switch():
> >   ...
> >   
> > finish_task_swtich():
> > 
> > membarrier_sched_in():
> > 
> > 
> > // 
> > no smp_mb() here.
> > 
> > , and we will miss the smp_mb() on CPU2, right? And this could even
> > happen if t2 has a membarrier_register_private_expedited() preceding the
> > membarrier_private_expedited().
> > 
> > Am I missing something subtle here?
> 
> I think the problem sits in this function:
> 
> static void membarrier_register_private_expedited(void)
> {
> struct task_struct *p = current;
> 
> if (READ_ONCE(p->mm->membarrier_private_expedited))
> return;
> WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
> membarrier_arch_register_private_expedited(p);
> }
> 
> I need to change the order between WRITE_ONCE() and invoking
> membarrier_arch_register_private_expedited. If I issue the
> WRITE_ONCE after the arch code (which sets the TIF flags),
> then concurrent membarrier priv exped commands will simply
> return an -EPERM error:
> 
> static void membarrier_register_private_expedited(void)
> {
> struct task_struct *p = current;
> 
> if (READ_ONCE(p->mm->membarrier_private_expedited))
> return;
> membarrier_arch_register_private_expedited(p);
> WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
> }
> 
> Do you agree that this would fix the race you identified ?
> 

Yep, that will do the trick ;-)

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-25 Thread Boqun Feng
On Sun, Sep 24, 2017 at 02:23:04PM +, Mathieu Desnoyers wrote:
[...]
> >> 
> >> copy_mm() is performed without holding current->sighand->siglock, so
> >> it appears to be racing with concurrent membarrier register cmd.
> > 
> > Speak of racing, I think we currently have a problem if we do a
> > register_private_expedited in one thread and do a
> > membarrer_private_expedited in another thread(sharing the same mm), as
> > follow:
> > 
> > {t1,t2,t3 sharing the same ->mm}
> > CPU 0   CPU 1   CPU2
> > === 
> > 
> > {in thread t1}
> > membarrier_register_private_expedited():
> >   ...
> >   WRITE_ONCE(->mm->membarrier_private_expedited, 1);
> >   membarrier_arch_register_private_expedited():
> > ...
> > 
> > 
> > {in thread t2}
> > membarrier_private_expedited():
> >   
> > READ_ONCE(->mm->membarrier_private_expedited); // == 1
> >   ...
> >   for_each_online_cpu()
> > ...
> > curr>
> > if (p && p->mm == current->mm) // 
> > false
> > 
> > 
> > {about 
> > to switch to t3}
> > 
> > rq->curr = t3;
> > 
> > 
> > context_switch():
> >   ...
> >   
> > finish_task_swtich():
> > 
> > membarrier_sched_in():
> > 
> > 
> > // 
> > no smp_mb() here.
> > 
> > , and we will miss the smp_mb() on CPU2, right? And this could even
> > happen if t2 has a membarrier_register_private_expedited() preceding the
> > membarrier_private_expedited().
> > 
> > Am I missing something subtle here?
> 
> I think the problem sits in this function:
> 
> static void membarrier_register_private_expedited(void)
> {
> struct task_struct *p = current;
> 
> if (READ_ONCE(p->mm->membarrier_private_expedited))
> return;
> WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
> membarrier_arch_register_private_expedited(p);
> }
> 
> I need to change the order between WRITE_ONCE() and invoking
> membarrier_arch_register_private_expedited. If I issue the
> WRITE_ONCE after the arch code (which sets the TIF flags),
> then concurrent membarrier priv exped commands will simply
> return an -EPERM error:
> 
> static void membarrier_register_private_expedited(void)
> {
> struct task_struct *p = current;
> 
> if (READ_ONCE(p->mm->membarrier_private_expedited))
> return;
> membarrier_arch_register_private_expedited(p);
> WRITE_ONCE(p->mm->membarrier_private_expedited, 1);
> }
> 
> Do you agree that this would fix the race you identified ?
> 

Yep, that will do the trick ;-)

Regards,
Boqun

> Thanks,
> 
> Mathieu
> 


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-24 Thread Mathieu Desnoyers
- On Sep 24, 2017, at 9:30 AM, Boqun Feng boqun.f...@gmail.com wrote:

> On Fri, Sep 22, 2017 at 03:10:10PM +, Mathieu Desnoyers wrote:
>> - On Sep 22, 2017, at 4:59 AM, Boqun Feng boqun.f...@gmail.com wrote:
>> 
>> > On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
>> > [...]
>> >> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
>> >> + struct task_struct *next)
>> >> +{
>> >> + /*
>> >> +  * Only need the full barrier when switching between processes.
>> >> +  */
>> >> + if (likely(!test_ti_thread_flag(task_thread_info(next),
>> >> + TIF_MEMBARRIER_PRIVATE_EXPEDITED)
>> >> + || prev->mm == next->mm))
>> > 
>> > And we also don't need the smp_mb() if !prev->mm, because switching from
>> > kernel to user will have a smp_mb() implied by mmdrop()?
>> 
>> Right. And we also don't need it when switching from userspace to kernel
> 
> Yep, but this case is covered already, as I think we don't allow kernel
> thread to have TIF_MEMBARRIER_PRIVATE_EXPEDITED set, right?

Good point.

> 
>> thread neither. Something like this:
>> 
>> static inline void membarrier_arch_sched_in(struct task_struct *prev,
>> struct task_struct *next)
>> {
>> /*
>>  * Only need the full barrier when switching between processes.
>>  * Barrier when switching from kernel to userspace is not
>>  * required here, given that it is implied by mmdrop(). Barrier
>>  * when switching from userspace to kernel is not needed after
>>  * store to rq->curr.
>>  */
>> if (likely(!test_ti_thread_flag(task_thread_info(next),
>> TIF_MEMBARRIER_PRIVATE_EXPEDITED)
>> || !prev->mm || !next->mm || prev->mm == next->mm))
> 
> , so no need to test next->mm here.
> 

Right, it's redundant wrt testing the thread flag.

>> return;
>> 
>> /*
>>  * The membarrier system call requires a full memory barrier
>>  * after storing to rq->curr, before going back to user-space.
>>  */
>> smp_mb();
>> }
>> 
>> > 
>> >> + return;
>> >> +
>> >> + /*
>> >> +  * The membarrier system call requires a full memory barrier
>> >> +  * after storing to rq->curr, before going back to user-space.
>> >> +  */
>> >> + smp_mb();
>> >> +}
>> > 
>> > [...]
>> > 
>> >> +static inline void membarrier_fork(struct task_struct *t,
>> >> + unsigned long clone_flags)
>> >> +{
>> >> + if (!current->mm || !t->mm)
>> >> + return;
>> >> + t->mm->membarrier_private_expedited =
>> >> + current->mm->membarrier_private_expedited;
>> > 
>> > Have we already done the copy of ->membarrier_private_expedited in
>> > copy_mm()?
>> 
>> copy_mm() is performed without holding current->sighand->siglock, so
>> it appears to be racing with concurrent membarrier register cmd.
> 
> Speak of racing, I think we currently have a problem if we do a
> register_private_expedited in one thread and do a
> membarrer_private_expedited in another thread(sharing the same mm), as
> follow:
> 
>   {t1,t2,t3 sharing the same ->mm}
>   CPU 0   CPU 1   CPU2
>   === 
> 
>   {in thread t1}
>   membarrier_register_private_expedited():
> ...
> WRITE_ONCE(->mm->membarrier_private_expedited, 1);
> membarrier_arch_register_private_expedited():
>   ...
>   
> 
>   {in thread t2}
>   membarrier_private_expedited():
> 
> READ_ONCE(->mm->membarrier_private_expedited); // == 1
> ...
> for_each_online_cpu()
>   ...
>   curr>
>   if (p && p->mm == current->mm) // 
> false
>   
>   
>   {about 
> to switch to t3}
>   
> rq->curr = t3;
>   
>   
> context_switch():
> ...
> 
> finish_task_swtich():
>   
> membarrier_sched_in():
>   
> 
>  

Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-24 Thread Mathieu Desnoyers
- On Sep 24, 2017, at 9:30 AM, Boqun Feng boqun.f...@gmail.com wrote:

> On Fri, Sep 22, 2017 at 03:10:10PM +, Mathieu Desnoyers wrote:
>> - On Sep 22, 2017, at 4:59 AM, Boqun Feng boqun.f...@gmail.com wrote:
>> 
>> > On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
>> > [...]
>> >> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
>> >> + struct task_struct *next)
>> >> +{
>> >> + /*
>> >> +  * Only need the full barrier when switching between processes.
>> >> +  */
>> >> + if (likely(!test_ti_thread_flag(task_thread_info(next),
>> >> + TIF_MEMBARRIER_PRIVATE_EXPEDITED)
>> >> + || prev->mm == next->mm))
>> > 
>> > And we also don't need the smp_mb() if !prev->mm, because switching from
>> > kernel to user will have a smp_mb() implied by mmdrop()?
>> 
>> Right. And we also don't need it when switching from userspace to kernel
> 
> Yep, but this case is covered already, as I think we don't allow kernel
> thread to have TIF_MEMBARRIER_PRIVATE_EXPEDITED set, right?

Good point.

> 
>> thread neither. Something like this:
>> 
>> static inline void membarrier_arch_sched_in(struct task_struct *prev,
>> struct task_struct *next)
>> {
>> /*
>>  * Only need the full barrier when switching between processes.
>>  * Barrier when switching from kernel to userspace is not
>>  * required here, given that it is implied by mmdrop(). Barrier
>>  * when switching from userspace to kernel is not needed after
>>  * store to rq->curr.
>>  */
>> if (likely(!test_ti_thread_flag(task_thread_info(next),
>> TIF_MEMBARRIER_PRIVATE_EXPEDITED)
>> || !prev->mm || !next->mm || prev->mm == next->mm))
> 
> , so no need to test next->mm here.
> 

Right, it's redundant wrt testing the thread flag.

>> return;
>> 
>> /*
>>  * The membarrier system call requires a full memory barrier
>>  * after storing to rq->curr, before going back to user-space.
>>  */
>> smp_mb();
>> }
>> 
>> > 
>> >> + return;
>> >> +
>> >> + /*
>> >> +  * The membarrier system call requires a full memory barrier
>> >> +  * after storing to rq->curr, before going back to user-space.
>> >> +  */
>> >> + smp_mb();
>> >> +}
>> > 
>> > [...]
>> > 
>> >> +static inline void membarrier_fork(struct task_struct *t,
>> >> + unsigned long clone_flags)
>> >> +{
>> >> + if (!current->mm || !t->mm)
>> >> + return;
>> >> + t->mm->membarrier_private_expedited =
>> >> + current->mm->membarrier_private_expedited;
>> > 
>> > Have we already done the copy of ->membarrier_private_expedited in
>> > copy_mm()?
>> 
>> copy_mm() is performed without holding current->sighand->siglock, so
>> it appears to be racing with concurrent membarrier register cmd.
> 
> Speak of racing, I think we currently have a problem if we do a
> register_private_expedited in one thread and do a
> membarrer_private_expedited in another thread(sharing the same mm), as
> follow:
> 
>   {t1,t2,t3 sharing the same ->mm}
>   CPU 0   CPU 1   CPU2
>   === 
> 
>   {in thread t1}
>   membarrier_register_private_expedited():
> ...
> WRITE_ONCE(->mm->membarrier_private_expedited, 1);
> membarrier_arch_register_private_expedited():
>   ...
>   
> 
>   {in thread t2}
>   membarrier_private_expedited():
> 
> READ_ONCE(->mm->membarrier_private_expedited); // == 1
> ...
> for_each_online_cpu()
>   ...
>   curr>
>   if (p && p->mm == current->mm) // 
> false
>   
>   
>   {about 
> to switch to t3}
>   
> rq->curr = t3;
>   
>   
> context_switch():
> ...
> 
> finish_task_swtich():
>   
> membarrier_sched_in():
>   
> 
>  

Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-24 Thread Boqun Feng
On Fri, Sep 22, 2017 at 03:10:10PM +, Mathieu Desnoyers wrote:
> - On Sep 22, 2017, at 4:59 AM, Boqun Feng boqun.f...@gmail.com wrote:
> 
> > On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> > [...]
> >> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
> >> +  struct task_struct *next)
> >> +{
> >> +  /*
> >> +   * Only need the full barrier when switching between processes.
> >> +   */
> >> +  if (likely(!test_ti_thread_flag(task_thread_info(next),
> >> +  TIF_MEMBARRIER_PRIVATE_EXPEDITED)
> >> +  || prev->mm == next->mm))
> > 
> > And we also don't need the smp_mb() if !prev->mm, because switching from
> > kernel to user will have a smp_mb() implied by mmdrop()?
> 
> Right. And we also don't need it when switching from userspace to kernel

Yep, but this case is covered already, as I think we don't allow kernel
thread to have TIF_MEMBARRIER_PRIVATE_EXPEDITED set, right?

> thread neither. Something like this:
> 
> static inline void membarrier_arch_sched_in(struct task_struct *prev,
> struct task_struct *next)
> {
> /*
>  * Only need the full barrier when switching between processes.
>  * Barrier when switching from kernel to userspace is not
>  * required here, given that it is implied by mmdrop(). Barrier
>  * when switching from userspace to kernel is not needed after
>  * store to rq->curr.
>  */
> if (likely(!test_ti_thread_flag(task_thread_info(next),
> TIF_MEMBARRIER_PRIVATE_EXPEDITED)
> || !prev->mm || !next->mm || prev->mm == next->mm))

, so no need to test next->mm here.

> return;
> 
> /*
>  * The membarrier system call requires a full memory barrier
>  * after storing to rq->curr, before going back to user-space.
>  */
> smp_mb();
> }
> 
> > 
> >> +  return;
> >> +
> >> +  /*
> >> +   * The membarrier system call requires a full memory barrier
> >> +   * after storing to rq->curr, before going back to user-space.
> >> +   */
> >> +  smp_mb();
> >> +}
> > 
> > [...]
> > 
> >> +static inline void membarrier_fork(struct task_struct *t,
> >> +  unsigned long clone_flags)
> >> +{
> >> +  if (!current->mm || !t->mm)
> >> +  return;
> >> +  t->mm->membarrier_private_expedited =
> >> +  current->mm->membarrier_private_expedited;
> > 
> > Have we already done the copy of ->membarrier_private_expedited in
> > copy_mm()?
> 
> copy_mm() is performed without holding current->sighand->siglock, so
> it appears to be racing with concurrent membarrier register cmd.

Speak of racing, I think we currently have a problem if we do a
register_private_expedited in one thread and do a
membarrer_private_expedited in another thread(sharing the same mm), as
follow:

{t1,t2,t3 sharing the same ->mm}
CPU 0   CPU 1   CPU2
=== 

{in thread t1}  
membarrier_register_private_expedited():
  ...
  WRITE_ONCE(->mm->membarrier_private_expedited, 1);
  membarrier_arch_register_private_expedited():
...


{in thread t2}
membarrier_private_expedited():
  
READ_ONCE(->mm->membarrier_private_expedited); // == 1
  ...
  for_each_online_cpu()
...
curr>
if (p && p->mm == current->mm) // 
false


{about 
to switch to t3}

rq->curr = t3;


context_switch():
  ...
  
finish_task_swtich():

membarrier_sched_in():


// 
no smp_mb() here.

, and we will miss the smp_mb() on CPU2, right? And this could even
happen if t2 has a 

Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-24 Thread Boqun Feng
On Fri, Sep 22, 2017 at 03:10:10PM +, Mathieu Desnoyers wrote:
> - On Sep 22, 2017, at 4:59 AM, Boqun Feng boqun.f...@gmail.com wrote:
> 
> > On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> > [...]
> >> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
> >> +  struct task_struct *next)
> >> +{
> >> +  /*
> >> +   * Only need the full barrier when switching between processes.
> >> +   */
> >> +  if (likely(!test_ti_thread_flag(task_thread_info(next),
> >> +  TIF_MEMBARRIER_PRIVATE_EXPEDITED)
> >> +  || prev->mm == next->mm))
> > 
> > And we also don't need the smp_mb() if !prev->mm, because switching from
> > kernel to user will have a smp_mb() implied by mmdrop()?
> 
> Right. And we also don't need it when switching from userspace to kernel

Yep, but this case is covered already, as I think we don't allow kernel
thread to have TIF_MEMBARRIER_PRIVATE_EXPEDITED set, right?

> thread neither. Something like this:
> 
> static inline void membarrier_arch_sched_in(struct task_struct *prev,
> struct task_struct *next)
> {
> /*
>  * Only need the full barrier when switching between processes.
>  * Barrier when switching from kernel to userspace is not
>  * required here, given that it is implied by mmdrop(). Barrier
>  * when switching from userspace to kernel is not needed after
>  * store to rq->curr.
>  */
> if (likely(!test_ti_thread_flag(task_thread_info(next),
> TIF_MEMBARRIER_PRIVATE_EXPEDITED)
> || !prev->mm || !next->mm || prev->mm == next->mm))

, so no need to test next->mm here.

> return;
> 
> /*
>  * The membarrier system call requires a full memory barrier
>  * after storing to rq->curr, before going back to user-space.
>  */
> smp_mb();
> }
> 
> > 
> >> +  return;
> >> +
> >> +  /*
> >> +   * The membarrier system call requires a full memory barrier
> >> +   * after storing to rq->curr, before going back to user-space.
> >> +   */
> >> +  smp_mb();
> >> +}
> > 
> > [...]
> > 
> >> +static inline void membarrier_fork(struct task_struct *t,
> >> +  unsigned long clone_flags)
> >> +{
> >> +  if (!current->mm || !t->mm)
> >> +  return;
> >> +  t->mm->membarrier_private_expedited =
> >> +  current->mm->membarrier_private_expedited;
> > 
> > Have we already done the copy of ->membarrier_private_expedited in
> > copy_mm()?
> 
> copy_mm() is performed without holding current->sighand->siglock, so
> it appears to be racing with concurrent membarrier register cmd.

Speak of racing, I think we currently have a problem if we do a
register_private_expedited in one thread and do a
membarrer_private_expedited in another thread(sharing the same mm), as
follow:

{t1,t2,t3 sharing the same ->mm}
CPU 0   CPU 1   CPU2
=== 

{in thread t1}  
membarrier_register_private_expedited():
  ...
  WRITE_ONCE(->mm->membarrier_private_expedited, 1);
  membarrier_arch_register_private_expedited():
...


{in thread t2}
membarrier_private_expedited():
  
READ_ONCE(->mm->membarrier_private_expedited); // == 1
  ...
  for_each_online_cpu()
...
curr>
if (p && p->mm == current->mm) // 
false


{about 
to switch to t3}

rq->curr = t3;


context_switch():
  ...
  
finish_task_swtich():

membarrier_sched_in():


// 
no smp_mb() here.

, and we will miss the smp_mb() on CPU2, right? And this could even
happen if t2 has a membarrier_register_private_expedited() preceding the

Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Mathieu Desnoyers
- On Sep 22, 2017, at 4:59 AM, Boqun Feng boqun.f...@gmail.com wrote:

> On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> [...]
>> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
>> +struct task_struct *next)
>> +{
>> +/*
>> + * Only need the full barrier when switching between processes.
>> + */
>> +if (likely(!test_ti_thread_flag(task_thread_info(next),
>> +TIF_MEMBARRIER_PRIVATE_EXPEDITED)
>> +|| prev->mm == next->mm))
> 
> And we also don't need the smp_mb() if !prev->mm, because switching from
> kernel to user will have a smp_mb() implied by mmdrop()?

Right. And we also don't need it when switching from userspace to kernel
thread neither. Something like this:

static inline void membarrier_arch_sched_in(struct task_struct *prev,
struct task_struct *next)
{
/*
 * Only need the full barrier when switching between processes.
 * Barrier when switching from kernel to userspace is not
 * required here, given that it is implied by mmdrop(). Barrier
 * when switching from userspace to kernel is not needed after
 * store to rq->curr.
 */
if (likely(!test_ti_thread_flag(task_thread_info(next),
TIF_MEMBARRIER_PRIVATE_EXPEDITED)
|| !prev->mm || !next->mm || prev->mm == next->mm))
return;

/*
 * The membarrier system call requires a full memory barrier
 * after storing to rq->curr, before going back to user-space.
 */
smp_mb();
}

> 
>> +return;
>> +
>> +/*
>> + * The membarrier system call requires a full memory barrier
>> + * after storing to rq->curr, before going back to user-space.
>> + */
>> +smp_mb();
>> +}
> 
> [...]
> 
>> +static inline void membarrier_fork(struct task_struct *t,
>> +unsigned long clone_flags)
>> +{
>> +if (!current->mm || !t->mm)
>> +return;
>> +t->mm->membarrier_private_expedited =
>> +current->mm->membarrier_private_expedited;
> 
> Have we already done the copy of ->membarrier_private_expedited in
> copy_mm()?

copy_mm() is performed without holding current->sighand->siglock, so
it appears to be racing with concurrent membarrier register cmd.
However, given that it is a single flag updated with WRITE_ONCE()
and read with READ_ONCE(), it might be OK to rely on copy_mm there.
If userspace runs registration concurrently with fork, they should
not expect the child to be specifically registered or unregistered.

So yes, I think you are right about removing this copy and relying on
copy_mm() instead. I also think we can improve membarrier_arch_fork()
on powerpc to test the current thread flag rather than using current->mm.

Which leads to those two changes:

static inline void membarrier_fork(struct task_struct *t,
unsigned long clone_flags)
{
/*
 * Prior copy_mm() copies the membarrier_private_expedited field
 * from current->mm to t->mm.
 */
membarrier_arch_fork(t, clone_flags);
}

And on PowerPC:

static inline void membarrier_arch_fork(struct task_struct *t,
unsigned long clone_flags)
{
/*
 * Coherence of TIF_MEMBARRIER_PRIVATE_EXPEDITED against thread
 * fork is protected by siglock. membarrier_arch_fork is called
 * with siglock held.
 */
if (test_thread_flag(TIF_MEMBARRIER_PRIVATE_EXPEDITED))
set_ti_thread_flag(task_thread_info(t),
TIF_MEMBARRIER_PRIVATE_EXPEDITED);
}

Thanks,

Mathieu


> 
> Regards,
> Boqun
> 
>> +membarrier_arch_fork(t, clone_flags);
>> +}
>> +static inline void membarrier_execve(struct task_struct *t)
>> +{
>> +t->mm->membarrier_private_expedited = 0;
>> +membarrier_arch_execve(t);
>> +}
>> +#else
>> +static inline void membarrier_sched_in(struct task_struct *prev,
>> +struct task_struct *next)
>> +{
>> +}
>> +static inline void membarrier_fork(struct task_struct *t,
>> +unsigned long clone_flags)
>> +{
>> +}
>> +static inline void membarrier_execve(struct task_struct *t)
>> +{
>> +}
>> +#endif
>> +
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Mathieu Desnoyers
- On Sep 22, 2017, at 4:59 AM, Boqun Feng boqun.f...@gmail.com wrote:

> On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> [...]
>> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
>> +struct task_struct *next)
>> +{
>> +/*
>> + * Only need the full barrier when switching between processes.
>> + */
>> +if (likely(!test_ti_thread_flag(task_thread_info(next),
>> +TIF_MEMBARRIER_PRIVATE_EXPEDITED)
>> +|| prev->mm == next->mm))
> 
> And we also don't need the smp_mb() if !prev->mm, because switching from
> kernel to user will have a smp_mb() implied by mmdrop()?

Right. And we also don't need it when switching from userspace to kernel
thread neither. Something like this:

static inline void membarrier_arch_sched_in(struct task_struct *prev,
struct task_struct *next)
{
/*
 * Only need the full barrier when switching between processes.
 * Barrier when switching from kernel to userspace is not
 * required here, given that it is implied by mmdrop(). Barrier
 * when switching from userspace to kernel is not needed after
 * store to rq->curr.
 */
if (likely(!test_ti_thread_flag(task_thread_info(next),
TIF_MEMBARRIER_PRIVATE_EXPEDITED)
|| !prev->mm || !next->mm || prev->mm == next->mm))
return;

/*
 * The membarrier system call requires a full memory barrier
 * after storing to rq->curr, before going back to user-space.
 */
smp_mb();
}

> 
>> +return;
>> +
>> +/*
>> + * The membarrier system call requires a full memory barrier
>> + * after storing to rq->curr, before going back to user-space.
>> + */
>> +smp_mb();
>> +}
> 
> [...]
> 
>> +static inline void membarrier_fork(struct task_struct *t,
>> +unsigned long clone_flags)
>> +{
>> +if (!current->mm || !t->mm)
>> +return;
>> +t->mm->membarrier_private_expedited =
>> +current->mm->membarrier_private_expedited;
> 
> Have we already done the copy of ->membarrier_private_expedited in
> copy_mm()?

copy_mm() is performed without holding current->sighand->siglock, so
it appears to be racing with concurrent membarrier register cmd.
However, given that it is a single flag updated with WRITE_ONCE()
and read with READ_ONCE(), it might be OK to rely on copy_mm there.
If userspace runs registration concurrently with fork, they should
not expect the child to be specifically registered or unregistered.

So yes, I think you are right about removing this copy and relying on
copy_mm() instead. I also think we can improve membarrier_arch_fork()
on powerpc to test the current thread flag rather than using current->mm.

Which leads to those two changes:

static inline void membarrier_fork(struct task_struct *t,
unsigned long clone_flags)
{
/*
 * Prior copy_mm() copies the membarrier_private_expedited field
 * from current->mm to t->mm.
 */
membarrier_arch_fork(t, clone_flags);
}

And on PowerPC:

static inline void membarrier_arch_fork(struct task_struct *t,
unsigned long clone_flags)
{
/*
 * Coherence of TIF_MEMBARRIER_PRIVATE_EXPEDITED against thread
 * fork is protected by siglock. membarrier_arch_fork is called
 * with siglock held.
 */
if (test_thread_flag(TIF_MEMBARRIER_PRIVATE_EXPEDITED))
set_ti_thread_flag(task_thread_info(t),
TIF_MEMBARRIER_PRIVATE_EXPEDITED);
}

Thanks,

Mathieu


> 
> Regards,
> Boqun
> 
>> +membarrier_arch_fork(t, clone_flags);
>> +}
>> +static inline void membarrier_execve(struct task_struct *t)
>> +{
>> +t->mm->membarrier_private_expedited = 0;
>> +membarrier_arch_execve(t);
>> +}
>> +#else
>> +static inline void membarrier_sched_in(struct task_struct *prev,
>> +struct task_struct *next)
>> +{
>> +}
>> +static inline void membarrier_fork(struct task_struct *t,
>> +unsigned long clone_flags)
>> +{
>> +}
>> +static inline void membarrier_execve(struct task_struct *t)
>> +{
>> +}
>> +#endif
>> +
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Boqun Feng
On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
[...]
> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
> + struct task_struct *next)
> +{
> + /*
> +  * Only need the full barrier when switching between processes.
> +  */
> + if (likely(!test_ti_thread_flag(task_thread_info(next),
> + TIF_MEMBARRIER_PRIVATE_EXPEDITED)
> + || prev->mm == next->mm))

And we also don't need the smp_mb() if !prev->mm, because switching from
kernel to user will have a smp_mb() implied by mmdrop()?

> + return;
> +
> + /*
> +  * The membarrier system call requires a full memory barrier
> +  * after storing to rq->curr, before going back to user-space.
> +  */
> + smp_mb();
> +}

[...]

> +static inline void membarrier_fork(struct task_struct *t,
> + unsigned long clone_flags)
> +{
> + if (!current->mm || !t->mm)
> + return;
> + t->mm->membarrier_private_expedited =
> + current->mm->membarrier_private_expedited;

Have we already done the copy of ->membarrier_private_expedited in
copy_mm()?

Regards,
Boqun

> + membarrier_arch_fork(t, clone_flags);
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> + t->mm->membarrier_private_expedited = 0;
> + membarrier_arch_execve(t);
> +}
> +#else
> +static inline void membarrier_sched_in(struct task_struct *prev,
> + struct task_struct *next)
> +{
> +}
> +static inline void membarrier_fork(struct task_struct *t,
> + unsigned long clone_flags)
> +{
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> +}
> +#endif
> +
[...]


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Boqun Feng
On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
[...]
> +static inline void membarrier_arch_sched_in(struct task_struct *prev,
> + struct task_struct *next)
> +{
> + /*
> +  * Only need the full barrier when switching between processes.
> +  */
> + if (likely(!test_ti_thread_flag(task_thread_info(next),
> + TIF_MEMBARRIER_PRIVATE_EXPEDITED)
> + || prev->mm == next->mm))

And we also don't need the smp_mb() if !prev->mm, because switching from
kernel to user will have a smp_mb() implied by mmdrop()?

> + return;
> +
> + /*
> +  * The membarrier system call requires a full memory barrier
> +  * after storing to rq->curr, before going back to user-space.
> +  */
> + smp_mb();
> +}

[...]

> +static inline void membarrier_fork(struct task_struct *t,
> + unsigned long clone_flags)
> +{
> + if (!current->mm || !t->mm)
> + return;
> + t->mm->membarrier_private_expedited =
> + current->mm->membarrier_private_expedited;

Have we already done the copy of ->membarrier_private_expedited in
copy_mm()?

Regards,
Boqun

> + membarrier_arch_fork(t, clone_flags);
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> + t->mm->membarrier_private_expedited = 0;
> + membarrier_arch_execve(t);
> +}
> +#else
> +static inline void membarrier_sched_in(struct task_struct *prev,
> + struct task_struct *next)
> +{
> +}
> +static inline void membarrier_fork(struct task_struct *t,
> + unsigned long clone_flags)
> +{
> +}
> +static inline void membarrier_execve(struct task_struct *t)
> +{
> +}
> +#endif
> +
[...]


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Boqun Feng
On Fri, Sep 22, 2017 at 10:24:41AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:
> 
> > The idea is in membarrier_private_expedited(), we go through all ->curr
> > on each CPU and 
> > 
> > 1)  If it's a userspace task and its ->mm is matched, we send an ipi
> > 
> > 2)  If it's a kernel task, we skip
> > 
> > (Because there will be a smp_mb() implied by mmdrop(), when it
> > switchs to userspace task).
> > 
> > 3)  If it's a userspace task and its ->mm is not matched, we take
> > the corresponding rq->lock and check rq->curr again, if its ->mm
> > matched, we send an ipi, otherwise we do nothing.
> > 
> > (Because if we observe rq->curr is not matched with rq->lock
> > held, when a task having matched ->mm schedules in, the rq->lock
> > pairing along with the smp_mb__after_spinlock() will guarantee
> > it observes all memory ops before sys_membarrir()).
> 
> 3) is an insta DoS.

OK, fair enough.

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Boqun Feng
On Fri, Sep 22, 2017 at 10:24:41AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:
> 
> > The idea is in membarrier_private_expedited(), we go through all ->curr
> > on each CPU and 
> > 
> > 1)  If it's a userspace task and its ->mm is matched, we send an ipi
> > 
> > 2)  If it's a kernel task, we skip
> > 
> > (Because there will be a smp_mb() implied by mmdrop(), when it
> > switchs to userspace task).
> > 
> > 3)  If it's a userspace task and its ->mm is not matched, we take
> > the corresponding rq->lock and check rq->curr again, if its ->mm
> > matched, we send an ipi, otherwise we do nothing.
> > 
> > (Because if we observe rq->curr is not matched with rq->lock
> > held, when a task having matched ->mm schedules in, the rq->lock
> > pairing along with the smp_mb__after_spinlock() will guarantee
> > it observes all memory ops before sys_membarrir()).
> 
> 3) is an insta DoS.

OK, fair enough.

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Peter Zijlstra
On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:

> The idea is in membarrier_private_expedited(), we go through all ->curr
> on each CPU and 
> 
> 1)If it's a userspace task and its ->mm is matched, we send an ipi
> 
> 2)If it's a kernel task, we skip
> 
>   (Because there will be a smp_mb() implied by mmdrop(), when it
>   switchs to userspace task).
> 
> 3)If it's a userspace task and its ->mm is not matched, we take
>   the corresponding rq->lock and check rq->curr again, if its ->mm
>   matched, we send an ipi, otherwise we do nothing.
> 
>   (Because if we observe rq->curr is not matched with rq->lock
>   held, when a task having matched ->mm schedules in, the rq->lock
>   pairing along with the smp_mb__after_spinlock() will guarantee
>   it observes all memory ops before sys_membarrir()).

3) is an insta DoS.


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-22 Thread Peter Zijlstra
On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:

> The idea is in membarrier_private_expedited(), we go through all ->curr
> on each CPU and 
> 
> 1)If it's a userspace task and its ->mm is matched, we send an ipi
> 
> 2)If it's a kernel task, we skip
> 
>   (Because there will be a smp_mb() implied by mmdrop(), when it
>   switchs to userspace task).
> 
> 3)If it's a userspace task and its ->mm is not matched, we take
>   the corresponding rq->lock and check rq->curr again, if its ->mm
>   matched, we send an ipi, otherwise we do nothing.
> 
>   (Because if we observe rq->curr is not matched with rq->lock
>   held, when a task having matched ->mm schedules in, the rq->lock
>   pairing along with the smp_mb__after_spinlock() will guarantee
>   it observes all memory ops before sys_membarrir()).

3) is an insta DoS.


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-21 Thread Mathieu Desnoyers
- On Sep 21, 2017, at 11:30 PM, Boqun Feng boqun.f...@gmail.com wrote:

> On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:
>> Hi Mathieu,
>> 
>> On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
>> > Provide a new command allowing processes to register their intent to use
>> > the private expedited command.
>> > 
>> > This allows PowerPC to skip the full memory barrier in switch_mm(), and
>> > only issue the barrier when scheduling into a task belonging to a
>> > process that has registered to use expedited private.
>> > 
>> > Processes are now required to register before using
>> > MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
>> > 
>> 
>> Sorry I'm late for the party, but I couldn't stop thinking whether we
>> could avoid the register thing at all, because the registering makes
>> sys_membarrier() more complex(both for the interface and the
>> implementation). So how about we trade-off a little bit by taking
>> some(not all) the rq->locks?
>> 
>> The idea is in membarrier_private_expedited(), we go through all ->curr
>> on each CPU and
>> 
>> 1)   If it's a userspace task and its ->mm is matched, we send an ipi
>> 
>> 2)   If it's a kernel task, we skip
>> 
>>  (Because there will be a smp_mb() implied by mmdrop(), when it
>>  switchs to userspace task).
>> 
>> 3)   If it's a userspace task and its ->mm is not matched, we take
>>  the corresponding rq->lock and check rq->curr again, if its ->mm
>>  matched, we send an ipi, otherwise we do nothing.
>> 
>>  (Because if we observe rq->curr is not matched with rq->lock
>>  held, when a task having matched ->mm schedules in, the rq->lock
>>  pairing along with the smp_mb__after_spinlock() will guarantee
>>  it observes all memory ops before sys_membarrir()).
>> 
>> membarrier_private_expedited() will look like this if we choose this
>> way:
>> 
>> void membarrier_private_expedited()
>> {
>>  int cpu;
>>  bool fallback = false;
>>  cpumask_var_t tmpmask;
>>  struct rq_flags rf;
>> 
>> 
>>  if (num_online_cpus() == 1)
>>  return;
>> 
>>  smp_mb();
>> 
>>  if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
>>  /* Fallback for OOM. */
>>  fallback = true;
>>  }
>> 
>>  cpus_read_lock();
>>  for_each_online_cpu(cpu) {
>>  struct task_struct *p;
>> 
>>  if (cpu == raw_smp_processor_id())
>>  continue;
>> 
>>  rcu_read_lock();
>>  p = task_rcu_dereference(_rq(cpu)->curr);
>> 
>>  if (!p) {
>>  rcu_read_unlock();
>>  continue;
>>  }
>> 
>>  if (p->mm == current->mm) {
>>  if (!fallback)
>>  __cpumask_set_cpu(cpu, tmpmask);
>>  else
>>  smp_call_function_single(cpu, ipi_mb, NULL, 1);
>>  }
>> 
>>  if (p->mm == current->mm || !p->mm) {
>>  rcu_read_unlock();
>>  continue;
>>  }
>> 
>>  rcu_read_unlock();
>>  
>>  /*
>>   * This should be a arch-specific code, as we don't
>>   * need it at else place other than some archs without
>>   * a smp_mb() in switch_mm() (i.e. powerpc)
>>   */
>>  rq_lock_irq(cpu_rq(cpu), );
>>  if (p->mm == current->mm) {
> 
> Oops, this one should be
> 
>   if (cpu_curr(cpu)->mm == current->mm)
> 
>>  if (!fallback)
>>  __cpumask_set_cpu(cpu, tmpmask);
>>  else
>>  smp_call_function_single(cpu, ipi_mb, NULL, 1);
> 
> , and this better be moved out of the lock rq->lock critical section.
> 
> Regards,
> Boqun
> 
>>  }
>>  rq_unlock_irq(cpu_rq(cpu), );
>>  }
>>  if (!fallback) {
>>  smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>>  free_cpumask_var(tmpmask);
>>  }
>>  cpus_read_unlock();
>> 
>>  smp_mb();
>> }
>> 
>> Thoughts?

Hi Boqun,

The main concern Peter has with the runqueue locking approach
is interference with the scheduler by hitting all CPU's runqueue
locks repeatedly if someone performs membarrier system calls in
a short loop.

Just reading the rq->curr pointer does not generate as much
overhead as grabbing each rq lock.

Thanks,

Mathieu


>> 
>> Regards,
>> Boqun
>> 
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-21 Thread Mathieu Desnoyers
- On Sep 21, 2017, at 11:30 PM, Boqun Feng boqun.f...@gmail.com wrote:

> On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:
>> Hi Mathieu,
>> 
>> On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
>> > Provide a new command allowing processes to register their intent to use
>> > the private expedited command.
>> > 
>> > This allows PowerPC to skip the full memory barrier in switch_mm(), and
>> > only issue the barrier when scheduling into a task belonging to a
>> > process that has registered to use expedited private.
>> > 
>> > Processes are now required to register before using
>> > MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
>> > 
>> 
>> Sorry I'm late for the party, but I couldn't stop thinking whether we
>> could avoid the register thing at all, because the registering makes
>> sys_membarrier() more complex(both for the interface and the
>> implementation). So how about we trade-off a little bit by taking
>> some(not all) the rq->locks?
>> 
>> The idea is in membarrier_private_expedited(), we go through all ->curr
>> on each CPU and
>> 
>> 1)   If it's a userspace task and its ->mm is matched, we send an ipi
>> 
>> 2)   If it's a kernel task, we skip
>> 
>>  (Because there will be a smp_mb() implied by mmdrop(), when it
>>  switchs to userspace task).
>> 
>> 3)   If it's a userspace task and its ->mm is not matched, we take
>>  the corresponding rq->lock and check rq->curr again, if its ->mm
>>  matched, we send an ipi, otherwise we do nothing.
>> 
>>  (Because if we observe rq->curr is not matched with rq->lock
>>  held, when a task having matched ->mm schedules in, the rq->lock
>>  pairing along with the smp_mb__after_spinlock() will guarantee
>>  it observes all memory ops before sys_membarrir()).
>> 
>> membarrier_private_expedited() will look like this if we choose this
>> way:
>> 
>> void membarrier_private_expedited()
>> {
>>  int cpu;
>>  bool fallback = false;
>>  cpumask_var_t tmpmask;
>>  struct rq_flags rf;
>> 
>> 
>>  if (num_online_cpus() == 1)
>>  return;
>> 
>>  smp_mb();
>> 
>>  if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
>>  /* Fallback for OOM. */
>>  fallback = true;
>>  }
>> 
>>  cpus_read_lock();
>>  for_each_online_cpu(cpu) {
>>  struct task_struct *p;
>> 
>>  if (cpu == raw_smp_processor_id())
>>  continue;
>> 
>>  rcu_read_lock();
>>  p = task_rcu_dereference(_rq(cpu)->curr);
>> 
>>  if (!p) {
>>  rcu_read_unlock();
>>  continue;
>>  }
>> 
>>  if (p->mm == current->mm) {
>>  if (!fallback)
>>  __cpumask_set_cpu(cpu, tmpmask);
>>  else
>>  smp_call_function_single(cpu, ipi_mb, NULL, 1);
>>  }
>> 
>>  if (p->mm == current->mm || !p->mm) {
>>  rcu_read_unlock();
>>  continue;
>>  }
>> 
>>  rcu_read_unlock();
>>  
>>  /*
>>   * This should be a arch-specific code, as we don't
>>   * need it at else place other than some archs without
>>   * a smp_mb() in switch_mm() (i.e. powerpc)
>>   */
>>  rq_lock_irq(cpu_rq(cpu), );
>>  if (p->mm == current->mm) {
> 
> Oops, this one should be
> 
>   if (cpu_curr(cpu)->mm == current->mm)
> 
>>  if (!fallback)
>>  __cpumask_set_cpu(cpu, tmpmask);
>>  else
>>  smp_call_function_single(cpu, ipi_mb, NULL, 1);
> 
> , and this better be moved out of the lock rq->lock critical section.
> 
> Regards,
> Boqun
> 
>>  }
>>  rq_unlock_irq(cpu_rq(cpu), );
>>  }
>>  if (!fallback) {
>>  smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>>  free_cpumask_var(tmpmask);
>>  }
>>  cpus_read_unlock();
>> 
>>  smp_mb();
>> }
>> 
>> Thoughts?

Hi Boqun,

The main concern Peter has with the runqueue locking approach
is interference with the scheduler by hitting all CPU's runqueue
locks repeatedly if someone performs membarrier system calls in
a short loop.

Just reading the rq->curr pointer does not generate as much
overhead as grabbing each rq lock.

Thanks,

Mathieu


>> 
>> Regards,
>> Boqun
>> 
> [...]

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-21 Thread Boqun Feng
On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:
> Hi Mathieu,
> 
> On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> > Provide a new command allowing processes to register their intent to use
> > the private expedited command.
> > 
> > This allows PowerPC to skip the full memory barrier in switch_mm(), and
> > only issue the barrier when scheduling into a task belonging to a
> > process that has registered to use expedited private.
> > 
> > Processes are now required to register before using
> > MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
> > 
> 
> Sorry I'm late for the party, but I couldn't stop thinking whether we
> could avoid the register thing at all, because the registering makes
> sys_membarrier() more complex(both for the interface and the
> implementation). So how about we trade-off a little bit by taking
> some(not all) the rq->locks?
> 
> The idea is in membarrier_private_expedited(), we go through all ->curr
> on each CPU and 
> 
> 1)If it's a userspace task and its ->mm is matched, we send an ipi
> 
> 2)If it's a kernel task, we skip
> 
>   (Because there will be a smp_mb() implied by mmdrop(), when it
>   switchs to userspace task).
> 
> 3)If it's a userspace task and its ->mm is not matched, we take
>   the corresponding rq->lock and check rq->curr again, if its ->mm
>   matched, we send an ipi, otherwise we do nothing.
> 
>   (Because if we observe rq->curr is not matched with rq->lock
>   held, when a task having matched ->mm schedules in, the rq->lock
>   pairing along with the smp_mb__after_spinlock() will guarantee
>   it observes all memory ops before sys_membarrir()).
> 
> membarrier_private_expedited() will look like this if we choose this
> way:
> 
> void membarrier_private_expedited()
> {
>   int cpu;
>   bool fallback = false;
>   cpumask_var_t tmpmask;
>   struct rq_flags rf;
> 
> 
>   if (num_online_cpus() == 1)
>   return;
> 
>   smp_mb();
> 
>   if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
>   /* Fallback for OOM. */
>   fallback = true;
>   }
> 
>   cpus_read_lock();
>   for_each_online_cpu(cpu) {
>   struct task_struct *p;
> 
>   if (cpu == raw_smp_processor_id())
>   continue;
> 
>   rcu_read_lock();
>   p = task_rcu_dereference(_rq(cpu)->curr);
> 
>   if (!p) {
>   rcu_read_unlock();
>   continue;
>   }
> 
>   if (p->mm == current->mm) {
>   if (!fallback)
>   __cpumask_set_cpu(cpu, tmpmask);
>   else
>   smp_call_function_single(cpu, ipi_mb, NULL, 1);
>   }
> 
>   if (p->mm == current->mm || !p->mm) {
>   rcu_read_unlock();
>   continue;
>   }
> 
>   rcu_read_unlock();
>   
>   /*
>* This should be a arch-specific code, as we don't
>* need it at else place other than some archs without
>* a smp_mb() in switch_mm() (i.e. powerpc)
>*/
>   rq_lock_irq(cpu_rq(cpu), );
>   if (p->mm == current->mm) {

Oops, this one should be

if (cpu_curr(cpu)->mm == current->mm)

>   if (!fallback)
>   __cpumask_set_cpu(cpu, tmpmask);
>   else
>   smp_call_function_single(cpu, ipi_mb, NULL, 1);

, and this better be moved out of the lock rq->lock critical section.

Regards,
Boqun

>   }
>   rq_unlock_irq(cpu_rq(cpu), );
>   }
>   if (!fallback) {
>   smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>   free_cpumask_var(tmpmask);
>   }
>   cpus_read_unlock();
> 
>   smp_mb();
> }
> 
> Thoughts?
> 
> Regards,
> Boqun
> 
[...]


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-21 Thread Boqun Feng
On Fri, Sep 22, 2017 at 11:22:06AM +0800, Boqun Feng wrote:
> Hi Mathieu,
> 
> On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> > Provide a new command allowing processes to register their intent to use
> > the private expedited command.
> > 
> > This allows PowerPC to skip the full memory barrier in switch_mm(), and
> > only issue the barrier when scheduling into a task belonging to a
> > process that has registered to use expedited private.
> > 
> > Processes are now required to register before using
> > MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
> > 
> 
> Sorry I'm late for the party, but I couldn't stop thinking whether we
> could avoid the register thing at all, because the registering makes
> sys_membarrier() more complex(both for the interface and the
> implementation). So how about we trade-off a little bit by taking
> some(not all) the rq->locks?
> 
> The idea is in membarrier_private_expedited(), we go through all ->curr
> on each CPU and 
> 
> 1)If it's a userspace task and its ->mm is matched, we send an ipi
> 
> 2)If it's a kernel task, we skip
> 
>   (Because there will be a smp_mb() implied by mmdrop(), when it
>   switchs to userspace task).
> 
> 3)If it's a userspace task and its ->mm is not matched, we take
>   the corresponding rq->lock and check rq->curr again, if its ->mm
>   matched, we send an ipi, otherwise we do nothing.
> 
>   (Because if we observe rq->curr is not matched with rq->lock
>   held, when a task having matched ->mm schedules in, the rq->lock
>   pairing along with the smp_mb__after_spinlock() will guarantee
>   it observes all memory ops before sys_membarrir()).
> 
> membarrier_private_expedited() will look like this if we choose this
> way:
> 
> void membarrier_private_expedited()
> {
>   int cpu;
>   bool fallback = false;
>   cpumask_var_t tmpmask;
>   struct rq_flags rf;
> 
> 
>   if (num_online_cpus() == 1)
>   return;
> 
>   smp_mb();
> 
>   if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
>   /* Fallback for OOM. */
>   fallback = true;
>   }
> 
>   cpus_read_lock();
>   for_each_online_cpu(cpu) {
>   struct task_struct *p;
> 
>   if (cpu == raw_smp_processor_id())
>   continue;
> 
>   rcu_read_lock();
>   p = task_rcu_dereference(_rq(cpu)->curr);
> 
>   if (!p) {
>   rcu_read_unlock();
>   continue;
>   }
> 
>   if (p->mm == current->mm) {
>   if (!fallback)
>   __cpumask_set_cpu(cpu, tmpmask);
>   else
>   smp_call_function_single(cpu, ipi_mb, NULL, 1);
>   }
> 
>   if (p->mm == current->mm || !p->mm) {
>   rcu_read_unlock();
>   continue;
>   }
> 
>   rcu_read_unlock();
>   
>   /*
>* This should be a arch-specific code, as we don't
>* need it at else place other than some archs without
>* a smp_mb() in switch_mm() (i.e. powerpc)
>*/
>   rq_lock_irq(cpu_rq(cpu), );
>   if (p->mm == current->mm) {

Oops, this one should be

if (cpu_curr(cpu)->mm == current->mm)

>   if (!fallback)
>   __cpumask_set_cpu(cpu, tmpmask);
>   else
>   smp_call_function_single(cpu, ipi_mb, NULL, 1);

, and this better be moved out of the lock rq->lock critical section.

Regards,
Boqun

>   }
>   rq_unlock_irq(cpu_rq(cpu), );
>   }
>   if (!fallback) {
>   smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>   free_cpumask_var(tmpmask);
>   }
>   cpus_read_unlock();
> 
>   smp_mb();
> }
> 
> Thoughts?
> 
> Regards,
> Boqun
> 
[...]


signature.asc
Description: PGP signature


Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-21 Thread Boqun Feng
Hi Mathieu,

On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> Provide a new command allowing processes to register their intent to use
> the private expedited command.
> 
> This allows PowerPC to skip the full memory barrier in switch_mm(), and
> only issue the barrier when scheduling into a task belonging to a
> process that has registered to use expedited private.
> 
> Processes are now required to register before using
> MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
> 

Sorry I'm late for the party, but I couldn't stop thinking whether we
could avoid the register thing at all, because the registering makes
sys_membarrier() more complex(both for the interface and the
implementation). So how about we trade-off a little bit by taking
some(not all) the rq->locks?

The idea is in membarrier_private_expedited(), we go through all ->curr
on each CPU and 

1)  If it's a userspace task and its ->mm is matched, we send an ipi

2)  If it's a kernel task, we skip

(Because there will be a smp_mb() implied by mmdrop(), when it
switchs to userspace task).

3)  If it's a userspace task and its ->mm is not matched, we take
the corresponding rq->lock and check rq->curr again, if its ->mm
matched, we send an ipi, otherwise we do nothing.

(Because if we observe rq->curr is not matched with rq->lock
held, when a task having matched ->mm schedules in, the rq->lock
pairing along with the smp_mb__after_spinlock() will guarantee
it observes all memory ops before sys_membarrir()).

membarrier_private_expedited() will look like this if we choose this
way:

void membarrier_private_expedited()
{
int cpu;
bool fallback = false;
cpumask_var_t tmpmask;
struct rq_flags rf;


if (num_online_cpus() == 1)
return;

smp_mb();

if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
/* Fallback for OOM. */
fallback = true;
}

cpus_read_lock();
for_each_online_cpu(cpu) {
struct task_struct *p;

if (cpu == raw_smp_processor_id())
continue;

rcu_read_lock();
p = task_rcu_dereference(_rq(cpu)->curr);

if (!p) {
rcu_read_unlock();
continue;
}

if (p->mm == current->mm) {
if (!fallback)
__cpumask_set_cpu(cpu, tmpmask);
else
smp_call_function_single(cpu, ipi_mb, NULL, 1);
}

if (p->mm == current->mm || !p->mm) {
rcu_read_unlock();
continue;
}

rcu_read_unlock();

/*
 * This should be a arch-specific code, as we don't
 * need it at else place other than some archs without
 * a smp_mb() in switch_mm() (i.e. powerpc)
 */
rq_lock_irq(cpu_rq(cpu), );
if (p->mm == current->mm) {
if (!fallback)
__cpumask_set_cpu(cpu, tmpmask);
else
smp_call_function_single(cpu, ipi_mb, NULL, 1);
}
rq_unlock_irq(cpu_rq(cpu), );
}
if (!fallback) {
smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
free_cpumask_var(tmpmask);
}
cpus_read_unlock();

smp_mb();
}

Thoughts?

Regards,
Boqun

> Changes since v1:
> - Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
>   powerpc membarrier_arch_sched_in(), given that we want to specifically
>   check the next thread state.
> - Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
> - Use task_thread_info() to pass thread_info from task to
>   *_ti_thread_flag().
> 
> Changes since v2:
> - Move membarrier_arch_sched_in() call to finish_task_switch().
> - Check for NULL t->mm in membarrier_arch_fork().
> - Use membarrier_sched_in() in generic code, which invokes the
>   arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
>   build on PowerPC.
> - Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
>   allnoconfig build on PowerPC.
> - Build and runtime tested on PowerPC.
> 
> Signed-off-by: Mathieu Desnoyers 
> CC: Peter Zijlstra 
> CC: Paul E. McKenney 
> CC: Boqun Feng 
> CC: Andrew Hunter 
> CC: Maged Michael 
> CC: gro...@google.com
> CC: Avi Kivity 
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> CC: Michael 

Re: [RFC PATCH v3 1/2] membarrier: Provide register expedited private command

2017-09-21 Thread Boqun Feng
Hi Mathieu,

On Tue, Sep 19, 2017 at 06:13:41PM -0400, Mathieu Desnoyers wrote:
> Provide a new command allowing processes to register their intent to use
> the private expedited command.
> 
> This allows PowerPC to skip the full memory barrier in switch_mm(), and
> only issue the barrier when scheduling into a task belonging to a
> process that has registered to use expedited private.
> 
> Processes are now required to register before using
> MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
> 

Sorry I'm late for the party, but I couldn't stop thinking whether we
could avoid the register thing at all, because the registering makes
sys_membarrier() more complex(both for the interface and the
implementation). So how about we trade-off a little bit by taking
some(not all) the rq->locks?

The idea is in membarrier_private_expedited(), we go through all ->curr
on each CPU and 

1)  If it's a userspace task and its ->mm is matched, we send an ipi

2)  If it's a kernel task, we skip

(Because there will be a smp_mb() implied by mmdrop(), when it
switchs to userspace task).

3)  If it's a userspace task and its ->mm is not matched, we take
the corresponding rq->lock and check rq->curr again, if its ->mm
matched, we send an ipi, otherwise we do nothing.

(Because if we observe rq->curr is not matched with rq->lock
held, when a task having matched ->mm schedules in, the rq->lock
pairing along with the smp_mb__after_spinlock() will guarantee
it observes all memory ops before sys_membarrir()).

membarrier_private_expedited() will look like this if we choose this
way:

void membarrier_private_expedited()
{
int cpu;
bool fallback = false;
cpumask_var_t tmpmask;
struct rq_flags rf;


if (num_online_cpus() == 1)
return;

smp_mb();

if (!zalloc_cpumask_var(, GFP_NOWAIT)) {
/* Fallback for OOM. */
fallback = true;
}

cpus_read_lock();
for_each_online_cpu(cpu) {
struct task_struct *p;

if (cpu == raw_smp_processor_id())
continue;

rcu_read_lock();
p = task_rcu_dereference(_rq(cpu)->curr);

if (!p) {
rcu_read_unlock();
continue;
}

if (p->mm == current->mm) {
if (!fallback)
__cpumask_set_cpu(cpu, tmpmask);
else
smp_call_function_single(cpu, ipi_mb, NULL, 1);
}

if (p->mm == current->mm || !p->mm) {
rcu_read_unlock();
continue;
}

rcu_read_unlock();

/*
 * This should be a arch-specific code, as we don't
 * need it at else place other than some archs without
 * a smp_mb() in switch_mm() (i.e. powerpc)
 */
rq_lock_irq(cpu_rq(cpu), );
if (p->mm == current->mm) {
if (!fallback)
__cpumask_set_cpu(cpu, tmpmask);
else
smp_call_function_single(cpu, ipi_mb, NULL, 1);
}
rq_unlock_irq(cpu_rq(cpu), );
}
if (!fallback) {
smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
free_cpumask_var(tmpmask);
}
cpus_read_unlock();

smp_mb();
}

Thoughts?

Regards,
Boqun

> Changes since v1:
> - Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
>   powerpc membarrier_arch_sched_in(), given that we want to specifically
>   check the next thread state.
> - Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
> - Use task_thread_info() to pass thread_info from task to
>   *_ti_thread_flag().
> 
> Changes since v2:
> - Move membarrier_arch_sched_in() call to finish_task_switch().
> - Check for NULL t->mm in membarrier_arch_fork().
> - Use membarrier_sched_in() in generic code, which invokes the
>   arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
>   build on PowerPC.
> - Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
>   allnoconfig build on PowerPC.
> - Build and runtime tested on PowerPC.
> 
> Signed-off-by: Mathieu Desnoyers 
> CC: Peter Zijlstra 
> CC: Paul E. McKenney 
> CC: Boqun Feng 
> CC: Andrew Hunter 
> CC: Maged Michael 
> CC: gro...@google.com
> CC: Avi Kivity 
> CC: Benjamin Herrenschmidt 
> CC: Paul Mackerras 
> CC: Michael Ellerman 
> CC: Dave Watson 
> CC: Alan Stern 
> CC: Will Deacon 
> CC: Andy Lutomirski 
> CC: linux-a...@vger.kernel.org
> ---
>  MAINTAINERS|  2 ++
>  arch/powerpc/Kconfig