subject:"\[PATCH 3\/3\] context_tracking,x86\: remove extraneous irq disable \& enable from context tracking on syscall entry"

* Andy Lutomirski  wrote:

> > on return from idle we'd do:
> >
> > rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
> >
> > etc. On all boundary transitions we can use a constant ADD on a 
> > suitable percpu variable.
> 
> Sounds good to me, except that we need to be careful to distinguish 
> between non-syscall entries from quiescent states
> and non-syscall entries from quiescent states.

It might be hard to make that distinction! ;-)

I suspect you wanted to raise the issue of various contexts nesting on 
each other, such as syscall triggering a page fault, which gets an irq 
nested, etc. - versus non-nested contexts such as user-space 
triggering a page fault or user-space getting an irq?

> [...]  We could save the old state (as the current exception_enter 
> code does) or we could allocate enough low bits for the state that 
> the problem goes away.

So I think, if it all works in practice just as well as it does in 
email, we might be better off with more state bits: that would tell 
any remote statistics/sampling code more as well.

It might also be easier to patch in/out, because this kind of state 
tracking will affect non-RT CPUs as well. (Later on we could do a 
separate IDT for RT CPUs as well, with a patched version of the entry 
code.)

> I don't think the TIF_RCU_QS variant is worthwhile -- merging the 
> counter and state is probably both easier and faster.

Yeah.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar  wrote:
>
> * Andy Lutomirski  wrote:
>
>> On May 8, 2015 12:07 PM, "Ingo Molnar"  wrote:
>> >
>> >
>> > * Andy Lutomirski  wrote:
>> >
>> > > > So do you mean:
>> > > >
>> > > >this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >...
>> > > >this_cpu_inc(rcu_qs_ctr);
>> > > >this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > > > ?
>> > > >
>> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
>> > > > it just two simple stores into a byte flag, one on entry and one on
>> > > > exit:
>> > > >
>> > > >this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >...
>> > > >this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > >
>> > > I was thinking that either a counter or a state flag could make sense.
>> > > Doing both would be pointless.  The counter could use the low bits to
>> > > indicate the state.  The benefit of the counter would be that the
>> > > RCU-waiting CPU could observe that the counter has incremented and
>> > > that therefore a grace period has elapsed.  Getting it right would
>> > > require lots of care.
>> >
>> > So if you mean:
>> >
>> >
>> >...
>> >this_cpu_inc(rcu_qs_ctr);
>> >
>> >
>> > I don't see how this would work reliably: how do you handle the case
>> > of a SCHED_FIFO task never returning from user-space (common technique
>> > in RT apps)? synchronize_rcu() would block indefinitely as it would
>> > never see rcu_qs_ctr increase.
>> >
>> > We have to be able to observe user-mode anyway, for system-time
>> > statistics purposes, and that flag could IMHO also drive the RCU GC
>> > machinery.
>>
>> The counter would have to be fancier than that to work.  We could
>> say that an even value means user mode, for example.  IOW the high
>> bits of the counter would count transitions to quiescent and the low
>> bits would indicate the state.
>>
>> Actually, I'd count backwards.  We could use an andl instruction to
>> go to user mode and a decl to go to kernel mode.
>
> at which point it's not really a monotonic quiescent state counter -
> which your naming suggested before.
>
> Also it would have to be done both at entry and at exit.
>
> But yes, if you replace a state flag with a recursive state flag that
> is INC/DEC maintained that would work as well. That's not a counter
> though.

I won't quibble about the name :)

>
>> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
>> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
>> > > because stores can be delayed.  (It could even happen after
>> > > sysret, IIRC, but IRET is serializing.)
>> >
>> > All it has to do in the synchronize_rcu() slowpath is something
>> > like:
>>
>> I don't think this works.  Suppose rt_cpu starts in kernel mode.
>> Then it checks ti flags.
>>
>> >
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> > smp_mb__before_atomic();
>> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> > smp_rmb();
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>>
>> Now it sets rcu_state and exits to user mode.  It never notices
>> TIF_RCU_NOTIFY.
>
> Indeed, you are right, that's racy.
>
>> [...]  We could fix this by sending an IPI to kick it.
>
> With an IPI we won't need any flags or counters on the RT CPU - it's
> the IPI we are trying to avoid.
>
> So how about the following way to fix the race: simply do a poll loop
> with a fixed timeout sleep: like it would do anyway if
> synchronize_rcu() was waiting for the timer irq to end the grace
> period on the RT-CPU.

Seems reasonable to me.

>
> The TIF flag would cause an RCU grace period to lapse, no need to wake
> up the synchronize_rcu() side: it will notice the (regular) increased
> RCU counter.
>
>> > We are talking about a dozen cycles, while a typical
>> > synchronize_rcu() will wait millions (sometimes billions) of
>> > cycles. There's absolutely zero performance concern here and it's
>> > all CPU local in any case.
>>
>> The barrier would affect all syscalls, though. [...]
>
> What barrier? I never suggested any barrier in the syscall fast path,
> this bit:

Oh, I misunderstood.

>
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> > smp_mb__before_atomic();
>> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> > smp_rmb();
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>
> runs inside synchronize_rcu() or so.
>
> But none of that is needed if we do:
>
> - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
>   entry/exit time, straight in the assembly
>
> - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
>   nothing but: this_cpu_inc(rcu_qs_ctr).
>
> - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
>   then does a timeout-poll-loop with jiffies granular
>   timeouts, simulating a timer

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry


* Andy Lutomirski  wrote:

> On May 8, 2015 12:07 PM, "Ingo Molnar"  wrote:
> >
> >
> > * Andy Lutomirski  wrote:
> >
> > > > So do you mean:
> > > >
> > > >this_cpu_set(rcu_state) = IN_KERNEL;
> > > >...
> > > >this_cpu_inc(rcu_qs_ctr);
> > > >this_cpu_set(rcu_state) = IN_USER;
> > > >
> > > > ?
> > > >
> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > > it just two simple stores into a byte flag, one on entry and one on
> > > > exit:
> > > >
> > > >this_cpu_set(rcu_state) = IN_KERNEL;
> > > >...
> > > >this_cpu_set(rcu_state) = IN_USER;
> > > >
> > >
> > > I was thinking that either a counter or a state flag could make sense.
> > > Doing both would be pointless.  The counter could use the low bits to
> > > indicate the state.  The benefit of the counter would be that the
> > > RCU-waiting CPU could observe that the counter has incremented and
> > > that therefore a grace period has elapsed.  Getting it right would
> > > require lots of care.
> >
> > So if you mean:
> >
> >
> >...
> >this_cpu_inc(rcu_qs_ctr);
> >
> >
> > I don't see how this would work reliably: how do you handle the case
> > of a SCHED_FIFO task never returning from user-space (common technique
> > in RT apps)? synchronize_rcu() would block indefinitely as it would
> > never see rcu_qs_ctr increase.
> >
> > We have to be able to observe user-mode anyway, for system-time
> > statistics purposes, and that flag could IMHO also drive the RCU GC
> > machinery.
> 
> The counter would have to be fancier than that to work.  We could 
> say that an even value means user mode, for example.  IOW the high 
> bits of the counter would count transitions to quiescent and the low 
> bits would indicate the state.
> 
> Actually, I'd count backwards.  We could use an andl instruction to 
> go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter - 
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that 
is INC/DEC maintained that would work as well. That's not a counter 
though.

> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> > > because stores can be delayed.  (It could even happen after 
> > > sysret, IIRC, but IRET is serializing.)
> >
> > All it has to do in the synchronize_rcu() slowpath is something 
> > like:
> 
> I don't think this works.  Suppose rt_cpu starts in kernel mode.  
> Then it checks ti flags.
>
> >
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> > smp_mb__before_atomic();
> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> > smp_rmb();
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
> 
> Now it sets rcu_state and exits to user mode.  It never notices
> TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

> [...]  We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's 
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop 
with a fixed timeout sleep: like it would do anyway if 
synchronize_rcu() was waiting for the timer irq to end the grace 
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake 
up the synchronize_rcu() side: it will notice the (regular) increased 
RCU counter.

> > We are talking about a dozen cycles, while a typical 
> > synchronize_rcu() will wait millions (sometimes billions) of 
> > cycles. There's absolutely zero performance concern here and it's 
> > all CPU local in any case.
> 
> The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path, 
this bit:

> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> > smp_mb__before_atomic();
> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> > smp_rmb();
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context 
  entry/exit time, straight in the assembly

- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does 
  nothing but: this_cpu_inc(rcu_qs_ctr).

- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and 
  then does a timeout-poll-loop with jiffies granular 
  timeouts, simulating a timer IRQ in essence. It will either 
  observe IN_USER or will see the regular RCU qs counter 
  increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the 
percpu counter with the percpu state. Quiescent states

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On May 8, 2015 12:07 PM, "Ingo Molnar"  wrote:
>
>
> * Andy Lutomirski  wrote:
>
> > > So do you mean:
> > >
> > >this_cpu_set(rcu_state) = IN_KERNEL;
> > >...
> > >this_cpu_inc(rcu_qs_ctr);
> > >this_cpu_set(rcu_state) = IN_USER;
> > >
> > > ?
> > >
> > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > it just two simple stores into a byte flag, one on entry and one on
> > > exit:
> > >
> > >this_cpu_set(rcu_state) = IN_KERNEL;
> > >...
> > >this_cpu_set(rcu_state) = IN_USER;
> > >
> >
> > I was thinking that either a counter or a state flag could make sense.
> > Doing both would be pointless.  The counter could use the low bits to
> > indicate the state.  The benefit of the counter would be that the
> > RCU-waiting CPU could observe that the counter has incremented and
> > that therefore a grace period has elapsed.  Getting it right would
> > require lots of care.
>
> So if you mean:
>
>
>...
>this_cpu_inc(rcu_qs_ctr);
>
>
> I don't see how this would work reliably: how do you handle the case
> of a SCHED_FIFO task never returning from user-space (common technique
> in RT apps)? synchronize_rcu() would block indefinitely as it would
> never see rcu_qs_ctr increase.
>
> We have to be able to observe user-mode anyway, for system-time
> statistics purposes, and that flag could IMHO also drive the RCU GC
> machinery.

The counter would have to be fancier than that to work.  We could say
that an even value means user mode, for example.  IOW the high bits of
the counter would count transitions to quiescent and the low bits
would indicate the state.

Actually, I'd count backwards.  We could use an andl instruction to go
to user mode and a decl to go to kernel mode.

>
> > > > The problem is that I don't see how TIF_RCU_THINGY can work
> > > > reliably. If the remote CPU sets it, it'll be too late and we'll
> > > > still enter user mode without seeing it.  If it's just an
> > > > optimization, though, then it should be fine.
> > >
> > > Well, after setting it, the remote CPU has to re-check whether the
> > > RT CPU has entered user-mode - before it goes to wait.
> >
> > How?
> >
> > Suppose the exit path looked like:
> >
> > this_cpu_write(rcu_state, IN_USER);
> >
> > if (ti->flags & _TIF_RCU_NOTIFY) {
> > if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags))
> > slow_notify_rcu_that_we_are_exiting();
> > }
> >
> > iret or sysret;
>
> No, it would look like this:
>
>this_cpu_write(rcu_state, IN_USER);
>iret or sysret;
>
> I.e. IN_USER is set well after all notifications are checked. No
> kernel execution happens afterwards. (No extra checks added - the
> regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)
>
> ( Same goes for idle: we just mark it IN_IDLE and move it back to
>   IN_KERNEL after the idling ends. )
>
> > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
> > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
> > because stores can be delayed.  (It could even happen after sysret,
> > IIRC, but IRET is serializing.)
>
> All it has to do in the synchronize_rcu() slowpath is something like:

I don't think this works.  Suppose rt_cpu starts in kernel mode.  Then
it checks ti flags.

>
> if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> smp_mb__before_atomic();
> set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> smp_rmb();
> if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

Now it sets rcu_state and exits to user mode.  It never notices
TIF_RCU_NOTIFY.  We could fix this by sending an IPI to kick it.

> ... go wait ...
> }
> /* Cool, we observed quiescent state: */
>
> The cost of the trivial barrier is nothing compared to the 'go wait'
> cost which we will pay in 99.9% of the cases!
>
> > If we put an mfence after this_cpu_set or did an unconditional
> > test_and_clear_bit on ti->flags then this problem goes away, but
> > that would probably be slower than we'd like.
>
> We are talking about a dozen cycles, while a typical synchronize_rcu()
> will wait millions (sometimes billions) of cycles. There's absolutely
> zero performance concern here and it's all CPU local in any case.
>

The barrier would affect all syscalls, though.  Admittedly that's
still much cheaper than the current thing, but given that we can round
trip a syscall in 110 cycles, we'd take at least 10% overhead.

My preference would be to use a counter and skip the barrier.  If the
waiting CPU polled the counter, then, even if it lost this race, it
would be guaranteed to see the counter change reasonably soon as long
as no CPU implementation sits on its store buffer forever.

> In fact a user-mode/kernel-mode flag speeds up naive implementations
> of synchronize_rcu(): because it's able to observe extended quiescent
> state immediately, without having to wait for a counter to increase
>

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry


* Rik van Riel  wrote:

> On 05/07/2015 08:29 AM, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker  wrote:
> > 
>  We cannot take the lock_trace(task) from irq context, and we 
>  probably do not need to anyway, since we do not care about a 
>  precise stack trace for the task.
> >>>
> >>> So one worry with this and similar approaches of statistically 
> >>> detecting user mode would be the fact that on the way out to 
> >>> user-space we don't really destroy the previous call trace - we 
> >>> just pop off the stack (non-destructively), restore RIPs and are 
> >>> gone.
> >>>
> >>> We'll need that percpu flag I suspect.
> >>
> >> Note we have the context tracking state which tells where the 
> >> current task is: user/system/guest.
> > 
> > Yes, but that overhead is what I'm suggesting we get rid of, I thought 
> > Rik was trying to find a mechanism that would be independent of that?
> 
> One thing at a time :)
> 
> I am working on the timer sampling stuff, which should be easy to 
> adapt to a different user/system/guest/irq/softirq/... tracking 
> thing, if somebody else comes up with a more efficient way to do 
> that.

So if you make the timer sampling use a percpu variable, and set that 
variable from the existing callbacks, then we could do this gradually: 
first the timer sampling uses the flag, then RCU could use it, and 
finally we could push it out to minimal assembly code.

But it's important to start out with a percpu flag to track this all.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

* Andy Lutomirski  wrote:

> > So do you mean:
> >
> >this_cpu_set(rcu_state) = IN_KERNEL;
> >...
> >this_cpu_inc(rcu_qs_ctr);
> >this_cpu_set(rcu_state) = IN_USER;
> >
> > ?
> >
> > So in your proposal we'd have an INC and two MOVs. I think we can make
> > it just two simple stores into a byte flag, one on entry and one on
> > exit:
> >
> >this_cpu_set(rcu_state) = IN_KERNEL;
> >...
> >this_cpu_set(rcu_state) = IN_USER;
> >
> 
> I was thinking that either a counter or a state flag could make sense.
> Doing both would be pointless.  The counter could use the low bits to
> indicate the state.  The benefit of the counter would be that the
> RCU-waiting CPU could observe that the counter has incremented and
> that therefore a grace period has elapsed.  Getting it right would
> require lots of care.

So if you mean:

   ...
   this_cpu_inc(rcu_qs_ctr);

I don't see how this would work reliably: how do you handle the case 
of a SCHED_FIFO task never returning from user-space (common technique 
in RT apps)? synchronize_rcu() would block indefinitely as it would 
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time 
statistics purposes, and that flag could IMHO also drive the RCU GC 
machinery.

> > > The problem is that I don't see how TIF_RCU_THINGY can work 
> > > reliably. If the remote CPU sets it, it'll be too late and we'll 
> > > still enter user mode without seeing it.  If it's just an 
> > > optimization, though, then it should be fine.
> >
> > Well, after setting it, the remote CPU has to re-check whether the 
> > RT CPU has entered user-mode - before it goes to wait.
> 
> How?
> 
> Suppose the exit path looked like:
> 
> this_cpu_write(rcu_state, IN_USER);
> 
> if (ti->flags & _TIF_RCU_NOTIFY) {
> if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags))
> slow_notify_rcu_that_we_are_exiting();
> }
> 
> iret or sysret;

No, it would look like this:

   this_cpu_write(rcu_state, IN_USER);
   iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No 
kernel execution happens afterwards. (No extra checks added - the 
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to 
  IN_KERNEL after the idling ends. )

> The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> because stores can be delayed.  (It could even happen after sysret, 
> IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
smp_mb__before_atomic();
set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
smp_rmb();
if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
... go wait ...
}
/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait' 
cost which we will pay in 99.9% of the cases!

> If we put an mfence after this_cpu_set or did an unconditional 
> test_and_clear_bit on ti->flags then this problem goes away, but 
> that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu() 
will wait millions (sometimes billions) of cycles. There's absolutely 
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations 
of synchronize_rcu(): because it's able to observe extended quiescent 
state immediately, without having to wait for a counter to increase 
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!) 
then synchronize_rcu() could return almost immediately - while 
previously it had to wait for scheduling or periodic timer irqs to 
trigger on all CPUs - adding many millisecs of delay even in the best 
of cases.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Paul E. McKenney

On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
> > (being executed on some other CPU not doing RT work) to 
> > intelligently wait for the remote (RT work doing) CPU to finish 
> > executing kernel code, without polling or so.
> 
> it's basically a cheap IPI being inserted on the remote CPU.
> 
> We need the TIF_RCU_QS callback not just to wait intelligently, but 
> mainly to elapse a grace period, otherwise synchronize_rcu() might not 
> ever make progress: think a SCHED_FIFO task doing some kernel work, 
> synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
> otherwise never scheduling and never getting any timer irqs either, 
> and thus never entering quiescent state.
> 
> (Cc:-ed Paul too, he might be interested in this as well.)

Hmmm...  So the point is that a NO_HZ_FULL CPU periodically posts
callbacks to indicate that it has passed through a quiescent state,
for example, upon entry to and/or exit from userspace?  These callbacks
would then be offloaded to some other CPU.

But the callback would not be invoked until RCU saw a grace period,
so I must be missing something here...  Probably that the TIF_RCU_QS
callback is not an RCU callback, but something else?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Paul E. McKenney

On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote:
 
 * Ingo Molnar mi...@kernel.org wrote:
 
  The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
  (being executed on some other CPU not doing RT work) to 
  intelligently wait for the remote (RT work doing) CPU to finish 
  executing kernel code, without polling or so.
 
 it's basically a cheap IPI being inserted on the remote CPU.
 
 We need the TIF_RCU_QS callback not just to wait intelligently, but 
 mainly to elapse a grace period, otherwise synchronize_rcu() might not 
 ever make progress: think a SCHED_FIFO task doing some kernel work, 
 synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
 otherwise never scheduling and never getting any timer irqs either, 
 and thus never entering quiescent state.
 
 (Cc:-ed Paul too, he might be interested in this as well.)

Hmmm...  So the point is that a NO_HZ_FULL CPU periodically posts
callbacks to indicate that it has passed through a quiescent state,
for example, upon entry to and/or exit from userspace?  These callbacks
would then be offloaded to some other CPU.

But the callback would not be invoked until RCU saw a grace period,
so I must be missing something here...  Probably that the TIF_RCU_QS
callback is not an RCU callback, but something else?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Andy Lutomirski l...@amacapital.net wrote:

  So do you mean:
 
 this_cpu_set(rcu_state) = IN_KERNEL;
 ...
 this_cpu_inc(rcu_qs_ctr);
 this_cpu_set(rcu_state) = IN_USER;
 
  ?
 
  So in your proposal we'd have an INC and two MOVs. I think we can make
  it just two simple stores into a byte flag, one on entry and one on
  exit:
 
 this_cpu_set(rcu_state) = IN_KERNEL;
 ...
 this_cpu_set(rcu_state) = IN_USER;
 
 
 I was thinking that either a counter or a state flag could make sense.
 Doing both would be pointless.  The counter could use the low bits to
 indicate the state.  The benefit of the counter would be that the
 RCU-waiting CPU could observe that the counter has incremented and
 that therefore a grace period has elapsed.  Getting it right would
 require lots of care.

So if you mean:

   syscall entry
   ...
   this_cpu_inc(rcu_qs_ctr);
   syscall exit

I don't see how this would work reliably: how do you handle the case 
of a SCHED_FIFO task never returning from user-space (common technique 
in RT apps)? synchronize_rcu() would block indefinitely as it would 
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time 
statistics purposes, and that flag could IMHO also drive the RCU GC 
machinery.

   The problem is that I don't see how TIF_RCU_THINGY can work 
   reliably. If the remote CPU sets it, it'll be too late and we'll 
   still enter user mode without seeing it.  If it's just an 
   optimization, though, then it should be fine.
 
  Well, after setting it, the remote CPU has to re-check whether the 
  RT CPU has entered user-mode - before it goes to wait.
 
 How?
 
 Suppose the exit path looked like:
 
 this_cpu_write(rcu_state, IN_USER);
 
 if (ti-flags  _TIF_RCU_NOTIFY) {
 if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags))
 slow_notify_rcu_that_we_are_exiting();
 }
 
 iret or sysret;

No, it would look like this:

   this_cpu_write(rcu_state, IN_USER);
   iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No 
kernel execution happens afterwards. (No extra checks added - the 
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to 
  IN_KERNEL after the idling ends. )

 The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
 _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
 because stores can be delayed.  (It could even happen after sysret, 
 IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
smp_mb__before_atomic();
set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
smp_rmb();
if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
... go wait ...
}
/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait' 
cost which we will pay in 99.9% of the cases!

 If we put an mfence after this_cpu_set or did an unconditional 
 test_and_clear_bit on ti-flags then this problem goes away, but 
 that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu() 
will wait millions (sometimes billions) of cycles. There's absolutely 
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations 
of synchronize_rcu(): because it's able to observe extended quiescent 
state immediately, without having to wait for a counter to increase 
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!) 
then synchronize_rcu() could return almost immediately - while 
previously it had to wait for scheduling or periodic timer irqs to 
trigger on all CPUs - adding many millisecs of delay even in the best 
of cases.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Rik van Riel r...@redhat.com wrote:

 On 05/07/2015 08:29 AM, Ingo Molnar wrote:
  
  * Frederic Weisbecker fweis...@gmail.com wrote:
  
  We cannot take the lock_trace(task) from irq context, and we 
  probably do not need to anyway, since we do not care about a 
  precise stack trace for the task.
 
  So one worry with this and similar approaches of statistically 
  detecting user mode would be the fact that on the way out to 
  user-space we don't really destroy the previous call trace - we 
  just pop off the stack (non-destructively), restore RIPs and are 
  gone.
 
  We'll need that percpu flag I suspect.
 
  Note we have the context tracking state which tells where the 
  current task is: user/system/guest.
  
  Yes, but that overhead is what I'm suggesting we get rid of, I thought 
  Rik was trying to find a mechanism that would be independent of that?
 
 One thing at a time :)
 
 I am working on the timer sampling stuff, which should be easy to 
 adapt to a different user/system/guest/irq/softirq/... tracking 
 thing, if somebody else comes up with a more efficient way to do 
 that.

So if you make the timer sampling use a percpu variable, and set that 
variable from the existing callbacks, then we could do this gradually: 
first the timer sampling uses the flag, then RCU could use it, and 
finally we could push it out to minimal assembly code.

But it's important to start out with a percpu flag to track this all.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote:


 * Andy Lutomirski l...@amacapital.net wrote:

   So do you mean:
  
  this_cpu_set(rcu_state) = IN_KERNEL;
  ...
  this_cpu_inc(rcu_qs_ctr);
  this_cpu_set(rcu_state) = IN_USER;
  
   ?
  
   So in your proposal we'd have an INC and two MOVs. I think we can make
   it just two simple stores into a byte flag, one on entry and one on
   exit:
  
  this_cpu_set(rcu_state) = IN_KERNEL;
  ...
  this_cpu_set(rcu_state) = IN_USER;
  
 
  I was thinking that either a counter or a state flag could make sense.
  Doing both would be pointless.  The counter could use the low bits to
  indicate the state.  The benefit of the counter would be that the
  RCU-waiting CPU could observe that the counter has incremented and
  that therefore a grace period has elapsed.  Getting it right would
  require lots of care.

 So if you mean:

syscall entry
...
this_cpu_inc(rcu_qs_ctr);
syscall exit

 I don't see how this would work reliably: how do you handle the case
 of a SCHED_FIFO task never returning from user-space (common technique
 in RT apps)? synchronize_rcu() would block indefinitely as it would
 never see rcu_qs_ctr increase.

 We have to be able to observe user-mode anyway, for system-time
 statistics purposes, and that flag could IMHO also drive the RCU GC
 machinery.

The counter would have to be fancier than that to work.  We could say
that an even value means user mode, for example.  IOW the high bits of
the counter would count transitions to quiescent and the low bits
would indicate the state.

Actually, I'd count backwards.  We could use an andl instruction to go
to user mode and a decl to go to kernel mode.


The problem is that I don't see how TIF_RCU_THINGY can work
reliably. If the remote CPU sets it, it'll be too late and we'll
still enter user mode without seeing it.  If it's just an
optimization, though, then it should be fine.
  
   Well, after setting it, the remote CPU has to re-check whether the
   RT CPU has entered user-mode - before it goes to wait.
 
  How?
 
  Suppose the exit path looked like:
 
  this_cpu_write(rcu_state, IN_USER);
 
  if (ti-flags  _TIF_RCU_NOTIFY) {
  if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags))
  slow_notify_rcu_that_we_are_exiting();
  }
 
  iret or sysret;

 No, it would look like this:

this_cpu_write(rcu_state, IN_USER);
iret or sysret;

 I.e. IN_USER is set well after all notifications are checked. No
 kernel execution happens afterwards. (No extra checks added - the
 regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

 ( Same goes for idle: we just mark it IN_IDLE and move it back to
   IN_KERNEL after the idling ends. )

  The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
  _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
  because stores can be delayed.  (It could even happen after sysret,
  IIRC, but IRET is serializing.)

 All it has to do in the synchronize_rcu() slowpath is something like:

I don't think this works.  Suppose rt_cpu starts in kernel mode.  Then
it checks ti flags.


 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
 smp_mb__before_atomic();
 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
 smp_rmb();
 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

Now it sets rcu_state and exits to user mode.  It never notices
TIF_RCU_NOTIFY.  We could fix this by sending an IPI to kick it.

 ... go wait ...
 }
 /* Cool, we observed quiescent state: */

 The cost of the trivial barrier is nothing compared to the 'go wait'
 cost which we will pay in 99.9% of the cases!

  If we put an mfence after this_cpu_set or did an unconditional
  test_and_clear_bit on ti-flags then this problem goes away, but
  that would probably be slower than we'd like.

 We are talking about a dozen cycles, while a typical synchronize_rcu()
 will wait millions (sometimes billions) of cycles. There's absolutely
 zero performance concern here and it's all CPU local in any case.


The barrier would affect all syscalls, though.  Admittedly that's
still much cheaper than the current thing, but given that we can round
trip a syscall in 110 cycles, we'd take at least 10% overhead.

My preference would be to use a counter and skip the barrier.  If the
waiting CPU polled the counter, then, even if it lost this race, it
would be guaranteed to see the counter change reasonably soon as long
as no CPU implementation sits on its store buffer forever.

 In fact a user-mode/kernel-mode flag speeds up naive implementations
 of synchronize_rcu(): because it's able to observe extended quiescent
 state immediately, without having to wait for a counter to increase
 (which was always the classic way to observe grace periods).

 If all CPUs are in user mode or are idle (which is rather common!)
 then

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Andy Lutomirski l...@amacapital.net wrote:

  on return from idle we'd do:
 
  rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
 
  etc. On all boundary transitions we can use a constant ADD on a 
  suitable percpu variable.
 
 Sounds good to me, except that we need to be careful to distinguish 
 between non-syscall entries from quiescent states
 and non-syscall entries from quiescent states.

It might be hard to make that distinction! ;-)

I suspect you wanted to raise the issue of various contexts nesting on 
each other, such as syscall triggering a page fault, which gets an irq 
nested, etc. - versus non-nested contexts such as user-space 
triggering a page fault or user-space getting an irq?

 [...]  We could save the old state (as the current exception_enter 
 code does) or we could allocate enough low bits for the state that 
 the problem goes away.

So I think, if it all works in practice just as well as it does in 
email, we might be better off with more state bits: that would tell 
any remote statistics/sampling code more as well.

It might also be easier to patch in/out, because this kind of state 
tracking will affect non-RT CPUs as well. (Later on we could do a 
separate IDT for RT CPUs as well, with a patched version of the entry 
code.)

 I don't think the TIF_RCU_QS variant is worthwhile -- merging the 
 counter and state is probably both easier and faster.

Yeah.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar mi...@kernel.org wrote:

 * Andy Lutomirski l...@amacapital.net wrote:

 On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote:
 
 
  * Andy Lutomirski l...@amacapital.net wrote:
 
So do you mean:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;
   
?
   
So in your proposal we'd have an INC and two MOVs. I think we can make
it just two simple stores into a byte flag, one on entry and one on
exit:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;
   
  
   I was thinking that either a counter or a state flag could make sense.
   Doing both would be pointless.  The counter could use the low bits to
   indicate the state.  The benefit of the counter would be that the
   RCU-waiting CPU could observe that the counter has incremented and
   that therefore a grace period has elapsed.  Getting it right would
   require lots of care.
 
  So if you mean:
 
 syscall entry
 ...
 this_cpu_inc(rcu_qs_ctr);
 syscall exit
 
  I don't see how this would work reliably: how do you handle the case
  of a SCHED_FIFO task never returning from user-space (common technique
  in RT apps)? synchronize_rcu() would block indefinitely as it would
  never see rcu_qs_ctr increase.
 
  We have to be able to observe user-mode anyway, for system-time
  statistics purposes, and that flag could IMHO also drive the RCU GC
  machinery.

 The counter would have to be fancier than that to work.  We could
 say that an even value means user mode, for example.  IOW the high
 bits of the counter would count transitions to quiescent and the low
 bits would indicate the state.

 Actually, I'd count backwards.  We could use an andl instruction to
 go to user mode and a decl to go to kernel mode.

 at which point it's not really a monotonic quiescent state counter -
 which your naming suggested before.

 Also it would have to be done both at entry and at exit.

 But yes, if you replace a state flag with a recursive state flag that
 is INC/DEC maintained that would work as well. That's not a counter
 though.

I won't quibble about the name :)


   The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
   _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
   because stores can be delayed.  (It could even happen after
   sysret, IIRC, but IRET is serializing.)
 
  All it has to do in the synchronize_rcu() slowpath is something
  like:

 I don't think this works.  Suppose rt_cpu starts in kernel mode.
 Then it checks ti flags.

 
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

 Now it sets rcu_state and exits to user mode.  It never notices
 TIF_RCU_NOTIFY.

 Indeed, you are right, that's racy.

 [...]  We could fix this by sending an IPI to kick it.

 With an IPI we won't need any flags or counters on the RT CPU - it's
 the IPI we are trying to avoid.

 So how about the following way to fix the race: simply do a poll loop
 with a fixed timeout sleep: like it would do anyway if
 synchronize_rcu() was waiting for the timer irq to end the grace
 period on the RT-CPU.

Seems reasonable to me.


 The TIF flag would cause an RCU grace period to lapse, no need to wake
 up the synchronize_rcu() side: it will notice the (regular) increased
 RCU counter.

  We are talking about a dozen cycles, while a typical
  synchronize_rcu() will wait millions (sometimes billions) of
  cycles. There's absolutely zero performance concern here and it's
  all CPU local in any case.

 The barrier would affect all syscalls, though. [...]

 What barrier? I never suggested any barrier in the syscall fast path,
 this bit:

Oh, I misunderstood.


  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

 runs inside synchronize_rcu() or so.

 But none of that is needed if we do:

 - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
   entry/exit time, straight in the assembly

 - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
   nothing but: this_cpu_inc(rcu_qs_ctr).

 - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
   then does a timeout-poll-loop with jiffies granular
   timeouts, simulating a timer IRQ in essence. It will either
   observe IN_USER or will see the regular RCU qs counter
   increase.

 this should be race free.

 Alternatively we can even get rid of the TIF flag by merging the
 percpu counter with

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Andy Lutomirski l...@amacapital.net wrote:

 On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote:
 
 
  * Andy Lutomirski l...@amacapital.net wrote:
 
So do you mean:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;
   
?
   
So in your proposal we'd have an INC and two MOVs. I think we can make
it just two simple stores into a byte flag, one on entry and one on
exit:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;
   
  
   I was thinking that either a counter or a state flag could make sense.
   Doing both would be pointless.  The counter could use the low bits to
   indicate the state.  The benefit of the counter would be that the
   RCU-waiting CPU could observe that the counter has incremented and
   that therefore a grace period has elapsed.  Getting it right would
   require lots of care.
 
  So if you mean:
 
 syscall entry
 ...
 this_cpu_inc(rcu_qs_ctr);
 syscall exit
 
  I don't see how this would work reliably: how do you handle the case
  of a SCHED_FIFO task never returning from user-space (common technique
  in RT apps)? synchronize_rcu() would block indefinitely as it would
  never see rcu_qs_ctr increase.
 
  We have to be able to observe user-mode anyway, for system-time
  statistics purposes, and that flag could IMHO also drive the RCU GC
  machinery.
 
 The counter would have to be fancier than that to work.  We could 
 say that an even value means user mode, for example.  IOW the high 
 bits of the counter would count transitions to quiescent and the low 
 bits would indicate the state.
 
 Actually, I'd count backwards.  We could use an andl instruction to 
 go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter - 
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that 
is INC/DEC maintained that would work as well. That's not a counter 
though.

   The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
   _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
   because stores can be delayed.  (It could even happen after 
   sysret, IIRC, but IRET is serializing.)
 
  All it has to do in the synchronize_rcu() slowpath is something 
  like:
 
 I don't think this works.  Suppose rt_cpu starts in kernel mode.  
 Then it checks ti flags.

 
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
 
 Now it sets rcu_state and exits to user mode.  It never notices
 TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

 [...]  We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's 
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop 
with a fixed timeout sleep: like it would do anyway if 
synchronize_rcu() was waiting for the timer irq to end the grace 
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake 
up the synchronize_rcu() side: it will notice the (regular) increased 
RCU counter.

  We are talking about a dozen cycles, while a typical 
  synchronize_rcu() will wait millions (sometimes billions) of 
  cycles. There's absolutely zero performance concern here and it's 
  all CPU local in any case.
 
 The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path, 
this bit:

  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context 
  entry/exit time, straight in the assembly

- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does 
  nothing but: this_cpu_inc(rcu_qs_ctr).

- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and 
  then does a timeout-poll-loop with jiffies granular 
  timeouts, simulating a timer IRQ in essence. It will either 
  observe IN_USER or will see the regular RCU qs counter 
  increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the 
percpu counter with the percpu state. Quiescent states are recorded 
via a counter shifted by two bits:

rcu_qs_ctr += 4;

while user/kernel/idle mode is recorded in

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On May 7, 2015 8:38 PM, "Ingo Molnar"  wrote:
>
>
> * Andy Lutomirski  wrote:
>
> > I think one or both of us is missing something or we're just talking
> > about different things.
>
> That's very much possible!
>
> I think part of the problem is that I called the 'remote CPU' the RT
> CPU, while you seem to be calling it the CPU that does the
> synchronize_rcu().
>
> So lets start again, with calling the synchronize_rcu() the 'remote
> CPU', and the one doing the RT workload the 'RT CPU':
>
> > If the nohz/RT cpu is about to enter user mode and stay there for a
> > long time, it does:
> >
> >   this_cpu_inc(rcu_qs_ctr);
> >
> > or whatever.  Maybe we add:
> >
> >   this_cpu_set(rcu_state) = IN_USER;
> >
> > or however it's spelled.
> >
> > The remote CPU wants to check our state.  If this happens just
> > before the IN_USER write or rcu_qs_ctr increment becomes visible,
> > then it'll think we're in kernel mode.  Now it either has to poll
> > (which is fine) or try to get us to tell the RCU core when we become
> > quiescent by setting TIF_RCU_THINGY.
>
> So do you mean:
>
>this_cpu_set(rcu_state) = IN_KERNEL;
>...
>this_cpu_inc(rcu_qs_ctr);
>this_cpu_set(rcu_state) = IN_USER;
>
> ?
>
> So in your proposal we'd have an INC and two MOVs. I think we can make
> it just two simple stores into a byte flag, one on entry and one on
> exit:
>
>this_cpu_set(rcu_state) = IN_KERNEL;
>...
>this_cpu_set(rcu_state) = IN_USER;
>

I was thinking that either a counter or a state flag could make sense.
Doing both would be pointless.  The counter could use the low bits to
indicate the state.  The benefit of the counter would be that the
RCU-waiting CPU could observe that the counter has incremented and
that therefore a grace period has elapsed.  Getting it right would
require lots of care.

> plus the rare but magic TIF_RCU_THINGY that tells a waiting
> synchronize_rcu() about the next quiescent state.
>
> > The problem is that I don't see how TIF_RCU_THINGY can work
> > reliably. If the remote CPU sets it, it'll be too late and we'll
> > still enter user mode without seeing it.  If it's just an
> > optimization, though, then it should be fine.
>
> Well, after setting it, the remote CPU has to re-check whether the RT
> CPU has entered user-mode - before it goes to wait.

How?

Suppose the exit path looked like:

this_cpu_write(rcu_state, IN_USER);

if (ti->flags & _TIF_RCU_NOTIFY) {
if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags))
slow_notify_rcu_that_we_are_exiting();
}

iret or sysret;

The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
_TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
because stores can be delayed.  (It could even happen after sysret,
IIRC, but IRET is serializing.)

If we put an mfence after this_cpu_set or did an unconditional
test_and_clear_bit on ti->flags then this problem goes away, but that
would probably be slower than we'd like.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Rik van Riel

On 05/07/2015 08:29 AM, Ingo Molnar wrote:
> 
> * Frederic Weisbecker  wrote:
> 
 We cannot take the lock_trace(task) from irq context, and we 
 probably do not need to anyway, since we do not care about a 
 precise stack trace for the task.
>>>
>>> So one worry with this and similar approaches of statistically 
>>> detecting user mode would be the fact that on the way out to 
>>> user-space we don't really destroy the previous call trace - we 
>>> just pop off the stack (non-destructively), restore RIPs and are 
>>> gone.
>>>
>>> We'll need that percpu flag I suspect.
>>
>> Note we have the context tracking state which tells where the 
>> current task is: user/system/guest.
> 
> Yes, but that overhead is what I'm suggesting we get rid of, I thought 
> Rik was trying to find a mechanism that would be independent of that?

One thing at a time :)

I am working on the timer sampling stuff, which should be easy
to adapt to a different user/system/guest/irq/softirq/... tracking
thing, if somebody else comes up with a more efficient way to do
that.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

* Andy Lutomirski  wrote:

> I think one or both of us is missing something or we're just talking 
> about different things.

That's very much possible!

I think part of the problem is that I called the 'remote CPU' the RT 
CPU, while you seem to be calling it the CPU that does the 
synchronize_rcu().

So lets start again, with calling the synchronize_rcu() the 'remote 
CPU', and the one doing the RT workload the 'RT CPU':

> If the nohz/RT cpu is about to enter user mode and stay there for a 
> long time, it does:
> 
>   this_cpu_inc(rcu_qs_ctr);
> 
> or whatever.  Maybe we add:
> 
>   this_cpu_set(rcu_state) = IN_USER;
> 
> or however it's spelled.
> 
> The remote CPU wants to check our state.  If this happens just 
> before the IN_USER write or rcu_qs_ctr increment becomes visible, 
> then it'll think we're in kernel mode.  Now it either has to poll 
> (which is fine) or try to get us to tell the RCU core when we become 
> quiescent by setting TIF_RCU_THINGY.

So do you mean:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;

?

So in your proposal we'd have an INC and two MOVs. I think we can make 
it just two simple stores into a byte flag, one on entry and one on 
exit:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;

plus the rare but magic TIF_RCU_THINGY that tells a waiting 
synchronize_rcu() about the next quiescent state.

> The problem is that I don't see how TIF_RCU_THINGY can work 
> reliably. If the remote CPU sets it, it'll be too late and we'll 
> still enter user mode without seeing it.  If it's just an 
> optimization, though, then it should be fine.

Well, after setting it, the remote CPU has to re-check whether the RT 
CPU has entered user-mode - before it goes to wait.

If it has entered user mode then the remote CPU has observed quiescent 
state and is done. If it's still in kernel mode then it waits until 
the TIF_RCU_THINGY does its completion.

> I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr 
> be enough for all of the above?

How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) 
work?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar  wrote:
>
> * Andy Lutomirski  wrote:
>
>> > > We cannot take the lock_trace(task) from irq context, and we
>> > > probably do not need to anyway, since we do not care about a
>> > > precise stack trace for the task.
>> >
>> > So one worry with this and similar approaches of statistically
>> > detecting user mode would be the fact that on the way out to
>> > user-space we don't really destroy the previous call trace - we
>> > just pop off the stack (non-destructively), restore RIPs and are
>> > gone.
>> >
>> > We'll need that percpu flag I suspect.
>> >
>> > And once we have the flag, we can get rid of the per syscall RCU
>> > callback as well, relatively easily: with CMPXCHG (in
>> > synchronize_rcu()!) we can reliably sample whether a CPU is in
>> > user mode right now, while the syscall entry/exit path does not
>> > use any atomics, we can just use a simple MOV.
>> >
>> > Once we observe 'user mode', then we have observed quiescent state
>> > and synchronize_rcu() can continue. If we've observed kernel mode
>> > we can frob the remote task's TIF_ flags to make it go into a
>> > quiescent state publishing routine on syscall-return.
>> >
>>
>> How does that work?
>>
>> If the exit code writes the per-cpu flag and then checks
>> TIF_whatever, we need a barrier to avoid a race where we end up in
>> user mode without seeing the flag.
>
> No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.
>
> Also, the user-mode flag would be changed from the 'kernel-mode' value
> to the 'user-mode' value after we do the regular TIF checks - just
> before we hit user-space.
>
> The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being
> executed on some other CPU not doing RT work) to intelligently wait
> for the remote (RT work doing) CPU to finish executing kernel code,
> without polling or so.
>
> And yes, the TIF_RCU_QS handler on the remote CPU would have to
> execute an SMP barrier, before it allows an RCU quiescent state to
> pass. Note that the regular RCU code for quiescent state publishing
> already has that barrier, typically something like:
>
> this_cpu_inc(rcu_qs_ctr);
>
> That in itself is enough for synchronize_rcu(), it could do a small
> timing loop with short sleeps, until it waits for rcu_qs_ctr to
> increase.

I think one or both of us is missing something or we're just talking
about different things.

If the nohz/RT cpu is about to enter user mode and stay there for a
long time, it does:

this_cpu_inc(rcu_qs_ctr);

or whatever.  Maybe we add:

this_cpu_set(rcu_state) = IN_USER;

or however it's spelled.

The remote CPU wants to check our state.  If this happens just before
the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll
think we're in kernel mode.  Now it either has to poll (which is fine)
or try to get us to tell the RCU core when we become quiescent by
setting TIF_RCU_THINGY.

The problem is that I don't see how TIF_RCU_THINGY can work reliably.
If the remote CPU sets it, it'll be too late and we'll still enter
user mode without seeing it.  If it's just an optimization, though,
then it should be fine.

I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr be
enough for all of the above?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

* Ingo Molnar  wrote:

> The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
> (being executed on some other CPU not doing RT work) to 
> intelligently wait for the remote (RT work doing) CPU to finish 
> executing kernel code, without polling or so.

it's basically a cheap IPI being inserted on the remote CPU.

We need the TIF_RCU_QS callback not just to wait intelligently, but 
mainly to elapse a grace period, otherwise synchronize_rcu() might not 
ever make progress: think a SCHED_FIFO task doing some kernel work, 
synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
otherwise never scheduling and never getting any timer irqs either, 
and thus never entering quiescent state.

(Cc:-ed Paul too, he might be interested in this as well.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

* Andy Lutomirski  wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> >
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> >
> > We'll need that percpu flag I suspect.
> >
> > And once we have the flag, we can get rid of the per syscall RCU 
> > callback as well, relatively easily: with CMPXCHG (in 
> > synchronize_rcu()!) we can reliably sample whether a CPU is in 
> > user mode right now, while the syscall entry/exit path does not 
> > use any atomics, we can just use a simple MOV.
> >
> > Once we observe 'user mode', then we have observed quiescent state 
> > and synchronize_rcu() can continue. If we've observed kernel mode 
> > we can frob the remote task's TIF_ flags to make it go into a 
> > quiescent state publishing routine on syscall-return.
> >
> 
> How does that work?
>
> If the exit code writes the per-cpu flag and then checks 
> TIF_whatever, we need a barrier to avoid a race where we end up in 
> user mode without seeing the flag.

No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

Also, the user-mode flag would be changed from the 'kernel-mode' value 
to the 'user-mode' value after we do the regular TIF checks - just 
before we hit user-space.

The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being 
executed on some other CPU not doing RT work) to intelligently wait 
for the remote (RT work doing) CPU to finish executing kernel code, 
without polling or so.

And yes, the TIF_RCU_QS handler on the remote CPU would have to 
execute an SMP barrier, before it allows an RCU quiescent state to 
pass. Note that the regular RCU code for quiescent state publishing 
already has that barrier, typically something like:

this_cpu_inc(rcu_qs_ctr);

That in itself is enough for synchronize_rcu(), it could do a small 
timing loop with short sleeps, until it waits for rcu_qs_ctr to 
increase.

> I think the right solution is to accept that race and just have the 
> RCU code send an IPI (or check again) if it sees too long of a 
> period elapse in kernel mode.

I don't think there's any need for an IPI.

> I think the flag should be a counter, though.  That way a workload 
> that makes lots of syscalls will be quickly detected as going 
> through quiescent states even if it's never actually observed in 
> user mode.

Flag write is easier on the CPU than an INC/DEC: only a store to a 
percpu location, no load needed, and no value dependencies. The store 
will be program ordered, so it's a spin_unlock() work-alike.

> > The only hard requirement of this scheme from the RCU 
> > synchronization POV is that all kernel contexts that may touch RCU 
> > state need to flip this flag reliably to 'kernel mode': i.e. all 
> > irq handlers, traps, NMIs and all syscall variants need to do 
> > this.
> >
> > But once it's there, it's really neat.
> 
> We already have to do this with the current code.  I went through 
> and checked all of the IST entries a couple versions ago.

Yeah - I just mean if the primary flag is _not_ TIF driven (which is 
my suggestion, and which I thought you suggested as well) then all 
code paths need to be covered.

Things like the irq-tracking callbacks are debugging and 
instrumentation code - the user/kernel mode flag would be a hard 
requirement on RCU correctness: missing a flag update might cause the 
kernel to crash in non-obvious ways.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry


* Frederic Weisbecker  wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> > 
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> > 
> > We'll need that percpu flag I suspect.
> 
> Note we have the context tracking state which tells where the 
> current task is: user/system/guest.

Yes, but that overhead is what I'm suggesting we get rid of, I thought 
Rik was trying to find a mechanism that would be independent of that?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On May 7, 2015 4:18 PM, "Ingo Molnar"  wrote:
>
>
> * Rik van Riel  wrote:
>
> > > If, on the other hand, you're just going to remotely sample the
> > > in-memory context, that sounds good.
> >
> > It's the latter.
> >
> > If you look at /proc//{stack,syscall,wchan} and other files,
> > you will see we already have ways to determine, from in memory
> > content, where a program is running at a certain point in time.
> >
> > In fact, the timer interrupt based accounting does a similar thing.
> > It has a task examine its own in-memory state to figure out what it
> > was doing before the timer interrupt happened.
> >
> > The kernel side stack pointer is probably enough to tell us whether
> > a task is active in kernel space, on an irq stack, or (maybe) in
> > user space. Not convinced about the latter, we may need to look at
> > the same state the RCU code keeps track of to see what mode a task
> > is in...
> >
> > I am looking at the code to see what locks we need to grab.
> >
> > I suspect the runqueue lock may be enough, to ensure that the task
> > struct, and stack do not go away while we are looking at them.
>
> That will be enough, especially if you get to the task reference via
> rq->curr.
>
> > We cannot take the lock_trace(task) from irq context, and we
> > probably do not need to anyway, since we do not care about a precise
> > stack trace for the task.
>
> So one worry with this and similar approaches of statistically
> detecting user mode would be the fact that on the way out to
> user-space we don't really destroy the previous call trace - we just
> pop off the stack (non-destructively), restore RIPs and are gone.
>
> We'll need that percpu flag I suspect.
>
> And once we have the flag, we can get rid of the per syscall RCU
> callback as well, relatively easily: with CMPXCHG (in
> synchronize_rcu()!) we can reliably sample whether a CPU is in user
> mode right now, while the syscall entry/exit path does not use any
> atomics, we can just use a simple MOV.
>
> Once we observe 'user mode', then we have observed quiescent state and
> synchronize_rcu() can continue. If we've observed kernel mode we can
> frob the remote task's TIF_ flags to make it go into a quiescent state
> publishing routine on syscall-return.
>

How does that work?

If the exit code writes the per-cpu flag and then checks TIF_whatever,
we need a barrier to avoid a race where we end up in user mode without
seeing the flag.

I think the right solution is to accept that race and just have the
RCU code send an IPI (or check again) if it sees too long of a period
elapse in kernel mode.

I think the flag should be a counter, though.  That way a workload
that makes lots of syscalls will be quickly detected as going through
quiescent states even if it's never actually observed in user mode.

> The only hard requirement of this scheme from the RCU synchronization
> POV is that all kernel contexts that may touch RCU state need to flip
> this flag reliably to 'kernel mode': i.e. all irq handlers, traps,
> NMIs and all syscall variants need to do this.
>
> But once it's there, it's really neat.
>

We already have to do this with the current code.  I went through and
checked all of the IST entries a couple versions ago.

I think we need to clean up the current garbage asm first, though.  See:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac

(Very much unfinished, and it should probably be split up, but AFAICT
it works.  Don't hold your breath for a real version.)

--Andy

> Thanks,
>
> Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Frederic Weisbecker

On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote:
> 
> * Rik van Riel  wrote:
> 
> > > If, on the other hand, you're just going to remotely sample the 
> > > in-memory context, that sounds good.
> > 
> > It's the latter.
> > 
> > If you look at /proc//{stack,syscall,wchan} and other files, 
> > you will see we already have ways to determine, from in memory 
> > content, where a program is running at a certain point in time.
> > 
> > In fact, the timer interrupt based accounting does a similar thing. 
> > It has a task examine its own in-memory state to figure out what it 
> > was doing before the timer interrupt happened.
> > 
> > The kernel side stack pointer is probably enough to tell us whether 
> > a task is active in kernel space, on an irq stack, or (maybe) in 
> > user space. Not convinced about the latter, we may need to look at 
> > the same state the RCU code keeps track of to see what mode a task 
> > is in...
> > 
> > I am looking at the code to see what locks we need to grab.
> > 
> > I suspect the runqueue lock may be enough, to ensure that the task 
> > struct, and stack do not go away while we are looking at them.
> 
> That will be enough, especially if you get to the task reference via 
> rq->curr.
> 
> > We cannot take the lock_trace(task) from irq context, and we 
> > probably do not need to anyway, since we do not care about a precise 
> > stack trace for the task.
> 
> So one worry with this and similar approaches of statistically 
> detecting user mode would be the fact that on the way out to 
> user-space we don't really destroy the previous call trace - we just 
> pop off the stack (non-destructively), restore RIPs and are gone.
> 
> We'll need that percpu flag I suspect.

Note we have the context tracking state which tells where the current
task is: user/system/guest.

> 
> And once we have the flag, we can get rid of the per syscall RCU 
> callback as well, relatively easily: with CMPXCHG (in 
> synchronize_rcu()!) we can reliably sample whether a CPU is in user 
> mode right now, while the syscall entry/exit path does not use any 
> atomics, we can just use a simple MOV.
> 
> Once we observe 'user mode', then we have observed quiescent state and 
> synchronize_rcu() can continue. If we've observed kernel mode we can 
> frob the remote task's TIF_ flags to make it go into a quiescent state 
> publishing routine on syscall-return.
> 
> The only hard requirement of this scheme from the RCU synchronization 
> POV is that all kernel contexts that may touch RCU state need to flip 
> this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
> NMIs and all syscall variants need to do this.
> 
> But once it's there, it's really neat.
> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

* Rik van Riel  wrote:

> > If, on the other hand, you're just going to remotely sample the 
> > in-memory context, that sounds good.
> 
> It's the latter.
> 
> If you look at /proc//{stack,syscall,wchan} and other files, 
> you will see we already have ways to determine, from in memory 
> content, where a program is running at a certain point in time.
> 
> In fact, the timer interrupt based accounting does a similar thing. 
> It has a task examine its own in-memory state to figure out what it 
> was doing before the timer interrupt happened.
> 
> The kernel side stack pointer is probably enough to tell us whether 
> a task is active in kernel space, on an irq stack, or (maybe) in 
> user space. Not convinced about the latter, we may need to look at 
> the same state the RCU code keeps track of to see what mode a task 
> is in...
> 
> I am looking at the code to see what locks we need to grab.
> 
> I suspect the runqueue lock may be enough, to ensure that the task 
> struct, and stack do not go away while we are looking at them.

That will be enough, especially if you get to the task reference via 
rq->curr.

> We cannot take the lock_trace(task) from irq context, and we 
> probably do not need to anyway, since we do not care about a precise 
> stack trace for the task.

So one worry with this and similar approaches of statistically 
detecting user mode would be the fact that on the way out to 
user-space we don't really destroy the previous call trace - we just 
pop off the stack (non-destructively), restore RIPs and are gone.

We'll need that percpu flag I suspect.

And once we have the flag, we can get rid of the per syscall RCU 
callback as well, relatively easily: with CMPXCHG (in 
synchronize_rcu()!) we can reliably sample whether a CPU is in user 
mode right now, while the syscall entry/exit path does not use any 
atomics, we can just use a simple MOV.

Once we observe 'user mode', then we have observed quiescent state and 
synchronize_rcu() can continue. If we've observed kernel mode we can 
frob the remote task's TIF_ flags to make it go into a quiescent state 
publishing routine on syscall-return.

The only hard requirement of this scheme from the RCU synchronization 
POV is that all kernel contexts that may touch RCU state need to flip 
this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
NMIs and all syscall variants need to do this.

But once it's there, it's really neat.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry


* Andy Lutomirski  wrote:

> >> [...]  Also, we'd have to audit all the entries, and 
> >> system_call_after_swapgs currently enables interrupts early 
> >> enough that an interrupt before all the pushes will do 
> >> unpredictable things to pt_regs.
> >
> > An irq hardware frame won't push zero to that selector value, 
> > right? That's the only bad thing that would confuse the code.
> >
> 
> I think it's not quite that simple.  The syscall entry looks like, 
> roughly:
> 
> fix rsp;
> sti;
> push ss;
> push rsp;
> push flags;
> push cs;
> push rip;
> 
> We can get an interrupt part-way through those pushes.  Maybe there's
> no bad place where we could get an IRQ since SS is first, [...]

... and it should be covered by the 'STI window' where the instruction 
following a STI is still part of the irqs-off section.

> [...] but this is still nasty.

True.

Another approach would be to set up two aliases in the GDT, so we 
could freely change 'ss' between them and thus store information, 
without possibly confusing the syscall entry/exit code.

That still gets complicated by IST entries, which creates multiple 
positions for the 'flag'.

> I think I prefer a per-cpu approach over a per-task approach, since 
> it's easier to reason about and it should still only require one 
> instruction on entry and one instruction on exit.

Yes, agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On May 7, 2015 8:38 PM, Ingo Molnar mi...@kernel.org wrote:


 * Andy Lutomirski l...@amacapital.net wrote:

  I think one or both of us is missing something or we're just talking
  about different things.

 That's very much possible!

 I think part of the problem is that I called the 'remote CPU' the RT
 CPU, while you seem to be calling it the CPU that does the
 synchronize_rcu().

 So lets start again, with calling the synchronize_rcu() the 'remote
 CPU', and the one doing the RT workload the 'RT CPU':

  If the nohz/RT cpu is about to enter user mode and stay there for a
  long time, it does:
 
this_cpu_inc(rcu_qs_ctr);
 
  or whatever.  Maybe we add:
 
this_cpu_set(rcu_state) = IN_USER;
 
  or however it's spelled.
 
  The remote CPU wants to check our state.  If this happens just
  before the IN_USER write or rcu_qs_ctr increment becomes visible,
  then it'll think we're in kernel mode.  Now it either has to poll
  (which is fine) or try to get us to tell the RCU core when we become
  quiescent by setting TIF_RCU_THINGY.

 So do you mean:

this_cpu_set(rcu_state) = IN_KERNEL;
...
this_cpu_inc(rcu_qs_ctr);
this_cpu_set(rcu_state) = IN_USER;

 ?

 So in your proposal we'd have an INC and two MOVs. I think we can make
 it just two simple stores into a byte flag, one on entry and one on
 exit:

this_cpu_set(rcu_state) = IN_KERNEL;
...
this_cpu_set(rcu_state) = IN_USER;


I was thinking that either a counter or a state flag could make sense.
Doing both would be pointless.  The counter could use the low bits to
indicate the state.  The benefit of the counter would be that the
RCU-waiting CPU could observe that the counter has incremented and
that therefore a grace period has elapsed.  Getting it right would
require lots of care.

 plus the rare but magic TIF_RCU_THINGY that tells a waiting
 synchronize_rcu() about the next quiescent state.

  The problem is that I don't see how TIF_RCU_THINGY can work
  reliably. If the remote CPU sets it, it'll be too late and we'll
  still enter user mode without seeing it.  If it's just an
  optimization, though, then it should be fine.

 Well, after setting it, the remote CPU has to re-check whether the RT
 CPU has entered user-mode - before it goes to wait.

How?

Suppose the exit path looked like:

this_cpu_write(rcu_state, IN_USER);

if (ti-flags  _TIF_RCU_NOTIFY) {
if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags))
slow_notify_rcu_that_we_are_exiting();
}

iret or sysret;

The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
_TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
because stores can be delayed.  (It could even happen after sysret,
IIRC, but IRET is serializing.)

If we put an mfence after this_cpu_set or did an unconditional
test_and_clear_bit on ti-flags then this problem goes away, but that
would probably be slower than we'd like.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Frederic Weisbecker

On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote:
 
 * Rik van Riel r...@redhat.com wrote:
 
   If, on the other hand, you're just going to remotely sample the 
   in-memory context, that sounds good.
  
  It's the latter.
  
  If you look at /proc/pid/{stack,syscall,wchan} and other files, 
  you will see we already have ways to determine, from in memory 
  content, where a program is running at a certain point in time.
  
  In fact, the timer interrupt based accounting does a similar thing. 
  It has a task examine its own in-memory state to figure out what it 
  was doing before the timer interrupt happened.
  
  The kernel side stack pointer is probably enough to tell us whether 
  a task is active in kernel space, on an irq stack, or (maybe) in 
  user space. Not convinced about the latter, we may need to look at 
  the same state the RCU code keeps track of to see what mode a task 
  is in...
  
  I am looking at the code to see what locks we need to grab.
  
  I suspect the runqueue lock may be enough, to ensure that the task 
  struct, and stack do not go away while we are looking at them.
 
 That will be enough, especially if you get to the task reference via 
 rq-curr.
 
  We cannot take the lock_trace(task) from irq context, and we 
  probably do not need to anyway, since we do not care about a precise 
  stack trace for the task.
 
 So one worry with this and similar approaches of statistically 
 detecting user mode would be the fact that on the way out to 
 user-space we don't really destroy the previous call trace - we just 
 pop off the stack (non-destructively), restore RIPs and are gone.
 
 We'll need that percpu flag I suspect.

Note we have the context tracking state which tells where the current
task is: user/system/guest.

 
 And once we have the flag, we can get rid of the per syscall RCU 
 callback as well, relatively easily: with CMPXCHG (in 
 synchronize_rcu()!) we can reliably sample whether a CPU is in user 
 mode right now, while the syscall entry/exit path does not use any 
 atomics, we can just use a simple MOV.
 
 Once we observe 'user mode', then we have observed quiescent state and 
 synchronize_rcu() can continue. If we've observed kernel mode we can 
 frob the remote task's TIF_ flags to make it go into a quiescent state 
 publishing routine on syscall-return.
 
 The only hard requirement of this scheme from the RCU synchronization 
 POV is that all kernel contexts that may touch RCU state need to flip 
 this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
 NMIs and all syscall variants need to do this.
 
 But once it's there, it's really neat.
 
 Thanks,
 
   Ingo
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Rik van Riel r...@redhat.com wrote:

  If, on the other hand, you're just going to remotely sample the 
  in-memory context, that sounds good.
 
 It's the latter.
 
 If you look at /proc/pid/{stack,syscall,wchan} and other files, 
 you will see we already have ways to determine, from in memory 
 content, where a program is running at a certain point in time.
 
 In fact, the timer interrupt based accounting does a similar thing. 
 It has a task examine its own in-memory state to figure out what it 
 was doing before the timer interrupt happened.
 
 The kernel side stack pointer is probably enough to tell us whether 
 a task is active in kernel space, on an irq stack, or (maybe) in 
 user space. Not convinced about the latter, we may need to look at 
 the same state the RCU code keeps track of to see what mode a task 
 is in...
 
 I am looking at the code to see what locks we need to grab.
 
 I suspect the runqueue lock may be enough, to ensure that the task 
 struct, and stack do not go away while we are looking at them.

That will be enough, especially if you get to the task reference via 
rq-curr.

 We cannot take the lock_trace(task) from irq context, and we 
 probably do not need to anyway, since we do not care about a precise 
 stack trace for the task.

So one worry with this and similar approaches of statistically 
detecting user mode would be the fact that on the way out to 
user-space we don't really destroy the previous call trace - we just 
pop off the stack (non-destructively), restore RIPs and are gone.

We'll need that percpu flag I suspect.

And once we have the flag, we can get rid of the per syscall RCU 
callback as well, relatively easily: with CMPXCHG (in 
synchronize_rcu()!) we can reliably sample whether a CPU is in user 
mode right now, while the syscall entry/exit path does not use any 
atomics, we can just use a simple MOV.

Once we observe 'user mode', then we have observed quiescent state and 
synchronize_rcu() can continue. If we've observed kernel mode we can 
frob the remote task's TIF_ flags to make it go into a quiescent state 
publishing routine on syscall-return.

The only hard requirement of this scheme from the RCU synchronization 
POV is that all kernel contexts that may touch RCU state need to flip 
this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
NMIs and all syscall variants need to do this.

But once it's there, it's really neat.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Andy Lutomirski l...@amacapital.net wrote:

   We cannot take the lock_trace(task) from irq context, and we 
   probably do not need to anyway, since we do not care about a 
   precise stack trace for the task.
 
  So one worry with this and similar approaches of statistically 
  detecting user mode would be the fact that on the way out to 
  user-space we don't really destroy the previous call trace - we 
  just pop off the stack (non-destructively), restore RIPs and are 
  gone.
 
  We'll need that percpu flag I suspect.
 
  And once we have the flag, we can get rid of the per syscall RCU 
  callback as well, relatively easily: with CMPXCHG (in 
  synchronize_rcu()!) we can reliably sample whether a CPU is in 
  user mode right now, while the syscall entry/exit path does not 
  use any atomics, we can just use a simple MOV.
 
  Once we observe 'user mode', then we have observed quiescent state 
  and synchronize_rcu() can continue. If we've observed kernel mode 
  we can frob the remote task's TIF_ flags to make it go into a 
  quiescent state publishing routine on syscall-return.
 
 
 How does that work?

 If the exit code writes the per-cpu flag and then checks 
 TIF_whatever, we need a barrier to avoid a race where we end up in 
 user mode without seeing the flag.

No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

Also, the user-mode flag would be changed from the 'kernel-mode' value 
to the 'user-mode' value after we do the regular TIF checks - just 
before we hit user-space.

The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being 
executed on some other CPU not doing RT work) to intelligently wait 
for the remote (RT work doing) CPU to finish executing kernel code, 
without polling or so.

And yes, the TIF_RCU_QS handler on the remote CPU would have to 
execute an SMP barrier, before it allows an RCU quiescent state to 
pass. Note that the regular RCU code for quiescent state publishing 
already has that barrier, typically something like:

this_cpu_inc(rcu_qs_ctr);

That in itself is enough for synchronize_rcu(), it could do a small 
timing loop with short sleeps, until it waits for rcu_qs_ctr to 
increase.

 I think the right solution is to accept that race and just have the 
 RCU code send an IPI (or check again) if it sees too long of a 
 period elapse in kernel mode.

I don't think there's any need for an IPI.

 I think the flag should be a counter, though.  That way a workload 
 that makes lots of syscalls will be quickly detected as going 
 through quiescent states even if it's never actually observed in 
 user mode.

Flag write is easier on the CPU than an INC/DEC: only a store to a 
percpu location, no load needed, and no value dependencies. The store 
will be program ordered, so it's a spin_unlock() work-alike.

  The only hard requirement of this scheme from the RCU 
  synchronization POV is that all kernel contexts that may touch RCU 
  state need to flip this flag reliably to 'kernel mode': i.e. all 
  irq handlers, traps, NMIs and all syscall variants need to do 
  this.
 
  But once it's there, it's really neat.
 
 We already have to do this with the current code.  I went through 
 and checked all of the IST entries a couple versions ago.

Yeah - I just mean if the primary flag is _not_ TIF driven (which is 
my suggestion, and which I thought you suggested as well) then all 
code paths need to be covered.

Things like the irq-tracking callbacks are debugging and 
instrumentation code - the user/kernel mode flag would be a hard 
requirement on RCU correctness: missing a flag update might cause the 
kernel to crash in non-obvious ways.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Rik van Riel

On 05/07/2015 08:29 AM, Ingo Molnar wrote:
 
 * Frederic Weisbecker fweis...@gmail.com wrote:
 
 We cannot take the lock_trace(task) from irq context, and we 
 probably do not need to anyway, since we do not care about a 
 precise stack trace for the task.

 So one worry with this and similar approaches of statistically 
 detecting user mode would be the fact that on the way out to 
 user-space we don't really destroy the previous call trace - we 
 just pop off the stack (non-destructively), restore RIPs and are 
 gone.

 We'll need that percpu flag I suspect.

 Note we have the context tracking state which tells where the 
 current task is: user/system/guest.
 
 Yes, but that overhead is what I'm suggesting we get rid of, I thought 
 Rik was trying to find a mechanism that would be independent of that?

One thing at a time :)

I am working on the timer sampling stuff, which should be easy
to adapt to a different user/system/guest/irq/softirq/... tracking
thing, if somebody else comes up with a more efficient way to do
that.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Andy Lutomirski l...@amacapital.net wrote:

 I think one or both of us is missing something or we're just talking 
 about different things.

That's very much possible!

I think part of the problem is that I called the 'remote CPU' the RT 
CPU, while you seem to be calling it the CPU that does the 
synchronize_rcu().

So lets start again, with calling the synchronize_rcu() the 'remote 
CPU', and the one doing the RT workload the 'RT CPU':

 If the nohz/RT cpu is about to enter user mode and stay there for a 
 long time, it does:
 
   this_cpu_inc(rcu_qs_ctr);
 
 or whatever.  Maybe we add:
 
   this_cpu_set(rcu_state) = IN_USER;
 
 or however it's spelled.
 
 The remote CPU wants to check our state.  If this happens just 
 before the IN_USER write or rcu_qs_ctr increment becomes visible, 
 then it'll think we're in kernel mode.  Now it either has to poll 
 (which is fine) or try to get us to tell the RCU core when we become 
 quiescent by setting TIF_RCU_THINGY.

So do you mean:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;

?

So in your proposal we'd have an INC and two MOVs. I think we can make 
it just two simple stores into a byte flag, one on entry and one on 
exit:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;

plus the rare but magic TIF_RCU_THINGY that tells a waiting 
synchronize_rcu() about the next quiescent state.

 The problem is that I don't see how TIF_RCU_THINGY can work 
 reliably. If the remote CPU sets it, it'll be too late and we'll 
 still enter user mode without seeing it.  If it's just an 
 optimization, though, then it should be fine.

Well, after setting it, the remote CPU has to re-check whether the RT 
CPU has entered user-mode - before it goes to wait.

If it has entered user mode then the remote CPU has observed quiescent 
state and is done. If it's still in kernel mode then it waits until 
the TIF_RCU_THINGY does its completion.

 I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr 
 be enough for all of the above?

How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) 
work?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Andy Lutomirski l...@amacapital.net wrote:

  [...]  Also, we'd have to audit all the entries, and 
  system_call_after_swapgs currently enables interrupts early 
  enough that an interrupt before all the pushes will do 
  unpredictable things to pt_regs.
 
  An irq hardware frame won't push zero to that selector value, 
  right? That's the only bad thing that would confuse the code.
 
 
 I think it's not quite that simple.  The syscall entry looks like, 
 roughly:
 
 fix rsp;
 sti;
 push ss;
 push rsp;
 push flags;
 push cs;
 push rip;
 
 We can get an interrupt part-way through those pushes.  Maybe there's
 no bad place where we could get an IRQ since SS is first, [...]

... and it should be covered by the 'STI window' where the instruction 
following a STI is still part of the irqs-off section.

 [...] but this is still nasty.

True.

Another approach would be to set up two aliases in the GDT, so we 
could freely change 'ss' between them and thus store information, 
without possibly confusing the syscall entry/exit code.

That still gets complicated by IST entries, which creates multiple 
positions for the 'flag'.

 I think I prefer a per-cpu approach over a per-task approach, since 
 it's easier to reason about and it should still only require one 
 instruction on entry and one instruction on exit.

Yes, agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Ingo Molnar mi...@kernel.org wrote:

 The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
 (being executed on some other CPU not doing RT work) to 
 intelligently wait for the remote (RT work doing) CPU to finish 
 executing kernel code, without polling or so.

it's basically a cheap IPI being inserted on the remote CPU.

We need the TIF_RCU_QS callback not just to wait intelligently, but 
mainly to elapse a grace period, otherwise synchronize_rcu() might not 
ever make progress: think a SCHED_FIFO task doing some kernel work, 
synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
otherwise never scheduling and never getting any timer irqs either, 
and thus never entering quiescent state.

(Cc:-ed Paul too, he might be interested in this as well.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar mi...@kernel.org wrote:

 * Andy Lutomirski l...@amacapital.net wrote:

   We cannot take the lock_trace(task) from irq context, and we
   probably do not need to anyway, since we do not care about a
   precise stack trace for the task.
 
  So one worry with this and similar approaches of statistically
  detecting user mode would be the fact that on the way out to
  user-space we don't really destroy the previous call trace - we
  just pop off the stack (non-destructively), restore RIPs and are
  gone.
 
  We'll need that percpu flag I suspect.
 
  And once we have the flag, we can get rid of the per syscall RCU
  callback as well, relatively easily: with CMPXCHG (in
  synchronize_rcu()!) we can reliably sample whether a CPU is in
  user mode right now, while the syscall entry/exit path does not
  use any atomics, we can just use a simple MOV.
 
  Once we observe 'user mode', then we have observed quiescent state
  and synchronize_rcu() can continue. If we've observed kernel mode
  we can frob the remote task's TIF_ flags to make it go into a
  quiescent state publishing routine on syscall-return.
 

 How does that work?

 If the exit code writes the per-cpu flag and then checks
 TIF_whatever, we need a barrier to avoid a race where we end up in
 user mode without seeing the flag.

 No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

 Also, the user-mode flag would be changed from the 'kernel-mode' value
 to the 'user-mode' value after we do the regular TIF checks - just
 before we hit user-space.

 The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being
 executed on some other CPU not doing RT work) to intelligently wait
 for the remote (RT work doing) CPU to finish executing kernel code,
 without polling or so.

 And yes, the TIF_RCU_QS handler on the remote CPU would have to
 execute an SMP barrier, before it allows an RCU quiescent state to
 pass. Note that the regular RCU code for quiescent state publishing
 already has that barrier, typically something like:

 this_cpu_inc(rcu_qs_ctr);

 That in itself is enough for synchronize_rcu(), it could do a small
 timing loop with short sleeps, until it waits for rcu_qs_ctr to
 increase.

I think one or both of us is missing something or we're just talking
about different things.

If the nohz/RT cpu is about to enter user mode and stay there for a
long time, it does:

this_cpu_inc(rcu_qs_ctr);

or whatever.  Maybe we add:

this_cpu_set(rcu_state) = IN_USER;

or however it's spelled.

The remote CPU wants to check our state.  If this happens just before
the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll
think we're in kernel mode.  Now it either has to poll (which is fine)
or try to get us to tell the RCU core when we become quiescent by
setting TIF_RCU_THINGY.

The problem is that I don't see how TIF_RCU_THINGY can work reliably.
If the remote CPU sets it, it'll be too late and we'll still enter
user mode without seeing it.  If it's just an optimization, though,
then it should be fine.

I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr be
enough for all of the above?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On May 7, 2015 4:18 PM, Ingo Molnar mi...@kernel.org wrote:

* Rik van Riel r...@redhat.com wrote:

If, on the other hand, you're just going to remotely sample the
in-memory context, that sounds good.

It's the latter.

If you look at /proc/pid/{stack,syscall,wchan} and other files,
you will see we already have ways to determine, from in memory
content, where a program is running at a certain point in time.

In fact, the timer interrupt based accounting does a similar thing.
It has a task examine its own in-memory state to figure out what it
was doing before the timer interrupt happened.

The kernel side stack pointer is probably enough to tell us whether
a task is active in kernel space, on an irq stack, or (maybe) in
user space. Not convinced about the latter, we may need to look at
the same state the RCU code keeps track of to see what mode a task
is in...

I am looking at the code to see what locks we need to grab.

I suspect the runqueue lock may be enough, to ensure that the task
struct, and stack do not go away while we are looking at them.

That will be enough, especially if you get to the task reference via
rq-curr.

We cannot take the lock_trace(task) from irq context, and we
probably do not need to anyway, since we do not care about a precise
stack trace for the task.

So one worry with this and similar approaches of statistically
detecting user mode would be the fact that on the way out to
user-space we don't really destroy the previous call trace - we just
pop off the stack (non-destructively), restore RIPs and are gone.

We'll need that percpu flag I suspect.

And once we have the flag, we can get rid of the per syscall RCU
callback as well, relatively easily: with CMPXCHG (in
synchronize_rcu()!) we can reliably sample whether a CPU is in user
mode right now, while the syscall entry/exit path does not use any
atomics, we can just use a simple MOV.

Once we observe 'user mode', then we have observed quiescent state and
synchronize_rcu() can continue. If we've observed kernel mode we can
frob the remote task's TIF_ flags to make it go into a quiescent state
publishing routine on syscall-return.

How does that work?

If the exit code writes the per-cpu flag and then checks TIF_whatever,
we need a barrier to avoid a race where we end up in user mode without
seeing the flag.

I think the right solution is to accept that race and just have the
RCU code send an IPI (or check again) if it sees too long of a period
elapse in kernel mode.

I think the flag should be a counter, though. That way a workload
that makes lots of syscalls will be quickly detected as going through
quiescent states even if it's never actually observed in user mode.

The only hard requirement of this scheme from the RCU synchronization
POV is that all kernel contexts that may touch RCU state need to flip
this flag reliably to 'kernel mode': i.e. all irq handlers, traps,
NMIs and all syscall variants need to do this.

But once it's there, it's really neat.

We already have to do this with the current code. I went through and
checked all of the IST entries a couple versions ago.

I think we need to clean up the current garbage asm first, though. See:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entryid=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac

(Very much unfinished, and it should probably be split up, but AFAICT
it works. Don't hold your breath for a real version.)

--Andy

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry


* Frederic Weisbecker fweis...@gmail.com wrote:

   We cannot take the lock_trace(task) from irq context, and we 
   probably do not need to anyway, since we do not care about a 
   precise stack trace for the task.
  
  So one worry with this and similar approaches of statistically 
  detecting user mode would be the fact that on the way out to 
  user-space we don't really destroy the previous call trace - we 
  just pop off the stack (non-destructively), restore RIPs and are 
  gone.
  
  We'll need that percpu flag I suspect.
 
 Note we have the context tracking state which tells where the 
 current task is: user/system/guest.

Yes, but that overhead is what I'm suggesting we get rid of, I thought 
Rik was trying to find a mechanism that would be independent of that?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> 
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>> peak at rq->curr's pt_regs::flag, and if:
>>
>>  - the flag is 0 then it has observed a quiescent state.
>>
>>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Actually, the "race" in this direction is fine. If flag==0, then
the nohz CPU is not accessing any RCU protected data structures,
and the synhcronize_rcu CPU will not be setting TIF_NOHZ.

The race is only a concern if the synchronize_rcu CPU reads
flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after
the nohz CPU has cleared flag (and is unable to handle RCU
stuff). An atomic compare and swap prevents that issue.

The other race, of the synchronize_rcu CPU reading 0, followed
by the nohz CPU going into kernel space, and setting the flag
to 1, should be fine. After all, this means the nohz_full CPU
just went into a new RCU grace period, which is just what the
synchronize_rcu CPU was waiting for.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>> peak at rq->curr's pt_regs::flag, and if:
>>
>>  - the flag is 0 then it has observed a quiescent state.
>>
>>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

As an aside, I suspect the remote timer sampling stuff is probably
best off piggybacking on the RCU status (and PF_VCPU) of the tasks
being sampled.

That seems like the most reliable in-memory state we can examine,
since the frame pointer may be in registers only, and not in
memory to begin with.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> 
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>> peak at rq->curr's pt_regs::flag, and if:
>>
>>  - the flag is 0 then it has observed a quiescent state.
>>
>>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

If the flag and TIF_NOHZ live in the same word, we can cmpxchg
and be guaranteed atomicity.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-04 Thread Paolo Bonzini



On 02/05/2015 07:27, Ingo Molnar wrote:
> 
>   - synchronize_rcu() avoids having to send an IPI by taking a 
> peak at rq->curr's pt_regs::flag, and if:
> 
>  - the flag is 0 then it has observed a quiescent state.
> 
>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>completion from a TIF_NOHZ callback.

Isn't this racy?

synchronize_rcu CPU nohz CPU
-
set flag = 0
read flag = 0
return to userspace
set TIF_NOHZ

and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-04 Thread Paolo Bonzini



On 02/05/2015 07:27, Ingo Molnar wrote:
 
   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:
 
  - the flag is 0 then it has observed a quiescent state.
 
  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.

Isn't this racy?

synchronize_rcu CPU nohz CPU
-
set flag = 0
read flag = 0
return to userspace
set TIF_NOHZ

and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 
 
 On 02/05/2015 07:27, Ingo Molnar wrote:

   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.
 
 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

If the flag and TIF_NOHZ live in the same word, we can cmpxchg
and be guaranteed atomicity.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 On 02/05/2015 07:27, Ingo Molnar wrote:

   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.
 
 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

As an aside, I suspect the remote timer sampling stuff is probably
best off piggybacking on the RCU status (and PF_VCPU) of the tasks
being sampled.

That seems like the most reliable in-memory state we can examine,
since the frame pointer may be in registers only, and not in
memory to begin with.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 
 
 On 02/05/2015 07:27, Ingo Molnar wrote:

   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.
 
 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Actually, the race in this direction is fine. If flag==0, then
the nohz CPU is not accessing any RCU protected data structures,
and the synhcronize_rcu CPU will not be setting TIF_NOHZ.

The race is only a concern if the synchronize_rcu CPU reads
flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after
the nohz CPU has cleared flag (and is unable to handle RCU
stuff). An atomic compare and swap prevents that issue.

The other race, of the synchronize_rcu CPU reading 0, followed
by the nohz CPU going into kernel space, and setting the flag
to 1, should be fine. After all, this means the nohz_full CPU
just went into a new RCU grace period, which is just what the
synchronize_rcu CPU was waiting for.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On 05/03/2015 02:24 PM, Andy Lutomirski wrote:
> On Sun, May 3, 2015 at 10:30 AM, Rik van Riel  wrote:
>> On 05/03/2015 09:23 AM, Mike Galbraith wrote:
>>
>>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
>>>
>>> 100M * stat() on isolated cpu
>>>
>>> NO_HZ_FULL offinactive housekeepernohz_full
>>> real0m14.266s 0m14.367s0m20.427s  0m27.921s
>>> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
>>> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
>>> (real)  1.000 1.0071.431  1.957

>> I'm convinced.
>>
>> Time to try the remote sampling of CPU use statistics, and
>> lighten up the RCU overhead of context tracking.
>>
> 
> I don't understand the remote sampling proposal.  Isn't the whole
> point of full nohz to avoid periodically interrupting a busy CPU?  If
> what you have in mind is sending IPIs, then that's just as bad, right?
> 
> If, on the other hand, you're just going to remotely sample the
> in-memory context, that sounds good.

It's the latter.

If you look at /proc//{stack,syscall,wchan} and other files,
you will see we already have ways to determine, from in memory
content, where a program is running at a certain point in time.

In fact, the timer interrupt based accounting does a similar thing.
It has a task examine its own in-memory state to figure out what it
was doing before the timer interrupt happened.

The kernel side stack pointer is probably enough to tell us whether
a task is active in kernel space, on an irq stack, or (maybe) in user
space. Not convinced about the latter, we may need to look at the
same state the RCU code keeps track of to see what mode a task is in...

I am looking at the code to see what locks we need to grab.

I suspect the runqueue lock may be enough, to ensure that the task
struct, and stack do not go away while we are looking at them.

We cannot take the lock_trace(task) from irq context, and we probably
do not need to anyway, since we do not care about a precise stack trace
for the task.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar  wrote:
>
> * Andy Lutomirski  wrote:
>
>> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel  wrote:
>> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
>> >
>> >> Or we could do that in the syscall path with a single store of a
>> >> constant flag to a location in the task struct. We have a number of
>> >> natural flags that get written on syscall entry, such as:
>> >>
>> >> pushq_cfi $__USER_DS/* pt_regs->ss */
>> >>
>> >> That goes to a constant location on the kernel stack. On return from
>> >> system calls we could write 0 to that location.
>>
>> Huh?  IRET with zero there will fault, and we can't guarantee that
>> all syscalls return using sysret. [...]
>
> So IRET is a slowpath - I was thinking about the fastpath mainly.

Slowpath or not, if we're using part of the hardware frame in pt_regs
as an indication of whether we're in user or kernel mode, we don't get
to arbitrarily change it to the "user" state before IRET or the IRET
will fail.  We could work around that with some kind of trampoline,
but that's complicated.

>
>> [...]  Also, we'd have to audit all the entries, and
>> system_call_after_swapgs currently enables interrupts early enough
>> that an interrupt before all the pushes will do unpredictable things
>> to pt_regs.
>
> An irq hardware frame won't push zero to that selector value, right?
> That's the only bad thing that would confuse the code.
>

I think it's not quite that simple.  The syscall entry looks like, roughly:

fix rsp;
sti;
push ss;
push rsp;
push flags;
push cs;
push rip;

We can get an interrupt part-way through those pushes.  Maybe there's
no bad place where we could get an IRQ since SS is first, but this is
still nasty.

>
> But yes, safely accessing the remote task is a complication, but with
> such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
> even if we do:
>
>> If we went that route, I'd advocate sticking the flag in tss->sp1.
>> That cacheline is unconditionally read on kernel entry already, and
>> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
>> lost track). [1]
>>
>> Alternatively, I'd suggest that we actually add a whole new word to
>> pt_regs.
>
> ... we'd still have to set TIF_NOHZ and are back to square one in
> terms of race complexity.
>
> But it's not overly complex: by taking the remote CPU's rq-lock from
> synchronize_rcu() we could get a stable pointer to the currently
> executing task.
>
> And if we do that, we might as well use the opportunity and take a
> look at pt_regs as well - this is why sticking it into pt_regs might
> be better.
>
> So I'd:
>
>   - enlarge pt_regs by a word and stick the flag there (this
> allocation is essentially free)
>
>   - update the word from entry/exit
>
>   - synchronize_rcu() avoids having to send an IPI by taking a
> peak at rq->curr's pt_regs::flag, and if:
>
>  - the flag is 0 then it has observed a quiescent state.
>
>  - the flag is 1, then it would set TIF_NOHZ and wait for a
>completion from a TIF_NOHZ callback.

This seems complicated, especially given IST entries that put their
pt_regs in different places.  We get a bit of a free pass on some of
the IST entries due to the IST stack switch thing, but we still need
to worry about NMI and MCE at least.

I think I want to understand the considerations a bit better before I
express a real opinion.  IIUC the remote CPU needs to reliably detect
that we've had a grace period.  This can potentially happen by
observing us with some quiescent (user-mode, guest-mode or idle) flag
set or by an interrupt incrementing a counter.  In CONFIG_PREEMPT_RCU
mode, the remote CPU can presumably observe that we're quiescent
directly.

Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is
unnecessary?  We may still want context tracking for the timing stats
sampling, but I don't see why we need it for RCU.

In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about
possible races that could cause us to fail to ever detect a grace
period.  If we have the periodic tick turned off and if the remote CPU
checks our context and sees us non-quiescent, then either we need a
non-racy way to get us to do work when we become quiescent or the
remote CPU needs to eventually send us an IPI.  I don't see any way to
do the former without an RMW or barrier on every transition into a
quiescent state.  Even if we used ti->flags, I still think that would
require us to upgrade the cli; test on exit to userspace to a
test-and-set or test-and-clear, either of which is much more
expensive.

I think I prefer a per-cpu approach over a per-task approach, since
it's easier to reason about and it should still only require one
instruction on entry and one instruction on exit.

FWIW, if we go the per-cpu route, what if we used a counter instead of
a simple context flag?  With a counter, the remote CPU could observe
that we changed state and therefore went through a

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On Sun, May 3, 2015 at 10:30 AM, Rik van Riel  wrote:
> On 05/03/2015 09:23 AM, Mike Galbraith wrote:
>
>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
>>
>> 100M * stat() on isolated cpu
>>
>> NO_HZ_FULL offinactive housekeepernohz_full
>> real0m14.266s 0m14.367s0m20.427s  0m27.921s
>> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
>> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
>> (real)  1.000 1.0071.431  1.957
>>1.000  1.000
>>
>>  real  0m20.423s  0m27.930s  +rik 1,2
>>  user  0m2.072s   0m10.450s
>>  sys   0m18.304s  0m17.471s
>>  vs off1.431  1.957
>>  vs prev   1.000  1.000
>>
>>  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
>> missing prototypes)
>>  user  0m1.884s   0m10.551s
>>  sys   0m18.353s  0m17.242s
>>  vs off1.419  1.948
>>  vs prev.991   .995
>>
>>  real  0m19.122s  0m26.946s  +rik 3
>>  user  0m1.896s   0m10.292s
>>  sys   0m17.198s  0m16.644s
>>  vs off1.340  1.888
>>  vs prev   .944.969
>
> I'm convinced.
>
> Time to try the remote sampling of CPU use statistics, and
> lighten up the RCU overhead of context tracking.
>

I don't understand the remote sampling proposal.  Isn't the whole
point of full nohz to avoid periodically interrupting a busy CPU?  If
what you have in mind is sending IPIs, then that's just as bad, right?

If, on the other hand, you're just going to remotely sample the
in-memory context, that sounds good.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

On 05/03/2015 09:23 AM, Mike Galbraith wrote:

> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
> 
> 100M * stat() on isolated cpu
> 
> NO_HZ_FULL offinactive housekeepernohz_full
> real0m14.266s 0m14.367s0m20.427s  0m27.921s
> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
> (real)  1.000 1.0071.431  1.957
>1.000  1.000
> 
>  real  0m20.423s  0m27.930s  +rik 1,2
>  user  0m2.072s   0m10.450s
>  sys   0m18.304s  0m17.471s
>  vs off1.431  1.957
>  vs prev   1.000  1.000
> 
>  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
> missing prototypes)
>  user  0m1.884s   0m10.551s
>  sys   0m18.353s  0m17.242s
>  vs off1.419  1.948
>  vs prev.991   .995
> 
>  real  0m19.122s  0m26.946s  +rik 3
>  user  0m1.896s   0m10.292s
>  sys   0m17.198s  0m16.644s
>  vs off1.340  1.888
>  vs prev   .944.969

I'm convinced.

Time to try the remote sampling of CPU use statistics, and
lighten up the RCU overhead of context tracking.

I should be able to start on the remote sampling of CPU use
statistics this week.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-03 Thread Mike Galbraith

On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote:
> On 05/01/2015 02:40 AM, Ingo Molnar wrote:
> 
> >> This patch builds on top of these patches by Paolo:
> >> https://lkml.org/lkml/2015/4/28/188
> >> https://lkml.org/lkml/2015/4/29/139
> >>
> >> Together with this patch I posted earlier this week, the syscall path
> >> on a nohz_full cpu seems to be about 10% faster.
> >> https://lkml.org/lkml/2015/4/24/394
> >>
> >> My test is a simple microbenchmark that calls getpriority() in a loop
> >> 10 million times:
> >>
> >>run timesystem time
> >> vanilla5.49s   2.08s
> >> __acct patch   5.21s   1.92s
> >> both patches   4.88s   1.71s
> >
> > Just curious, what are the numbers if you don't have context tracking 
> > enabled, i.e. without nohz_full?
> > 
> > I.e. what's the baseline we are talking about?
> 
> It's an astounding difference. This is not a kernel without nohz_full,
> just a CPU without nohz_full running the same kernel I tested with
> yesterday:
> 
>   run timesystem time
> vanilla   5.49s   2.08s
> __acct patch  5.21s   1.92s
> both patches  4.88s   1.71s
> CPU w/o nohz  3.12s   1.63s<-- your numbers, mostly

Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

100M * stat() on isolated cpu

NO_HZ_FULL offinactive housekeepernohz_full
real0m14.266s 0m14.367s0m20.427s  0m27.921s
user0m1.756s  0m1.553s 0m1.976s   0m10.447s
sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
(real)  1.000 1.0071.431  1.957
   1.000  1.000

 real  0m20.423s  0m27.930s  +rik 1,2
 user  0m2.072s   0m10.450s
 sys   0m18.304s  0m17.471s
 vs off1.431  1.957
 vs prev   1.000  1.000

 real  0m20.256s  0m27.803s  +paolo 1,2 (2 
missing prototypes)
 user  0m1.884s   0m10.551s
 sys   0m18.353s  0m17.242s
 vs off1.419  1.948
 vs prev.991   .995

 real  0m19.122s  0m26.946s  +rik 3
 user  0m1.896s   0m10.292s
 sys   0m17.198s  0m16.644s
 vs off1.340  1.888
 vs prev   .944.969



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-03 Thread Mike Galbraith

On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote:
 On 05/01/2015 02:40 AM, Ingo Molnar wrote:
 
  This patch builds on top of these patches by Paolo:
  https://lkml.org/lkml/2015/4/28/188
  https://lkml.org/lkml/2015/4/29/139
 
  Together with this patch I posted earlier this week, the syscall path
  on a nohz_full cpu seems to be about 10% faster.
  https://lkml.org/lkml/2015/4/24/394
 
  My test is a simple microbenchmark that calls getpriority() in a loop
  10 million times:
 
 run timesystem time
  vanilla5.49s   2.08s
  __acct patch   5.21s   1.92s
  both patches   4.88s   1.71s
 
  Just curious, what are the numbers if you don't have context tracking 
  enabled, i.e. without nohz_full?
  
  I.e. what's the baseline we are talking about?
 
 It's an astounding difference. This is not a kernel without nohz_full,
 just a CPU without nohz_full running the same kernel I tested with
 yesterday:
 
   run timesystem time
 vanilla   5.49s   2.08s
 __acct patch  5.21s   1.92s
 both patches  4.88s   1.71s
 CPU w/o nohz  3.12s   1.63s-- your numbers, mostly

Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

100M * stat() on isolated cpu

NO_HZ_FULL offinactive housekeepernohz_full
real0m14.266s 0m14.367s0m20.427s  0m27.921s
user0m1.756s  0m1.553s 0m1.976s   0m10.447s
sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
(real)  1.000 1.0071.431  1.957
   1.000  1.000

 real  0m20.423s  0m27.930s  +rik 1,2
 user  0m2.072s   0m10.450s
 sys   0m18.304s  0m17.471s
 vs off1.431  1.957
 vs prev   1.000  1.000

 real  0m20.256s  0m27.803s  +paolo 1,2 (2 
missing prototypes)
 user  0m1.884s   0m10.551s
 sys   0m18.353s  0m17.242s
 vs off1.419  1.948
 vs prev.991   .995

 real  0m19.122s  0m26.946s  +rik 3
 user  0m1.896s   0m10.292s
 sys   0m17.198s  0m16.644s
 vs off1.340  1.888
 vs prev   .944.969



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On 05/03/2015 02:24 PM, Andy Lutomirski wrote:
 On Sun, May 3, 2015 at 10:30 AM, Rik van Riel r...@redhat.com wrote:
 On 05/03/2015 09:23 AM, Mike Galbraith wrote:

 Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

 100M * stat() on isolated cpu

 NO_HZ_FULL offinactive housekeepernohz_full
 real0m14.266s 0m14.367s0m20.427s  0m27.921s
 user0m1.756s  0m1.553s 0m1.976s   0m10.447s
 sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
 (real)  1.000 1.0071.431  1.957

 I'm convinced.

 Time to try the remote sampling of CPU use statistics, and
 lighten up the RCU overhead of context tracking.

 
 I don't understand the remote sampling proposal.  Isn't the whole
 point of full nohz to avoid periodically interrupting a busy CPU?  If
 what you have in mind is sending IPIs, then that's just as bad, right?
 
 If, on the other hand, you're just going to remotely sample the
 in-memory context, that sounds good.

It's the latter.

If you look at /proc/pid/{stack,syscall,wchan} and other files,
you will see we already have ways to determine, from in memory
content, where a program is running at a certain point in time.

In fact, the timer interrupt based accounting does a similar thing.
It has a task examine its own in-memory state to figure out what it
was doing before the timer interrupt happened.

The kernel side stack pointer is probably enough to tell us whether
a task is active in kernel space, on an irq stack, or (maybe) in user
space. Not convinced about the latter, we may need to look at the
same state the RCU code keeps track of to see what mode a task is in...

I am looking at the code to see what locks we need to grab.

I suspect the runqueue lock may be enough, to ensure that the task
struct, and stack do not go away while we are looking at them.

We cannot take the lock_trace(task) from irq context, and we probably
do not need to anyway, since we do not care about a precise stack trace
for the task.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On 05/03/2015 09:23 AM, Mike Galbraith wrote:

 Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
 
 100M * stat() on isolated cpu
 
 NO_HZ_FULL offinactive housekeepernohz_full
 real0m14.266s 0m14.367s0m20.427s  0m27.921s
 user0m1.756s  0m1.553s 0m1.976s   0m10.447s
 sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
 (real)  1.000 1.0071.431  1.957
1.000  1.000
 
  real  0m20.423s  0m27.930s  +rik 1,2
  user  0m2.072s   0m10.450s
  sys   0m18.304s  0m17.471s
  vs off1.431  1.957
  vs prev   1.000  1.000
 
  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
 missing prototypes)
  user  0m1.884s   0m10.551s
  sys   0m18.353s  0m17.242s
  vs off1.419  1.948
  vs prev.991   .995
 
  real  0m19.122s  0m26.946s  +rik 3
  user  0m1.896s   0m10.292s
  sys   0m17.198s  0m16.644s
  vs off1.340  1.888
  vs prev   .944.969

I'm convinced.

Time to try the remote sampling of CPU use statistics, and
lighten up the RCU overhead of context tracking.

I should be able to start on the remote sampling of CPU use
statistics this week.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar mi...@kernel.org wrote:

 * Andy Lutomirski l...@amacapital.net wrote:

 On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote:
  On 05/01/2015 02:40 PM, Ingo Molnar wrote:
 
  Or we could do that in the syscall path with a single store of a
  constant flag to a location in the task struct. We have a number of
  natural flags that get written on syscall entry, such as:
 
  pushq_cfi $__USER_DS/* pt_regs-ss */
 
  That goes to a constant location on the kernel stack. On return from
  system calls we could write 0 to that location.

 Huh?  IRET with zero there will fault, and we can't guarantee that
 all syscalls return using sysret. [...]

 So IRET is a slowpath - I was thinking about the fastpath mainly.

Slowpath or not, if we're using part of the hardware frame in pt_regs
as an indication of whether we're in user or kernel mode, we don't get
to arbitrarily change it to the user state before IRET or the IRET
will fail.  We could work around that with some kind of trampoline,
but that's complicated.


 [...]  Also, we'd have to audit all the entries, and
 system_call_after_swapgs currently enables interrupts early enough
 that an interrupt before all the pushes will do unpredictable things
 to pt_regs.

 An irq hardware frame won't push zero to that selector value, right?
 That's the only bad thing that would confuse the code.


I think it's not quite that simple.  The syscall entry looks like, roughly:

fix rsp;
sti;
push ss;
push rsp;
push flags;
push cs;
push rip;

We can get an interrupt part-way through those pushes.  Maybe there's
no bad place where we could get an IRQ since SS is first, but this is
still nasty.


 But yes, safely accessing the remote task is a complication, but with
 such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
 even if we do:

 If we went that route, I'd advocate sticking the flag in tss-sp1.
 That cacheline is unconditionally read on kernel entry already, and
 sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
 lost track). [1]

 Alternatively, I'd suggest that we actually add a whole new word to
 pt_regs.

 ... we'd still have to set TIF_NOHZ and are back to square one in
 terms of race complexity.

 But it's not overly complex: by taking the remote CPU's rq-lock from
 synchronize_rcu() we could get a stable pointer to the currently
 executing task.

 And if we do that, we might as well use the opportunity and take a
 look at pt_regs as well - this is why sticking it into pt_regs might
 be better.

 So I'd:

   - enlarge pt_regs by a word and stick the flag there (this
 allocation is essentially free)

   - update the word from entry/exit

   - synchronize_rcu() avoids having to send an IPI by taking a
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a
completion from a TIF_NOHZ callback.

This seems complicated, especially given IST entries that put their
pt_regs in different places.  We get a bit of a free pass on some of
the IST entries due to the IST stack switch thing, but we still need
to worry about NMI and MCE at least.

I think I want to understand the considerations a bit better before I
express a real opinion.  IIUC the remote CPU needs to reliably detect
that we've had a grace period.  This can potentially happen by
observing us with some quiescent (user-mode, guest-mode or idle) flag
set or by an interrupt incrementing a counter.  In CONFIG_PREEMPT_RCU
mode, the remote CPU can presumably observe that we're quiescent
directly.

Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is
unnecessary?  We may still want context tracking for the timing stats
sampling, but I don't see why we need it for RCU.

In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about
possible races that could cause us to fail to ever detect a grace
period.  If we have the periodic tick turned off and if the remote CPU
checks our context and sees us non-quiescent, then either we need a
non-racy way to get us to do work when we become quiescent or the
remote CPU needs to eventually send us an IPI.  I don't see any way to
do the former without an RMW or barrier on every transition into a
quiescent state.  Even if we used ti-flags, I still think that would
require us to upgrade the cli; test on exit to userspace to a
test-and-set or test-and-clear, either of which is much more
expensive.

I think I prefer a per-cpu approach over a per-task approach, since
it's easier to reason about and it should still only require one
instruction on entry and one instruction on exit.

FWIW, if we go the per-cpu route, what if we used a counter instead of
a simple context flag?  With a counter, the remote CPU could observe
that we changed state and therefore went through a grace period even
if the remote CPU never actually observes us in

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry