Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > on return from idle we'd do:
> >
> > rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
> >
> > etc. On all boundary transitions we can use a constant ADD on a 
> > suitable percpu variable.
> 
> Sounds good to me, except that we need to be careful to distinguish 
> between non-syscall entries from quiescent states
> and non-syscall entries from quiescent states.

It might be hard to make that distinction! ;-)

I suspect you wanted to raise the issue of various contexts nesting on 
each other, such as syscall triggering a page fault, which gets an irq 
nested, etc. - versus non-nested contexts such as user-space 
triggering a page fault or user-space getting an irq?

> [...]  We could save the old state (as the current exception_enter 
> code does) or we could allocate enough low bits for the state that 
> the problem goes away.

So I think, if it all works in practice just as well as it does in 
email, we might be better off with more state bits: that would tell 
any remote statistics/sampling code more as well.

It might also be easier to patch in/out, because this kind of state 
tracking will affect non-RT CPUs as well. (Later on we could do a 
separate IDT for RT CPUs as well, with a patched version of the entry 
code.)

> I don't think the TIF_RCU_QS variant is worthwhile -- merging the 
> counter and state is probably both easier and faster.

Yeah.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Andy Lutomirski
On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar  wrote:
>
> * Andy Lutomirski  wrote:
>
>> On May 8, 2015 12:07 PM, "Ingo Molnar"  wrote:
>> >
>> >
>> > * Andy Lutomirski  wrote:
>> >
>> > > > So do you mean:
>> > > >
>> > > >this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >...
>> > > >this_cpu_inc(rcu_qs_ctr);
>> > > >this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > > > ?
>> > > >
>> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
>> > > > it just two simple stores into a byte flag, one on entry and one on
>> > > > exit:
>> > > >
>> > > >this_cpu_set(rcu_state) = IN_KERNEL;
>> > > >...
>> > > >this_cpu_set(rcu_state) = IN_USER;
>> > > >
>> > >
>> > > I was thinking that either a counter or a state flag could make sense.
>> > > Doing both would be pointless.  The counter could use the low bits to
>> > > indicate the state.  The benefit of the counter would be that the
>> > > RCU-waiting CPU could observe that the counter has incremented and
>> > > that therefore a grace period has elapsed.  Getting it right would
>> > > require lots of care.
>> >
>> > So if you mean:
>> >
>> >
>> >...
>> >this_cpu_inc(rcu_qs_ctr);
>> >
>> >
>> > I don't see how this would work reliably: how do you handle the case
>> > of a SCHED_FIFO task never returning from user-space (common technique
>> > in RT apps)? synchronize_rcu() would block indefinitely as it would
>> > never see rcu_qs_ctr increase.
>> >
>> > We have to be able to observe user-mode anyway, for system-time
>> > statistics purposes, and that flag could IMHO also drive the RCU GC
>> > machinery.
>>
>> The counter would have to be fancier than that to work.  We could
>> say that an even value means user mode, for example.  IOW the high
>> bits of the counter would count transitions to quiescent and the low
>> bits would indicate the state.
>>
>> Actually, I'd count backwards.  We could use an andl instruction to
>> go to user mode and a decl to go to kernel mode.
>
> at which point it's not really a monotonic quiescent state counter -
> which your naming suggested before.
>
> Also it would have to be done both at entry and at exit.
>
> But yes, if you replace a state flag with a recursive state flag that
> is INC/DEC maintained that would work as well. That's not a counter
> though.

I won't quibble about the name :)

>
>> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
>> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
>> > > because stores can be delayed.  (It could even happen after
>> > > sysret, IIRC, but IRET is serializing.)
>> >
>> > All it has to do in the synchronize_rcu() slowpath is something
>> > like:
>>
>> I don't think this works.  Suppose rt_cpu starts in kernel mode.
>> Then it checks ti flags.
>>
>> >
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> > smp_mb__before_atomic();
>> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> > smp_rmb();
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>>
>> Now it sets rcu_state and exits to user mode.  It never notices
>> TIF_RCU_NOTIFY.
>
> Indeed, you are right, that's racy.
>
>> [...]  We could fix this by sending an IPI to kick it.
>
> With an IPI we won't need any flags or counters on the RT CPU - it's
> the IPI we are trying to avoid.
>
> So how about the following way to fix the race: simply do a poll loop
> with a fixed timeout sleep: like it would do anyway if
> synchronize_rcu() was waiting for the timer irq to end the grace
> period on the RT-CPU.

Seems reasonable to me.

>
> The TIF flag would cause an RCU grace period to lapse, no need to wake
> up the synchronize_rcu() side: it will notice the (regular) increased
> RCU counter.
>
>> > We are talking about a dozen cycles, while a typical
>> > synchronize_rcu() will wait millions (sometimes billions) of
>> > cycles. There's absolutely zero performance concern here and it's
>> > all CPU local in any case.
>>
>> The barrier would affect all syscalls, though. [...]
>
> What barrier? I never suggested any barrier in the syscall fast path,
> this bit:

Oh, I misunderstood.

>
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
>> > smp_mb__before_atomic();
>> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
>> > smp_rmb();
>> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
>
> runs inside synchronize_rcu() or so.
>
> But none of that is needed if we do:
>
> - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
>   entry/exit time, straight in the assembly
>
> - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
>   nothing but: this_cpu_inc(rcu_qs_ctr).
>
> - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
>   then does a timeout-poll-loop with jiffies granular
>   timeouts, simulating a timer 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> On May 8, 2015 12:07 PM, "Ingo Molnar"  wrote:
> >
> >
> > * Andy Lutomirski  wrote:
> >
> > > > So do you mean:
> > > >
> > > >this_cpu_set(rcu_state) = IN_KERNEL;
> > > >...
> > > >this_cpu_inc(rcu_qs_ctr);
> > > >this_cpu_set(rcu_state) = IN_USER;
> > > >
> > > > ?
> > > >
> > > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > > it just two simple stores into a byte flag, one on entry and one on
> > > > exit:
> > > >
> > > >this_cpu_set(rcu_state) = IN_KERNEL;
> > > >...
> > > >this_cpu_set(rcu_state) = IN_USER;
> > > >
> > >
> > > I was thinking that either a counter or a state flag could make sense.
> > > Doing both would be pointless.  The counter could use the low bits to
> > > indicate the state.  The benefit of the counter would be that the
> > > RCU-waiting CPU could observe that the counter has incremented and
> > > that therefore a grace period has elapsed.  Getting it right would
> > > require lots of care.
> >
> > So if you mean:
> >
> >
> >...
> >this_cpu_inc(rcu_qs_ctr);
> >
> >
> > I don't see how this would work reliably: how do you handle the case
> > of a SCHED_FIFO task never returning from user-space (common technique
> > in RT apps)? synchronize_rcu() would block indefinitely as it would
> > never see rcu_qs_ctr increase.
> >
> > We have to be able to observe user-mode anyway, for system-time
> > statistics purposes, and that flag could IMHO also drive the RCU GC
> > machinery.
> 
> The counter would have to be fancier than that to work.  We could 
> say that an even value means user mode, for example.  IOW the high 
> bits of the counter would count transitions to quiescent and the low 
> bits would indicate the state.
> 
> Actually, I'd count backwards.  We could use an andl instruction to 
> go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter - 
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that 
is INC/DEC maintained that would work as well. That's not a counter 
though.

> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> > > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> > > because stores can be delayed.  (It could even happen after 
> > > sysret, IIRC, but IRET is serializing.)
> >
> > All it has to do in the synchronize_rcu() slowpath is something 
> > like:
> 
> I don't think this works.  Suppose rt_cpu starts in kernel mode.  
> Then it checks ti flags.
>
> >
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> > smp_mb__before_atomic();
> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> > smp_rmb();
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
> 
> Now it sets rcu_state and exits to user mode.  It never notices
> TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

> [...]  We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's 
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop 
with a fixed timeout sleep: like it would do anyway if 
synchronize_rcu() was waiting for the timer irq to end the grace 
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake 
up the synchronize_rcu() side: it will notice the (regular) increased 
RCU counter.

> > We are talking about a dozen cycles, while a typical 
> > synchronize_rcu() will wait millions (sometimes billions) of 
> > cycles. There's absolutely zero performance concern here and it's 
> > all CPU local in any case.
> 
> The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path, 
this bit:

> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> > smp_mb__before_atomic();
> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> > smp_rmb();
> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context 
  entry/exit time, straight in the assembly

- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does 
  nothing but: this_cpu_inc(rcu_qs_ctr).

- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and 
  then does a timeout-poll-loop with jiffies granular 
  timeouts, simulating a timer IRQ in essence. It will either 
  observe IN_USER or will see the regular RCU qs counter 
  increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the 
percpu counter with the percpu state. Quiescent states 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Andy Lutomirski
On May 8, 2015 12:07 PM, "Ingo Molnar"  wrote:
>
>
> * Andy Lutomirski  wrote:
>
> > > So do you mean:
> > >
> > >this_cpu_set(rcu_state) = IN_KERNEL;
> > >...
> > >this_cpu_inc(rcu_qs_ctr);
> > >this_cpu_set(rcu_state) = IN_USER;
> > >
> > > ?
> > >
> > > So in your proposal we'd have an INC and two MOVs. I think we can make
> > > it just two simple stores into a byte flag, one on entry and one on
> > > exit:
> > >
> > >this_cpu_set(rcu_state) = IN_KERNEL;
> > >...
> > >this_cpu_set(rcu_state) = IN_USER;
> > >
> >
> > I was thinking that either a counter or a state flag could make sense.
> > Doing both would be pointless.  The counter could use the low bits to
> > indicate the state.  The benefit of the counter would be that the
> > RCU-waiting CPU could observe that the counter has incremented and
> > that therefore a grace period has elapsed.  Getting it right would
> > require lots of care.
>
> So if you mean:
>
>
>...
>this_cpu_inc(rcu_qs_ctr);
>
>
> I don't see how this would work reliably: how do you handle the case
> of a SCHED_FIFO task never returning from user-space (common technique
> in RT apps)? synchronize_rcu() would block indefinitely as it would
> never see rcu_qs_ctr increase.
>
> We have to be able to observe user-mode anyway, for system-time
> statistics purposes, and that flag could IMHO also drive the RCU GC
> machinery.

The counter would have to be fancier than that to work.  We could say
that an even value means user mode, for example.  IOW the high bits of
the counter would count transitions to quiescent and the low bits
would indicate the state.

Actually, I'd count backwards.  We could use an andl instruction to go
to user mode and a decl to go to kernel mode.

>
> > > > The problem is that I don't see how TIF_RCU_THINGY can work
> > > > reliably. If the remote CPU sets it, it'll be too late and we'll
> > > > still enter user mode without seeing it.  If it's just an
> > > > optimization, though, then it should be fine.
> > >
> > > Well, after setting it, the remote CPU has to re-check whether the
> > > RT CPU has entered user-mode - before it goes to wait.
> >
> > How?
> >
> > Suppose the exit path looked like:
> >
> > this_cpu_write(rcu_state, IN_USER);
> >
> > if (ti->flags & _TIF_RCU_NOTIFY) {
> > if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags))
> > slow_notify_rcu_that_we_are_exiting();
> > }
> >
> > iret or sysret;
>
> No, it would look like this:
>
>this_cpu_write(rcu_state, IN_USER);
>iret or sysret;
>
> I.e. IN_USER is set well after all notifications are checked. No
> kernel execution happens afterwards. (No extra checks added - the
> regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)
>
> ( Same goes for idle: we just mark it IN_IDLE and move it back to
>   IN_KERNEL after the idling ends. )
>
> > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
> > _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
> > because stores can be delayed.  (It could even happen after sysret,
> > IIRC, but IRET is serializing.)
>
> All it has to do in the synchronize_rcu() slowpath is something like:

I don't think this works.  Suppose rt_cpu starts in kernel mode.  Then
it checks ti flags.

>
> if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
> smp_mb__before_atomic();
> set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
> smp_rmb();
> if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

Now it sets rcu_state and exits to user mode.  It never notices
TIF_RCU_NOTIFY.  We could fix this by sending an IPI to kick it.

> ... go wait ...
> }
> /* Cool, we observed quiescent state: */
>
> The cost of the trivial barrier is nothing compared to the 'go wait'
> cost which we will pay in 99.9% of the cases!
>
> > If we put an mfence after this_cpu_set or did an unconditional
> > test_and_clear_bit on ti->flags then this problem goes away, but
> > that would probably be slower than we'd like.
>
> We are talking about a dozen cycles, while a typical synchronize_rcu()
> will wait millions (sometimes billions) of cycles. There's absolutely
> zero performance concern here and it's all CPU local in any case.
>

The barrier would affect all syscalls, though.  Admittedly that's
still much cheaper than the current thing, but given that we can round
trip a syscall in 110 cycles, we'd take at least 10% overhead.

My preference would be to use a counter and skip the barrier.  If the
waiting CPU polled the counter, then, even if it lost this race, it
would be guaranteed to see the counter change reasonably soon as long
as no CPU implementation sits on its store buffer forever.

> In fact a user-mode/kernel-mode flag speeds up naive implementations
> of synchronize_rcu(): because it's able to observe extended quiescent
> state immediately, without having to wait for a counter to increase
> 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Rik van Riel  wrote:

> On 05/07/2015 08:29 AM, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker  wrote:
> > 
>  We cannot take the lock_trace(task) from irq context, and we 
>  probably do not need to anyway, since we do not care about a 
>  precise stack trace for the task.
> >>>
> >>> So one worry with this and similar approaches of statistically 
> >>> detecting user mode would be the fact that on the way out to 
> >>> user-space we don't really destroy the previous call trace - we 
> >>> just pop off the stack (non-destructively), restore RIPs and are 
> >>> gone.
> >>>
> >>> We'll need that percpu flag I suspect.
> >>
> >> Note we have the context tracking state which tells where the 
> >> current task is: user/system/guest.
> > 
> > Yes, but that overhead is what I'm suggesting we get rid of, I thought 
> > Rik was trying to find a mechanism that would be independent of that?
> 
> One thing at a time :)
> 
> I am working on the timer sampling stuff, which should be easy to 
> adapt to a different user/system/guest/irq/softirq/... tracking 
> thing, if somebody else comes up with a more efficient way to do 
> that.

So if you make the timer sampling use a percpu variable, and set that 
variable from the existing callbacks, then we could do this gradually: 
first the timer sampling uses the flag, then RCU could use it, and 
finally we could push it out to minimal assembly code.

But it's important to start out with a percpu flag to track this all.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > So do you mean:
> >
> >this_cpu_set(rcu_state) = IN_KERNEL;
> >...
> >this_cpu_inc(rcu_qs_ctr);
> >this_cpu_set(rcu_state) = IN_USER;
> >
> > ?
> >
> > So in your proposal we'd have an INC and two MOVs. I think we can make
> > it just two simple stores into a byte flag, one on entry and one on
> > exit:
> >
> >this_cpu_set(rcu_state) = IN_KERNEL;
> >...
> >this_cpu_set(rcu_state) = IN_USER;
> >
> 
> I was thinking that either a counter or a state flag could make sense.
> Doing both would be pointless.  The counter could use the low bits to
> indicate the state.  The benefit of the counter would be that the
> RCU-waiting CPU could observe that the counter has incremented and
> that therefore a grace period has elapsed.  Getting it right would
> require lots of care.

So if you mean:

   
   ...
   this_cpu_inc(rcu_qs_ctr);
   

I don't see how this would work reliably: how do you handle the case 
of a SCHED_FIFO task never returning from user-space (common technique 
in RT apps)? synchronize_rcu() would block indefinitely as it would 
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time 
statistics purposes, and that flag could IMHO also drive the RCU GC 
machinery.

> > > The problem is that I don't see how TIF_RCU_THINGY can work 
> > > reliably. If the remote CPU sets it, it'll be too late and we'll 
> > > still enter user mode without seeing it.  If it's just an 
> > > optimization, though, then it should be fine.
> >
> > Well, after setting it, the remote CPU has to re-check whether the 
> > RT CPU has entered user-mode - before it goes to wait.
> 
> How?
> 
> Suppose the exit path looked like:
> 
> this_cpu_write(rcu_state, IN_USER);
> 
> if (ti->flags & _TIF_RCU_NOTIFY) {
> if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags))
> slow_notify_rcu_that_we_are_exiting();
> }
> 
> iret or sysret;

No, it would look like this:

   this_cpu_write(rcu_state, IN_USER);
   iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No 
kernel execution happens afterwards. (No extra checks added - the 
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to 
  IN_KERNEL after the idling ends. )

> The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
> _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
> because stores can be delayed.  (It could even happen after sysret, 
> IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
smp_mb__before_atomic();
set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
smp_rmb();
if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
... go wait ...
}
/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait' 
cost which we will pay in 99.9% of the cases!

> If we put an mfence after this_cpu_set or did an unconditional 
> test_and_clear_bit on ti->flags then this problem goes away, but 
> that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu() 
will wait millions (sometimes billions) of cycles. There's absolutely 
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations 
of synchronize_rcu(): because it's able to observe extended quiescent 
state immediately, without having to wait for a counter to increase 
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!) 
then synchronize_rcu() could return almost immediately - while 
previously it had to wait for scheduling or periodic timer irqs to 
trigger on all CPUs - adding many millisecs of delay even in the best 
of cases.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-08 Thread Paul E. McKenney
On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
> > (being executed on some other CPU not doing RT work) to 
> > intelligently wait for the remote (RT work doing) CPU to finish 
> > executing kernel code, without polling or so.
> 
> it's basically a cheap IPI being inserted on the remote CPU.
> 
> We need the TIF_RCU_QS callback not just to wait intelligently, but 
> mainly to elapse a grace period, otherwise synchronize_rcu() might not 
> ever make progress: think a SCHED_FIFO task doing some kernel work, 
> synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
> otherwise never scheduling and never getting any timer irqs either, 
> and thus never entering quiescent state.
> 
> (Cc:-ed Paul too, he might be interested in this as well.)

Hmmm...  So the point is that a NO_HZ_FULL CPU periodically posts
callbacks to indicate that it has passed through a quiescent state,
for example, upon entry to and/or exit from userspace?  These callbacks
would then be offloaded to some other CPU.

But the callback would not be invoked until RCU saw a grace period,
so I must be missing something here...  Probably that the TIF_RCU_QS
callback is not an RCU callback, but something else?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Paul E. McKenney
On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote:
 
 * Ingo Molnar mi...@kernel.org wrote:
 
  The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
  (being executed on some other CPU not doing RT work) to 
  intelligently wait for the remote (RT work doing) CPU to finish 
  executing kernel code, without polling or so.
 
 it's basically a cheap IPI being inserted on the remote CPU.
 
 We need the TIF_RCU_QS callback not just to wait intelligently, but 
 mainly to elapse a grace period, otherwise synchronize_rcu() might not 
 ever make progress: think a SCHED_FIFO task doing some kernel work, 
 synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
 otherwise never scheduling and never getting any timer irqs either, 
 and thus never entering quiescent state.
 
 (Cc:-ed Paul too, he might be interested in this as well.)

Hmmm...  So the point is that a NO_HZ_FULL CPU periodically posts
callbacks to indicate that it has passed through a quiescent state,
for example, upon entry to and/or exit from userspace?  These callbacks
would then be offloaded to some other CPU.

But the callback would not be invoked until RCU saw a grace period,
so I must be missing something here...  Probably that the TIF_RCU_QS
callback is not an RCU callback, but something else?

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

  So do you mean:
 
 this_cpu_set(rcu_state) = IN_KERNEL;
 ...
 this_cpu_inc(rcu_qs_ctr);
 this_cpu_set(rcu_state) = IN_USER;
 
  ?
 
  So in your proposal we'd have an INC and two MOVs. I think we can make
  it just two simple stores into a byte flag, one on entry and one on
  exit:
 
 this_cpu_set(rcu_state) = IN_KERNEL;
 ...
 this_cpu_set(rcu_state) = IN_USER;
 
 
 I was thinking that either a counter or a state flag could make sense.
 Doing both would be pointless.  The counter could use the low bits to
 indicate the state.  The benefit of the counter would be that the
 RCU-waiting CPU could observe that the counter has incremented and
 that therefore a grace period has elapsed.  Getting it right would
 require lots of care.

So if you mean:

   syscall entry
   ...
   this_cpu_inc(rcu_qs_ctr);
   syscall exit

I don't see how this would work reliably: how do you handle the case 
of a SCHED_FIFO task never returning from user-space (common technique 
in RT apps)? synchronize_rcu() would block indefinitely as it would 
never see rcu_qs_ctr increase.

We have to be able to observe user-mode anyway, for system-time 
statistics purposes, and that flag could IMHO also drive the RCU GC 
machinery.

   The problem is that I don't see how TIF_RCU_THINGY can work 
   reliably. If the remote CPU sets it, it'll be too late and we'll 
   still enter user mode without seeing it.  If it's just an 
   optimization, though, then it should be fine.
 
  Well, after setting it, the remote CPU has to re-check whether the 
  RT CPU has entered user-mode - before it goes to wait.
 
 How?
 
 Suppose the exit path looked like:
 
 this_cpu_write(rcu_state, IN_USER);
 
 if (ti-flags  _TIF_RCU_NOTIFY) {
 if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags))
 slow_notify_rcu_that_we_are_exiting();
 }
 
 iret or sysret;

No, it would look like this:

   this_cpu_write(rcu_state, IN_USER);
   iret or sysret;

I.e. IN_USER is set well after all notifications are checked. No 
kernel execution happens afterwards. (No extra checks added - the 
regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

( Same goes for idle: we just mark it IN_IDLE and move it back to 
  IN_KERNEL after the idling ends. )

 The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
 _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
 because stores can be delayed.  (It could even happen after sysret, 
 IIRC, but IRET is serializing.)

All it has to do in the synchronize_rcu() slowpath is something like:

if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
smp_mb__before_atomic();
set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
smp_rmb();
if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
... go wait ...
}
/* Cool, we observed quiescent state: */

The cost of the trivial barrier is nothing compared to the 'go wait' 
cost which we will pay in 99.9% of the cases!

 If we put an mfence after this_cpu_set or did an unconditional 
 test_and_clear_bit on ti-flags then this problem goes away, but 
 that would probably be slower than we'd like.

We are talking about a dozen cycles, while a typical synchronize_rcu() 
will wait millions (sometimes billions) of cycles. There's absolutely 
zero performance concern here and it's all CPU local in any case.

In fact a user-mode/kernel-mode flag speeds up naive implementations 
of synchronize_rcu(): because it's able to observe extended quiescent 
state immediately, without having to wait for a counter to increase 
(which was always the classic way to observe grace periods).

If all CPUs are in user mode or are idle (which is rather common!) 
then synchronize_rcu() could return almost immediately - while 
previously it had to wait for scheduling or periodic timer irqs to 
trigger on all CPUs - adding many millisecs of delay even in the best 
of cases.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

 On 05/07/2015 08:29 AM, Ingo Molnar wrote:
  
  * Frederic Weisbecker fweis...@gmail.com wrote:
  
  We cannot take the lock_trace(task) from irq context, and we 
  probably do not need to anyway, since we do not care about a 
  precise stack trace for the task.
 
  So one worry with this and similar approaches of statistically 
  detecting user mode would be the fact that on the way out to 
  user-space we don't really destroy the previous call trace - we 
  just pop off the stack (non-destructively), restore RIPs and are 
  gone.
 
  We'll need that percpu flag I suspect.
 
  Note we have the context tracking state which tells where the 
  current task is: user/system/guest.
  
  Yes, but that overhead is what I'm suggesting we get rid of, I thought 
  Rik was trying to find a mechanism that would be independent of that?
 
 One thing at a time :)
 
 I am working on the timer sampling stuff, which should be easy to 
 adapt to a different user/system/guest/irq/softirq/... tracking 
 thing, if somebody else comes up with a more efficient way to do 
 that.

So if you make the timer sampling use a percpu variable, and set that 
variable from the existing callbacks, then we could do this gradually: 
first the timer sampling uses the flag, then RCU could use it, and 
finally we could push it out to minimal assembly code.

But it's important to start out with a percpu flag to track this all.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Andy Lutomirski
On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote:


 * Andy Lutomirski l...@amacapital.net wrote:

   So do you mean:
  
  this_cpu_set(rcu_state) = IN_KERNEL;
  ...
  this_cpu_inc(rcu_qs_ctr);
  this_cpu_set(rcu_state) = IN_USER;
  
   ?
  
   So in your proposal we'd have an INC and two MOVs. I think we can make
   it just two simple stores into a byte flag, one on entry and one on
   exit:
  
  this_cpu_set(rcu_state) = IN_KERNEL;
  ...
  this_cpu_set(rcu_state) = IN_USER;
  
 
  I was thinking that either a counter or a state flag could make sense.
  Doing both would be pointless.  The counter could use the low bits to
  indicate the state.  The benefit of the counter would be that the
  RCU-waiting CPU could observe that the counter has incremented and
  that therefore a grace period has elapsed.  Getting it right would
  require lots of care.

 So if you mean:

syscall entry
...
this_cpu_inc(rcu_qs_ctr);
syscall exit

 I don't see how this would work reliably: how do you handle the case
 of a SCHED_FIFO task never returning from user-space (common technique
 in RT apps)? synchronize_rcu() would block indefinitely as it would
 never see rcu_qs_ctr increase.

 We have to be able to observe user-mode anyway, for system-time
 statistics purposes, and that flag could IMHO also drive the RCU GC
 machinery.

The counter would have to be fancier than that to work.  We could say
that an even value means user mode, for example.  IOW the high bits of
the counter would count transitions to quiescent and the low bits
would indicate the state.

Actually, I'd count backwards.  We could use an andl instruction to go
to user mode and a decl to go to kernel mode.


The problem is that I don't see how TIF_RCU_THINGY can work
reliably. If the remote CPU sets it, it'll be too late and we'll
still enter user mode without seeing it.  If it's just an
optimization, though, then it should be fine.
  
   Well, after setting it, the remote CPU has to re-check whether the
   RT CPU has entered user-mode - before it goes to wait.
 
  How?
 
  Suppose the exit path looked like:
 
  this_cpu_write(rcu_state, IN_USER);
 
  if (ti-flags  _TIF_RCU_NOTIFY) {
  if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags))
  slow_notify_rcu_that_we_are_exiting();
  }
 
  iret or sysret;

 No, it would look like this:

this_cpu_write(rcu_state, IN_USER);
iret or sysret;

 I.e. IN_USER is set well after all notifications are checked. No
 kernel execution happens afterwards. (No extra checks added - the
 regular return-to-user-work checks would handle TIF_RCU_NOTIFY.)

 ( Same goes for idle: we just mark it IN_IDLE and move it back to
   IN_KERNEL after the idling ends. )

  The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
  _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
  because stores can be delayed.  (It could even happen after sysret,
  IIRC, but IRET is serializing.)

 All it has to do in the synchronize_rcu() slowpath is something like:

I don't think this works.  Suppose rt_cpu starts in kernel mode.  Then
it checks ti flags.


 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
 smp_mb__before_atomic();
 set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
 smp_rmb();
 if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

Now it sets rcu_state and exits to user mode.  It never notices
TIF_RCU_NOTIFY.  We could fix this by sending an IPI to kick it.

 ... go wait ...
 }
 /* Cool, we observed quiescent state: */

 The cost of the trivial barrier is nothing compared to the 'go wait'
 cost which we will pay in 99.9% of the cases!

  If we put an mfence after this_cpu_set or did an unconditional
  test_and_clear_bit on ti-flags then this problem goes away, but
  that would probably be slower than we'd like.

 We are talking about a dozen cycles, while a typical synchronize_rcu()
 will wait millions (sometimes billions) of cycles. There's absolutely
 zero performance concern here and it's all CPU local in any case.


The barrier would affect all syscalls, though.  Admittedly that's
still much cheaper than the current thing, but given that we can round
trip a syscall in 110 cycles, we'd take at least 10% overhead.

My preference would be to use a counter and skip the barrier.  If the
waiting CPU polled the counter, then, even if it lost this race, it
would be guaranteed to see the counter change reasonably soon as long
as no CPU implementation sits on its store buffer forever.

 In fact a user-mode/kernel-mode flag speeds up naive implementations
 of synchronize_rcu(): because it's able to observe extended quiescent
 state immediately, without having to wait for a counter to increase
 (which was always the classic way to observe grace periods).

 If all CPUs are in user mode or are idle (which is rather common!)
 then 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

  on return from idle we'd do:
 
  rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */
 
  etc. On all boundary transitions we can use a constant ADD on a 
  suitable percpu variable.
 
 Sounds good to me, except that we need to be careful to distinguish 
 between non-syscall entries from quiescent states
 and non-syscall entries from quiescent states.

It might be hard to make that distinction! ;-)

I suspect you wanted to raise the issue of various contexts nesting on 
each other, such as syscall triggering a page fault, which gets an irq 
nested, etc. - versus non-nested contexts such as user-space 
triggering a page fault or user-space getting an irq?

 [...]  We could save the old state (as the current exception_enter 
 code does) or we could allocate enough low bits for the state that 
 the problem goes away.

So I think, if it all works in practice just as well as it does in 
email, we might be better off with more state bits: that would tell 
any remote statistics/sampling code more as well.

It might also be easier to patch in/out, because this kind of state 
tracking will affect non-RT CPUs as well. (Later on we could do a 
separate IDT for RT CPUs as well, with a patched version of the entry 
code.)

 I don't think the TIF_RCU_QS variant is worthwhile -- merging the 
 counter and state is probably both easier and faster.

Yeah.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Andy Lutomirski
On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar mi...@kernel.org wrote:

 * Andy Lutomirski l...@amacapital.net wrote:

 On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote:
 
 
  * Andy Lutomirski l...@amacapital.net wrote:
 
So do you mean:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;
   
?
   
So in your proposal we'd have an INC and two MOVs. I think we can make
it just two simple stores into a byte flag, one on entry and one on
exit:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;
   
  
   I was thinking that either a counter or a state flag could make sense.
   Doing both would be pointless.  The counter could use the low bits to
   indicate the state.  The benefit of the counter would be that the
   RCU-waiting CPU could observe that the counter has incremented and
   that therefore a grace period has elapsed.  Getting it right would
   require lots of care.
 
  So if you mean:
 
 syscall entry
 ...
 this_cpu_inc(rcu_qs_ctr);
 syscall exit
 
  I don't see how this would work reliably: how do you handle the case
  of a SCHED_FIFO task never returning from user-space (common technique
  in RT apps)? synchronize_rcu() would block indefinitely as it would
  never see rcu_qs_ctr increase.
 
  We have to be able to observe user-mode anyway, for system-time
  statistics purposes, and that flag could IMHO also drive the RCU GC
  machinery.

 The counter would have to be fancier than that to work.  We could
 say that an even value means user mode, for example.  IOW the high
 bits of the counter would count transitions to quiescent and the low
 bits would indicate the state.

 Actually, I'd count backwards.  We could use an andl instruction to
 go to user mode and a decl to go to kernel mode.

 at which point it's not really a monotonic quiescent state counter -
 which your naming suggested before.

 Also it would have to be done both at entry and at exit.

 But yes, if you replace a state flag with a recursive state flag that
 is INC/DEC maintained that would work as well. That's not a counter
 though.

I won't quibble about the name :)


   The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
   _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
   because stores can be delayed.  (It could even happen after
   sysret, IIRC, but IRET is serializing.)
 
  All it has to do in the synchronize_rcu() slowpath is something
  like:

 I don't think this works.  Suppose rt_cpu starts in kernel mode.
 Then it checks ti flags.

 
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

 Now it sets rcu_state and exits to user mode.  It never notices
 TIF_RCU_NOTIFY.

 Indeed, you are right, that's racy.

 [...]  We could fix this by sending an IPI to kick it.

 With an IPI we won't need any flags or counters on the RT CPU - it's
 the IPI we are trying to avoid.

 So how about the following way to fix the race: simply do a poll loop
 with a fixed timeout sleep: like it would do anyway if
 synchronize_rcu() was waiting for the timer irq to end the grace
 period on the RT-CPU.

Seems reasonable to me.


 The TIF flag would cause an RCU grace period to lapse, no need to wake
 up the synchronize_rcu() side: it will notice the (regular) increased
 RCU counter.

  We are talking about a dozen cycles, while a typical
  synchronize_rcu() will wait millions (sometimes billions) of
  cycles. There's absolutely zero performance concern here and it's
  all CPU local in any case.

 The barrier would affect all syscalls, though. [...]

 What barrier? I never suggested any barrier in the syscall fast path,
 this bit:

Oh, I misunderstood.


  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

 runs inside synchronize_rcu() or so.

 But none of that is needed if we do:

 - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context
   entry/exit time, straight in the assembly

 - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does
   nothing but: this_cpu_inc(rcu_qs_ctr).

 - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and
   then does a timeout-poll-loop with jiffies granular
   timeouts, simulating a timer IRQ in essence. It will either
   observe IN_USER or will see the regular RCU qs counter
   increase.

 this should be race free.

 Alternatively we can even get rid of the TIF flag by merging the
 percpu counter with 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-08 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

 On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote:
 
 
  * Andy Lutomirski l...@amacapital.net wrote:
 
So do you mean:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;
   
?
   
So in your proposal we'd have an INC and two MOVs. I think we can make
it just two simple stores into a byte flag, one on entry and one on
exit:
   
   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;
   
  
   I was thinking that either a counter or a state flag could make sense.
   Doing both would be pointless.  The counter could use the low bits to
   indicate the state.  The benefit of the counter would be that the
   RCU-waiting CPU could observe that the counter has incremented and
   that therefore a grace period has elapsed.  Getting it right would
   require lots of care.
 
  So if you mean:
 
 syscall entry
 ...
 this_cpu_inc(rcu_qs_ctr);
 syscall exit
 
  I don't see how this would work reliably: how do you handle the case
  of a SCHED_FIFO task never returning from user-space (common technique
  in RT apps)? synchronize_rcu() would block indefinitely as it would
  never see rcu_qs_ctr increase.
 
  We have to be able to observe user-mode anyway, for system-time
  statistics purposes, and that flag could IMHO also drive the RCU GC
  machinery.
 
 The counter would have to be fancier than that to work.  We could 
 say that an even value means user mode, for example.  IOW the high 
 bits of the counter would count transitions to quiescent and the low 
 bits would indicate the state.
 
 Actually, I'd count backwards.  We could use an andl instruction to 
 go to user mode and a decl to go to kernel mode.

at which point it's not really a monotonic quiescent state counter - 
which your naming suggested before.

Also it would have to be done both at entry and at exit.

But yes, if you replace a state flag with a recursive state flag that 
is INC/DEC maintained that would work as well. That's not a counter 
though.

   The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets 
   _TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET 
   because stores can be delayed.  (It could even happen after 
   sysret, IIRC, but IRET is serializing.)
 
  All it has to do in the synchronize_rcu() slowpath is something 
  like:
 
 I don't think this works.  Suppose rt_cpu starts in kernel mode.  
 Then it checks ti flags.

 
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)
 
 Now it sets rcu_state and exits to user mode.  It never notices
 TIF_RCU_NOTIFY.

Indeed, you are right, that's racy.

 [...]  We could fix this by sending an IPI to kick it.

With an IPI we won't need any flags or counters on the RT CPU - it's 
the IPI we are trying to avoid.

So how about the following way to fix the race: simply do a poll loop 
with a fixed timeout sleep: like it would do anyway if 
synchronize_rcu() was waiting for the timer irq to end the grace 
period on the RT-CPU.

The TIF flag would cause an RCU grace period to lapse, no need to wake 
up the synchronize_rcu() side: it will notice the (regular) increased 
RCU counter.

  We are talking about a dozen cycles, while a typical 
  synchronize_rcu() will wait millions (sometimes billions) of 
  cycles. There's absolutely zero performance concern here and it's 
  all CPU local in any case.
 
 The barrier would affect all syscalls, though. [...]

What barrier? I never suggested any barrier in the syscall fast path, 
this bit:

  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) {
  smp_mb__before_atomic();
  set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY);
  smp_rmb();
  if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL)

runs inside synchronize_rcu() or so.

But none of that is needed if we do:

- simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context 
  entry/exit time, straight in the assembly

- TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does 
  nothing but: this_cpu_inc(rcu_qs_ctr).

- synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and 
  then does a timeout-poll-loop with jiffies granular 
  timeouts, simulating a timer IRQ in essence. It will either 
  observe IN_USER or will see the regular RCU qs counter 
  increase.

this should be race free.

Alternatively we can even get rid of the TIF flag by merging the 
percpu counter with the percpu state. Quiescent states are recorded 
via a counter shifted by two bits:

rcu_qs_ctr += 4;

while user/kernel/idle mode is recorded in 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Andy Lutomirski
On May 7, 2015 8:38 PM, "Ingo Molnar"  wrote:
>
>
> * Andy Lutomirski  wrote:
>
> > I think one or both of us is missing something or we're just talking
> > about different things.
>
> That's very much possible!
>
> I think part of the problem is that I called the 'remote CPU' the RT
> CPU, while you seem to be calling it the CPU that does the
> synchronize_rcu().
>
> So lets start again, with calling the synchronize_rcu() the 'remote
> CPU', and the one doing the RT workload the 'RT CPU':
>
> > If the nohz/RT cpu is about to enter user mode and stay there for a
> > long time, it does:
> >
> >   this_cpu_inc(rcu_qs_ctr);
> >
> > or whatever.  Maybe we add:
> >
> >   this_cpu_set(rcu_state) = IN_USER;
> >
> > or however it's spelled.
> >
> > The remote CPU wants to check our state.  If this happens just
> > before the IN_USER write or rcu_qs_ctr increment becomes visible,
> > then it'll think we're in kernel mode.  Now it either has to poll
> > (which is fine) or try to get us to tell the RCU core when we become
> > quiescent by setting TIF_RCU_THINGY.
>
> So do you mean:
>
>this_cpu_set(rcu_state) = IN_KERNEL;
>...
>this_cpu_inc(rcu_qs_ctr);
>this_cpu_set(rcu_state) = IN_USER;
>
> ?
>
> So in your proposal we'd have an INC and two MOVs. I think we can make
> it just two simple stores into a byte flag, one on entry and one on
> exit:
>
>this_cpu_set(rcu_state) = IN_KERNEL;
>...
>this_cpu_set(rcu_state) = IN_USER;
>

I was thinking that either a counter or a state flag could make sense.
Doing both would be pointless.  The counter could use the low bits to
indicate the state.  The benefit of the counter would be that the
RCU-waiting CPU could observe that the counter has incremented and
that therefore a grace period has elapsed.  Getting it right would
require lots of care.

> plus the rare but magic TIF_RCU_THINGY that tells a waiting
> synchronize_rcu() about the next quiescent state.
>
> > The problem is that I don't see how TIF_RCU_THINGY can work
> > reliably. If the remote CPU sets it, it'll be too late and we'll
> > still enter user mode without seeing it.  If it's just an
> > optimization, though, then it should be fine.
>
> Well, after setting it, the remote CPU has to re-check whether the RT
> CPU has entered user-mode - before it goes to wait.

How?

Suppose the exit path looked like:

this_cpu_write(rcu_state, IN_USER);

if (ti->flags & _TIF_RCU_NOTIFY) {
if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags))
slow_notify_rcu_that_we_are_exiting();
}

iret or sysret;

The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
_TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
because stores can be delayed.  (It could even happen after sysret,
IIRC, but IRET is serializing.)

If we put an mfence after this_cpu_set or did an unconditional
test_and_clear_bit on ti->flags then this problem goes away, but that
would probably be slower than we'd like.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Rik van Riel
On 05/07/2015 08:29 AM, Ingo Molnar wrote:
> 
> * Frederic Weisbecker  wrote:
> 
 We cannot take the lock_trace(task) from irq context, and we 
 probably do not need to anyway, since we do not care about a 
 precise stack trace for the task.
>>>
>>> So one worry with this and similar approaches of statistically 
>>> detecting user mode would be the fact that on the way out to 
>>> user-space we don't really destroy the previous call trace - we 
>>> just pop off the stack (non-destructively), restore RIPs and are 
>>> gone.
>>>
>>> We'll need that percpu flag I suspect.
>>
>> Note we have the context tracking state which tells where the 
>> current task is: user/system/guest.
> 
> Yes, but that overhead is what I'm suggesting we get rid of, I thought 
> Rik was trying to find a mechanism that would be independent of that?

One thing at a time :)

I am working on the timer sampling stuff, which should be easy
to adapt to a different user/system/guest/irq/softirq/... tracking
thing, if somebody else comes up with a more efficient way to do
that.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> I think one or both of us is missing something or we're just talking 
> about different things.

That's very much possible!

I think part of the problem is that I called the 'remote CPU' the RT 
CPU, while you seem to be calling it the CPU that does the 
synchronize_rcu().

So lets start again, with calling the synchronize_rcu() the 'remote 
CPU', and the one doing the RT workload the 'RT CPU':

> If the nohz/RT cpu is about to enter user mode and stay there for a 
> long time, it does:
> 
>   this_cpu_inc(rcu_qs_ctr);
> 
> or whatever.  Maybe we add:
> 
>   this_cpu_set(rcu_state) = IN_USER;
> 
> or however it's spelled.
> 
> The remote CPU wants to check our state.  If this happens just 
> before the IN_USER write or rcu_qs_ctr increment becomes visible, 
> then it'll think we're in kernel mode.  Now it either has to poll 
> (which is fine) or try to get us to tell the RCU core when we become 
> quiescent by setting TIF_RCU_THINGY.

So do you mean:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;

?

So in your proposal we'd have an INC and two MOVs. I think we can make 
it just two simple stores into a byte flag, one on entry and one on 
exit:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;

plus the rare but magic TIF_RCU_THINGY that tells a waiting 
synchronize_rcu() about the next quiescent state.

> The problem is that I don't see how TIF_RCU_THINGY can work 
> reliably. If the remote CPU sets it, it'll be too late and we'll 
> still enter user mode without seeing it.  If it's just an 
> optimization, though, then it should be fine.

Well, after setting it, the remote CPU has to re-check whether the RT 
CPU has entered user-mode - before it goes to wait.

If it has entered user mode then the remote CPU has observed quiescent 
state and is done. If it's still in kernel mode then it waits until 
the TIF_RCU_THINGY does its completion.

> I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr 
> be enough for all of the above?

How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) 
work?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Andy Lutomirski
On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar  wrote:
>
> * Andy Lutomirski  wrote:
>
>> > > We cannot take the lock_trace(task) from irq context, and we
>> > > probably do not need to anyway, since we do not care about a
>> > > precise stack trace for the task.
>> >
>> > So one worry with this and similar approaches of statistically
>> > detecting user mode would be the fact that on the way out to
>> > user-space we don't really destroy the previous call trace - we
>> > just pop off the stack (non-destructively), restore RIPs and are
>> > gone.
>> >
>> > We'll need that percpu flag I suspect.
>> >
>> > And once we have the flag, we can get rid of the per syscall RCU
>> > callback as well, relatively easily: with CMPXCHG (in
>> > synchronize_rcu()!) we can reliably sample whether a CPU is in
>> > user mode right now, while the syscall entry/exit path does not
>> > use any atomics, we can just use a simple MOV.
>> >
>> > Once we observe 'user mode', then we have observed quiescent state
>> > and synchronize_rcu() can continue. If we've observed kernel mode
>> > we can frob the remote task's TIF_ flags to make it go into a
>> > quiescent state publishing routine on syscall-return.
>> >
>>
>> How does that work?
>>
>> If the exit code writes the per-cpu flag and then checks
>> TIF_whatever, we need a barrier to avoid a race where we end up in
>> user mode without seeing the flag.
>
> No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.
>
> Also, the user-mode flag would be changed from the 'kernel-mode' value
> to the 'user-mode' value after we do the regular TIF checks - just
> before we hit user-space.
>
> The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being
> executed on some other CPU not doing RT work) to intelligently wait
> for the remote (RT work doing) CPU to finish executing kernel code,
> without polling or so.
>
> And yes, the TIF_RCU_QS handler on the remote CPU would have to
> execute an SMP barrier, before it allows an RCU quiescent state to
> pass. Note that the regular RCU code for quiescent state publishing
> already has that barrier, typically something like:
>
> this_cpu_inc(rcu_qs_ctr);
>
> That in itself is enough for synchronize_rcu(), it could do a small
> timing loop with short sleeps, until it waits for rcu_qs_ctr to
> increase.

I think one or both of us is missing something or we're just talking
about different things.

If the nohz/RT cpu is about to enter user mode and stay there for a
long time, it does:

this_cpu_inc(rcu_qs_ctr);

or whatever.  Maybe we add:

this_cpu_set(rcu_state) = IN_USER;

or however it's spelled.

The remote CPU wants to check our state.  If this happens just before
the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll
think we're in kernel mode.  Now it either has to poll (which is fine)
or try to get us to tell the RCU core when we become quiescent by
setting TIF_RCU_THINGY.

The problem is that I don't see how TIF_RCU_THINGY can work reliably.
If the remote CPU sets it, it'll be too late and we'll still enter
user mode without seeing it.  If it's just an optimization, though,
then it should be fine.

I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr be
enough for all of the above?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Ingo Molnar  wrote:

> The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
> (being executed on some other CPU not doing RT work) to 
> intelligently wait for the remote (RT work doing) CPU to finish 
> executing kernel code, without polling or so.

it's basically a cheap IPI being inserted on the remote CPU.

We need the TIF_RCU_QS callback not just to wait intelligently, but 
mainly to elapse a grace period, otherwise synchronize_rcu() might not 
ever make progress: think a SCHED_FIFO task doing some kernel work, 
synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
otherwise never scheduling and never getting any timer irqs either, 
and thus never entering quiescent state.

(Cc:-ed Paul too, he might be interested in this as well.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> >
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> >
> > We'll need that percpu flag I suspect.
> >
> > And once we have the flag, we can get rid of the per syscall RCU 
> > callback as well, relatively easily: with CMPXCHG (in 
> > synchronize_rcu()!) we can reliably sample whether a CPU is in 
> > user mode right now, while the syscall entry/exit path does not 
> > use any atomics, we can just use a simple MOV.
> >
> > Once we observe 'user mode', then we have observed quiescent state 
> > and synchronize_rcu() can continue. If we've observed kernel mode 
> > we can frob the remote task's TIF_ flags to make it go into a 
> > quiescent state publishing routine on syscall-return.
> >
> 
> How does that work?
>
> If the exit code writes the per-cpu flag and then checks 
> TIF_whatever, we need a barrier to avoid a race where we end up in 
> user mode without seeing the flag.

No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

Also, the user-mode flag would be changed from the 'kernel-mode' value 
to the 'user-mode' value after we do the regular TIF checks - just 
before we hit user-space.

The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being 
executed on some other CPU not doing RT work) to intelligently wait 
for the remote (RT work doing) CPU to finish executing kernel code, 
without polling or so.

And yes, the TIF_RCU_QS handler on the remote CPU would have to 
execute an SMP barrier, before it allows an RCU quiescent state to 
pass. Note that the regular RCU code for quiescent state publishing 
already has that barrier, typically something like:

this_cpu_inc(rcu_qs_ctr);

That in itself is enough for synchronize_rcu(), it could do a small 
timing loop with short sleeps, until it waits for rcu_qs_ctr to 
increase.

> I think the right solution is to accept that race and just have the 
> RCU code send an IPI (or check again) if it sees too long of a 
> period elapse in kernel mode.

I don't think there's any need for an IPI.

> I think the flag should be a counter, though.  That way a workload 
> that makes lots of syscalls will be quickly detected as going 
> through quiescent states even if it's never actually observed in 
> user mode.

Flag write is easier on the CPU than an INC/DEC: only a store to a 
percpu location, no load needed, and no value dependencies. The store 
will be program ordered, so it's a spin_unlock() work-alike.

> > The only hard requirement of this scheme from the RCU 
> > synchronization POV is that all kernel contexts that may touch RCU 
> > state need to flip this flag reliably to 'kernel mode': i.e. all 
> > irq handlers, traps, NMIs and all syscall variants need to do 
> > this.
> >
> > But once it's there, it's really neat.
> 
> We already have to do this with the current code.  I went through 
> and checked all of the IST entries a couple versions ago.

Yeah - I just mean if the primary flag is _not_ TIF driven (which is 
my suggestion, and which I thought you suggested as well) then all 
code paths need to be covered.

Things like the irq-tracking callbacks are debugging and 
instrumentation code - the user/kernel mode flag would be a hard 
requirement on RCU correctness: missing a flag update might cause the 
kernel to crash in non-obvious ways.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Frederic Weisbecker  wrote:

> > > We cannot take the lock_trace(task) from irq context, and we 
> > > probably do not need to anyway, since we do not care about a 
> > > precise stack trace for the task.
> > 
> > So one worry with this and similar approaches of statistically 
> > detecting user mode would be the fact that on the way out to 
> > user-space we don't really destroy the previous call trace - we 
> > just pop off the stack (non-destructively), restore RIPs and are 
> > gone.
> > 
> > We'll need that percpu flag I suspect.
> 
> Note we have the context tracking state which tells where the 
> current task is: user/system/guest.

Yes, but that overhead is what I'm suggesting we get rid of, I thought 
Rik was trying to find a mechanism that would be independent of that?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Andy Lutomirski
On May 7, 2015 4:18 PM, "Ingo Molnar"  wrote:
>
>
> * Rik van Riel  wrote:
>
> > > If, on the other hand, you're just going to remotely sample the
> > > in-memory context, that sounds good.
> >
> > It's the latter.
> >
> > If you look at /proc//{stack,syscall,wchan} and other files,
> > you will see we already have ways to determine, from in memory
> > content, where a program is running at a certain point in time.
> >
> > In fact, the timer interrupt based accounting does a similar thing.
> > It has a task examine its own in-memory state to figure out what it
> > was doing before the timer interrupt happened.
> >
> > The kernel side stack pointer is probably enough to tell us whether
> > a task is active in kernel space, on an irq stack, or (maybe) in
> > user space. Not convinced about the latter, we may need to look at
> > the same state the RCU code keeps track of to see what mode a task
> > is in...
> >
> > I am looking at the code to see what locks we need to grab.
> >
> > I suspect the runqueue lock may be enough, to ensure that the task
> > struct, and stack do not go away while we are looking at them.
>
> That will be enough, especially if you get to the task reference via
> rq->curr.
>
> > We cannot take the lock_trace(task) from irq context, and we
> > probably do not need to anyway, since we do not care about a precise
> > stack trace for the task.
>
> So one worry with this and similar approaches of statistically
> detecting user mode would be the fact that on the way out to
> user-space we don't really destroy the previous call trace - we just
> pop off the stack (non-destructively), restore RIPs and are gone.
>
> We'll need that percpu flag I suspect.
>
> And once we have the flag, we can get rid of the per syscall RCU
> callback as well, relatively easily: with CMPXCHG (in
> synchronize_rcu()!) we can reliably sample whether a CPU is in user
> mode right now, while the syscall entry/exit path does not use any
> atomics, we can just use a simple MOV.
>
> Once we observe 'user mode', then we have observed quiescent state and
> synchronize_rcu() can continue. If we've observed kernel mode we can
> frob the remote task's TIF_ flags to make it go into a quiescent state
> publishing routine on syscall-return.
>

How does that work?

If the exit code writes the per-cpu flag and then checks TIF_whatever,
we need a barrier to avoid a race where we end up in user mode without
seeing the flag.

I think the right solution is to accept that race and just have the
RCU code send an IPI (or check again) if it sees too long of a period
elapse in kernel mode.

I think the flag should be a counter, though.  That way a workload
that makes lots of syscalls will be quickly detected as going through
quiescent states even if it's never actually observed in user mode.

> The only hard requirement of this scheme from the RCU synchronization
> POV is that all kernel contexts that may touch RCU state need to flip
> this flag reliably to 'kernel mode': i.e. all irq handlers, traps,
> NMIs and all syscall variants need to do this.
>
> But once it's there, it's really neat.
>

We already have to do this with the current code.  I went through and
checked all of the IST entries a couple versions ago.

I think we need to clean up the current garbage asm first, though.  See:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac

(Very much unfinished, and it should probably be split up, but AFAICT
it works.  Don't hold your breath for a real version.)

--Andy

> Thanks,
>
> Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Frederic Weisbecker
On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote:
> 
> * Rik van Riel  wrote:
> 
> > > If, on the other hand, you're just going to remotely sample the 
> > > in-memory context, that sounds good.
> > 
> > It's the latter.
> > 
> > If you look at /proc//{stack,syscall,wchan} and other files, 
> > you will see we already have ways to determine, from in memory 
> > content, where a program is running at a certain point in time.
> > 
> > In fact, the timer interrupt based accounting does a similar thing. 
> > It has a task examine its own in-memory state to figure out what it 
> > was doing before the timer interrupt happened.
> > 
> > The kernel side stack pointer is probably enough to tell us whether 
> > a task is active in kernel space, on an irq stack, or (maybe) in 
> > user space. Not convinced about the latter, we may need to look at 
> > the same state the RCU code keeps track of to see what mode a task 
> > is in...
> > 
> > I am looking at the code to see what locks we need to grab.
> > 
> > I suspect the runqueue lock may be enough, to ensure that the task 
> > struct, and stack do not go away while we are looking at them.
> 
> That will be enough, especially if you get to the task reference via 
> rq->curr.
> 
> > We cannot take the lock_trace(task) from irq context, and we 
> > probably do not need to anyway, since we do not care about a precise 
> > stack trace for the task.
> 
> So one worry with this and similar approaches of statistically 
> detecting user mode would be the fact that on the way out to 
> user-space we don't really destroy the previous call trace - we just 
> pop off the stack (non-destructively), restore RIPs and are gone.
> 
> We'll need that percpu flag I suspect.

Note we have the context tracking state which tells where the current
task is: user/system/guest.

> 
> And once we have the flag, we can get rid of the per syscall RCU 
> callback as well, relatively easily: with CMPXCHG (in 
> synchronize_rcu()!) we can reliably sample whether a CPU is in user 
> mode right now, while the syscall entry/exit path does not use any 
> atomics, we can just use a simple MOV.
> 
> Once we observe 'user mode', then we have observed quiescent state and 
> synchronize_rcu() can continue. If we've observed kernel mode we can 
> frob the remote task's TIF_ flags to make it go into a quiescent state 
> publishing routine on syscall-return.
> 
> The only hard requirement of this scheme from the RCU synchronization 
> POV is that all kernel contexts that may touch RCU state need to flip 
> this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
> NMIs and all syscall variants need to do this.
> 
> But once it's there, it's really neat.
> 
> Thanks,
> 
>   Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Rik van Riel  wrote:

> > If, on the other hand, you're just going to remotely sample the 
> > in-memory context, that sounds good.
> 
> It's the latter.
> 
> If you look at /proc//{stack,syscall,wchan} and other files, 
> you will see we already have ways to determine, from in memory 
> content, where a program is running at a certain point in time.
> 
> In fact, the timer interrupt based accounting does a similar thing. 
> It has a task examine its own in-memory state to figure out what it 
> was doing before the timer interrupt happened.
> 
> The kernel side stack pointer is probably enough to tell us whether 
> a task is active in kernel space, on an irq stack, or (maybe) in 
> user space. Not convinced about the latter, we may need to look at 
> the same state the RCU code keeps track of to see what mode a task 
> is in...
> 
> I am looking at the code to see what locks we need to grab.
> 
> I suspect the runqueue lock may be enough, to ensure that the task 
> struct, and stack do not go away while we are looking at them.

That will be enough, especially if you get to the task reference via 
rq->curr.

> We cannot take the lock_trace(task) from irq context, and we 
> probably do not need to anyway, since we do not care about a precise 
> stack trace for the task.

So one worry with this and similar approaches of statistically 
detecting user mode would be the fact that on the way out to 
user-space we don't really destroy the previous call trace - we just 
pop off the stack (non-destructively), restore RIPs and are gone.

We'll need that percpu flag I suspect.

And once we have the flag, we can get rid of the per syscall RCU 
callback as well, relatively easily: with CMPXCHG (in 
synchronize_rcu()!) we can reliably sample whether a CPU is in user 
mode right now, while the syscall entry/exit path does not use any 
atomics, we can just use a simple MOV.

Once we observe 'user mode', then we have observed quiescent state and 
synchronize_rcu() can continue. If we've observed kernel mode we can 
frob the remote task's TIF_ flags to make it go into a quiescent state 
publishing routine on syscall-return.

The only hard requirement of this scheme from the RCU synchronization 
POV is that all kernel contexts that may touch RCU state need to flip 
this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
NMIs and all syscall variants need to do this.

But once it's there, it's really neat.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> >> [...]  Also, we'd have to audit all the entries, and 
> >> system_call_after_swapgs currently enables interrupts early 
> >> enough that an interrupt before all the pushes will do 
> >> unpredictable things to pt_regs.
> >
> > An irq hardware frame won't push zero to that selector value, 
> > right? That's the only bad thing that would confuse the code.
> >
> 
> I think it's not quite that simple.  The syscall entry looks like, 
> roughly:
> 
> fix rsp;
> sti;
> push ss;
> push rsp;
> push flags;
> push cs;
> push rip;
> 
> We can get an interrupt part-way through those pushes.  Maybe there's
> no bad place where we could get an IRQ since SS is first, [...]

... and it should be covered by the 'STI window' where the instruction 
following a STI is still part of the irqs-off section.

> [...] but this is still nasty.

True.

Another approach would be to set up two aliases in the GDT, so we 
could freely change 'ss' between them and thus store information, 
without possibly confusing the syscall entry/exit code.

That still gets complicated by IST entries, which creates multiple 
positions for the 'flag'.

> I think I prefer a per-cpu approach over a per-task approach, since 
> it's easier to reason about and it should still only require one 
> instruction on entry and one instruction on exit.

Yes, agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Andy Lutomirski
On May 7, 2015 8:38 PM, Ingo Molnar mi...@kernel.org wrote:


 * Andy Lutomirski l...@amacapital.net wrote:

  I think one or both of us is missing something or we're just talking
  about different things.

 That's very much possible!

 I think part of the problem is that I called the 'remote CPU' the RT
 CPU, while you seem to be calling it the CPU that does the
 synchronize_rcu().

 So lets start again, with calling the synchronize_rcu() the 'remote
 CPU', and the one doing the RT workload the 'RT CPU':

  If the nohz/RT cpu is about to enter user mode and stay there for a
  long time, it does:
 
this_cpu_inc(rcu_qs_ctr);
 
  or whatever.  Maybe we add:
 
this_cpu_set(rcu_state) = IN_USER;
 
  or however it's spelled.
 
  The remote CPU wants to check our state.  If this happens just
  before the IN_USER write or rcu_qs_ctr increment becomes visible,
  then it'll think we're in kernel mode.  Now it either has to poll
  (which is fine) or try to get us to tell the RCU core when we become
  quiescent by setting TIF_RCU_THINGY.

 So do you mean:

this_cpu_set(rcu_state) = IN_KERNEL;
...
this_cpu_inc(rcu_qs_ctr);
this_cpu_set(rcu_state) = IN_USER;

 ?

 So in your proposal we'd have an INC and two MOVs. I think we can make
 it just two simple stores into a byte flag, one on entry and one on
 exit:

this_cpu_set(rcu_state) = IN_KERNEL;
...
this_cpu_set(rcu_state) = IN_USER;


I was thinking that either a counter or a state flag could make sense.
Doing both would be pointless.  The counter could use the low bits to
indicate the state.  The benefit of the counter would be that the
RCU-waiting CPU could observe that the counter has incremented and
that therefore a grace period has elapsed.  Getting it right would
require lots of care.

 plus the rare but magic TIF_RCU_THINGY that tells a waiting
 synchronize_rcu() about the next quiescent state.

  The problem is that I don't see how TIF_RCU_THINGY can work
  reliably. If the remote CPU sets it, it'll be too late and we'll
  still enter user mode without seeing it.  If it's just an
  optimization, though, then it should be fine.

 Well, after setting it, the remote CPU has to re-check whether the RT
 CPU has entered user-mode - before it goes to wait.

How?

Suppose the exit path looked like:

this_cpu_write(rcu_state, IN_USER);

if (ti-flags  _TIF_RCU_NOTIFY) {
if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags))
slow_notify_rcu_that_we_are_exiting();
}

iret or sysret;

The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets
_TIF_RCU_NOTIFY.  This could happen arbitrarily late before IRET
because stores can be delayed.  (It could even happen after sysret,
IIRC, but IRET is serializing.)

If we put an mfence after this_cpu_set or did an unconditional
test_and_clear_bit on ti-flags then this problem goes away, but that
would probably be slower than we'd like.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Frederic Weisbecker
On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote:
 
 * Rik van Riel r...@redhat.com wrote:
 
   If, on the other hand, you're just going to remotely sample the 
   in-memory context, that sounds good.
  
  It's the latter.
  
  If you look at /proc/pid/{stack,syscall,wchan} and other files, 
  you will see we already have ways to determine, from in memory 
  content, where a program is running at a certain point in time.
  
  In fact, the timer interrupt based accounting does a similar thing. 
  It has a task examine its own in-memory state to figure out what it 
  was doing before the timer interrupt happened.
  
  The kernel side stack pointer is probably enough to tell us whether 
  a task is active in kernel space, on an irq stack, or (maybe) in 
  user space. Not convinced about the latter, we may need to look at 
  the same state the RCU code keeps track of to see what mode a task 
  is in...
  
  I am looking at the code to see what locks we need to grab.
  
  I suspect the runqueue lock may be enough, to ensure that the task 
  struct, and stack do not go away while we are looking at them.
 
 That will be enough, especially if you get to the task reference via 
 rq-curr.
 
  We cannot take the lock_trace(task) from irq context, and we 
  probably do not need to anyway, since we do not care about a precise 
  stack trace for the task.
 
 So one worry with this and similar approaches of statistically 
 detecting user mode would be the fact that on the way out to 
 user-space we don't really destroy the previous call trace - we just 
 pop off the stack (non-destructively), restore RIPs and are gone.
 
 We'll need that percpu flag I suspect.

Note we have the context tracking state which tells where the current
task is: user/system/guest.

 
 And once we have the flag, we can get rid of the per syscall RCU 
 callback as well, relatively easily: with CMPXCHG (in 
 synchronize_rcu()!) we can reliably sample whether a CPU is in user 
 mode right now, while the syscall entry/exit path does not use any 
 atomics, we can just use a simple MOV.
 
 Once we observe 'user mode', then we have observed quiescent state and 
 synchronize_rcu() can continue. If we've observed kernel mode we can 
 frob the remote task's TIF_ flags to make it go into a quiescent state 
 publishing routine on syscall-return.
 
 The only hard requirement of this scheme from the RCU synchronization 
 POV is that all kernel contexts that may touch RCU state need to flip 
 this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
 NMIs and all syscall variants need to do this.
 
 But once it's there, it's really neat.
 
 Thanks,
 
   Ingo
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

  If, on the other hand, you're just going to remotely sample the 
  in-memory context, that sounds good.
 
 It's the latter.
 
 If you look at /proc/pid/{stack,syscall,wchan} and other files, 
 you will see we already have ways to determine, from in memory 
 content, where a program is running at a certain point in time.
 
 In fact, the timer interrupt based accounting does a similar thing. 
 It has a task examine its own in-memory state to figure out what it 
 was doing before the timer interrupt happened.
 
 The kernel side stack pointer is probably enough to tell us whether 
 a task is active in kernel space, on an irq stack, or (maybe) in 
 user space. Not convinced about the latter, we may need to look at 
 the same state the RCU code keeps track of to see what mode a task 
 is in...
 
 I am looking at the code to see what locks we need to grab.
 
 I suspect the runqueue lock may be enough, to ensure that the task 
 struct, and stack do not go away while we are looking at them.

That will be enough, especially if you get to the task reference via 
rq-curr.

 We cannot take the lock_trace(task) from irq context, and we 
 probably do not need to anyway, since we do not care about a precise 
 stack trace for the task.

So one worry with this and similar approaches of statistically 
detecting user mode would be the fact that on the way out to 
user-space we don't really destroy the previous call trace - we just 
pop off the stack (non-destructively), restore RIPs and are gone.

We'll need that percpu flag I suspect.

And once we have the flag, we can get rid of the per syscall RCU 
callback as well, relatively easily: with CMPXCHG (in 
synchronize_rcu()!) we can reliably sample whether a CPU is in user 
mode right now, while the syscall entry/exit path does not use any 
atomics, we can just use a simple MOV.

Once we observe 'user mode', then we have observed quiescent state and 
synchronize_rcu() can continue. If we've observed kernel mode we can 
frob the remote task's TIF_ flags to make it go into a quiescent state 
publishing routine on syscall-return.

The only hard requirement of this scheme from the RCU synchronization 
POV is that all kernel contexts that may touch RCU state need to flip 
this flag reliably to 'kernel mode': i.e. all irq handlers, traps, 
NMIs and all syscall variants need to do this.

But once it's there, it's really neat.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

   We cannot take the lock_trace(task) from irq context, and we 
   probably do not need to anyway, since we do not care about a 
   precise stack trace for the task.
 
  So one worry with this and similar approaches of statistically 
  detecting user mode would be the fact that on the way out to 
  user-space we don't really destroy the previous call trace - we 
  just pop off the stack (non-destructively), restore RIPs and are 
  gone.
 
  We'll need that percpu flag I suspect.
 
  And once we have the flag, we can get rid of the per syscall RCU 
  callback as well, relatively easily: with CMPXCHG (in 
  synchronize_rcu()!) we can reliably sample whether a CPU is in 
  user mode right now, while the syscall entry/exit path does not 
  use any atomics, we can just use a simple MOV.
 
  Once we observe 'user mode', then we have observed quiescent state 
  and synchronize_rcu() can continue. If we've observed kernel mode 
  we can frob the remote task's TIF_ flags to make it go into a 
  quiescent state publishing routine on syscall-return.
 
 
 How does that work?

 If the exit code writes the per-cpu flag and then checks 
 TIF_whatever, we need a barrier to avoid a race where we end up in 
 user mode without seeing the flag.

No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

Also, the user-mode flag would be changed from the 'kernel-mode' value 
to the 'user-mode' value after we do the regular TIF checks - just 
before we hit user-space.

The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being 
executed on some other CPU not doing RT work) to intelligently wait 
for the remote (RT work doing) CPU to finish executing kernel code, 
without polling or so.

And yes, the TIF_RCU_QS handler on the remote CPU would have to 
execute an SMP barrier, before it allows an RCU quiescent state to 
pass. Note that the regular RCU code for quiescent state publishing 
already has that barrier, typically something like:

this_cpu_inc(rcu_qs_ctr);

That in itself is enough for synchronize_rcu(), it could do a small 
timing loop with short sleeps, until it waits for rcu_qs_ctr to 
increase.

 I think the right solution is to accept that race and just have the 
 RCU code send an IPI (or check again) if it sees too long of a 
 period elapse in kernel mode.

I don't think there's any need for an IPI.

 I think the flag should be a counter, though.  That way a workload 
 that makes lots of syscalls will be quickly detected as going 
 through quiescent states even if it's never actually observed in 
 user mode.

Flag write is easier on the CPU than an INC/DEC: only a store to a 
percpu location, no load needed, and no value dependencies. The store 
will be program ordered, so it's a spin_unlock() work-alike.

  The only hard requirement of this scheme from the RCU 
  synchronization POV is that all kernel contexts that may touch RCU 
  state need to flip this flag reliably to 'kernel mode': i.e. all 
  irq handlers, traps, NMIs and all syscall variants need to do 
  this.
 
  But once it's there, it's really neat.
 
 We already have to do this with the current code.  I went through 
 and checked all of the IST entries a couple versions ago.

Yeah - I just mean if the primary flag is _not_ TIF driven (which is 
my suggestion, and which I thought you suggested as well) then all 
code paths need to be covered.

Things like the irq-tracking callbacks are debugging and 
instrumentation code - the user/kernel mode flag would be a hard 
requirement on RCU correctness: missing a flag update might cause the 
kernel to crash in non-obvious ways.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Rik van Riel
On 05/07/2015 08:29 AM, Ingo Molnar wrote:
 
 * Frederic Weisbecker fweis...@gmail.com wrote:
 
 We cannot take the lock_trace(task) from irq context, and we 
 probably do not need to anyway, since we do not care about a 
 precise stack trace for the task.

 So one worry with this and similar approaches of statistically 
 detecting user mode would be the fact that on the way out to 
 user-space we don't really destroy the previous call trace - we 
 just pop off the stack (non-destructively), restore RIPs and are 
 gone.

 We'll need that percpu flag I suspect.

 Note we have the context tracking state which tells where the 
 current task is: user/system/guest.
 
 Yes, but that overhead is what I'm suggesting we get rid of, I thought 
 Rik was trying to find a mechanism that would be independent of that?

One thing at a time :)

I am working on the timer sampling stuff, which should be easy
to adapt to a different user/system/guest/irq/softirq/... tracking
thing, if somebody else comes up with a more efficient way to do
that.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

 I think one or both of us is missing something or we're just talking 
 about different things.

That's very much possible!

I think part of the problem is that I called the 'remote CPU' the RT 
CPU, while you seem to be calling it the CPU that does the 
synchronize_rcu().

So lets start again, with calling the synchronize_rcu() the 'remote 
CPU', and the one doing the RT workload the 'RT CPU':

 If the nohz/RT cpu is about to enter user mode and stay there for a 
 long time, it does:
 
   this_cpu_inc(rcu_qs_ctr);
 
 or whatever.  Maybe we add:
 
   this_cpu_set(rcu_state) = IN_USER;
 
 or however it's spelled.
 
 The remote CPU wants to check our state.  If this happens just 
 before the IN_USER write or rcu_qs_ctr increment becomes visible, 
 then it'll think we're in kernel mode.  Now it either has to poll 
 (which is fine) or try to get us to tell the RCU core when we become 
 quiescent by setting TIF_RCU_THINGY.

So do you mean:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_inc(rcu_qs_ctr);
   this_cpu_set(rcu_state) = IN_USER;

?

So in your proposal we'd have an INC and two MOVs. I think we can make 
it just two simple stores into a byte flag, one on entry and one on 
exit:

   this_cpu_set(rcu_state) = IN_KERNEL;
   ...
   this_cpu_set(rcu_state) = IN_USER;

plus the rare but magic TIF_RCU_THINGY that tells a waiting 
synchronize_rcu() about the next quiescent state.

 The problem is that I don't see how TIF_RCU_THINGY can work 
 reliably. If the remote CPU sets it, it'll be too late and we'll 
 still enter user mode without seeing it.  If it's just an 
 optimization, though, then it should be fine.

Well, after setting it, the remote CPU has to re-check whether the RT 
CPU has entered user-mode - before it goes to wait.

If it has entered user mode then the remote CPU has observed quiescent 
state and is done. If it's still in kernel mode then it waits until 
the TIF_RCU_THINGY does its completion.

 I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr 
 be enough for all of the above?

How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) 
work?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

  [...]  Also, we'd have to audit all the entries, and 
  system_call_after_swapgs currently enables interrupts early 
  enough that an interrupt before all the pushes will do 
  unpredictable things to pt_regs.
 
  An irq hardware frame won't push zero to that selector value, 
  right? That's the only bad thing that would confuse the code.
 
 
 I think it's not quite that simple.  The syscall entry looks like, 
 roughly:
 
 fix rsp;
 sti;
 push ss;
 push rsp;
 push flags;
 push cs;
 push rip;
 
 We can get an interrupt part-way through those pushes.  Maybe there's
 no bad place where we could get an IRQ since SS is first, [...]

... and it should be covered by the 'STI window' where the instruction 
following a STI is still part of the irqs-off section.

 [...] but this is still nasty.

True.

Another approach would be to set up two aliases in the GDT, so we 
could freely change 'ss' between them and thus store information, 
without possibly confusing the syscall entry/exit code.

That still gets complicated by IST entries, which creates multiple 
positions for the 'flag'.

 I think I prefer a per-cpu approach over a per-task approach, since 
 it's easier to reason about and it should still only require one 
 instruction on entry and one instruction on exit.

Yes, agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

 The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() 
 (being executed on some other CPU not doing RT work) to 
 intelligently wait for the remote (RT work doing) CPU to finish 
 executing kernel code, without polling or so.

it's basically a cheap IPI being inserted on the remote CPU.

We need the TIF_RCU_QS callback not just to wait intelligently, but 
mainly to elapse a grace period, otherwise synchronize_rcu() might not 
ever make progress: think a SCHED_FIFO task doing some kernel work, 
synchronize_rcu() stumbling upon it - but the SCHED_FIFO task 
otherwise never scheduling and never getting any timer irqs either, 
and thus never entering quiescent state.

(Cc:-ed Paul too, he might be interested in this as well.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Andy Lutomirski
On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar mi...@kernel.org wrote:

 * Andy Lutomirski l...@amacapital.net wrote:

   We cannot take the lock_trace(task) from irq context, and we
   probably do not need to anyway, since we do not care about a
   precise stack trace for the task.
 
  So one worry with this and similar approaches of statistically
  detecting user mode would be the fact that on the way out to
  user-space we don't really destroy the previous call trace - we
  just pop off the stack (non-destructively), restore RIPs and are
  gone.
 
  We'll need that percpu flag I suspect.
 
  And once we have the flag, we can get rid of the per syscall RCU
  callback as well, relatively easily: with CMPXCHG (in
  synchronize_rcu()!) we can reliably sample whether a CPU is in
  user mode right now, while the syscall entry/exit path does not
  use any atomics, we can just use a simple MOV.
 
  Once we observe 'user mode', then we have observed quiescent state
  and synchronize_rcu() can continue. If we've observed kernel mode
  we can frob the remote task's TIF_ flags to make it go into a
  quiescent state publishing routine on syscall-return.
 

 How does that work?

 If the exit code writes the per-cpu flag and then checks
 TIF_whatever, we need a barrier to avoid a race where we end up in
 user mode without seeing the flag.

 No, the TIF_RCU_SYNC flag would be independent of the user-mode flag.

 Also, the user-mode flag would be changed from the 'kernel-mode' value
 to the 'user-mode' value after we do the regular TIF checks - just
 before we hit user-space.

 The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being
 executed on some other CPU not doing RT work) to intelligently wait
 for the remote (RT work doing) CPU to finish executing kernel code,
 without polling or so.

 And yes, the TIF_RCU_QS handler on the remote CPU would have to
 execute an SMP barrier, before it allows an RCU quiescent state to
 pass. Note that the regular RCU code for quiescent state publishing
 already has that barrier, typically something like:

 this_cpu_inc(rcu_qs_ctr);

 That in itself is enough for synchronize_rcu(), it could do a small
 timing loop with short sleeps, until it waits for rcu_qs_ctr to
 increase.

I think one or both of us is missing something or we're just talking
about different things.

If the nohz/RT cpu is about to enter user mode and stay there for a
long time, it does:

this_cpu_inc(rcu_qs_ctr);

or whatever.  Maybe we add:

this_cpu_set(rcu_state) = IN_USER;

or however it's spelled.

The remote CPU wants to check our state.  If this happens just before
the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll
think we're in kernel mode.  Now it either has to poll (which is fine)
or try to get us to tell the RCU core when we become quiescent by
setting TIF_RCU_THINGY.

The problem is that I don't see how TIF_RCU_THINGY can work reliably.
If the remote CPU sets it, it'll be too late and we'll still enter
user mode without seeing it.  If it's just an optimization, though,
then it should be fine.

I still feel like this is all overengineered.  Shouldn't rcu_qs_ctr be
enough for all of the above?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Andy Lutomirski
On May 7, 2015 4:18 PM, Ingo Molnar mi...@kernel.org wrote:


 * Rik van Riel r...@redhat.com wrote:

   If, on the other hand, you're just going to remotely sample the
   in-memory context, that sounds good.
 
  It's the latter.
 
  If you look at /proc/pid/{stack,syscall,wchan} and other files,
  you will see we already have ways to determine, from in memory
  content, where a program is running at a certain point in time.
 
  In fact, the timer interrupt based accounting does a similar thing.
  It has a task examine its own in-memory state to figure out what it
  was doing before the timer interrupt happened.
 
  The kernel side stack pointer is probably enough to tell us whether
  a task is active in kernel space, on an irq stack, or (maybe) in
  user space. Not convinced about the latter, we may need to look at
  the same state the RCU code keeps track of to see what mode a task
  is in...
 
  I am looking at the code to see what locks we need to grab.
 
  I suspect the runqueue lock may be enough, to ensure that the task
  struct, and stack do not go away while we are looking at them.

 That will be enough, especially if you get to the task reference via
 rq-curr.

  We cannot take the lock_trace(task) from irq context, and we
  probably do not need to anyway, since we do not care about a precise
  stack trace for the task.

 So one worry with this and similar approaches of statistically
 detecting user mode would be the fact that on the way out to
 user-space we don't really destroy the previous call trace - we just
 pop off the stack (non-destructively), restore RIPs and are gone.

 We'll need that percpu flag I suspect.

 And once we have the flag, we can get rid of the per syscall RCU
 callback as well, relatively easily: with CMPXCHG (in
 synchronize_rcu()!) we can reliably sample whether a CPU is in user
 mode right now, while the syscall entry/exit path does not use any
 atomics, we can just use a simple MOV.

 Once we observe 'user mode', then we have observed quiescent state and
 synchronize_rcu() can continue. If we've observed kernel mode we can
 frob the remote task's TIF_ flags to make it go into a quiescent state
 publishing routine on syscall-return.


How does that work?

If the exit code writes the per-cpu flag and then checks TIF_whatever,
we need a barrier to avoid a race where we end up in user mode without
seeing the flag.

I think the right solution is to accept that race and just have the
RCU code send an IPI (or check again) if it sees too long of a period
elapse in kernel mode.

I think the flag should be a counter, though.  That way a workload
that makes lots of syscalls will be quickly detected as going through
quiescent states even if it's never actually observed in user mode.

 The only hard requirement of this scheme from the RCU synchronization
 POV is that all kernel contexts that may touch RCU state need to flip
 this flag reliably to 'kernel mode': i.e. all irq handlers, traps,
 NMIs and all syscall variants need to do this.

 But once it's there, it's really neat.


We already have to do this with the current code.  I went through and
checked all of the IST entries a couple versions ago.

I think we need to clean up the current garbage asm first, though.  See:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entryid=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac

(Very much unfinished, and it should probably be split up, but AFAICT
it works.  Don't hold your breath for a real version.)

--Andy

 Thanks,

 Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-07 Thread Ingo Molnar

* Frederic Weisbecker fweis...@gmail.com wrote:

   We cannot take the lock_trace(task) from irq context, and we 
   probably do not need to anyway, since we do not care about a 
   precise stack trace for the task.
  
  So one worry with this and similar approaches of statistically 
  detecting user mode would be the fact that on the way out to 
  user-space we don't really destroy the previous call trace - we 
  just pop off the stack (non-destructively), restore RIPs and are 
  gone.
  
  We'll need that percpu flag I suspect.
 
 Note we have the context tracking state which tells where the 
 current task is: user/system/guest.

Yes, but that overhead is what I'm suggesting we get rid of, I thought 
Rik was trying to find a mechanism that would be independent of that?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> 
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>> peak at rq->curr's pt_regs::flag, and if:
>>
>>  - the flag is 0 then it has observed a quiescent state.
>>
>>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Actually, the "race" in this direction is fine. If flag==0, then
the nohz CPU is not accessing any RCU protected data structures,
and the synhcronize_rcu CPU will not be setting TIF_NOHZ.

The race is only a concern if the synchronize_rcu CPU reads
flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after
the nohz CPU has cleared flag (and is unable to handle RCU
stuff). An atomic compare and swap prevents that issue.

The other race, of the synchronize_rcu CPU reading 0, followed
by the nohz CPU going into kernel space, and setting the flag
to 1, should be fine. After all, this means the nohz_full CPU
just went into a new RCU grace period, which is just what the
synchronize_rcu CPU was waiting for.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>> peak at rq->curr's pt_regs::flag, and if:
>>
>>  - the flag is 0 then it has observed a quiescent state.
>>
>>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

As an aside, I suspect the remote timer sampling stuff is probably
best off piggybacking on the RCU status (and PF_VCPU) of the tasks
being sampled.

That seems like the most reliable in-memory state we can examine,
since the frame pointer may be in registers only, and not in
memory to begin with.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
> 
> 
> On 02/05/2015 07:27, Ingo Molnar wrote:
>>
>>   - synchronize_rcu() avoids having to send an IPI by taking a 
>> peak at rq->curr's pt_regs::flag, and if:
>>
>>  - the flag is 0 then it has observed a quiescent state.
>>
>>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>>completion from a TIF_NOHZ callback.
> 
> Isn't this racy?
> 
>   synchronize_rcu CPU nohz CPU
>   -
>   set flag = 0
>   read flag = 0
>   return to userspace
>   set TIF_NOHZ
> 
> and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

If the flag and TIF_NOHZ live in the same word, we can cmpxchg
and be guaranteed atomicity.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-04 Thread Paolo Bonzini


On 02/05/2015 07:27, Ingo Molnar wrote:
> 
>   - synchronize_rcu() avoids having to send an IPI by taking a 
> peak at rq->curr's pt_regs::flag, and if:
> 
>  - the flag is 0 then it has observed a quiescent state.
> 
>  - the flag is 1, then it would set TIF_NOHZ and wait for a 
>completion from a TIF_NOHZ callback.

Isn't this racy?

synchronize_rcu CPU nohz CPU
-
set flag = 0
read flag = 0
return to userspace
set TIF_NOHZ

and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-04 Thread Paolo Bonzini


On 02/05/2015 07:27, Ingo Molnar wrote:
 
   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:
 
  - the flag is 0 then it has observed a quiescent state.
 
  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.

Isn't this racy?

synchronize_rcu CPU nohz CPU
-
set flag = 0
read flag = 0
return to userspace
set TIF_NOHZ

and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 
 
 On 02/05/2015 07:27, Ingo Molnar wrote:

   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.
 
 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

If the flag and TIF_NOHZ live in the same word, we can cmpxchg
and be guaranteed atomicity.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 On 02/05/2015 07:27, Ingo Molnar wrote:

   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.
 
 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

As an aside, I suspect the remote timer sampling stuff is probably
best off piggybacking on the RCU status (and PF_VCPU) of the tasks
being sampled.

That seems like the most reliable in-memory state we can examine,
since the frame pointer may be in registers only, and not in
memory to begin with.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-04 Thread Rik van Riel
On 05/04/2015 05:26 AM, Paolo Bonzini wrote:
 
 
 On 02/05/2015 07:27, Ingo Molnar wrote:

   - synchronize_rcu() avoids having to send an IPI by taking a 
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a 
completion from a TIF_NOHZ callback.
 
 Isn't this racy?
 
   synchronize_rcu CPU nohz CPU
   -
   set flag = 0
   read flag = 0
   return to userspace
   set TIF_NOHZ
 
 and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU.

Actually, the race in this direction is fine. If flag==0, then
the nohz CPU is not accessing any RCU protected data structures,
and the synhcronize_rcu CPU will not be setting TIF_NOHZ.

The race is only a concern if the synchronize_rcu CPU reads
flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after
the nohz CPU has cleared flag (and is unable to handle RCU
stuff). An atomic compare and swap prevents that issue.

The other race, of the synchronize_rcu CPU reading 0, followed
by the nohz CPU going into kernel space, and setting the flag
to 1, should be fine. After all, this means the nohz_full CPU
just went into a new RCU grace period, which is just what the
synchronize_rcu CPU was waiting for.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-03 Thread Rik van Riel
On 05/03/2015 02:24 PM, Andy Lutomirski wrote:
> On Sun, May 3, 2015 at 10:30 AM, Rik van Riel  wrote:
>> On 05/03/2015 09:23 AM, Mike Galbraith wrote:
>>
>>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
>>>
>>> 100M * stat() on isolated cpu
>>>
>>> NO_HZ_FULL offinactive housekeepernohz_full
>>> real0m14.266s 0m14.367s0m20.427s  0m27.921s
>>> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
>>> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
>>> (real)  1.000 1.0071.431  1.957

>> I'm convinced.
>>
>> Time to try the remote sampling of CPU use statistics, and
>> lighten up the RCU overhead of context tracking.
>>
> 
> I don't understand the remote sampling proposal.  Isn't the whole
> point of full nohz to avoid periodically interrupting a busy CPU?  If
> what you have in mind is sending IPIs, then that's just as bad, right?
> 
> If, on the other hand, you're just going to remotely sample the
> in-memory context, that sounds good.

It's the latter.

If you look at /proc//{stack,syscall,wchan} and other files,
you will see we already have ways to determine, from in memory
content, where a program is running at a certain point in time.

In fact, the timer interrupt based accounting does a similar thing.
It has a task examine its own in-memory state to figure out what it
was doing before the timer interrupt happened.

The kernel side stack pointer is probably enough to tell us whether
a task is active in kernel space, on an irq stack, or (maybe) in user
space. Not convinced about the latter, we may need to look at the
same state the RCU code keeps track of to see what mode a task is in...

I am looking at the code to see what locks we need to grab.

I suspect the runqueue lock may be enough, to ensure that the task
struct, and stack do not go away while we are looking at them.

We cannot take the lock_trace(task) from irq context, and we probably
do not need to anyway, since we do not care about a precise stack trace
for the task.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-03 Thread Andy Lutomirski
On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar  wrote:
>
> * Andy Lutomirski  wrote:
>
>> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel  wrote:
>> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
>> >
>> >> Or we could do that in the syscall path with a single store of a
>> >> constant flag to a location in the task struct. We have a number of
>> >> natural flags that get written on syscall entry, such as:
>> >>
>> >> pushq_cfi $__USER_DS/* pt_regs->ss */
>> >>
>> >> That goes to a constant location on the kernel stack. On return from
>> >> system calls we could write 0 to that location.
>>
>> Huh?  IRET with zero there will fault, and we can't guarantee that
>> all syscalls return using sysret. [...]
>
> So IRET is a slowpath - I was thinking about the fastpath mainly.

Slowpath or not, if we're using part of the hardware frame in pt_regs
as an indication of whether we're in user or kernel mode, we don't get
to arbitrarily change it to the "user" state before IRET or the IRET
will fail.  We could work around that with some kind of trampoline,
but that's complicated.

>
>> [...]  Also, we'd have to audit all the entries, and
>> system_call_after_swapgs currently enables interrupts early enough
>> that an interrupt before all the pushes will do unpredictable things
>> to pt_regs.
>
> An irq hardware frame won't push zero to that selector value, right?
> That's the only bad thing that would confuse the code.
>

I think it's not quite that simple.  The syscall entry looks like, roughly:

fix rsp;
sti;
push ss;
push rsp;
push flags;
push cs;
push rip;

We can get an interrupt part-way through those pushes.  Maybe there's
no bad place where we could get an IRQ since SS is first, but this is
still nasty.

>
> But yes, safely accessing the remote task is a complication, but with
> such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
> even if we do:
>
>> If we went that route, I'd advocate sticking the flag in tss->sp1.
>> That cacheline is unconditionally read on kernel entry already, and
>> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
>> lost track). [1]
>>
>> Alternatively, I'd suggest that we actually add a whole new word to
>> pt_regs.
>
> ... we'd still have to set TIF_NOHZ and are back to square one in
> terms of race complexity.
>
> But it's not overly complex: by taking the remote CPU's rq-lock from
> synchronize_rcu() we could get a stable pointer to the currently
> executing task.
>
> And if we do that, we might as well use the opportunity and take a
> look at pt_regs as well - this is why sticking it into pt_regs might
> be better.
>
> So I'd:
>
>   - enlarge pt_regs by a word and stick the flag there (this
> allocation is essentially free)
>
>   - update the word from entry/exit
>
>   - synchronize_rcu() avoids having to send an IPI by taking a
> peak at rq->curr's pt_regs::flag, and if:
>
>  - the flag is 0 then it has observed a quiescent state.
>
>  - the flag is 1, then it would set TIF_NOHZ and wait for a
>completion from a TIF_NOHZ callback.

This seems complicated, especially given IST entries that put their
pt_regs in different places.  We get a bit of a free pass on some of
the IST entries due to the IST stack switch thing, but we still need
to worry about NMI and MCE at least.

I think I want to understand the considerations a bit better before I
express a real opinion.  IIUC the remote CPU needs to reliably detect
that we've had a grace period.  This can potentially happen by
observing us with some quiescent (user-mode, guest-mode or idle) flag
set or by an interrupt incrementing a counter.  In CONFIG_PREEMPT_RCU
mode, the remote CPU can presumably observe that we're quiescent
directly.

Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is
unnecessary?  We may still want context tracking for the timing stats
sampling, but I don't see why we need it for RCU.

In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about
possible races that could cause us to fail to ever detect a grace
period.  If we have the periodic tick turned off and if the remote CPU
checks our context and sees us non-quiescent, then either we need a
non-racy way to get us to do work when we become quiescent or the
remote CPU needs to eventually send us an IPI.  I don't see any way to
do the former without an RMW or barrier on every transition into a
quiescent state.  Even if we used ti->flags, I still think that would
require us to upgrade the cli; test on exit to userspace to a
test-and-set or test-and-clear, either of which is much more
expensive.

I think I prefer a per-cpu approach over a per-task approach, since
it's easier to reason about and it should still only require one
instruction on entry and one instruction on exit.

FWIW, if we go the per-cpu route, what if we used a counter instead of
a simple context flag?  With a counter, the remote CPU could observe
that we changed state and therefore went through a 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-03 Thread Andy Lutomirski
On Sun, May 3, 2015 at 10:30 AM, Rik van Riel  wrote:
> On 05/03/2015 09:23 AM, Mike Galbraith wrote:
>
>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
>>
>> 100M * stat() on isolated cpu
>>
>> NO_HZ_FULL offinactive housekeepernohz_full
>> real0m14.266s 0m14.367s0m20.427s  0m27.921s
>> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
>> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
>> (real)  1.000 1.0071.431  1.957
>>1.000  1.000
>>
>>  real  0m20.423s  0m27.930s  +rik 1,2
>>  user  0m2.072s   0m10.450s
>>  sys   0m18.304s  0m17.471s
>>  vs off1.431  1.957
>>  vs prev   1.000  1.000
>>
>>  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
>> missing prototypes)
>>  user  0m1.884s   0m10.551s
>>  sys   0m18.353s  0m17.242s
>>  vs off1.419  1.948
>>  vs prev.991   .995
>>
>>  real  0m19.122s  0m26.946s  +rik 3
>>  user  0m1.896s   0m10.292s
>>  sys   0m17.198s  0m16.644s
>>  vs off1.340  1.888
>>  vs prev   .944.969
>
> I'm convinced.
>
> Time to try the remote sampling of CPU use statistics, and
> lighten up the RCU overhead of context tracking.
>

I don't understand the remote sampling proposal.  Isn't the whole
point of full nohz to avoid periodically interrupting a busy CPU?  If
what you have in mind is sending IPIs, then that's just as bad, right?

If, on the other hand, you're just going to remotely sample the
in-memory context, that sounds good.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-03 Thread Rik van Riel
On 05/03/2015 09:23 AM, Mike Galbraith wrote:

> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
> 
> 100M * stat() on isolated cpu
> 
> NO_HZ_FULL offinactive housekeepernohz_full
> real0m14.266s 0m14.367s0m20.427s  0m27.921s
> user0m1.756s  0m1.553s 0m1.976s   0m10.447s
> sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
> (real)  1.000 1.0071.431  1.957
>1.000  1.000
> 
>  real  0m20.423s  0m27.930s  +rik 1,2
>  user  0m2.072s   0m10.450s
>  sys   0m18.304s  0m17.471s
>  vs off1.431  1.957
>  vs prev   1.000  1.000
> 
>  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
> missing prototypes)
>  user  0m1.884s   0m10.551s
>  sys   0m18.353s  0m17.242s
>  vs off1.419  1.948
>  vs prev.991   .995
> 
>  real  0m19.122s  0m26.946s  +rik 3
>  user  0m1.896s   0m10.292s
>  sys   0m17.198s  0m16.644s
>  vs off1.340  1.888
>  vs prev   .944.969

I'm convinced.

Time to try the remote sampling of CPU use statistics, and
lighten up the RCU overhead of context tracking.

I should be able to start on the remote sampling of CPU use
statistics this week.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-03 Thread Mike Galbraith
On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote:
> On 05/01/2015 02:40 AM, Ingo Molnar wrote:
> 
> >> This patch builds on top of these patches by Paolo:
> >> https://lkml.org/lkml/2015/4/28/188
> >> https://lkml.org/lkml/2015/4/29/139
> >>
> >> Together with this patch I posted earlier this week, the syscall path
> >> on a nohz_full cpu seems to be about 10% faster.
> >> https://lkml.org/lkml/2015/4/24/394
> >>
> >> My test is a simple microbenchmark that calls getpriority() in a loop
> >> 10 million times:
> >>
> >>run timesystem time
> >> vanilla5.49s   2.08s
> >> __acct patch   5.21s   1.92s
> >> both patches   4.88s   1.71s
> >
> > Just curious, what are the numbers if you don't have context tracking 
> > enabled, i.e. without nohz_full?
> > 
> > I.e. what's the baseline we are talking about?
> 
> It's an astounding difference. This is not a kernel without nohz_full,
> just a CPU without nohz_full running the same kernel I tested with
> yesterday:
> 
>   run timesystem time
> vanilla   5.49s   2.08s
> __acct patch  5.21s   1.92s
> both patches  4.88s   1.71s
> CPU w/o nohz  3.12s   1.63s<-- your numbers, mostly

Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

100M * stat() on isolated cpu

NO_HZ_FULL offinactive housekeepernohz_full
real0m14.266s 0m14.367s0m20.427s  0m27.921s
user0m1.756s  0m1.553s 0m1.976s   0m10.447s
sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
(real)  1.000 1.0071.431  1.957
   1.000  1.000

 real  0m20.423s  0m27.930s  +rik 1,2
 user  0m2.072s   0m10.450s
 sys   0m18.304s  0m17.471s
 vs off1.431  1.957
 vs prev   1.000  1.000

 real  0m20.256s  0m27.803s  +paolo 1,2 (2 
missing prototypes)
 user  0m1.884s   0m10.551s
 sys   0m18.353s  0m17.242s
 vs off1.419  1.948
 vs prev.991   .995

 real  0m19.122s  0m26.946s  +rik 3
 user  0m1.896s   0m10.292s
 sys   0m17.198s  0m16.644s
 vs off1.340  1.888
 vs prev   .944.969



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-03 Thread Mike Galbraith
On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote:
 On 05/01/2015 02:40 AM, Ingo Molnar wrote:
 
  This patch builds on top of these patches by Paolo:
  https://lkml.org/lkml/2015/4/28/188
  https://lkml.org/lkml/2015/4/29/139
 
  Together with this patch I posted earlier this week, the syscall path
  on a nohz_full cpu seems to be about 10% faster.
  https://lkml.org/lkml/2015/4/24/394
 
  My test is a simple microbenchmark that calls getpriority() in a loop
  10 million times:
 
 run timesystem time
  vanilla5.49s   2.08s
  __acct patch   5.21s   1.92s
  both patches   4.88s   1.71s
 
  Just curious, what are the numbers if you don't have context tracking 
  enabled, i.e. without nohz_full?
  
  I.e. what's the baseline we are talking about?
 
 It's an astounding difference. This is not a kernel without nohz_full,
 just a CPU without nohz_full running the same kernel I tested with
 yesterday:
 
   run timesystem time
 vanilla   5.49s   2.08s
 __acct patch  5.21s   1.92s
 both patches  4.88s   1.71s
 CPU w/o nohz  3.12s   1.63s-- your numbers, mostly

Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

100M * stat() on isolated cpu

NO_HZ_FULL offinactive housekeepernohz_full
real0m14.266s 0m14.367s0m20.427s  0m27.921s
user0m1.756s  0m1.553s 0m1.976s   0m10.447s
sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
(real)  1.000 1.0071.431  1.957
   1.000  1.000

 real  0m20.423s  0m27.930s  +rik 1,2
 user  0m2.072s   0m10.450s
 sys   0m18.304s  0m17.471s
 vs off1.431  1.957
 vs prev   1.000  1.000

 real  0m20.256s  0m27.803s  +paolo 1,2 (2 
missing prototypes)
 user  0m1.884s   0m10.551s
 sys   0m18.353s  0m17.242s
 vs off1.419  1.948
 vs prev.991   .995

 real  0m19.122s  0m26.946s  +rik 3
 user  0m1.896s   0m10.292s
 sys   0m17.198s  0m16.644s
 vs off1.340  1.888
 vs prev   .944.969



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-03 Thread Rik van Riel
On 05/03/2015 02:24 PM, Andy Lutomirski wrote:
 On Sun, May 3, 2015 at 10:30 AM, Rik van Riel r...@redhat.com wrote:
 On 05/03/2015 09:23 AM, Mike Galbraith wrote:

 Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

 100M * stat() on isolated cpu

 NO_HZ_FULL offinactive housekeepernohz_full
 real0m14.266s 0m14.367s0m20.427s  0m27.921s
 user0m1.756s  0m1.553s 0m1.976s   0m10.447s
 sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
 (real)  1.000 1.0071.431  1.957

 I'm convinced.

 Time to try the remote sampling of CPU use statistics, and
 lighten up the RCU overhead of context tracking.

 
 I don't understand the remote sampling proposal.  Isn't the whole
 point of full nohz to avoid periodically interrupting a busy CPU?  If
 what you have in mind is sending IPIs, then that's just as bad, right?
 
 If, on the other hand, you're just going to remotely sample the
 in-memory context, that sounds good.

It's the latter.

If you look at /proc/pid/{stack,syscall,wchan} and other files,
you will see we already have ways to determine, from in memory
content, where a program is running at a certain point in time.

In fact, the timer interrupt based accounting does a similar thing.
It has a task examine its own in-memory state to figure out what it
was doing before the timer interrupt happened.

The kernel side stack pointer is probably enough to tell us whether
a task is active in kernel space, on an irq stack, or (maybe) in user
space. Not convinced about the latter, we may need to look at the
same state the RCU code keeps track of to see what mode a task is in...

I am looking at the code to see what locks we need to grab.

I suspect the runqueue lock may be enough, to ensure that the task
struct, and stack do not go away while we are looking at them.

We cannot take the lock_trace(task) from irq context, and we probably
do not need to anyway, since we do not care about a precise stack trace
for the task.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-03 Thread Rik van Riel
On 05/03/2015 09:23 AM, Mike Galbraith wrote:

 Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.
 
 100M * stat() on isolated cpu
 
 NO_HZ_FULL offinactive housekeepernohz_full
 real0m14.266s 0m14.367s0m20.427s  0m27.921s
 user0m1.756s  0m1.553s 0m1.976s   0m10.447s
 sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
 (real)  1.000 1.0071.431  1.957
1.000  1.000
 
  real  0m20.423s  0m27.930s  +rik 1,2
  user  0m2.072s   0m10.450s
  sys   0m18.304s  0m17.471s
  vs off1.431  1.957
  vs prev   1.000  1.000
 
  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
 missing prototypes)
  user  0m1.884s   0m10.551s
  sys   0m18.353s  0m17.242s
  vs off1.419  1.948
  vs prev.991   .995
 
  real  0m19.122s  0m26.946s  +rik 3
  user  0m1.896s   0m10.292s
  sys   0m17.198s  0m16.644s
  vs off1.340  1.888
  vs prev   .944.969

I'm convinced.

Time to try the remote sampling of CPU use statistics, and
lighten up the RCU overhead of context tracking.

I should be able to start on the remote sampling of CPU use
statistics this week.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-03 Thread Andy Lutomirski
On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar mi...@kernel.org wrote:

 * Andy Lutomirski l...@amacapital.net wrote:

 On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote:
  On 05/01/2015 02:40 PM, Ingo Molnar wrote:
 
  Or we could do that in the syscall path with a single store of a
  constant flag to a location in the task struct. We have a number of
  natural flags that get written on syscall entry, such as:
 
  pushq_cfi $__USER_DS/* pt_regs-ss */
 
  That goes to a constant location on the kernel stack. On return from
  system calls we could write 0 to that location.

 Huh?  IRET with zero there will fault, and we can't guarantee that
 all syscalls return using sysret. [...]

 So IRET is a slowpath - I was thinking about the fastpath mainly.

Slowpath or not, if we're using part of the hardware frame in pt_regs
as an indication of whether we're in user or kernel mode, we don't get
to arbitrarily change it to the user state before IRET or the IRET
will fail.  We could work around that with some kind of trampoline,
but that's complicated.


 [...]  Also, we'd have to audit all the entries, and
 system_call_after_swapgs currently enables interrupts early enough
 that an interrupt before all the pushes will do unpredictable things
 to pt_regs.

 An irq hardware frame won't push zero to that selector value, right?
 That's the only bad thing that would confuse the code.


I think it's not quite that simple.  The syscall entry looks like, roughly:

fix rsp;
sti;
push ss;
push rsp;
push flags;
push cs;
push rip;

We can get an interrupt part-way through those pushes.  Maybe there's
no bad place where we could get an IRQ since SS is first, but this is
still nasty.


 But yes, safely accessing the remote task is a complication, but with
 such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So
 even if we do:

 If we went that route, I'd advocate sticking the flag in tss-sp1.
 That cacheline is unconditionally read on kernel entry already, and
 sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
 lost track). [1]

 Alternatively, I'd suggest that we actually add a whole new word to
 pt_regs.

 ... we'd still have to set TIF_NOHZ and are back to square one in
 terms of race complexity.

 But it's not overly complex: by taking the remote CPU's rq-lock from
 synchronize_rcu() we could get a stable pointer to the currently
 executing task.

 And if we do that, we might as well use the opportunity and take a
 look at pt_regs as well - this is why sticking it into pt_regs might
 be better.

 So I'd:

   - enlarge pt_regs by a word and stick the flag there (this
 allocation is essentially free)

   - update the word from entry/exit

   - synchronize_rcu() avoids having to send an IPI by taking a
 peak at rq-curr's pt_regs::flag, and if:

  - the flag is 0 then it has observed a quiescent state.

  - the flag is 1, then it would set TIF_NOHZ and wait for a
completion from a TIF_NOHZ callback.

This seems complicated, especially given IST entries that put their
pt_regs in different places.  We get a bit of a free pass on some of
the IST entries due to the IST stack switch thing, but we still need
to worry about NMI and MCE at least.

I think I want to understand the considerations a bit better before I
express a real opinion.  IIUC the remote CPU needs to reliably detect
that we've had a grace period.  This can potentially happen by
observing us with some quiescent (user-mode, guest-mode or idle) flag
set or by an interrupt incrementing a counter.  In CONFIG_PREEMPT_RCU
mode, the remote CPU can presumably observe that we're quiescent
directly.

Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is
unnecessary?  We may still want context tracking for the timing stats
sampling, but I don't see why we need it for RCU.

In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about
possible races that could cause us to fail to ever detect a grace
period.  If we have the periodic tick turned off and if the remote CPU
checks our context and sees us non-quiescent, then either we need a
non-racy way to get us to do work when we become quiescent or the
remote CPU needs to eventually send us an IPI.  I don't see any way to
do the former without an RMW or barrier on every transition into a
quiescent state.  Even if we used ti-flags, I still think that would
require us to upgrade the cli; test on exit to userspace to a
test-and-set or test-and-clear, either of which is much more
expensive.

I think I prefer a per-cpu approach over a per-task approach, since
it's easier to reason about and it should still only require one
instruction on entry and one instruction on exit.

FWIW, if we go the per-cpu route, what if we used a counter instead of
a simple context flag?  With a counter, the remote CPU could observe
that we changed state and therefore went through a grace period even
if the remote CPU never actually observes us in 

Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-03 Thread Andy Lutomirski
On Sun, May 3, 2015 at 10:30 AM, Rik van Riel r...@redhat.com wrote:
 On 05/03/2015 09:23 AM, Mike Galbraith wrote:

 Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements.

 100M * stat() on isolated cpu

 NO_HZ_FULL offinactive housekeepernohz_full
 real0m14.266s 0m14.367s0m20.427s  0m27.921s
 user0m1.756s  0m1.553s 0m1.976s   0m10.447s
 sys 0m12.508s 0m12.769s0m18.400s  0m17.464s
 (real)  1.000 1.0071.431  1.957
1.000  1.000

  real  0m20.423s  0m27.930s  +rik 1,2
  user  0m2.072s   0m10.450s
  sys   0m18.304s  0m17.471s
  vs off1.431  1.957
  vs prev   1.000  1.000

  real  0m20.256s  0m27.803s  +paolo 1,2 (2 
 missing prototypes)
  user  0m1.884s   0m10.551s
  sys   0m18.353s  0m17.242s
  vs off1.419  1.948
  vs prev.991   .995

  real  0m19.122s  0m26.946s  +rik 3
  user  0m1.896s   0m10.292s
  sys   0m17.198s  0m16.644s
  vs off1.340  1.888
  vs prev   .944.969

 I'm convinced.

 Time to try the remote sampling of CPU use statistics, and
 lighten up the RCU overhead of context tracking.


I don't understand the remote sampling proposal.  Isn't the whole
point of full nohz to avoid periodically interrupting a busy CPU?  If
what you have in mind is sending IPIs, then that's just as bad, right?

If, on the other hand, you're just going to remotely sample the
in-memory context, that sounds good.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-02 Thread Rik van Riel
On 05/02/2015 01:27 AM, Ingo Molnar wrote:

> Regarding the user/kernel execution time split measurement:
> 
> 1) the same flag could be used to sample a remote CPU's statistics 
> from another CPU and update the stats in the currently executing task. 
> As long as there's at least one non-nohz-full CPU, this would work. Or 
> are there systems were all CPUs are nohz-full?

On a NO_HZ_FULL system, you need at least one CPU to execute
RCU callbacks, and do other system things like that, so there
is at least one CPU that is not nohz_full.

On NUMA systems, I could even see the sane option being one
CPU that is not isolated or nohz_full per NUMA node, so we
have a place to route irqs, etc...

> 2) Alternatively we could just drive user/kernel split statistics from 
> context switches, which would be inaccurate if the workload is 
> SCHED_FIFO that only rarely context switches.
> 
> How does this sound?

I think option (1) sounds nicer :)

What locks do we need, besides the runqueue lock to make sure
the task does not go away, and later the task's vtime_lock to
update its time statistics?

Do we even need the lock_trace(task) as taken in proc_pid_stack(),
since all we care is whether or not the thing is in kernel, user,
or guest mode?

For guest mode, we set a flag in the task struct somewhere, that
part is easy.

It also looks like dump_trace() can distinguish between normal,
exception, and irq stacks. Not sure how fancy we need to get...

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-02 Thread Rik van Riel
On 05/02/2015 01:27 AM, Ingo Molnar wrote:

 Regarding the user/kernel execution time split measurement:
 
 1) the same flag could be used to sample a remote CPU's statistics 
 from another CPU and update the stats in the currently executing task. 
 As long as there's at least one non-nohz-full CPU, this would work. Or 
 are there systems were all CPUs are nohz-full?

On a NO_HZ_FULL system, you need at least one CPU to execute
RCU callbacks, and do other system things like that, so there
is at least one CPU that is not nohz_full.

On NUMA systems, I could even see the sane option being one
CPU that is not isolated or nohz_full per NUMA node, so we
have a place to route irqs, etc...

 2) Alternatively we could just drive user/kernel split statistics from 
 context switches, which would be inaccurate if the workload is 
 SCHED_FIFO that only rarely context switches.
 
 How does this sound?

I think option (1) sounds nicer :)

What locks do we need, besides the runqueue lock to make sure
the task does not go away, and later the task's vtime_lock to
update its time statistics?

Do we even need the lock_trace(task) as taken in proc_pid_stack(),
since all we care is whether or not the thing is in kernel, user,
or guest mode?

For guest mode, we set a flag in the task struct somewhere, that
part is easy.

It also looks like dump_trace() can distinguish between normal,
exception, and irq stacks. Not sure how fancy we need to get...

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel  wrote:
> > On 05/01/2015 02:40 PM, Ingo Molnar wrote:
> >
> >> Or we could do that in the syscall path with a single store of a
> >> constant flag to a location in the task struct. We have a number of
> >> natural flags that get written on syscall entry, such as:
> >>
> >> pushq_cfi $__USER_DS/* pt_regs->ss */
> >>
> >> That goes to a constant location on the kernel stack. On return from
> >> system calls we could write 0 to that location.
> 
> Huh?  IRET with zero there will fault, and we can't guarantee that 
> all syscalls return using sysret. [...]

So IRET is a slowpath - I was thinking about the fastpath mainly.

> [...]  Also, we'd have to audit all the entries, and 
> system_call_after_swapgs currently enables interrupts early enough 
> that an interrupt before all the pushes will do unpredictable things 
> to pt_regs.

An irq hardware frame won't push zero to that selector value, right? 
That's the only bad thing that would confuse the code.

> We could further abuse orig_ax, but that would require a lot of 
> auditing.  Honestly, though, I think keeping a flag in an 
> otherwise-hot cache line is a better bet. [...]

That would work too, at the cost of one more instruction, as now we'd 
have to both set and clear it.

> [...]  Also, making it per-cpu instead of per-task will probably be 
> easier on the RCU code, since otherwise the RCU code will have to do 
> some kind of synchronization (even if it's a wait-free probe of the 
> rq lock or similar) to figure out which pt_regs to look at.

So the synchronize_rcu() part is an even slower slow path, in 
comparison with system call entry overhead.

But yes, safely accessing the remote task is a complication, but with 
such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So 
even if we do:

> If we went that route, I'd advocate sticking the flag in tss->sp1. 
> That cacheline is unconditionally read on kernel entry already, and 
> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I 
> lost track). [1]
> 
> Alternatively, I'd suggest that we actually add a whole new word to 
> pt_regs.

... we'd still have to set TIF_NOHZ and are back to square one in 
terms of race complexity.

But it's not overly complex: by taking the remote CPU's rq-lock from 
synchronize_rcu() we could get a stable pointer to the currently 
executing task.

And if we do that, we might as well use the opportunity and take a 
look at pt_regs as well - this is why sticking it into pt_regs might 
be better.

So I'd:

  - enlarge pt_regs by a word and stick the flag there (this 
allocation is essentially free)

  - update the word from entry/exit

  - synchronize_rcu() avoids having to send an IPI by taking a 
peak at rq->curr's pt_regs::flag, and if:

 - the flag is 0 then it has observed a quiescent state.

 - the flag is 1, then it would set TIF_NOHZ and wait for a 
   completion from a TIF_NOHZ callback.

synchronize_rcu() often involves waiting for (timer tick driven) grace 
periods anyway, so this is a relatively fast solution - and it would 
limit the overhead to 2 instructions.

On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled)
we could patch out those two instructions into NOPs, which would be
eliminated in the decoder.

Regarding the user/kernel execution time split measurement:

1) the same flag could be used to sample a remote CPU's statistics 
from another CPU and update the stats in the currently executing task. 
As long as there's at least one non-nohz-full CPU, this would work. Or 
are there systems were all CPUs are nohz-full?

2) Alternatively we could just drive user/kernel split statistics from 
context switches, which would be inaccurate if the workload is 
SCHED_FIFO that only rarely context switches.

How does this sound?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Mike Galbraith
On Fri, 2015-05-01 at 14:05 -0400, Rik van Riel wrote:
> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel  wrote:
> > 
> >>> I can understand people running hard-RT workloads not wanting to 
> >>> see the overhead of a timer tick or a scheduler tick with variable 
> >>> (and occasionally heavy) work done in IRQ context, but the jitter 
> >>> caused by a single trivial IPI with constant work should be very, 
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> > 
> > I don't buy this:
> > 
> >> At that point an IPI, either on the host or in the guest, involves a 
> >> full VMEXIT & VMENTER cycle.
> > 
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> > usec on recent hardware, and I bet it will get better with time.
> > 
> > I'm not aware of any hard-RT workload that cannot take 1 usec 
> > latencies.
> 
> Now think about doing this kind of IPI from inside a guest,
> to another VCPU on the same guest.
> 
> Now you are looking at VMEXIT/VMENTER on the first VCPU,
> plus the cost of the IPI on the host, plus the cost of
> the emulation layer, plus VMEXIT/VMENTER on the second
> VCPU to trigger the IPI work, and possibly a second
> VMEXIT/VMENTER for IPI completion.
> 
> I suspect it would be better to do RCU callback offload
> in some other way.

I don't get it.  How the heck do people manage to talk about realtime in
virtual boxen, and not at least crack a smile.  Real, virtual, real,
virtual... what's wrong with this picture?

Why is virtual realtime not an oxymoron?

I did that for grins once, and it was either really funny, or really
sad, not sure which... but it did not look really really useful.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Andy Lutomirski
On Fri, May 1, 2015 at 12:11 PM, Rik van Riel  wrote:
> On 05/01/2015 02:40 PM, Ingo Molnar wrote:
>
>> Or we could do that in the syscall path with a single store of a
>> constant flag to a location in the task struct. We have a number of
>> natural flags that get written on syscall entry, such as:
>>
>> pushq_cfi $__USER_DS/* pt_regs->ss */
>>
>> That goes to a constant location on the kernel stack. On return from
>> system calls we could write 0 to that location.

Huh?  IRET with zero there will fault, and we can't guarantee that all
syscalls return using sysret.  Also, we'd have to audit all the entries,
and system_call_after_swapgs currently enables interrupts early enough
that an interrupt before all the pushes will do unpredictable things to
pt_regs.

We could further abuse orig_ax, but that would require a lot of
auditing.  Honestly, though, I think keeping a flag in an
otherwise-hot cache line is a better bet.  Also, making it per-cpu
instead of per-task will probably be easier on the RCU code, since
otherwise the RCU code will have to do some kind of synchronization
(even if it's a wait-free probe of the rq lock or similar) to figure
out which pt_regs to look at.

If we went that route, I'd advocate sticking the flag in tss->sp1.
That cacheline is unconditionally read on kernel entry already, and
sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
lost track). [1]

Alternatively, I'd suggest that we actually add a whole new word to pt_regs.

[1] It's not unconditionally accessed yet, but it wil be once Denys'
latest patches are in.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 02:40 PM, Ingo Molnar wrote:

> Or we could do that in the syscall path with a single store of a 
> constant flag to a location in the task struct. We have a number of 
> natural flags that get written on syscall entry, such as:
> 
> pushq_cfi $__USER_DS/* pt_regs->ss */
> 
> That goes to a constant location on the kernel stack. On return from 
> system calls we could write 0 to that location.
> 
> So the remote CPU would have to do a read of this location. There are 
> two cases:
> 
>  - If it's 0, then it has observed quiescent state on that CPU. (It 
>does not have to be atomics anymore, as we'd only observe the value 
>and MESI coherency takes care of it.)

That should do the trick.

>  - If it's not 0 then the remote CPU is not executing user-space code 
>and we can install (remotely) a TIF_NOHZ flag in it and expect it 
>to process it either on return to user-space or on a context 
>switch.

I may have to think about this a little more, but
it seems like it should work.

Can we use a separate byte in the flags word for
flags that can get set remotely, so we can do
stores and clearing of local-only flags without
atomic instructions?

> This way, unless I'm missing something, reduces the overhead to a 
> single store to a hot cacheline on return-to-userspace - which 
> instruction if we place it well might as well be close to zero cost. 
> No syscall entry cost. Slow-return cost only in the (rare) case of 
> someone using synchronize_rcu().

I think that should take care of the RCU aspect of
nohz_full.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel  wrote:
> > 
> >>> I can understand people running hard-RT workloads not wanting to 
> >>> see the overhead of a timer tick or a scheduler tick with variable 
> >>> (and occasionally heavy) work done in IRQ context, but the jitter 
> >>> caused by a single trivial IPI with constant work should be very, 
> >>> very low and constant.
> >>
> >> Not if the realtime workload is running inside a KVM guest.
> > 
> > I don't buy this:
> > 
> >> At that point an IPI, either on the host or in the guest, involves a 
> >> full VMEXIT & VMENTER cycle.
> > 
> > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> > usec on recent hardware, and I bet it will get better with time.
> > 
> > I'm not aware of any hard-RT workload that cannot take 1 usec 
> > latencies.
> 
> Now think about doing this kind of IPI from inside a guest, to 
> another VCPU on the same guest.
> 
> Now you are looking at VMEXIT/VMENTER on the first VCPU,

Does it matter? It's not the hard-RT CPU, and this is a slowpath of 
synchronize_rcu().

> plus the cost of the IPI on the host, plus the cost of the emulation 
> layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI 
> work, and possibly a second VMEXIT/VMENTER for IPI completion.

Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies.

> I suspect it would be better to do RCU callback offload in some 
> other way.

Well, it's not just about callback offload, but it's about the basic 
synchronization guarantee of synchronize_rcu(): that all RCU read-side 
critical sections have finished executing after the call returns.

So even if a nohz-full CPU never actually queues a callback, it needs 
to stop using resources that a synchronize_rcu() caller expects it to 
stop using.

We can do that only if we know it in an SMP-coherent way that the 
remote CPU is not in an rcu_read_lock() section.

Sending an IPI is one way to achieve that.

Or we could do that in the syscall path with a single store of a 
constant flag to a location in the task struct. We have a number of 
natural flags that get written on syscall entry, such as:

pushq_cfi $__USER_DS/* pt_regs->ss */

That goes to a constant location on the kernel stack. On return from 
system calls we could write 0 to that location.

So the remote CPU would have to do a read of this location. There are 
two cases:

 - If it's 0, then it has observed quiescent state on that CPU. (It 
   does not have to be atomics anymore, as we'd only observe the value 
   and MESI coherency takes care of it.)

 - If it's not 0 then the remote CPU is not executing user-space code 
   and we can install (remotely) a TIF_NOHZ flag in it and expect it 
   to process it either on return to user-space or on a context 
   switch.

This way, unless I'm missing something, reduces the overhead to a 
single store to a hot cacheline on return-to-userspace - which 
instruction if we place it well might as well be close to zero cost. 
No syscall entry cost. Slow-return cost only in the (rare) case of 
someone using synchronize_rcu().

Hm?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:34 PM, Ingo Molnar wrote:
> 
> * Rik van Riel  wrote:
> 
>>> I can understand people running hard-RT workloads not wanting to 
>>> see the overhead of a timer tick or a scheduler tick with variable 
>>> (and occasionally heavy) work done in IRQ context, but the jitter 
>>> caused by a single trivial IPI with constant work should be very, 
>>> very low and constant.
>>
>> Not if the realtime workload is running inside a KVM guest.
> 
> I don't buy this:
> 
>> At that point an IPI, either on the host or in the guest, involves a 
>> full VMEXIT & VMENTER cycle.
> 
> So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
> usec on recent hardware, and I bet it will get better with time.
> 
> I'm not aware of any hard-RT workload that cannot take 1 usec 
> latencies.

Now think about doing this kind of IPI from inside a guest,
to another VCPU on the same guest.

Now you are looking at VMEXIT/VMENTER on the first VCPU,
plus the cost of the IPI on the host, plus the cost of
the emulation layer, plus VMEXIT/VMENTER on the second
VCPU to trigger the IPI work, and possibly a second
VMEXIT/VMENTER for IPI completion.

I suspect it would be better to do RCU callback offload
in some other way.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> > I.e. much of what we do today, except that we could skip variable 
> > work such as the scheduler tick or (unforced) RCU processing like 
> > the RCU softirq work.
> 
> Any ideas how we could avoid that sampling timer interrupt latency 
> stacking up when dealing with both guest and host?

Well, it would be host_latency+guest_latency worst-case, right?

That should still be bounded, as long as both guest and host is 
running nohz-full mode.

Also, technically when the host gets such an IRQ, the guest is 
interrupted as well. So the host could do the guest's work and pass 
through the result via paravirt or so.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 01:12 PM, Ingo Molnar wrote:
> 
> * Rik van Riel  wrote:
> 
>> On 05/01/2015 12:45 PM, Ingo Molnar wrote:
>>>
>>> * Rik van Riel  wrote:
>>>
 On 05/01/2015 12:37 PM, Ingo Molnar wrote:

> Also note that this bit in context_tracking_enter():
>
> if (state == CONTEXT_USER) {
> trace_user_enter(0);
> vtime_user_enter(current);
> }
>
>
> is related to precise time measurement of user/kernel execution 
> times, it's not needed by the scheduler at all, it's just exported 
> to tooling. It's not fundamental to the scheduler.

 Any objections to the idea from the other thread to simply keep the 
 time accumulating in buffers in local_clock() units, and only update 
 the task vtime once a second or so?
>>>
>>> So I really think per syscall overhead is the wrong thing to do for 
>>> anything that a common distro enables.
>>>
>>> I see very little use for such precise, high-freq measurements on 
>>> normal systems - and abnormal systems could enable it dynamically just 
>>> like they can enable syscall auditing.
>>>
>>> I.e. I don't mind the occasional crazy piece of code, as long as it 
>>> does not hurt the innocent.
>>
>> Then how should/could we keep a rough idea of user / system / guest 
>> time when running without a periodic timer tick?
> 
> So I'd split the timer tick into two parts: just the constant-work 
> sampling bit that doesn't do much, and the variable-work part which 
> gets essentially shut off when the timeout is far into the future.
> 
> Then we could do IRQ driven sampling without introducing variable 
> amount jitter into hard-RT execution time.
> 
> I.e. much of what we do today, except that we could skip variable work 
> such as the scheduler tick or (unforced) RCU processing like the RCU 
> softirq work.

Any ideas how we could avoid that sampling timer interrupt
latency stacking up when dealing with both guest and host?

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> On 05/01/2015 12:45 PM, Ingo Molnar wrote:
> > 
> > * Rik van Riel  wrote:
> > 
> >> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
> >>
> >>> Also note that this bit in context_tracking_enter():
> >>>
> >>> if (state == CONTEXT_USER) {
> >>> trace_user_enter(0);
> >>> vtime_user_enter(current);
> >>> }
> >>>
> >>>
> >>> is related to precise time measurement of user/kernel execution 
> >>> times, it's not needed by the scheduler at all, it's just exported 
> >>> to tooling. It's not fundamental to the scheduler.
> >>
> >> Any objections to the idea from the other thread to simply keep the 
> >> time accumulating in buffers in local_clock() units, and only update 
> >> the task vtime once a second or so?
> > 
> > So I really think per syscall overhead is the wrong thing to do for 
> > anything that a common distro enables.
> > 
> > I see very little use for such precise, high-freq measurements on 
> > normal systems - and abnormal systems could enable it dynamically just 
> > like they can enable syscall auditing.
> > 
> > I.e. I don't mind the occasional crazy piece of code, as long as it 
> > does not hurt the innocent.
> 
> Then how should/could we keep a rough idea of user / system / guest 
> time when running without a periodic timer tick?

So I'd split the timer tick into two parts: just the constant-work 
sampling bit that doesn't do much, and the variable-work part which 
gets essentially shut off when the timeout is far into the future.

Then we could do IRQ driven sampling without introducing variable 
amount jitter into hard-RT execution time.

I.e. much of what we do today, except that we could skip variable work 
such as the scheduler tick or (unforced) RCU processing like the RCU 
softirq work.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:45 PM, Ingo Molnar wrote:
> 
> * Rik van Riel  wrote:
> 
>> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
>>
>>> Also note that this bit in context_tracking_enter():
>>>
>>> if (state == CONTEXT_USER) {
>>> trace_user_enter(0);
>>> vtime_user_enter(current);
>>> }
>>>
>>>
>>> is related to precise time measurement of user/kernel execution 
>>> times, it's not needed by the scheduler at all, it's just exported 
>>> to tooling. It's not fundamental to the scheduler.
>>
>> Any objections to the idea from the other thread to simply keep the 
>> time accumulating in buffers in local_clock() units, and only update 
>> the task vtime once a second or so?
> 
> So I really think per syscall overhead is the wrong thing to do for 
> anything that a common distro enables.
> 
> I see very little use for such precise, high-freq measurements on 
> normal systems - and abnormal systems could enable it dynamically just 
> like they can enable syscall auditing.
> 
> I.e. I don't mind the occasional crazy piece of code, as long as it 
> does not hurt the innocent.

Then how should/could we keep a rough idea of user / system / guest
time when running without a periodic timer tick?

These statistics are useful, and gathering them in a low impact
way (local_clock read + variable read + subtraction + addition +
store) may be acceptable overhead.

This is a few function calls, one tsc read, a possible cache
line miss, and an unlikely branch.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> On 05/01/2015 12:37 PM, Ingo Molnar wrote:
> 
> > Also note that this bit in context_tracking_enter():
> > 
> > if (state == CONTEXT_USER) {
> > trace_user_enter(0);
> > vtime_user_enter(current);
> > }
> > 
> > 
> > is related to precise time measurement of user/kernel execution 
> > times, it's not needed by the scheduler at all, it's just exported 
> > to tooling. It's not fundamental to the scheduler.
> 
> Any objections to the idea from the other thread to simply keep the 
> time accumulating in buffers in local_clock() units, and only update 
> the task vtime once a second or so?

So I really think per syscall overhead is the wrong thing to do for 
anything that a common distro enables.

I see very little use for such precise, high-freq measurements on 
normal systems - and abnormal systems could enable it dynamically just 
like they can enable syscall auditing.

I.e. I don't mind the occasional crazy piece of code, as long as it 
does not hurt the innocent.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:37 PM, Ingo Molnar wrote:

> Also note that this bit in context_tracking_enter():
> 
> if (state == CONTEXT_USER) {
> trace_user_enter(0);
> vtime_user_enter(current);
> }
> 
> 
> is related to precise time measurement of user/kernel execution times, 
> it's not needed by the scheduler at all, it's just exported to 
> tooling. It's not fundamental to the scheduler.

Any objections to the idea from the other thread to simply keep
the time accumulating in buffers in local_clock() units, and
only update the task vtime once a second or so?

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > 2. To suppress the timing tick, we need to get some timing for, 
> > um, the scheduler?  I wasn't really sure about this one.
> 
> So we have variable timeslice timers for the scheduler implemented, 
> they are off by default but they worked last someone tried them. See 
> the 'HRTICK' scheduler feature.
> 
> And for SCHED_FIFO that timeout can be 'never' - i.e. essentially 
> stopping the scheduler tick. (within reason.)

Also note that this bit in context_tracking_enter():

if (state == CONTEXT_USER) {
trace_user_enter(0);
vtime_user_enter(current);
}


is related to precise time measurement of user/kernel execution times, 
it's not needed by the scheduler at all, it's just exported to 
tooling. It's not fundamental to the scheduler.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> > I can understand people running hard-RT workloads not wanting to 
> > see the overhead of a timer tick or a scheduler tick with variable 
> > (and occasionally heavy) work done in IRQ context, but the jitter 
> > caused by a single trivial IPI with constant work should be very, 
> > very low and constant.
> 
> Not if the realtime workload is running inside a KVM guest.

I don't buy this:

> At that point an IPI, either on the host or in the guest, involves a 
> full VMEXIT & VMENTER cycle.

So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
usec on recent hardware, and I bet it will get better with time.

I'm not aware of any hard-RT workload that cannot take 1 usec 
latencies.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> On 05/01/2015 12:03 PM, Andy Lutomirski wrote:
> 
> > The last time I asked, the impression I got was that we needed two things:
> > 
> > 1. We can't pluck things from the RCU list without knowing whether the
> > CPU is in an RCU read-side critical section, and we can't know that
> > unless we have regular grade periods or we know that the CPU is idle.
> > To make the CPU detectably idle, we need to set a bit somewhere.
> 
> More than that. We also need a way for another CPU to identify the 
> callbacks they could run for us, without confusing them with new 
> callbacks queued after we transitioned back from USER to KERNEL 
> context.

That's easy: a simple, constant-work IPI sent to nohz-full CPUs could 
just dequeue these callbacks, if it sees that we are still in 
user-mode.

It's super easy because such an IPI would essentially be equivalent to 
a system call context from RCU's POV, if it uses this test:

if (user_mode(regs)) {
... pluck RCU callbacks ...
}

That way we can have lots of overhead removed from the syscall call 
path ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:21 PM, Ingo Molnar wrote:
> 
> * Andy Lutomirski  wrote:
> 
>>> So what's the point? Why not remove this big source of overhead 
>>> altogether?
>>
>> The last time I asked, the impression I got was that we needed two 
>> things:
>>
>> 1. We can't pluck things from the RCU list without knowing whether 
>> the CPU is in an RCU read-side critical section, and we can't know 
>> that unless we have regular grade periods or we know that the CPU is 
>> idle. To make the CPU detectably idle, we need to set a bit 
>> somewhere.
> 
> 'Idle' as in 'executing pure user-space mode, without entering the 
> kernel and possibly doing an rcu_read_lock()', right?
> 
> So we don't have to test it from the remote CPU: we could probe such 
> CPUs via a single low-overhead IPI. I'd much rather push such overhead 
> to sync_rcu() than to the syscall entry code!
>
> I can understand people running hard-RT workloads not wanting to see 
> the overhead of a timer tick or a scheduler tick with variable (and 
> occasionally heavy) work done in IRQ context, but the jitter caused by 
> a single trivial IPI with constant work should be very, very low and 
> constant.

Not if the realtime workload is running inside a KVM
guest.

At that point an IPI, either on the host or in the
guest, involves a full VMEXIT & VMENTER cycle.

I suspect it would be easy enough at user_enter() or
guest_enter() time to:
1) flip the bit
2) check to see if we have any callbacks queued on our on queue, and
3) if so, move the callbacks from that queue to the "another CPU can
   take these" queue

At that point, another CPU wanting to do an RCU grace period
advance can:
1) skip trying to advance the grace period on the CPU that
   is in userspace or guest mode, and
2) take the "for somebody else" callbacks of that CPU, and
   run them

This does not seem overly complicated.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:03 PM, Andy Lutomirski wrote:

> The last time I asked, the impression I got was that we needed two things:
> 
> 1. We can't pluck things from the RCU list without knowing whether the
> CPU is in an RCU read-side critical section, and we can't know that
> unless we have regular grade periods or we know that the CPU is idle.
> To make the CPU detectably idle, we need to set a bit somewhere.

More than that. We also need a way for another CPU to identify the
callbacks they could run for us, without confusing them with new
callbacks queued after we transitioned back from USER to KERNEL
context.

> 2. To suppress the timing tick, we need to get some timing for, um,
> the scheduler?  I wasn't really sure about this one.
> 
> Could we reduce the overhead by making the IN_USER vs IN_KERNEL
> indication be a single bit and, worst case, an rdtsc and maybe a
> subtraction?  We could probably get away with banning full nohz on
> non-invariant tsc systems.

I suspect we can.

There is no need to update the vtime every single time
we enter vtime_user_enter and functions like it.

We can keep a buffer, which:
1) keeps values in TSC cycles (or whatever unit local_clock does)
2) is only ever accessed by the current task, so it requires no
   locking
3) values can be actually folded into vtime periodically, when
   they exceed a certain threshold (1 second ?)

That means the vtime_seqlock is something that we would only take
once a second or so, and the calculations in account_user_time
would only be done on a fairly infrequent basis. That has the
potential to reduce overhead by a lot.

If nobody has any objections to that kind of change, I would be
happy to implement it.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > So what's the point? Why not remove this big source of overhead 
> > altogether?
> 
> The last time I asked, the impression I got was that we needed two 
> things:
> 
> 1. We can't pluck things from the RCU list without knowing whether 
> the CPU is in an RCU read-side critical section, and we can't know 
> that unless we have regular grade periods or we know that the CPU is 
> idle. To make the CPU detectably idle, we need to set a bit 
> somewhere.

'Idle' as in 'executing pure user-space mode, without entering the 
kernel and possibly doing an rcu_read_lock()', right?

So we don't have to test it from the remote CPU: we could probe such 
CPUs via a single low-overhead IPI. I'd much rather push such overhead 
to sync_rcu() than to the syscall entry code!

I can understand people running hard-RT workloads not wanting to see 
the overhead of a timer tick or a scheduler tick with variable (and 
occasionally heavy) work done in IRQ context, but the jitter caused by 
a single trivial IPI with constant work should be very, very low and 
constant.

If user-space RT code does not tolerate _that_ kind of latencies then 
it really has its priorities wrong and we should not try to please it. 
It should not hurt the other 99.9% of sane hard-RT users.

And the other usecase, virtualization, obviously does not care and 
could take the IPI just fine.

> 2. To suppress the timing tick, we need to get some timing for, um, 
> the scheduler?  I wasn't really sure about this one.

So we have variable timeslice timers for the scheduler implemented, 
they are off by default but they worked last someone tried them. See 
the 'HRTICK' scheduler feature.

And for SCHED_FIFO that timeout can be 'never' - i.e. essentially 
stopping the scheduler tick. (within reason.)

> Could we reduce the overhead by making the IN_USER vs IN_KERNEL 
> indication be a single bit and, worst case, an rdtsc and maybe a 
> subtraction?  We could probably get away with banning full nohz on 
> non-invariant tsc systems.
> 
> (I do understand why it would be tricky to transition from IN_USER 
> to IN_KERNEL with IRQs on.  Solvable, maybe, but tricky.)

We can make it literally zero overhead: by using an IPI from 
synchronize_rcu() and friend.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Andy Lutomirski
On Fri, May 1, 2015 at 8:59 AM, Ingo Molnar  wrote:
>
> * Rik van Riel  wrote:
>
>> > I.e. what's the baseline we are talking about?
>>
>> It's an astounding difference. This is not a kernel without
>> nohz_full, just a CPU without nohz_full running the same kernel I
>> tested with yesterday:
>>
>>   run timesystem time
>> vanilla   5.49s   2.08s
>> __acct patch  5.21s   1.92s
>> both patches  4.88s   1.71s
>> CPU w/o nohz  3.12s   1.63s<-- your numbers, mostly
>>
>> What is even more interesting is that the majority of the time
>> difference seems to come from _user_ time, which has gone down from
>> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on
>> the CPU without nohz_full enabled...
>>
>> At syscall entry time, the nohz_full context tracking code is very
>> straightforward. We check thread_info->flags &
>> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which
>> handles USER -> KERNEL context transition.
>>
>> Syscall exit time is a convoluted mess. Both do_notify_resume and
>> syscall_trace_leave call exit_user() on entry and enter_user() on
>> exit, leaving the time spent looping around between int_with_check
>> and syscall_return: in entry_64.S accounted as user time.
>>
>> I sent an email about this last night, it may be useful to add a
>> third test & function call point to the syscall return code, where
>> we can call user_enter() just ONCE, and remove the other context
>> tracking calls from that loop.
>
> So what I'm wondering about is the big picture:
>
>  - This is crazy big overhead in something as fundamental as system
>calls!
>
>  - We don't even have the excuse of the syscall auditing code, which
>kind of has to run for every syscall if it wants to do its job!
>
>  - [ The 'precise vtime' stuff that is driven from syscall entry/exit
>  is crazy, and I hope not enabled in any distro. ]
>
>  - So why are we doing this in every syscall time at all?
>
> Basically the whole point of user-context tracking is to be able to
> flush pending RCU callbacks. But that's crazy, we can sure defer a few
> kfree()s on this CPU, even indefinitely!
>
> If some other CPU does a sync_rcu(), then it can very well pluck those
> callbacks from this super low latency CPU's RCU lists (with due care)
> and go and free stuff itself ... There's no need to disturb this CPU
> for that!
>
> If user-space does not do anything kernel-ish then there won't be any
> new RCU callbacks piled up, so it's not like it's a resource leak
> issue either.
>
> So what's the point? Why not remove this big source of overhead
> altogether?

The last time I asked, the impression I got was that we needed two things:

1. We can't pluck things from the RCU list without knowing whether the
CPU is in an RCU read-side critical section, and we can't know that
unless we have regular grade periods or we know that the CPU is idle.
To make the CPU detectably idle, we need to set a bit somewhere.

2. To suppress the timing tick, we need to get some timing for, um,
the scheduler?  I wasn't really sure about this one.

Could we reduce the overhead by making the IN_USER vs IN_KERNEL
indication be a single bit and, worst case, an rdtsc and maybe a
subtraction?  We could probably get away with banning full nohz on
non-invariant tsc systems.

(I do understand why it would be tricky to transition from IN_USER to
IN_KERNEL with IRQs on.  Solvable, maybe, but tricky.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel  wrote:

> > I.e. what's the baseline we are talking about?
> 
> It's an astounding difference. This is not a kernel without 
> nohz_full, just a CPU without nohz_full running the same kernel I 
> tested with yesterday:
> 
>   run timesystem time
> vanilla   5.49s   2.08s
> __acct patch  5.21s   1.92s
> both patches  4.88s   1.71s
> CPU w/o nohz  3.12s   1.63s<-- your numbers, mostly
> 
> What is even more interesting is that the majority of the time 
> difference seems to come from _user_ time, which has gone down from 
> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on 
> the CPU without nohz_full enabled...
> 
> At syscall entry time, the nohz_full context tracking code is very 
> straightforward. We check thread_info->flags & 
> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which 
> handles USER -> KERNEL context transition.
> 
> Syscall exit time is a convoluted mess. Both do_notify_resume and 
> syscall_trace_leave call exit_user() on entry and enter_user() on 
> exit, leaving the time spent looping around between int_with_check 
> and syscall_return: in entry_64.S accounted as user time.
> 
> I sent an email about this last night, it may be useful to add a 
> third test & function call point to the syscall return code, where 
> we can call user_enter() just ONCE, and remove the other context 
> tracking calls from that loop.

So what I'm wondering about is the big picture:

 - This is crazy big overhead in something as fundamental as system
   calls!

 - We don't even have the excuse of the syscall auditing code, which
   kind of has to run for every syscall if it wants to do its job!

 - [ The 'precise vtime' stuff that is driven from syscall entry/exit 
 is crazy, and I hope not enabled in any distro. ]

 - So why are we doing this in every syscall time at all?

Basically the whole point of user-context tracking is to be able to 
flush pending RCU callbacks. But that's crazy, we can sure defer a few 
kfree()s on this CPU, even indefinitely!

If some other CPU does a sync_rcu(), then it can very well pluck those 
callbacks from this super low latency CPU's RCU lists (with due care) 
and go and free stuff itself ... There's no need to disturb this CPU 
for that!

If user-space does not do anything kernel-ish then there won't be any 
new RCU callbacks piled up, so it's not like it's a resource leak 
issue either.

So what's the point? Why not remove this big source of overhead 
altogether?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 02:40 AM, Ingo Molnar wrote:

>> This patch builds on top of these patches by Paolo:
>> https://lkml.org/lkml/2015/4/28/188
>> https://lkml.org/lkml/2015/4/29/139
>>
>> Together with this patch I posted earlier this week, the syscall path
>> on a nohz_full cpu seems to be about 10% faster.
>> https://lkml.org/lkml/2015/4/24/394
>>
>> My test is a simple microbenchmark that calls getpriority() in a loop
>> 10 million times:
>>
>>  run timesystem time
>> vanilla  5.49s   2.08s
>> __acct patch 5.21s   1.92s
>> both patches 4.88s   1.71s
>
> Just curious, what are the numbers if you don't have context tracking 
> enabled, i.e. without nohz_full?
> 
> I.e. what's the baseline we are talking about?

It's an astounding difference. This is not a kernel without nohz_full,
just a CPU without nohz_full running the same kernel I tested with
yesterday:

run timesystem time
vanilla 5.49s   2.08s
__acct patch5.21s   1.92s
both patches4.88s   1.71s
CPU w/o nohz3.12s   1.63s<-- your numbers, mostly

What is even more interesting is that the majority of the time
difference seems to come from _user_ time, which has gone down
from around 3.4 seconds in the vanilla kernel to around 1.5 seconds
on the CPU without nohz_full enabled...

At syscall entry time, the nohz_full context tracking code is very
straightforward. We check thread_info->flags & _TIF_WORK_SYSCALL_ENTRY,
and call syscall_trace_enter_phase1, which handles USER -> KERNEL
context transition.

Syscall exit time is a convoluted mess. Both do_notify_resume and
syscall_trace_leave call exit_user() on entry and enter_user()
on exit, leaving the time spent looping around between int_with_check
and syscall_return: in entry_64.S accounted as user time.

I sent an email about this last night, it may be useful to add a
third test & function call point to the syscall return code, where
we can call user_enter() just ONCE, and remove the other context
tracking calls from that loop.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* r...@redhat.com  wrote:

> From: Rik van Riel 
> 
> On syscall entry with nohz_full on, we enable interrupts, call user_exit,
> disable interrupts, do something, re-enable interrupts, and go on our
> merry way.
> 
> Profiling shows that a large amount of the nohz_full overhead comes
> from the extraneous disabling and re-enabling of interrupts. Andy
> suggested simply not enabling interrupts until after the context
> tracking code has done its thing, which allows us to skip a whole
> interrupt disable & re-enable cycle.
> 
> This patch builds on top of these patches by Paolo:
> https://lkml.org/lkml/2015/4/28/188
> https://lkml.org/lkml/2015/4/29/139
> 
> Together with this patch I posted earlier this week, the syscall path
> on a nohz_full cpu seems to be about 10% faster.
> https://lkml.org/lkml/2015/4/24/394
> 
> My test is a simple microbenchmark that calls getpriority() in a loop
> 10 million times:
> 
>   run timesystem time
> vanilla   5.49s   2.08s
> __acct patch  5.21s   1.92s
> both patches  4.88s   1.71s

Just curious, what are the numbers if you don't have context tracking 
enabled, i.e. without nohz_full?

I.e. what's the baseline we are talking about?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Mike Galbraith
On Fri, 2015-05-01 at 14:05 -0400, Rik van Riel wrote:
 On 05/01/2015 12:34 PM, Ingo Molnar wrote:
  
  * Rik van Riel r...@redhat.com wrote:
  
  I can understand people running hard-RT workloads not wanting to 
  see the overhead of a timer tick or a scheduler tick with variable 
  (and occasionally heavy) work done in IRQ context, but the jitter 
  caused by a single trivial IPI with constant work should be very, 
  very low and constant.
 
  Not if the realtime workload is running inside a KVM guest.
  
  I don't buy this:
  
  At that point an IPI, either on the host or in the guest, involves a 
  full VMEXIT  VMENTER cycle.
  
  So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
  usec on recent hardware, and I bet it will get better with time.
  
  I'm not aware of any hard-RT workload that cannot take 1 usec 
  latencies.
 
 Now think about doing this kind of IPI from inside a guest,
 to another VCPU on the same guest.
 
 Now you are looking at VMEXIT/VMENTER on the first VCPU,
 plus the cost of the IPI on the host, plus the cost of
 the emulation layer, plus VMEXIT/VMENTER on the second
 VCPU to trigger the IPI work, and possibly a second
 VMEXIT/VMENTER for IPI completion.
 
 I suspect it would be better to do RCU callback offload
 in some other way.

I don't get it.  How the heck do people manage to talk about realtime in
virtual boxen, and not at least crack a smile.  Real, virtual, real,
virtual... what's wrong with this picture?

Why is virtual realtime not an oxymoron?

I did that for grins once, and it was either really funny, or really
sad, not sure which... but it did not look really really useful.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

 On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote:
  On 05/01/2015 02:40 PM, Ingo Molnar wrote:
 
  Or we could do that in the syscall path with a single store of a
  constant flag to a location in the task struct. We have a number of
  natural flags that get written on syscall entry, such as:
 
  pushq_cfi $__USER_DS/* pt_regs-ss */
 
  That goes to a constant location on the kernel stack. On return from
  system calls we could write 0 to that location.
 
 Huh?  IRET with zero there will fault, and we can't guarantee that 
 all syscalls return using sysret. [...]

So IRET is a slowpath - I was thinking about the fastpath mainly.

 [...]  Also, we'd have to audit all the entries, and 
 system_call_after_swapgs currently enables interrupts early enough 
 that an interrupt before all the pushes will do unpredictable things 
 to pt_regs.

An irq hardware frame won't push zero to that selector value, right? 
That's the only bad thing that would confuse the code.

 We could further abuse orig_ax, but that would require a lot of 
 auditing.  Honestly, though, I think keeping a flag in an 
 otherwise-hot cache line is a better bet. [...]

That would work too, at the cost of one more instruction, as now we'd 
have to both set and clear it.

 [...]  Also, making it per-cpu instead of per-task will probably be 
 easier on the RCU code, since otherwise the RCU code will have to do 
 some kind of synchronization (even if it's a wait-free probe of the 
 rq lock or similar) to figure out which pt_regs to look at.

So the synchronize_rcu() part is an even slower slow path, in 
comparison with system call entry overhead.

But yes, safely accessing the remote task is a complication, but with 
such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So 
even if we do:

 If we went that route, I'd advocate sticking the flag in tss-sp1. 
 That cacheline is unconditionally read on kernel entry already, and 
 sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I 
 lost track). [1]
 
 Alternatively, I'd suggest that we actually add a whole new word to 
 pt_regs.

... we'd still have to set TIF_NOHZ and are back to square one in 
terms of race complexity.

But it's not overly complex: by taking the remote CPU's rq-lock from 
synchronize_rcu() we could get a stable pointer to the currently 
executing task.

And if we do that, we might as well use the opportunity and take a 
look at pt_regs as well - this is why sticking it into pt_regs might 
be better.

So I'd:

  - enlarge pt_regs by a word and stick the flag there (this 
allocation is essentially free)

  - update the word from entry/exit

  - synchronize_rcu() avoids having to send an IPI by taking a 
peak at rq-curr's pt_regs::flag, and if:

 - the flag is 0 then it has observed a quiescent state.

 - the flag is 1, then it would set TIF_NOHZ and wait for a 
   completion from a TIF_NOHZ callback.

synchronize_rcu() often involves waiting for (timer tick driven) grace 
periods anyway, so this is a relatively fast solution - and it would 
limit the overhead to 2 instructions.

On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled)
we could patch out those two instructions into NOPs, which would be
eliminated in the decoder.

Regarding the user/kernel execution time split measurement:

1) the same flag could be used to sample a remote CPU's statistics 
from another CPU and update the stats in the currently executing task. 
As long as there's at least one non-nohz-full CPU, this would work. Or 
are there systems were all CPUs are nohz-full?

2) Alternatively we could just drive user/kernel split statistics from 
context switches, which would be inaccurate if the workload is 
SCHED_FIFO that only rarely context switches.

How does this sound?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

  I.e. what's the baseline we are talking about?
 
 It's an astounding difference. This is not a kernel without 
 nohz_full, just a CPU without nohz_full running the same kernel I 
 tested with yesterday:
 
   run timesystem time
 vanilla   5.49s   2.08s
 __acct patch  5.21s   1.92s
 both patches  4.88s   1.71s
 CPU w/o nohz  3.12s   1.63s-- your numbers, mostly
 
 What is even more interesting is that the majority of the time 
 difference seems to come from _user_ time, which has gone down from 
 around 3.4 seconds in the vanilla kernel to around 1.5 seconds on 
 the CPU without nohz_full enabled...
 
 At syscall entry time, the nohz_full context tracking code is very 
 straightforward. We check thread_info-flags  
 _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which 
 handles USER - KERNEL context transition.
 
 Syscall exit time is a convoluted mess. Both do_notify_resume and 
 syscall_trace_leave call exit_user() on entry and enter_user() on 
 exit, leaving the time spent looping around between int_with_check 
 and syscall_return: in entry_64.S accounted as user time.
 
 I sent an email about this last night, it may be useful to add a 
 third test  function call point to the syscall return code, where 
 we can call user_enter() just ONCE, and remove the other context 
 tracking calls from that loop.

So what I'm wondering about is the big picture:

 - This is crazy big overhead in something as fundamental as system
   calls!

 - We don't even have the excuse of the syscall auditing code, which
   kind of has to run for every syscall if it wants to do its job!

 - [ The 'precise vtime' stuff that is driven from syscall entry/exit 
 is crazy, and I hope not enabled in any distro. ]

 - So why are we doing this in every syscall time at all?

Basically the whole point of user-context tracking is to be able to 
flush pending RCU callbacks. But that's crazy, we can sure defer a few 
kfree()s on this CPU, even indefinitely!

If some other CPU does a sync_rcu(), then it can very well pluck those 
callbacks from this super low latency CPU's RCU lists (with due care) 
and go and free stuff itself ... There's no need to disturb this CPU 
for that!

If user-space does not do anything kernel-ish then there won't be any 
new RCU callbacks piled up, so it's not like it's a resource leak 
issue either.

So what's the point? Why not remove this big source of overhead 
altogether?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* r...@redhat.com r...@redhat.com wrote:

 From: Rik van Riel r...@redhat.com
 
 On syscall entry with nohz_full on, we enable interrupts, call user_exit,
 disable interrupts, do something, re-enable interrupts, and go on our
 merry way.
 
 Profiling shows that a large amount of the nohz_full overhead comes
 from the extraneous disabling and re-enabling of interrupts. Andy
 suggested simply not enabling interrupts until after the context
 tracking code has done its thing, which allows us to skip a whole
 interrupt disable  re-enable cycle.
 
 This patch builds on top of these patches by Paolo:
 https://lkml.org/lkml/2015/4/28/188
 https://lkml.org/lkml/2015/4/29/139
 
 Together with this patch I posted earlier this week, the syscall path
 on a nohz_full cpu seems to be about 10% faster.
 https://lkml.org/lkml/2015/4/24/394
 
 My test is a simple microbenchmark that calls getpriority() in a loop
 10 million times:
 
   run timesystem time
 vanilla   5.49s   2.08s
 __acct patch  5.21s   1.92s
 both patches  4.88s   1.71s

Just curious, what are the numbers if you don't have context tracking 
enabled, i.e. without nohz_full?

I.e. what's the baseline we are talking about?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

  2. To suppress the timing tick, we need to get some timing for, 
  um, the scheduler?  I wasn't really sure about this one.
 
 So we have variable timeslice timers for the scheduler implemented, 
 they are off by default but they worked last someone tried them. See 
 the 'HRTICK' scheduler feature.
 
 And for SCHED_FIFO that timeout can be 'never' - i.e. essentially 
 stopping the scheduler tick. (within reason.)

Also note that this bit in context_tracking_enter():

if (state == CONTEXT_USER) {
trace_user_enter(0);
vtime_user_enter(current);
}


is related to precise time measurement of user/kernel execution times, 
it's not needed by the scheduler at all, it's just exported to 
tooling. It's not fundamental to the scheduler.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 01:12 PM, Ingo Molnar wrote:
 
 * Rik van Riel r...@redhat.com wrote:
 
 On 05/01/2015 12:45 PM, Ingo Molnar wrote:

 * Rik van Riel r...@redhat.com wrote:

 On 05/01/2015 12:37 PM, Ingo Molnar wrote:

 Also note that this bit in context_tracking_enter():

 if (state == CONTEXT_USER) {
 trace_user_enter(0);
 vtime_user_enter(current);
 }


 is related to precise time measurement of user/kernel execution 
 times, it's not needed by the scheduler at all, it's just exported 
 to tooling. It's not fundamental to the scheduler.

 Any objections to the idea from the other thread to simply keep the 
 time accumulating in buffers in local_clock() units, and only update 
 the task vtime once a second or so?

 So I really think per syscall overhead is the wrong thing to do for 
 anything that a common distro enables.

 I see very little use for such precise, high-freq measurements on 
 normal systems - and abnormal systems could enable it dynamically just 
 like they can enable syscall auditing.

 I.e. I don't mind the occasional crazy piece of code, as long as it 
 does not hurt the innocent.

 Then how should/could we keep a rough idea of user / system / guest 
 time when running without a periodic timer tick?
 
 So I'd split the timer tick into two parts: just the constant-work 
 sampling bit that doesn't do much, and the variable-work part which 
 gets essentially shut off when the timeout is far into the future.
 
 Then we could do IRQ driven sampling without introducing variable 
 amount jitter into hard-RT execution time.
 
 I.e. much of what we do today, except that we could skip variable work 
 such as the scheduler tick or (unforced) RCU processing like the RCU 
 softirq work.

Any ideas how we could avoid that sampling timer interrupt
latency stacking up when dealing with both guest and host?

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:21 PM, Ingo Molnar wrote:
 
 * Andy Lutomirski l...@amacapital.net wrote:
 
 So what's the point? Why not remove this big source of overhead 
 altogether?

 The last time I asked, the impression I got was that we needed two 
 things:

 1. We can't pluck things from the RCU list without knowing whether 
 the CPU is in an RCU read-side critical section, and we can't know 
 that unless we have regular grade periods or we know that the CPU is 
 idle. To make the CPU detectably idle, we need to set a bit 
 somewhere.
 
 'Idle' as in 'executing pure user-space mode, without entering the 
 kernel and possibly doing an rcu_read_lock()', right?
 
 So we don't have to test it from the remote CPU: we could probe such 
 CPUs via a single low-overhead IPI. I'd much rather push such overhead 
 to sync_rcu() than to the syscall entry code!

 I can understand people running hard-RT workloads not wanting to see 
 the overhead of a timer tick or a scheduler tick with variable (and 
 occasionally heavy) work done in IRQ context, but the jitter caused by 
 a single trivial IPI with constant work should be very, very low and 
 constant.

Not if the realtime workload is running inside a KVM
guest.

At that point an IPI, either on the host or in the
guest, involves a full VMEXIT  VMENTER cycle.

I suspect it would be easy enough at user_enter() or
guest_enter() time to:
1) flip the bit
2) check to see if we have any callbacks queued on our on queue, and
3) if so, move the callbacks from that queue to the another CPU can
   take these queue

At that point, another CPU wanting to do an RCU grace period
advance can:
1) skip trying to advance the grace period on the CPU that
   is in userspace or guest mode, and
2) take the for somebody else callbacks of that CPU, and
   run them

This does not seem overly complicated.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:37 PM, Ingo Molnar wrote:

 Also note that this bit in context_tracking_enter():
 
 if (state == CONTEXT_USER) {
 trace_user_enter(0);
 vtime_user_enter(current);
 }
 
 
 is related to precise time measurement of user/kernel execution times, 
 it's not needed by the scheduler at all, it's just exported to 
 tooling. It's not fundamental to the scheduler.

Any objections to the idea from the other thread to simply keep
the time accumulating in buffers in local_clock() units, and
only update the task vtime once a second or so?

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

  So what's the point? Why not remove this big source of overhead 
  altogether?
 
 The last time I asked, the impression I got was that we needed two 
 things:
 
 1. We can't pluck things from the RCU list without knowing whether 
 the CPU is in an RCU read-side critical section, and we can't know 
 that unless we have regular grade periods or we know that the CPU is 
 idle. To make the CPU detectably idle, we need to set a bit 
 somewhere.

'Idle' as in 'executing pure user-space mode, without entering the 
kernel and possibly doing an rcu_read_lock()', right?

So we don't have to test it from the remote CPU: we could probe such 
CPUs via a single low-overhead IPI. I'd much rather push such overhead 
to sync_rcu() than to the syscall entry code!

I can understand people running hard-RT workloads not wanting to see 
the overhead of a timer tick or a scheduler tick with variable (and 
occasionally heavy) work done in IRQ context, but the jitter caused by 
a single trivial IPI with constant work should be very, very low and 
constant.

If user-space RT code does not tolerate _that_ kind of latencies then 
it really has its priorities wrong and we should not try to please it. 
It should not hurt the other 99.9% of sane hard-RT users.

And the other usecase, virtualization, obviously does not care and 
could take the IPI just fine.

 2. To suppress the timing tick, we need to get some timing for, um, 
 the scheduler?  I wasn't really sure about this one.

So we have variable timeslice timers for the scheduler implemented, 
they are off by default but they worked last someone tried them. See 
the 'HRTICK' scheduler feature.

And for SCHED_FIFO that timeout can be 'never' - i.e. essentially 
stopping the scheduler tick. (within reason.)

 Could we reduce the overhead by making the IN_USER vs IN_KERNEL 
 indication be a single bit and, worst case, an rdtsc and maybe a 
 subtraction?  We could probably get away with banning full nohz on 
 non-invariant tsc systems.
 
 (I do understand why it would be tricky to transition from IN_USER 
 to IN_KERNEL with IRQs on.  Solvable, maybe, but tricky.)

We can make it literally zero overhead: by using an IPI from 
synchronize_rcu() and friend.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:03 PM, Andy Lutomirski wrote:

 The last time I asked, the impression I got was that we needed two things:
 
 1. We can't pluck things from the RCU list without knowing whether the
 CPU is in an RCU read-side critical section, and we can't know that
 unless we have regular grade periods or we know that the CPU is idle.
 To make the CPU detectably idle, we need to set a bit somewhere.

More than that. We also need a way for another CPU to identify the
callbacks they could run for us, without confusing them with new
callbacks queued after we transitioned back from USER to KERNEL
context.

 2. To suppress the timing tick, we need to get some timing for, um,
 the scheduler?  I wasn't really sure about this one.
 
 Could we reduce the overhead by making the IN_USER vs IN_KERNEL
 indication be a single bit and, worst case, an rdtsc and maybe a
 subtraction?  We could probably get away with banning full nohz on
 non-invariant tsc systems.

I suspect we can.

There is no need to update the vtime every single time
we enter vtime_user_enter and functions like it.

We can keep a buffer, which:
1) keeps values in TSC cycles (or whatever unit local_clock does)
2) is only ever accessed by the current task, so it requires no
   locking
3) values can be actually folded into vtime periodically, when
   they exceed a certain threshold (1 second ?)

That means the vtime_seqlock is something that we would only take
once a second or so, and the calculations in account_user_time
would only be done on a fairly infrequent basis. That has the
potential to reduce overhead by a lot.

If nobody has any objections to that kind of change, I would be
happy to implement it.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Andy Lutomirski
On Fri, May 1, 2015 at 8:59 AM, Ingo Molnar mi...@kernel.org wrote:

 * Rik van Riel r...@redhat.com wrote:

  I.e. what's the baseline we are talking about?

 It's an astounding difference. This is not a kernel without
 nohz_full, just a CPU without nohz_full running the same kernel I
 tested with yesterday:

   run timesystem time
 vanilla   5.49s   2.08s
 __acct patch  5.21s   1.92s
 both patches  4.88s   1.71s
 CPU w/o nohz  3.12s   1.63s-- your numbers, mostly

 What is even more interesting is that the majority of the time
 difference seems to come from _user_ time, which has gone down from
 around 3.4 seconds in the vanilla kernel to around 1.5 seconds on
 the CPU without nohz_full enabled...

 At syscall entry time, the nohz_full context tracking code is very
 straightforward. We check thread_info-flags 
 _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which
 handles USER - KERNEL context transition.

 Syscall exit time is a convoluted mess. Both do_notify_resume and
 syscall_trace_leave call exit_user() on entry and enter_user() on
 exit, leaving the time spent looping around between int_with_check
 and syscall_return: in entry_64.S accounted as user time.

 I sent an email about this last night, it may be useful to add a
 third test  function call point to the syscall return code, where
 we can call user_enter() just ONCE, and remove the other context
 tracking calls from that loop.

 So what I'm wondering about is the big picture:

  - This is crazy big overhead in something as fundamental as system
calls!

  - We don't even have the excuse of the syscall auditing code, which
kind of has to run for every syscall if it wants to do its job!

  - [ The 'precise vtime' stuff that is driven from syscall entry/exit
  is crazy, and I hope not enabled in any distro. ]

  - So why are we doing this in every syscall time at all?

 Basically the whole point of user-context tracking is to be able to
 flush pending RCU callbacks. But that's crazy, we can sure defer a few
 kfree()s on this CPU, even indefinitely!

 If some other CPU does a sync_rcu(), then it can very well pluck those
 callbacks from this super low latency CPU's RCU lists (with due care)
 and go and free stuff itself ... There's no need to disturb this CPU
 for that!

 If user-space does not do anything kernel-ish then there won't be any
 new RCU callbacks piled up, so it's not like it's a resource leak
 issue either.

 So what's the point? Why not remove this big source of overhead
 altogether?

The last time I asked, the impression I got was that we needed two things:

1. We can't pluck things from the RCU list without knowing whether the
CPU is in an RCU read-side critical section, and we can't know that
unless we have regular grade periods or we know that the CPU is idle.
To make the CPU detectably idle, we need to set a bit somewhere.

2. To suppress the timing tick, we need to get some timing for, um,
the scheduler?  I wasn't really sure about this one.

Could we reduce the overhead by making the IN_USER vs IN_KERNEL
indication be a single bit and, worst case, an rdtsc and maybe a
subtraction?  We could probably get away with banning full nohz on
non-invariant tsc systems.

(I do understand why it would be tricky to transition from IN_USER to
IN_KERNEL with IRQs on.  Solvable, maybe, but tricky.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

 On 05/01/2015 12:03 PM, Andy Lutomirski wrote:
 
  The last time I asked, the impression I got was that we needed two things:
  
  1. We can't pluck things from the RCU list without knowing whether the
  CPU is in an RCU read-side critical section, and we can't know that
  unless we have regular grade periods or we know that the CPU is idle.
  To make the CPU detectably idle, we need to set a bit somewhere.
 
 More than that. We also need a way for another CPU to identify the 
 callbacks they could run for us, without confusing them with new 
 callbacks queued after we transitioned back from USER to KERNEL 
 context.

That's easy: a simple, constant-work IPI sent to nohz-full CPUs could 
just dequeue these callbacks, if it sees that we are still in 
user-mode.

It's super easy because such an IPI would essentially be equivalent to 
a system call context from RCU's POV, if it uses this test:

if (user_mode(regs)) {
... pluck RCU callbacks ...
}

That way we can have lots of overhead removed from the syscall call 
path ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

  I can understand people running hard-RT workloads not wanting to 
  see the overhead of a timer tick or a scheduler tick with variable 
  (and occasionally heavy) work done in IRQ context, but the jitter 
  caused by a single trivial IPI with constant work should be very, 
  very low and constant.
 
 Not if the realtime workload is running inside a KVM guest.

I don't buy this:

 At that point an IPI, either on the host or in the guest, involves a 
 full VMEXIT  VMENTER cycle.

So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
usec on recent hardware, and I bet it will get better with time.

I'm not aware of any hard-RT workload that cannot take 1 usec 
latencies.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

 On 05/01/2015 12:37 PM, Ingo Molnar wrote:
 
  Also note that this bit in context_tracking_enter():
  
  if (state == CONTEXT_USER) {
  trace_user_enter(0);
  vtime_user_enter(current);
  }
  
  
  is related to precise time measurement of user/kernel execution 
  times, it's not needed by the scheduler at all, it's just exported 
  to tooling. It's not fundamental to the scheduler.
 
 Any objections to the idea from the other thread to simply keep the 
 time accumulating in buffers in local_clock() units, and only update 
 the task vtime once a second or so?

So I really think per syscall overhead is the wrong thing to do for 
anything that a common distro enables.

I see very little use for such precise, high-freq measurements on 
normal systems - and abnormal systems could enable it dynamically just 
like they can enable syscall auditing.

I.e. I don't mind the occasional crazy piece of code, as long as it 
does not hurt the innocent.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:45 PM, Ingo Molnar wrote:
 
 * Rik van Riel r...@redhat.com wrote:
 
 On 05/01/2015 12:37 PM, Ingo Molnar wrote:

 Also note that this bit in context_tracking_enter():

 if (state == CONTEXT_USER) {
 trace_user_enter(0);
 vtime_user_enter(current);
 }


 is related to precise time measurement of user/kernel execution 
 times, it's not needed by the scheduler at all, it's just exported 
 to tooling. It's not fundamental to the scheduler.

 Any objections to the idea from the other thread to simply keep the 
 time accumulating in buffers in local_clock() units, and only update 
 the task vtime once a second or so?
 
 So I really think per syscall overhead is the wrong thing to do for 
 anything that a common distro enables.
 
 I see very little use for such precise, high-freq measurements on 
 normal systems - and abnormal systems could enable it dynamically just 
 like they can enable syscall auditing.
 
 I.e. I don't mind the occasional crazy piece of code, as long as it 
 does not hurt the innocent.

Then how should/could we keep a rough idea of user / system / guest
time when running without a periodic timer tick?

These statistics are useful, and gathering them in a low impact
way (local_clock read + variable read + subtraction + addition +
store) may be acceptable overhead.

This is a few function calls, one tsc read, a possible cache
line miss, and an unlikely branch.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

 On 05/01/2015 12:45 PM, Ingo Molnar wrote:
  
  * Rik van Riel r...@redhat.com wrote:
  
  On 05/01/2015 12:37 PM, Ingo Molnar wrote:
 
  Also note that this bit in context_tracking_enter():
 
  if (state == CONTEXT_USER) {
  trace_user_enter(0);
  vtime_user_enter(current);
  }
 
 
  is related to precise time measurement of user/kernel execution 
  times, it's not needed by the scheduler at all, it's just exported 
  to tooling. It's not fundamental to the scheduler.
 
  Any objections to the idea from the other thread to simply keep the 
  time accumulating in buffers in local_clock() units, and only update 
  the task vtime once a second or so?
  
  So I really think per syscall overhead is the wrong thing to do for 
  anything that a common distro enables.
  
  I see very little use for such precise, high-freq measurements on 
  normal systems - and abnormal systems could enable it dynamically just 
  like they can enable syscall auditing.
  
  I.e. I don't mind the occasional crazy piece of code, as long as it 
  does not hurt the innocent.
 
 Then how should/could we keep a rough idea of user / system / guest 
 time when running without a periodic timer tick?

So I'd split the timer tick into two parts: just the constant-work 
sampling bit that doesn't do much, and the variable-work part which 
gets essentially shut off when the timeout is far into the future.

Then we could do IRQ driven sampling without introducing variable 
amount jitter into hard-RT execution time.

I.e. much of what we do today, except that we could skip variable work 
such as the scheduler tick or (unforced) RCU processing like the RCU 
softirq work.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 02:40 AM, Ingo Molnar wrote:

 This patch builds on top of these patches by Paolo:
 https://lkml.org/lkml/2015/4/28/188
 https://lkml.org/lkml/2015/4/29/139

 Together with this patch I posted earlier this week, the syscall path
 on a nohz_full cpu seems to be about 10% faster.
 https://lkml.org/lkml/2015/4/24/394

 My test is a simple microbenchmark that calls getpriority() in a loop
 10 million times:

  run timesystem time
 vanilla  5.49s   2.08s
 __acct patch 5.21s   1.92s
 both patches 4.88s   1.71s

 Just curious, what are the numbers if you don't have context tracking 
 enabled, i.e. without nohz_full?
 
 I.e. what's the baseline we are talking about?

It's an astounding difference. This is not a kernel without nohz_full,
just a CPU without nohz_full running the same kernel I tested with
yesterday:

run timesystem time
vanilla 5.49s   2.08s
__acct patch5.21s   1.92s
both patches4.88s   1.71s
CPU w/o nohz3.12s   1.63s-- your numbers, mostly

What is even more interesting is that the majority of the time
difference seems to come from _user_ time, which has gone down
from around 3.4 seconds in the vanilla kernel to around 1.5 seconds
on the CPU without nohz_full enabled...

At syscall entry time, the nohz_full context tracking code is very
straightforward. We check thread_info-flags  _TIF_WORK_SYSCALL_ENTRY,
and call syscall_trace_enter_phase1, which handles USER - KERNEL
context transition.

Syscall exit time is a convoluted mess. Both do_notify_resume and
syscall_trace_leave call exit_user() on entry and enter_user()
on exit, leaving the time spent looping around between int_with_check
and syscall_return: in entry_64.S accounted as user time.

I sent an email about this last night, it may be useful to add a
third test  function call point to the syscall return code, where
we can call user_enter() just ONCE, and remove the other context
tracking calls from that loop.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Andy Lutomirski
On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote:
 On 05/01/2015 02:40 PM, Ingo Molnar wrote:

 Or we could do that in the syscall path with a single store of a
 constant flag to a location in the task struct. We have a number of
 natural flags that get written on syscall entry, such as:

 pushq_cfi $__USER_DS/* pt_regs-ss */

 That goes to a constant location on the kernel stack. On return from
 system calls we could write 0 to that location.

Huh?  IRET with zero there will fault, and we can't guarantee that all
syscalls return using sysret.  Also, we'd have to audit all the entries,
and system_call_after_swapgs currently enables interrupts early enough
that an interrupt before all the pushes will do unpredictable things to
pt_regs.

We could further abuse orig_ax, but that would require a lot of
auditing.  Honestly, though, I think keeping a flag in an
otherwise-hot cache line is a better bet.  Also, making it per-cpu
instead of per-task will probably be easier on the RCU code, since
otherwise the RCU code will have to do some kind of synchronization
(even if it's a wait-free probe of the rq lock or similar) to figure
out which pt_regs to look at.

If we went that route, I'd advocate sticking the flag in tss-sp1.
That cacheline is unconditionally read on kernel entry already, and
sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I
lost track). [1]

Alternatively, I'd suggest that we actually add a whole new word to pt_regs.

[1] It's not unconditionally accessed yet, but it wil be once Denys'
latest patches are in.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

  I.e. much of what we do today, except that we could skip variable 
  work such as the scheduler tick or (unforced) RCU processing like 
  the RCU softirq work.
 
 Any ideas how we could avoid that sampling timer interrupt latency 
 stacking up when dealing with both guest and host?

Well, it would be host_latency+guest_latency worst-case, right?

That should still be bounded, as long as both guest and host is 
running nohz-full mode.

Also, technically when the host gets such an IRQ, the guest is 
interrupted as well. So the host could do the guest's work and pass 
through the result via paravirt or so.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 02:40 PM, Ingo Molnar wrote:

 Or we could do that in the syscall path with a single store of a 
 constant flag to a location in the task struct. We have a number of 
 natural flags that get written on syscall entry, such as:
 
 pushq_cfi $__USER_DS/* pt_regs-ss */
 
 That goes to a constant location on the kernel stack. On return from 
 system calls we could write 0 to that location.
 
 So the remote CPU would have to do a read of this location. There are 
 two cases:
 
  - If it's 0, then it has observed quiescent state on that CPU. (It 
does not have to be atomics anymore, as we'd only observe the value 
and MESI coherency takes care of it.)

That should do the trick.

  - If it's not 0 then the remote CPU is not executing user-space code 
and we can install (remotely) a TIF_NOHZ flag in it and expect it 
to process it either on return to user-space or on a context 
switch.

I may have to think about this a little more, but
it seems like it should work.

Can we use a separate byte in the flags word for
flags that can get set remotely, so we can do
stores and clearing of local-only flags without
atomic instructions?

 This way, unless I'm missing something, reduces the overhead to a 
 single store to a hot cacheline on return-to-userspace - which 
 instruction if we place it well might as well be close to zero cost. 
 No syscall entry cost. Slow-return cost only in the (rare) case of 
 someone using synchronize_rcu().

I think that should take care of the RCU aspect of
nohz_full.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

 On 05/01/2015 12:34 PM, Ingo Molnar wrote:
  
  * Rik van Riel r...@redhat.com wrote:
  
  I can understand people running hard-RT workloads not wanting to 
  see the overhead of a timer tick or a scheduler tick with variable 
  (and occasionally heavy) work done in IRQ context, but the jitter 
  caused by a single trivial IPI with constant work should be very, 
  very low and constant.
 
  Not if the realtime workload is running inside a KVM guest.
  
  I don't buy this:
  
  At that point an IPI, either on the host or in the guest, involves a 
  full VMEXIT  VMENTER cycle.
  
  So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
  usec on recent hardware, and I bet it will get better with time.
  
  I'm not aware of any hard-RT workload that cannot take 1 usec 
  latencies.
 
 Now think about doing this kind of IPI from inside a guest, to 
 another VCPU on the same guest.
 
 Now you are looking at VMEXIT/VMENTER on the first VCPU,

Does it matter? It's not the hard-RT CPU, and this is a slowpath of 
synchronize_rcu().

 plus the cost of the IPI on the host, plus the cost of the emulation 
 layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI 
 work, and possibly a second VMEXIT/VMENTER for IPI completion.

Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies.

 I suspect it would be better to do RCU callback offload in some 
 other way.

Well, it's not just about callback offload, but it's about the basic 
synchronization guarantee of synchronize_rcu(): that all RCU read-side 
critical sections have finished executing after the call returns.

So even if a nohz-full CPU never actually queues a callback, it needs 
to stop using resources that a synchronize_rcu() caller expects it to 
stop using.

We can do that only if we know it in an SMP-coherent way that the 
remote CPU is not in an rcu_read_lock() section.

Sending an IPI is one way to achieve that.

Or we could do that in the syscall path with a single store of a 
constant flag to a location in the task struct. We have a number of 
natural flags that get written on syscall entry, such as:

pushq_cfi $__USER_DS/* pt_regs-ss */

That goes to a constant location on the kernel stack. On return from 
system calls we could write 0 to that location.

So the remote CPU would have to do a read of this location. There are 
two cases:

 - If it's 0, then it has observed quiescent state on that CPU. (It 
   does not have to be atomics anymore, as we'd only observe the value 
   and MESI coherency takes care of it.)

 - If it's not 0 then the remote CPU is not executing user-space code 
   and we can install (remotely) a TIF_NOHZ flag in it and expect it 
   to process it either on return to user-space or on a context 
   switch.

This way, unless I'm missing something, reduces the overhead to a 
single store to a hot cacheline on return-to-userspace - which 
instruction if we place it well might as well be close to zero cost. 
No syscall entry cost. Slow-return cost only in the (rare) case of 
someone using synchronize_rcu().

Hm?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry

2015-05-01 Thread Rik van Riel
On 05/01/2015 12:34 PM, Ingo Molnar wrote:
 
 * Rik van Riel r...@redhat.com wrote:
 
 I can understand people running hard-RT workloads not wanting to 
 see the overhead of a timer tick or a scheduler tick with variable 
 (and occasionally heavy) work done in IRQ context, but the jitter 
 caused by a single trivial IPI with constant work should be very, 
 very low and constant.

 Not if the realtime workload is running inside a KVM guest.
 
 I don't buy this:
 
 At that point an IPI, either on the host or in the guest, involves a 
 full VMEXIT  VMENTER cycle.
 
 So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 
 usec on recent hardware, and I bet it will get better with time.
 
 I'm not aware of any hard-RT workload that cannot take 1 usec 
 latencies.

Now think about doing this kind of IPI from inside a guest,
to another VCPU on the same guest.

Now you are looking at VMEXIT/VMENTER on the first VCPU,
plus the cost of the IPI on the host, plus the cost of
the emulation layer, plus VMEXIT/VMENTER on the second
VCPU to trigger the IPI work, and possibly a second
VMEXIT/VMENTER for IPI completion.

I suspect it would be better to do RCU callback offload
in some other way.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >