Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > > on return from idle we'd do: > > > > rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */ > > > > etc. On all boundary transitions we can use a constant ADD on a > > suitable percpu variable. > > Sounds good to me, except that we need to be careful to distinguish > between non-syscall entries from quiescent states > and non-syscall entries from quiescent states. It might be hard to make that distinction! ;-) I suspect you wanted to raise the issue of various contexts nesting on each other, such as syscall triggering a page fault, which gets an irq nested, etc. - versus non-nested contexts such as user-space triggering a page fault or user-space getting an irq? > [...] We could save the old state (as the current exception_enter > code does) or we could allocate enough low bits for the state that > the problem goes away. So I think, if it all works in practice just as well as it does in email, we might be better off with more state bits: that would tell any remote statistics/sampling code more as well. It might also be easier to patch in/out, because this kind of state tracking will affect non-RT CPUs as well. (Later on we could do a separate IDT for RT CPUs as well, with a patched version of the entry code.) > I don't think the TIF_RCU_QS variant is worthwhile -- merging the > counter and state is probably both easier and faster. Yeah. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> On May 8, 2015 12:07 PM, "Ingo Molnar" wrote: >> > >> > >> > * Andy Lutomirski wrote: >> > >> > > > So do you mean: >> > > > >> > > >this_cpu_set(rcu_state) = IN_KERNEL; >> > > >... >> > > >this_cpu_inc(rcu_qs_ctr); >> > > >this_cpu_set(rcu_state) = IN_USER; >> > > > >> > > > ? >> > > > >> > > > So in your proposal we'd have an INC and two MOVs. I think we can make >> > > > it just two simple stores into a byte flag, one on entry and one on >> > > > exit: >> > > > >> > > >this_cpu_set(rcu_state) = IN_KERNEL; >> > > >... >> > > >this_cpu_set(rcu_state) = IN_USER; >> > > > >> > > >> > > I was thinking that either a counter or a state flag could make sense. >> > > Doing both would be pointless. The counter could use the low bits to >> > > indicate the state. The benefit of the counter would be that the >> > > RCU-waiting CPU could observe that the counter has incremented and >> > > that therefore a grace period has elapsed. Getting it right would >> > > require lots of care. >> > >> > So if you mean: >> > >> > >> >... >> >this_cpu_inc(rcu_qs_ctr); >> > >> > >> > I don't see how this would work reliably: how do you handle the case >> > of a SCHED_FIFO task never returning from user-space (common technique >> > in RT apps)? synchronize_rcu() would block indefinitely as it would >> > never see rcu_qs_ctr increase. >> > >> > We have to be able to observe user-mode anyway, for system-time >> > statistics purposes, and that flag could IMHO also drive the RCU GC >> > machinery. >> >> The counter would have to be fancier than that to work. We could >> say that an even value means user mode, for example. IOW the high >> bits of the counter would count transitions to quiescent and the low >> bits would indicate the state. >> >> Actually, I'd count backwards. We could use an andl instruction to >> go to user mode and a decl to go to kernel mode. > > at which point it's not really a monotonic quiescent state counter - > which your naming suggested before. > > Also it would have to be done both at entry and at exit. > > But yes, if you replace a state flag with a recursive state flag that > is INC/DEC maintained that would work as well. That's not a counter > though. I won't quibble about the name :) > >> > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets >> > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET >> > > because stores can be delayed. (It could even happen after >> > > sysret, IIRC, but IRET is serializing.) >> > >> > All it has to do in the synchronize_rcu() slowpath is something >> > like: >> >> I don't think this works. Suppose rt_cpu starts in kernel mode. >> Then it checks ti flags. >> >> > >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { >> > smp_mb__before_atomic(); >> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); >> > smp_rmb(); >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) >> >> Now it sets rcu_state and exits to user mode. It never notices >> TIF_RCU_NOTIFY. > > Indeed, you are right, that's racy. > >> [...] We could fix this by sending an IPI to kick it. > > With an IPI we won't need any flags or counters on the RT CPU - it's > the IPI we are trying to avoid. > > So how about the following way to fix the race: simply do a poll loop > with a fixed timeout sleep: like it would do anyway if > synchronize_rcu() was waiting for the timer irq to end the grace > period on the RT-CPU. Seems reasonable to me. > > The TIF flag would cause an RCU grace period to lapse, no need to wake > up the synchronize_rcu() side: it will notice the (regular) increased > RCU counter. > >> > We are talking about a dozen cycles, while a typical >> > synchronize_rcu() will wait millions (sometimes billions) of >> > cycles. There's absolutely zero performance concern here and it's >> > all CPU local in any case. >> >> The barrier would affect all syscalls, though. [...] > > What barrier? I never suggested any barrier in the syscall fast path, > this bit: Oh, I misunderstood. > >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { >> > smp_mb__before_atomic(); >> > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); >> > smp_rmb(); >> > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) > > runs inside synchronize_rcu() or so. > > But none of that is needed if we do: > > - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context > entry/exit time, straight in the assembly > > - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does > nothing but: this_cpu_inc(rcu_qs_ctr). > > - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and > then does a timeout-poll-loop with jiffies granular > timeouts, simulating a timer
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > On May 8, 2015 12:07 PM, "Ingo Molnar" wrote: > > > > > > * Andy Lutomirski wrote: > > > > > > So do you mean: > > > > > > > >this_cpu_set(rcu_state) = IN_KERNEL; > > > >... > > > >this_cpu_inc(rcu_qs_ctr); > > > >this_cpu_set(rcu_state) = IN_USER; > > > > > > > > ? > > > > > > > > So in your proposal we'd have an INC and two MOVs. I think we can make > > > > it just two simple stores into a byte flag, one on entry and one on > > > > exit: > > > > > > > >this_cpu_set(rcu_state) = IN_KERNEL; > > > >... > > > >this_cpu_set(rcu_state) = IN_USER; > > > > > > > > > > I was thinking that either a counter or a state flag could make sense. > > > Doing both would be pointless. The counter could use the low bits to > > > indicate the state. The benefit of the counter would be that the > > > RCU-waiting CPU could observe that the counter has incremented and > > > that therefore a grace period has elapsed. Getting it right would > > > require lots of care. > > > > So if you mean: > > > > > >... > >this_cpu_inc(rcu_qs_ctr); > > > > > > I don't see how this would work reliably: how do you handle the case > > of a SCHED_FIFO task never returning from user-space (common technique > > in RT apps)? synchronize_rcu() would block indefinitely as it would > > never see rcu_qs_ctr increase. > > > > We have to be able to observe user-mode anyway, for system-time > > statistics purposes, and that flag could IMHO also drive the RCU GC > > machinery. > > The counter would have to be fancier than that to work. We could > say that an even value means user mode, for example. IOW the high > bits of the counter would count transitions to quiescent and the low > bits would indicate the state. > > Actually, I'd count backwards. We could use an andl instruction to > go to user mode and a decl to go to kernel mode. at which point it's not really a monotonic quiescent state counter - which your naming suggested before. Also it would have to be done both at entry and at exit. But yes, if you replace a state flag with a recursive state flag that is INC/DEC maintained that would work as well. That's not a counter though. > > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets > > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET > > > because stores can be delayed. (It could even happen after > > > sysret, IIRC, but IRET is serializing.) > > > > All it has to do in the synchronize_rcu() slowpath is something > > like: > > I don't think this works. Suppose rt_cpu starts in kernel mode. > Then it checks ti flags. > > > > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { > > smp_mb__before_atomic(); > > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); > > smp_rmb(); > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) > > Now it sets rcu_state and exits to user mode. It never notices > TIF_RCU_NOTIFY. Indeed, you are right, that's racy. > [...] We could fix this by sending an IPI to kick it. With an IPI we won't need any flags or counters on the RT CPU - it's the IPI we are trying to avoid. So how about the following way to fix the race: simply do a poll loop with a fixed timeout sleep: like it would do anyway if synchronize_rcu() was waiting for the timer irq to end the grace period on the RT-CPU. The TIF flag would cause an RCU grace period to lapse, no need to wake up the synchronize_rcu() side: it will notice the (regular) increased RCU counter. > > We are talking about a dozen cycles, while a typical > > synchronize_rcu() will wait millions (sometimes billions) of > > cycles. There's absolutely zero performance concern here and it's > > all CPU local in any case. > > The barrier would affect all syscalls, though. [...] What barrier? I never suggested any barrier in the syscall fast path, this bit: > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { > > smp_mb__before_atomic(); > > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); > > smp_rmb(); > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) runs inside synchronize_rcu() or so. But none of that is needed if we do: - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context entry/exit time, straight in the assembly - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does nothing but: this_cpu_inc(rcu_qs_ctr). - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and then does a timeout-poll-loop with jiffies granular timeouts, simulating a timer IRQ in essence. It will either observe IN_USER or will see the regular RCU qs counter increase. this should be race free. Alternatively we can even get rid of the TIF flag by merging the percpu counter with the percpu state. Quiescent states
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On May 8, 2015 12:07 PM, "Ingo Molnar" wrote: > > > * Andy Lutomirski wrote: > > > > So do you mean: > > > > > >this_cpu_set(rcu_state) = IN_KERNEL; > > >... > > >this_cpu_inc(rcu_qs_ctr); > > >this_cpu_set(rcu_state) = IN_USER; > > > > > > ? > > > > > > So in your proposal we'd have an INC and two MOVs. I think we can make > > > it just two simple stores into a byte flag, one on entry and one on > > > exit: > > > > > >this_cpu_set(rcu_state) = IN_KERNEL; > > >... > > >this_cpu_set(rcu_state) = IN_USER; > > > > > > > I was thinking that either a counter or a state flag could make sense. > > Doing both would be pointless. The counter could use the low bits to > > indicate the state. The benefit of the counter would be that the > > RCU-waiting CPU could observe that the counter has incremented and > > that therefore a grace period has elapsed. Getting it right would > > require lots of care. > > So if you mean: > > >... >this_cpu_inc(rcu_qs_ctr); > > > I don't see how this would work reliably: how do you handle the case > of a SCHED_FIFO task never returning from user-space (common technique > in RT apps)? synchronize_rcu() would block indefinitely as it would > never see rcu_qs_ctr increase. > > We have to be able to observe user-mode anyway, for system-time > statistics purposes, and that flag could IMHO also drive the RCU GC > machinery. The counter would have to be fancier than that to work. We could say that an even value means user mode, for example. IOW the high bits of the counter would count transitions to quiescent and the low bits would indicate the state. Actually, I'd count backwards. We could use an andl instruction to go to user mode and a decl to go to kernel mode. > > > > > The problem is that I don't see how TIF_RCU_THINGY can work > > > > reliably. If the remote CPU sets it, it'll be too late and we'll > > > > still enter user mode without seeing it. If it's just an > > > > optimization, though, then it should be fine. > > > > > > Well, after setting it, the remote CPU has to re-check whether the > > > RT CPU has entered user-mode - before it goes to wait. > > > > How? > > > > Suppose the exit path looked like: > > > > this_cpu_write(rcu_state, IN_USER); > > > > if (ti->flags & _TIF_RCU_NOTIFY) { > > if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags)) > > slow_notify_rcu_that_we_are_exiting(); > > } > > > > iret or sysret; > > No, it would look like this: > >this_cpu_write(rcu_state, IN_USER); >iret or sysret; > > I.e. IN_USER is set well after all notifications are checked. No > kernel execution happens afterwards. (No extra checks added - the > regular return-to-user-work checks would handle TIF_RCU_NOTIFY.) > > ( Same goes for idle: we just mark it IN_IDLE and move it back to > IN_KERNEL after the idling ends. ) > > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET > > because stores can be delayed. (It could even happen after sysret, > > IIRC, but IRET is serializing.) > > All it has to do in the synchronize_rcu() slowpath is something like: I don't think this works. Suppose rt_cpu starts in kernel mode. Then it checks ti flags. > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { > smp_mb__before_atomic(); > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); > smp_rmb(); > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) Now it sets rcu_state and exits to user mode. It never notices TIF_RCU_NOTIFY. We could fix this by sending an IPI to kick it. > ... go wait ... > } > /* Cool, we observed quiescent state: */ > > The cost of the trivial barrier is nothing compared to the 'go wait' > cost which we will pay in 99.9% of the cases! > > > If we put an mfence after this_cpu_set or did an unconditional > > test_and_clear_bit on ti->flags then this problem goes away, but > > that would probably be slower than we'd like. > > We are talking about a dozen cycles, while a typical synchronize_rcu() > will wait millions (sometimes billions) of cycles. There's absolutely > zero performance concern here and it's all CPU local in any case. > The barrier would affect all syscalls, though. Admittedly that's still much cheaper than the current thing, but given that we can round trip a syscall in 110 cycles, we'd take at least 10% overhead. My preference would be to use a counter and skip the barrier. If the waiting CPU polled the counter, then, even if it lost this race, it would be guaranteed to see the counter change reasonably soon as long as no CPU implementation sits on its store buffer forever. > In fact a user-mode/kernel-mode flag speeds up naive implementations > of synchronize_rcu(): because it's able to observe extended quiescent > state immediately, without having to wait for a counter to increase >
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > On 05/07/2015 08:29 AM, Ingo Molnar wrote: > > > > * Frederic Weisbecker wrote: > > > We cannot take the lock_trace(task) from irq context, and we > probably do not need to anyway, since we do not care about a > precise stack trace for the task. > >>> > >>> So one worry with this and similar approaches of statistically > >>> detecting user mode would be the fact that on the way out to > >>> user-space we don't really destroy the previous call trace - we > >>> just pop off the stack (non-destructively), restore RIPs and are > >>> gone. > >>> > >>> We'll need that percpu flag I suspect. > >> > >> Note we have the context tracking state which tells where the > >> current task is: user/system/guest. > > > > Yes, but that overhead is what I'm suggesting we get rid of, I thought > > Rik was trying to find a mechanism that would be independent of that? > > One thing at a time :) > > I am working on the timer sampling stuff, which should be easy to > adapt to a different user/system/guest/irq/softirq/... tracking > thing, if somebody else comes up with a more efficient way to do > that. So if you make the timer sampling use a percpu variable, and set that variable from the existing callbacks, then we could do this gradually: first the timer sampling uses the flag, then RCU could use it, and finally we could push it out to minimal assembly code. But it's important to start out with a percpu flag to track this all. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > > So do you mean: > > > >this_cpu_set(rcu_state) = IN_KERNEL; > >... > >this_cpu_inc(rcu_qs_ctr); > >this_cpu_set(rcu_state) = IN_USER; > > > > ? > > > > So in your proposal we'd have an INC and two MOVs. I think we can make > > it just two simple stores into a byte flag, one on entry and one on > > exit: > > > >this_cpu_set(rcu_state) = IN_KERNEL; > >... > >this_cpu_set(rcu_state) = IN_USER; > > > > I was thinking that either a counter or a state flag could make sense. > Doing both would be pointless. The counter could use the low bits to > indicate the state. The benefit of the counter would be that the > RCU-waiting CPU could observe that the counter has incremented and > that therefore a grace period has elapsed. Getting it right would > require lots of care. So if you mean: ... this_cpu_inc(rcu_qs_ctr); I don't see how this would work reliably: how do you handle the case of a SCHED_FIFO task never returning from user-space (common technique in RT apps)? synchronize_rcu() would block indefinitely as it would never see rcu_qs_ctr increase. We have to be able to observe user-mode anyway, for system-time statistics purposes, and that flag could IMHO also drive the RCU GC machinery. > > > The problem is that I don't see how TIF_RCU_THINGY can work > > > reliably. If the remote CPU sets it, it'll be too late and we'll > > > still enter user mode without seeing it. If it's just an > > > optimization, though, then it should be fine. > > > > Well, after setting it, the remote CPU has to re-check whether the > > RT CPU has entered user-mode - before it goes to wait. > > How? > > Suppose the exit path looked like: > > this_cpu_write(rcu_state, IN_USER); > > if (ti->flags & _TIF_RCU_NOTIFY) { > if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags)) > slow_notify_rcu_that_we_are_exiting(); > } > > iret or sysret; No, it would look like this: this_cpu_write(rcu_state, IN_USER); iret or sysret; I.e. IN_USER is set well after all notifications are checked. No kernel execution happens afterwards. (No extra checks added - the regular return-to-user-work checks would handle TIF_RCU_NOTIFY.) ( Same goes for idle: we just mark it IN_IDLE and move it back to IN_KERNEL after the idling ends. ) > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET > because stores can be delayed. (It could even happen after sysret, > IIRC, but IRET is serializing.) All it has to do in the synchronize_rcu() slowpath is something like: if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) ... go wait ... } /* Cool, we observed quiescent state: */ The cost of the trivial barrier is nothing compared to the 'go wait' cost which we will pay in 99.9% of the cases! > If we put an mfence after this_cpu_set or did an unconditional > test_and_clear_bit on ti->flags then this problem goes away, but > that would probably be slower than we'd like. We are talking about a dozen cycles, while a typical synchronize_rcu() will wait millions (sometimes billions) of cycles. There's absolutely zero performance concern here and it's all CPU local in any case. In fact a user-mode/kernel-mode flag speeds up naive implementations of synchronize_rcu(): because it's able to observe extended quiescent state immediately, without having to wait for a counter to increase (which was always the classic way to observe grace periods). If all CPUs are in user mode or are idle (which is rather common!) then synchronize_rcu() could return almost immediately - while previously it had to wait for scheduling or periodic timer irqs to trigger on all CPUs - adding many millisecs of delay even in the best of cases. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() > > (being executed on some other CPU not doing RT work) to > > intelligently wait for the remote (RT work doing) CPU to finish > > executing kernel code, without polling or so. > > it's basically a cheap IPI being inserted on the remote CPU. > > We need the TIF_RCU_QS callback not just to wait intelligently, but > mainly to elapse a grace period, otherwise synchronize_rcu() might not > ever make progress: think a SCHED_FIFO task doing some kernel work, > synchronize_rcu() stumbling upon it - but the SCHED_FIFO task > otherwise never scheduling and never getting any timer irqs either, > and thus never entering quiescent state. > > (Cc:-ed Paul too, he might be interested in this as well.) Hmmm... So the point is that a NO_HZ_FULL CPU periodically posts callbacks to indicate that it has passed through a quiescent state, for example, upon entry to and/or exit from userspace? These callbacks would then be offloaded to some other CPU. But the callback would not be invoked until RCU saw a grace period, so I must be missing something here... Probably that the TIF_RCU_QS callback is not an RCU callback, but something else? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Thu, May 07, 2015 at 02:49:39PM +0200, Ingo Molnar wrote: * Ingo Molnar mi...@kernel.org wrote: The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being executed on some other CPU not doing RT work) to intelligently wait for the remote (RT work doing) CPU to finish executing kernel code, without polling or so. it's basically a cheap IPI being inserted on the remote CPU. We need the TIF_RCU_QS callback not just to wait intelligently, but mainly to elapse a grace period, otherwise synchronize_rcu() might not ever make progress: think a SCHED_FIFO task doing some kernel work, synchronize_rcu() stumbling upon it - but the SCHED_FIFO task otherwise never scheduling and never getting any timer irqs either, and thus never entering quiescent state. (Cc:-ed Paul too, he might be interested in this as well.) Hmmm... So the point is that a NO_HZ_FULL CPU periodically posts callbacks to indicate that it has passed through a quiescent state, for example, upon entry to and/or exit from userspace? These callbacks would then be offloaded to some other CPU. But the callback would not be invoked until RCU saw a grace period, so I must be missing something here... Probably that the TIF_RCU_QS callback is not an RCU callback, but something else? Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; I was thinking that either a counter or a state flag could make sense. Doing both would be pointless. The counter could use the low bits to indicate the state. The benefit of the counter would be that the RCU-waiting CPU could observe that the counter has incremented and that therefore a grace period has elapsed. Getting it right would require lots of care. So if you mean: syscall entry ... this_cpu_inc(rcu_qs_ctr); syscall exit I don't see how this would work reliably: how do you handle the case of a SCHED_FIFO task never returning from user-space (common technique in RT apps)? synchronize_rcu() would block indefinitely as it would never see rcu_qs_ctr increase. We have to be able to observe user-mode anyway, for system-time statistics purposes, and that flag could IMHO also drive the RCU GC machinery. The problem is that I don't see how TIF_RCU_THINGY can work reliably. If the remote CPU sets it, it'll be too late and we'll still enter user mode without seeing it. If it's just an optimization, though, then it should be fine. Well, after setting it, the remote CPU has to re-check whether the RT CPU has entered user-mode - before it goes to wait. How? Suppose the exit path looked like: this_cpu_write(rcu_state, IN_USER); if (ti-flags _TIF_RCU_NOTIFY) { if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags)) slow_notify_rcu_that_we_are_exiting(); } iret or sysret; No, it would look like this: this_cpu_write(rcu_state, IN_USER); iret or sysret; I.e. IN_USER is set well after all notifications are checked. No kernel execution happens afterwards. (No extra checks added - the regular return-to-user-work checks would handle TIF_RCU_NOTIFY.) ( Same goes for idle: we just mark it IN_IDLE and move it back to IN_KERNEL after the idling ends. ) The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET because stores can be delayed. (It could even happen after sysret, IIRC, but IRET is serializing.) All it has to do in the synchronize_rcu() slowpath is something like: if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) ... go wait ... } /* Cool, we observed quiescent state: */ The cost of the trivial barrier is nothing compared to the 'go wait' cost which we will pay in 99.9% of the cases! If we put an mfence after this_cpu_set or did an unconditional test_and_clear_bit on ti-flags then this problem goes away, but that would probably be slower than we'd like. We are talking about a dozen cycles, while a typical synchronize_rcu() will wait millions (sometimes billions) of cycles. There's absolutely zero performance concern here and it's all CPU local in any case. In fact a user-mode/kernel-mode flag speeds up naive implementations of synchronize_rcu(): because it's able to observe extended quiescent state immediately, without having to wait for a counter to increase (which was always the classic way to observe grace periods). If all CPUs are in user mode or are idle (which is rather common!) then synchronize_rcu() could return almost immediately - while previously it had to wait for scheduling or periodic timer irqs to trigger on all CPUs - adding many millisecs of delay even in the best of cases. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: On 05/07/2015 08:29 AM, Ingo Molnar wrote: * Frederic Weisbecker fweis...@gmail.com wrote: We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. Note we have the context tracking state which tells where the current task is: user/system/guest. Yes, but that overhead is what I'm suggesting we get rid of, I thought Rik was trying to find a mechanism that would be independent of that? One thing at a time :) I am working on the timer sampling stuff, which should be easy to adapt to a different user/system/guest/irq/softirq/... tracking thing, if somebody else comes up with a more efficient way to do that. So if you make the timer sampling use a percpu variable, and set that variable from the existing callbacks, then we could do this gradually: first the timer sampling uses the flag, then RCU could use it, and finally we could push it out to minimal assembly code. But it's important to start out with a percpu flag to track this all. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; I was thinking that either a counter or a state flag could make sense. Doing both would be pointless. The counter could use the low bits to indicate the state. The benefit of the counter would be that the RCU-waiting CPU could observe that the counter has incremented and that therefore a grace period has elapsed. Getting it right would require lots of care. So if you mean: syscall entry ... this_cpu_inc(rcu_qs_ctr); syscall exit I don't see how this would work reliably: how do you handle the case of a SCHED_FIFO task never returning from user-space (common technique in RT apps)? synchronize_rcu() would block indefinitely as it would never see rcu_qs_ctr increase. We have to be able to observe user-mode anyway, for system-time statistics purposes, and that flag could IMHO also drive the RCU GC machinery. The counter would have to be fancier than that to work. We could say that an even value means user mode, for example. IOW the high bits of the counter would count transitions to quiescent and the low bits would indicate the state. Actually, I'd count backwards. We could use an andl instruction to go to user mode and a decl to go to kernel mode. The problem is that I don't see how TIF_RCU_THINGY can work reliably. If the remote CPU sets it, it'll be too late and we'll still enter user mode without seeing it. If it's just an optimization, though, then it should be fine. Well, after setting it, the remote CPU has to re-check whether the RT CPU has entered user-mode - before it goes to wait. How? Suppose the exit path looked like: this_cpu_write(rcu_state, IN_USER); if (ti-flags _TIF_RCU_NOTIFY) { if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags)) slow_notify_rcu_that_we_are_exiting(); } iret or sysret; No, it would look like this: this_cpu_write(rcu_state, IN_USER); iret or sysret; I.e. IN_USER is set well after all notifications are checked. No kernel execution happens afterwards. (No extra checks added - the regular return-to-user-work checks would handle TIF_RCU_NOTIFY.) ( Same goes for idle: we just mark it IN_IDLE and move it back to IN_KERNEL after the idling ends. ) The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET because stores can be delayed. (It could even happen after sysret, IIRC, but IRET is serializing.) All it has to do in the synchronize_rcu() slowpath is something like: I don't think this works. Suppose rt_cpu starts in kernel mode. Then it checks ti flags. if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) Now it sets rcu_state and exits to user mode. It never notices TIF_RCU_NOTIFY. We could fix this by sending an IPI to kick it. ... go wait ... } /* Cool, we observed quiescent state: */ The cost of the trivial barrier is nothing compared to the 'go wait' cost which we will pay in 99.9% of the cases! If we put an mfence after this_cpu_set or did an unconditional test_and_clear_bit on ti-flags then this problem goes away, but that would probably be slower than we'd like. We are talking about a dozen cycles, while a typical synchronize_rcu() will wait millions (sometimes billions) of cycles. There's absolutely zero performance concern here and it's all CPU local in any case. The barrier would affect all syscalls, though. Admittedly that's still much cheaper than the current thing, but given that we can round trip a syscall in 110 cycles, we'd take at least 10% overhead. My preference would be to use a counter and skip the barrier. If the waiting CPU polled the counter, then, even if it lost this race, it would be guaranteed to see the counter change reasonably soon as long as no CPU implementation sits on its store buffer forever. In fact a user-mode/kernel-mode flag speeds up naive implementations of synchronize_rcu(): because it's able to observe extended quiescent state immediately, without having to wait for a counter to increase (which was always the classic way to observe grace periods). If all CPUs are in user mode or are idle (which is rather common!) then
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: on return from idle we'd do: rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */ etc. On all boundary transitions we can use a constant ADD on a suitable percpu variable. Sounds good to me, except that we need to be careful to distinguish between non-syscall entries from quiescent states and non-syscall entries from quiescent states. It might be hard to make that distinction! ;-) I suspect you wanted to raise the issue of various contexts nesting on each other, such as syscall triggering a page fault, which gets an irq nested, etc. - versus non-nested contexts such as user-space triggering a page fault or user-space getting an irq? [...] We could save the old state (as the current exception_enter code does) or we could allocate enough low bits for the state that the problem goes away. So I think, if it all works in practice just as well as it does in email, we might be better off with more state bits: that would tell any remote statistics/sampling code more as well. It might also be easier to patch in/out, because this kind of state tracking will affect non-RT CPUs as well. (Later on we could do a separate IDT for RT CPUs as well, with a patched version of the entry code.) I don't think the TIF_RCU_QS variant is worthwhile -- merging the counter and state is probably both easier and faster. Yeah. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Fri, May 8, 2015 at 4:27 AM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; I was thinking that either a counter or a state flag could make sense. Doing both would be pointless. The counter could use the low bits to indicate the state. The benefit of the counter would be that the RCU-waiting CPU could observe that the counter has incremented and that therefore a grace period has elapsed. Getting it right would require lots of care. So if you mean: syscall entry ... this_cpu_inc(rcu_qs_ctr); syscall exit I don't see how this would work reliably: how do you handle the case of a SCHED_FIFO task never returning from user-space (common technique in RT apps)? synchronize_rcu() would block indefinitely as it would never see rcu_qs_ctr increase. We have to be able to observe user-mode anyway, for system-time statistics purposes, and that flag could IMHO also drive the RCU GC machinery. The counter would have to be fancier than that to work. We could say that an even value means user mode, for example. IOW the high bits of the counter would count transitions to quiescent and the low bits would indicate the state. Actually, I'd count backwards. We could use an andl instruction to go to user mode and a decl to go to kernel mode. at which point it's not really a monotonic quiescent state counter - which your naming suggested before. Also it would have to be done both at entry and at exit. But yes, if you replace a state flag with a recursive state flag that is INC/DEC maintained that would work as well. That's not a counter though. I won't quibble about the name :) The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET because stores can be delayed. (It could even happen after sysret, IIRC, but IRET is serializing.) All it has to do in the synchronize_rcu() slowpath is something like: I don't think this works. Suppose rt_cpu starts in kernel mode. Then it checks ti flags. if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) Now it sets rcu_state and exits to user mode. It never notices TIF_RCU_NOTIFY. Indeed, you are right, that's racy. [...] We could fix this by sending an IPI to kick it. With an IPI we won't need any flags or counters on the RT CPU - it's the IPI we are trying to avoid. So how about the following way to fix the race: simply do a poll loop with a fixed timeout sleep: like it would do anyway if synchronize_rcu() was waiting for the timer irq to end the grace period on the RT-CPU. Seems reasonable to me. The TIF flag would cause an RCU grace period to lapse, no need to wake up the synchronize_rcu() side: it will notice the (regular) increased RCU counter. We are talking about a dozen cycles, while a typical synchronize_rcu() will wait millions (sometimes billions) of cycles. There's absolutely zero performance concern here and it's all CPU local in any case. The barrier would affect all syscalls, though. [...] What barrier? I never suggested any barrier in the syscall fast path, this bit: Oh, I misunderstood. if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) runs inside synchronize_rcu() or so. But none of that is needed if we do: - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context entry/exit time, straight in the assembly - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does nothing but: this_cpu_inc(rcu_qs_ctr). - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and then does a timeout-poll-loop with jiffies granular timeouts, simulating a timer IRQ in essence. It will either observe IN_USER or will see the regular RCU qs counter increase. this should be race free. Alternatively we can even get rid of the TIF flag by merging the percpu counter with
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: On May 8, 2015 12:07 PM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; I was thinking that either a counter or a state flag could make sense. Doing both would be pointless. The counter could use the low bits to indicate the state. The benefit of the counter would be that the RCU-waiting CPU could observe that the counter has incremented and that therefore a grace period has elapsed. Getting it right would require lots of care. So if you mean: syscall entry ... this_cpu_inc(rcu_qs_ctr); syscall exit I don't see how this would work reliably: how do you handle the case of a SCHED_FIFO task never returning from user-space (common technique in RT apps)? synchronize_rcu() would block indefinitely as it would never see rcu_qs_ctr increase. We have to be able to observe user-mode anyway, for system-time statistics purposes, and that flag could IMHO also drive the RCU GC machinery. The counter would have to be fancier than that to work. We could say that an even value means user mode, for example. IOW the high bits of the counter would count transitions to quiescent and the low bits would indicate the state. Actually, I'd count backwards. We could use an andl instruction to go to user mode and a decl to go to kernel mode. at which point it's not really a monotonic quiescent state counter - which your naming suggested before. Also it would have to be done both at entry and at exit. But yes, if you replace a state flag with a recursive state flag that is INC/DEC maintained that would work as well. That's not a counter though. The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET because stores can be delayed. (It could even happen after sysret, IIRC, but IRET is serializing.) All it has to do in the synchronize_rcu() slowpath is something like: I don't think this works. Suppose rt_cpu starts in kernel mode. Then it checks ti flags. if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) Now it sets rcu_state and exits to user mode. It never notices TIF_RCU_NOTIFY. Indeed, you are right, that's racy. [...] We could fix this by sending an IPI to kick it. With an IPI we won't need any flags or counters on the RT CPU - it's the IPI we are trying to avoid. So how about the following way to fix the race: simply do a poll loop with a fixed timeout sleep: like it would do anyway if synchronize_rcu() was waiting for the timer irq to end the grace period on the RT-CPU. The TIF flag would cause an RCU grace period to lapse, no need to wake up the synchronize_rcu() side: it will notice the (regular) increased RCU counter. We are talking about a dozen cycles, while a typical synchronize_rcu() will wait millions (sometimes billions) of cycles. There's absolutely zero performance concern here and it's all CPU local in any case. The barrier would affect all syscalls, though. [...] What barrier? I never suggested any barrier in the syscall fast path, this bit: if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) runs inside synchronize_rcu() or so. But none of that is needed if we do: - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context entry/exit time, straight in the assembly - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does nothing but: this_cpu_inc(rcu_qs_ctr). - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and then does a timeout-poll-loop with jiffies granular timeouts, simulating a timer IRQ in essence. It will either observe IN_USER or will see the regular RCU qs counter increase. this should be race free. Alternatively we can even get rid of the TIF flag by merging the percpu counter with the percpu state. Quiescent states are recorded via a counter shifted by two bits: rcu_qs_ctr += 4; while user/kernel/idle mode is recorded in
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On May 7, 2015 8:38 PM, "Ingo Molnar" wrote: > > > * Andy Lutomirski wrote: > > > I think one or both of us is missing something or we're just talking > > about different things. > > That's very much possible! > > I think part of the problem is that I called the 'remote CPU' the RT > CPU, while you seem to be calling it the CPU that does the > synchronize_rcu(). > > So lets start again, with calling the synchronize_rcu() the 'remote > CPU', and the one doing the RT workload the 'RT CPU': > > > If the nohz/RT cpu is about to enter user mode and stay there for a > > long time, it does: > > > > this_cpu_inc(rcu_qs_ctr); > > > > or whatever. Maybe we add: > > > > this_cpu_set(rcu_state) = IN_USER; > > > > or however it's spelled. > > > > The remote CPU wants to check our state. If this happens just > > before the IN_USER write or rcu_qs_ctr increment becomes visible, > > then it'll think we're in kernel mode. Now it either has to poll > > (which is fine) or try to get us to tell the RCU core when we become > > quiescent by setting TIF_RCU_THINGY. > > So do you mean: > >this_cpu_set(rcu_state) = IN_KERNEL; >... >this_cpu_inc(rcu_qs_ctr); >this_cpu_set(rcu_state) = IN_USER; > > ? > > So in your proposal we'd have an INC and two MOVs. I think we can make > it just two simple stores into a byte flag, one on entry and one on > exit: > >this_cpu_set(rcu_state) = IN_KERNEL; >... >this_cpu_set(rcu_state) = IN_USER; > I was thinking that either a counter or a state flag could make sense. Doing both would be pointless. The counter could use the low bits to indicate the state. The benefit of the counter would be that the RCU-waiting CPU could observe that the counter has incremented and that therefore a grace period has elapsed. Getting it right would require lots of care. > plus the rare but magic TIF_RCU_THINGY that tells a waiting > synchronize_rcu() about the next quiescent state. > > > The problem is that I don't see how TIF_RCU_THINGY can work > > reliably. If the remote CPU sets it, it'll be too late and we'll > > still enter user mode without seeing it. If it's just an > > optimization, though, then it should be fine. > > Well, after setting it, the remote CPU has to re-check whether the RT > CPU has entered user-mode - before it goes to wait. How? Suppose the exit path looked like: this_cpu_write(rcu_state, IN_USER); if (ti->flags & _TIF_RCU_NOTIFY) { if (test_and_clear_bit(TIF_RCU_NOTIFY, >flags)) slow_notify_rcu_that_we_are_exiting(); } iret or sysret; The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET because stores can be delayed. (It could even happen after sysret, IIRC, but IRET is serializing.) If we put an mfence after this_cpu_set or did an unconditional test_and_clear_bit on ti->flags then this problem goes away, but that would probably be slower than we'd like. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/07/2015 08:29 AM, Ingo Molnar wrote: > > * Frederic Weisbecker wrote: > We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. >>> >>> So one worry with this and similar approaches of statistically >>> detecting user mode would be the fact that on the way out to >>> user-space we don't really destroy the previous call trace - we >>> just pop off the stack (non-destructively), restore RIPs and are >>> gone. >>> >>> We'll need that percpu flag I suspect. >> >> Note we have the context tracking state which tells where the >> current task is: user/system/guest. > > Yes, but that overhead is what I'm suggesting we get rid of, I thought > Rik was trying to find a mechanism that would be independent of that? One thing at a time :) I am working on the timer sampling stuff, which should be easy to adapt to a different user/system/guest/irq/softirq/... tracking thing, if somebody else comes up with a more efficient way to do that. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > I think one or both of us is missing something or we're just talking > about different things. That's very much possible! I think part of the problem is that I called the 'remote CPU' the RT CPU, while you seem to be calling it the CPU that does the synchronize_rcu(). So lets start again, with calling the synchronize_rcu() the 'remote CPU', and the one doing the RT workload the 'RT CPU': > If the nohz/RT cpu is about to enter user mode and stay there for a > long time, it does: > > this_cpu_inc(rcu_qs_ctr); > > or whatever. Maybe we add: > > this_cpu_set(rcu_state) = IN_USER; > > or however it's spelled. > > The remote CPU wants to check our state. If this happens just > before the IN_USER write or rcu_qs_ctr increment becomes visible, > then it'll think we're in kernel mode. Now it either has to poll > (which is fine) or try to get us to tell the RCU core when we become > quiescent by setting TIF_RCU_THINGY. So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; plus the rare but magic TIF_RCU_THINGY that tells a waiting synchronize_rcu() about the next quiescent state. > The problem is that I don't see how TIF_RCU_THINGY can work > reliably. If the remote CPU sets it, it'll be too late and we'll > still enter user mode without seeing it. If it's just an > optimization, though, then it should be fine. Well, after setting it, the remote CPU has to re-check whether the RT CPU has entered user-mode - before it goes to wait. If it has entered user mode then the remote CPU has observed quiescent state and is done. If it's still in kernel mode then it waits until the TIF_RCU_THINGY does its completion. > I still feel like this is all overengineered. Shouldn't rcu_qs_ctr > be enough for all of the above? How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) work? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> > > We cannot take the lock_trace(task) from irq context, and we >> > > probably do not need to anyway, since we do not care about a >> > > precise stack trace for the task. >> > >> > So one worry with this and similar approaches of statistically >> > detecting user mode would be the fact that on the way out to >> > user-space we don't really destroy the previous call trace - we >> > just pop off the stack (non-destructively), restore RIPs and are >> > gone. >> > >> > We'll need that percpu flag I suspect. >> > >> > And once we have the flag, we can get rid of the per syscall RCU >> > callback as well, relatively easily: with CMPXCHG (in >> > synchronize_rcu()!) we can reliably sample whether a CPU is in >> > user mode right now, while the syscall entry/exit path does not >> > use any atomics, we can just use a simple MOV. >> > >> > Once we observe 'user mode', then we have observed quiescent state >> > and synchronize_rcu() can continue. If we've observed kernel mode >> > we can frob the remote task's TIF_ flags to make it go into a >> > quiescent state publishing routine on syscall-return. >> > >> >> How does that work? >> >> If the exit code writes the per-cpu flag and then checks >> TIF_whatever, we need a barrier to avoid a race where we end up in >> user mode without seeing the flag. > > No, the TIF_RCU_SYNC flag would be independent of the user-mode flag. > > Also, the user-mode flag would be changed from the 'kernel-mode' value > to the 'user-mode' value after we do the regular TIF checks - just > before we hit user-space. > > The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being > executed on some other CPU not doing RT work) to intelligently wait > for the remote (RT work doing) CPU to finish executing kernel code, > without polling or so. > > And yes, the TIF_RCU_QS handler on the remote CPU would have to > execute an SMP barrier, before it allows an RCU quiescent state to > pass. Note that the regular RCU code for quiescent state publishing > already has that barrier, typically something like: > > this_cpu_inc(rcu_qs_ctr); > > That in itself is enough for synchronize_rcu(), it could do a small > timing loop with short sleeps, until it waits for rcu_qs_ctr to > increase. I think one or both of us is missing something or we're just talking about different things. If the nohz/RT cpu is about to enter user mode and stay there for a long time, it does: this_cpu_inc(rcu_qs_ctr); or whatever. Maybe we add: this_cpu_set(rcu_state) = IN_USER; or however it's spelled. The remote CPU wants to check our state. If this happens just before the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll think we're in kernel mode. Now it either has to poll (which is fine) or try to get us to tell the RCU core when we become quiescent by setting TIF_RCU_THINGY. The problem is that I don't see how TIF_RCU_THINGY can work reliably. If the remote CPU sets it, it'll be too late and we'll still enter user mode without seeing it. If it's just an optimization, though, then it should be fine. I still feel like this is all overengineered. Shouldn't rcu_qs_ctr be enough for all of the above? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Ingo Molnar wrote: > The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() > (being executed on some other CPU not doing RT work) to > intelligently wait for the remote (RT work doing) CPU to finish > executing kernel code, without polling or so. it's basically a cheap IPI being inserted on the remote CPU. We need the TIF_RCU_QS callback not just to wait intelligently, but mainly to elapse a grace period, otherwise synchronize_rcu() might not ever make progress: think a SCHED_FIFO task doing some kernel work, synchronize_rcu() stumbling upon it - but the SCHED_FIFO task otherwise never scheduling and never getting any timer irqs either, and thus never entering quiescent state. (Cc:-ed Paul too, he might be interested in this as well.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > > > We cannot take the lock_trace(task) from irq context, and we > > > probably do not need to anyway, since we do not care about a > > > precise stack trace for the task. > > > > So one worry with this and similar approaches of statistically > > detecting user mode would be the fact that on the way out to > > user-space we don't really destroy the previous call trace - we > > just pop off the stack (non-destructively), restore RIPs and are > > gone. > > > > We'll need that percpu flag I suspect. > > > > And once we have the flag, we can get rid of the per syscall RCU > > callback as well, relatively easily: with CMPXCHG (in > > synchronize_rcu()!) we can reliably sample whether a CPU is in > > user mode right now, while the syscall entry/exit path does not > > use any atomics, we can just use a simple MOV. > > > > Once we observe 'user mode', then we have observed quiescent state > > and synchronize_rcu() can continue. If we've observed kernel mode > > we can frob the remote task's TIF_ flags to make it go into a > > quiescent state publishing routine on syscall-return. > > > > How does that work? > > If the exit code writes the per-cpu flag and then checks > TIF_whatever, we need a barrier to avoid a race where we end up in > user mode without seeing the flag. No, the TIF_RCU_SYNC flag would be independent of the user-mode flag. Also, the user-mode flag would be changed from the 'kernel-mode' value to the 'user-mode' value after we do the regular TIF checks - just before we hit user-space. The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being executed on some other CPU not doing RT work) to intelligently wait for the remote (RT work doing) CPU to finish executing kernel code, without polling or so. And yes, the TIF_RCU_QS handler on the remote CPU would have to execute an SMP barrier, before it allows an RCU quiescent state to pass. Note that the regular RCU code for quiescent state publishing already has that barrier, typically something like: this_cpu_inc(rcu_qs_ctr); That in itself is enough for synchronize_rcu(), it could do a small timing loop with short sleeps, until it waits for rcu_qs_ctr to increase. > I think the right solution is to accept that race and just have the > RCU code send an IPI (or check again) if it sees too long of a > period elapse in kernel mode. I don't think there's any need for an IPI. > I think the flag should be a counter, though. That way a workload > that makes lots of syscalls will be quickly detected as going > through quiescent states even if it's never actually observed in > user mode. Flag write is easier on the CPU than an INC/DEC: only a store to a percpu location, no load needed, and no value dependencies. The store will be program ordered, so it's a spin_unlock() work-alike. > > The only hard requirement of this scheme from the RCU > > synchronization POV is that all kernel contexts that may touch RCU > > state need to flip this flag reliably to 'kernel mode': i.e. all > > irq handlers, traps, NMIs and all syscall variants need to do > > this. > > > > But once it's there, it's really neat. > > We already have to do this with the current code. I went through > and checked all of the IST entries a couple versions ago. Yeah - I just mean if the primary flag is _not_ TIF driven (which is my suggestion, and which I thought you suggested as well) then all code paths need to be covered. Things like the irq-tracking callbacks are debugging and instrumentation code - the user/kernel mode flag would be a hard requirement on RCU correctness: missing a flag update might cause the kernel to crash in non-obvious ways. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Frederic Weisbecker wrote: > > > We cannot take the lock_trace(task) from irq context, and we > > > probably do not need to anyway, since we do not care about a > > > precise stack trace for the task. > > > > So one worry with this and similar approaches of statistically > > detecting user mode would be the fact that on the way out to > > user-space we don't really destroy the previous call trace - we > > just pop off the stack (non-destructively), restore RIPs and are > > gone. > > > > We'll need that percpu flag I suspect. > > Note we have the context tracking state which tells where the > current task is: user/system/guest. Yes, but that overhead is what I'm suggesting we get rid of, I thought Rik was trying to find a mechanism that would be independent of that? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On May 7, 2015 4:18 PM, "Ingo Molnar" wrote: > > > * Rik van Riel wrote: > > > > If, on the other hand, you're just going to remotely sample the > > > in-memory context, that sounds good. > > > > It's the latter. > > > > If you look at /proc//{stack,syscall,wchan} and other files, > > you will see we already have ways to determine, from in memory > > content, where a program is running at a certain point in time. > > > > In fact, the timer interrupt based accounting does a similar thing. > > It has a task examine its own in-memory state to figure out what it > > was doing before the timer interrupt happened. > > > > The kernel side stack pointer is probably enough to tell us whether > > a task is active in kernel space, on an irq stack, or (maybe) in > > user space. Not convinced about the latter, we may need to look at > > the same state the RCU code keeps track of to see what mode a task > > is in... > > > > I am looking at the code to see what locks we need to grab. > > > > I suspect the runqueue lock may be enough, to ensure that the task > > struct, and stack do not go away while we are looking at them. > > That will be enough, especially if you get to the task reference via > rq->curr. > > > We cannot take the lock_trace(task) from irq context, and we > > probably do not need to anyway, since we do not care about a precise > > stack trace for the task. > > So one worry with this and similar approaches of statistically > detecting user mode would be the fact that on the way out to > user-space we don't really destroy the previous call trace - we just > pop off the stack (non-destructively), restore RIPs and are gone. > > We'll need that percpu flag I suspect. > > And once we have the flag, we can get rid of the per syscall RCU > callback as well, relatively easily: with CMPXCHG (in > synchronize_rcu()!) we can reliably sample whether a CPU is in user > mode right now, while the syscall entry/exit path does not use any > atomics, we can just use a simple MOV. > > Once we observe 'user mode', then we have observed quiescent state and > synchronize_rcu() can continue. If we've observed kernel mode we can > frob the remote task's TIF_ flags to make it go into a quiescent state > publishing routine on syscall-return. > How does that work? If the exit code writes the per-cpu flag and then checks TIF_whatever, we need a barrier to avoid a race where we end up in user mode without seeing the flag. I think the right solution is to accept that race and just have the RCU code send an IPI (or check again) if it sees too long of a period elapse in kernel mode. I think the flag should be a counter, though. That way a workload that makes lots of syscalls will be quickly detected as going through quiescent states even if it's never actually observed in user mode. > The only hard requirement of this scheme from the RCU synchronization > POV is that all kernel contexts that may touch RCU state need to flip > this flag reliably to 'kernel mode': i.e. all irq handlers, traps, > NMIs and all syscall variants need to do this. > > But once it's there, it's really neat. > We already have to do this with the current code. I went through and checked all of the IST entries a couple versions ago. I think we need to clean up the current garbage asm first, though. See: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac (Very much unfinished, and it should probably be split up, but AFAICT it works. Don't hold your breath for a real version.) --Andy > Thanks, > > Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote: > > * Rik van Riel wrote: > > > > If, on the other hand, you're just going to remotely sample the > > > in-memory context, that sounds good. > > > > It's the latter. > > > > If you look at /proc//{stack,syscall,wchan} and other files, > > you will see we already have ways to determine, from in memory > > content, where a program is running at a certain point in time. > > > > In fact, the timer interrupt based accounting does a similar thing. > > It has a task examine its own in-memory state to figure out what it > > was doing before the timer interrupt happened. > > > > The kernel side stack pointer is probably enough to tell us whether > > a task is active in kernel space, on an irq stack, or (maybe) in > > user space. Not convinced about the latter, we may need to look at > > the same state the RCU code keeps track of to see what mode a task > > is in... > > > > I am looking at the code to see what locks we need to grab. > > > > I suspect the runqueue lock may be enough, to ensure that the task > > struct, and stack do not go away while we are looking at them. > > That will be enough, especially if you get to the task reference via > rq->curr. > > > We cannot take the lock_trace(task) from irq context, and we > > probably do not need to anyway, since we do not care about a precise > > stack trace for the task. > > So one worry with this and similar approaches of statistically > detecting user mode would be the fact that on the way out to > user-space we don't really destroy the previous call trace - we just > pop off the stack (non-destructively), restore RIPs and are gone. > > We'll need that percpu flag I suspect. Note we have the context tracking state which tells where the current task is: user/system/guest. > > And once we have the flag, we can get rid of the per syscall RCU > callback as well, relatively easily: with CMPXCHG (in > synchronize_rcu()!) we can reliably sample whether a CPU is in user > mode right now, while the syscall entry/exit path does not use any > atomics, we can just use a simple MOV. > > Once we observe 'user mode', then we have observed quiescent state and > synchronize_rcu() can continue. If we've observed kernel mode we can > frob the remote task's TIF_ flags to make it go into a quiescent state > publishing routine on syscall-return. > > The only hard requirement of this scheme from the RCU synchronization > POV is that all kernel contexts that may touch RCU state need to flip > this flag reliably to 'kernel mode': i.e. all irq handlers, traps, > NMIs and all syscall variants need to do this. > > But once it's there, it's really neat. > > Thanks, > > Ingo > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > > If, on the other hand, you're just going to remotely sample the > > in-memory context, that sounds good. > > It's the latter. > > If you look at /proc//{stack,syscall,wchan} and other files, > you will see we already have ways to determine, from in memory > content, where a program is running at a certain point in time. > > In fact, the timer interrupt based accounting does a similar thing. > It has a task examine its own in-memory state to figure out what it > was doing before the timer interrupt happened. > > The kernel side stack pointer is probably enough to tell us whether > a task is active in kernel space, on an irq stack, or (maybe) in > user space. Not convinced about the latter, we may need to look at > the same state the RCU code keeps track of to see what mode a task > is in... > > I am looking at the code to see what locks we need to grab. > > I suspect the runqueue lock may be enough, to ensure that the task > struct, and stack do not go away while we are looking at them. That will be enough, especially if you get to the task reference via rq->curr. > We cannot take the lock_trace(task) from irq context, and we > probably do not need to anyway, since we do not care about a precise > stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. And once we have the flag, we can get rid of the per syscall RCU callback as well, relatively easily: with CMPXCHG (in synchronize_rcu()!) we can reliably sample whether a CPU is in user mode right now, while the syscall entry/exit path does not use any atomics, we can just use a simple MOV. Once we observe 'user mode', then we have observed quiescent state and synchronize_rcu() can continue. If we've observed kernel mode we can frob the remote task's TIF_ flags to make it go into a quiescent state publishing routine on syscall-return. The only hard requirement of this scheme from the RCU synchronization POV is that all kernel contexts that may touch RCU state need to flip this flag reliably to 'kernel mode': i.e. all irq handlers, traps, NMIs and all syscall variants need to do this. But once it's there, it's really neat. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > >> [...] Also, we'd have to audit all the entries, and > >> system_call_after_swapgs currently enables interrupts early > >> enough that an interrupt before all the pushes will do > >> unpredictable things to pt_regs. > > > > An irq hardware frame won't push zero to that selector value, > > right? That's the only bad thing that would confuse the code. > > > > I think it's not quite that simple. The syscall entry looks like, > roughly: > > fix rsp; > sti; > push ss; > push rsp; > push flags; > push cs; > push rip; > > We can get an interrupt part-way through those pushes. Maybe there's > no bad place where we could get an IRQ since SS is first, [...] ... and it should be covered by the 'STI window' where the instruction following a STI is still part of the irqs-off section. > [...] but this is still nasty. True. Another approach would be to set up two aliases in the GDT, so we could freely change 'ss' between them and thus store information, without possibly confusing the syscall entry/exit code. That still gets complicated by IST entries, which creates multiple positions for the 'flag'. > I think I prefer a per-cpu approach over a per-task approach, since > it's easier to reason about and it should still only require one > instruction on entry and one instruction on exit. Yes, agreed. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On May 7, 2015 8:38 PM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: I think one or both of us is missing something or we're just talking about different things. That's very much possible! I think part of the problem is that I called the 'remote CPU' the RT CPU, while you seem to be calling it the CPU that does the synchronize_rcu(). So lets start again, with calling the synchronize_rcu() the 'remote CPU', and the one doing the RT workload the 'RT CPU': If the nohz/RT cpu is about to enter user mode and stay there for a long time, it does: this_cpu_inc(rcu_qs_ctr); or whatever. Maybe we add: this_cpu_set(rcu_state) = IN_USER; or however it's spelled. The remote CPU wants to check our state. If this happens just before the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll think we're in kernel mode. Now it either has to poll (which is fine) or try to get us to tell the RCU core when we become quiescent by setting TIF_RCU_THINGY. So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; I was thinking that either a counter or a state flag could make sense. Doing both would be pointless. The counter could use the low bits to indicate the state. The benefit of the counter would be that the RCU-waiting CPU could observe that the counter has incremented and that therefore a grace period has elapsed. Getting it right would require lots of care. plus the rare but magic TIF_RCU_THINGY that tells a waiting synchronize_rcu() about the next quiescent state. The problem is that I don't see how TIF_RCU_THINGY can work reliably. If the remote CPU sets it, it'll be too late and we'll still enter user mode without seeing it. If it's just an optimization, though, then it should be fine. Well, after setting it, the remote CPU has to re-check whether the RT CPU has entered user-mode - before it goes to wait. How? Suppose the exit path looked like: this_cpu_write(rcu_state, IN_USER); if (ti-flags _TIF_RCU_NOTIFY) { if (test_and_clear_bit(TIF_RCU_NOTIFY, ti-flags)) slow_notify_rcu_that_we_are_exiting(); } iret or sysret; The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET because stores can be delayed. (It could even happen after sysret, IIRC, but IRET is serializing.) If we put an mfence after this_cpu_set or did an unconditional test_and_clear_bit on ti-flags then this problem goes away, but that would probably be slower than we'd like. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Thu, May 07, 2015 at 12:48:45PM +0200, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: If, on the other hand, you're just going to remotely sample the in-memory context, that sounds good. It's the latter. If you look at /proc/pid/{stack,syscall,wchan} and other files, you will see we already have ways to determine, from in memory content, where a program is running at a certain point in time. In fact, the timer interrupt based accounting does a similar thing. It has a task examine its own in-memory state to figure out what it was doing before the timer interrupt happened. The kernel side stack pointer is probably enough to tell us whether a task is active in kernel space, on an irq stack, or (maybe) in user space. Not convinced about the latter, we may need to look at the same state the RCU code keeps track of to see what mode a task is in... I am looking at the code to see what locks we need to grab. I suspect the runqueue lock may be enough, to ensure that the task struct, and stack do not go away while we are looking at them. That will be enough, especially if you get to the task reference via rq-curr. We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. Note we have the context tracking state which tells where the current task is: user/system/guest. And once we have the flag, we can get rid of the per syscall RCU callback as well, relatively easily: with CMPXCHG (in synchronize_rcu()!) we can reliably sample whether a CPU is in user mode right now, while the syscall entry/exit path does not use any atomics, we can just use a simple MOV. Once we observe 'user mode', then we have observed quiescent state and synchronize_rcu() can continue. If we've observed kernel mode we can frob the remote task's TIF_ flags to make it go into a quiescent state publishing routine on syscall-return. The only hard requirement of this scheme from the RCU synchronization POV is that all kernel contexts that may touch RCU state need to flip this flag reliably to 'kernel mode': i.e. all irq handlers, traps, NMIs and all syscall variants need to do this. But once it's there, it's really neat. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: If, on the other hand, you're just going to remotely sample the in-memory context, that sounds good. It's the latter. If you look at /proc/pid/{stack,syscall,wchan} and other files, you will see we already have ways to determine, from in memory content, where a program is running at a certain point in time. In fact, the timer interrupt based accounting does a similar thing. It has a task examine its own in-memory state to figure out what it was doing before the timer interrupt happened. The kernel side stack pointer is probably enough to tell us whether a task is active in kernel space, on an irq stack, or (maybe) in user space. Not convinced about the latter, we may need to look at the same state the RCU code keeps track of to see what mode a task is in... I am looking at the code to see what locks we need to grab. I suspect the runqueue lock may be enough, to ensure that the task struct, and stack do not go away while we are looking at them. That will be enough, especially if you get to the task reference via rq-curr. We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. And once we have the flag, we can get rid of the per syscall RCU callback as well, relatively easily: with CMPXCHG (in synchronize_rcu()!) we can reliably sample whether a CPU is in user mode right now, while the syscall entry/exit path does not use any atomics, we can just use a simple MOV. Once we observe 'user mode', then we have observed quiescent state and synchronize_rcu() can continue. If we've observed kernel mode we can frob the remote task's TIF_ flags to make it go into a quiescent state publishing routine on syscall-return. The only hard requirement of this scheme from the RCU synchronization POV is that all kernel contexts that may touch RCU state need to flip this flag reliably to 'kernel mode': i.e. all irq handlers, traps, NMIs and all syscall variants need to do this. But once it's there, it's really neat. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. And once we have the flag, we can get rid of the per syscall RCU callback as well, relatively easily: with CMPXCHG (in synchronize_rcu()!) we can reliably sample whether a CPU is in user mode right now, while the syscall entry/exit path does not use any atomics, we can just use a simple MOV. Once we observe 'user mode', then we have observed quiescent state and synchronize_rcu() can continue. If we've observed kernel mode we can frob the remote task's TIF_ flags to make it go into a quiescent state publishing routine on syscall-return. How does that work? If the exit code writes the per-cpu flag and then checks TIF_whatever, we need a barrier to avoid a race where we end up in user mode without seeing the flag. No, the TIF_RCU_SYNC flag would be independent of the user-mode flag. Also, the user-mode flag would be changed from the 'kernel-mode' value to the 'user-mode' value after we do the regular TIF checks - just before we hit user-space. The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being executed on some other CPU not doing RT work) to intelligently wait for the remote (RT work doing) CPU to finish executing kernel code, without polling or so. And yes, the TIF_RCU_QS handler on the remote CPU would have to execute an SMP barrier, before it allows an RCU quiescent state to pass. Note that the regular RCU code for quiescent state publishing already has that barrier, typically something like: this_cpu_inc(rcu_qs_ctr); That in itself is enough for synchronize_rcu(), it could do a small timing loop with short sleeps, until it waits for rcu_qs_ctr to increase. I think the right solution is to accept that race and just have the RCU code send an IPI (or check again) if it sees too long of a period elapse in kernel mode. I don't think there's any need for an IPI. I think the flag should be a counter, though. That way a workload that makes lots of syscalls will be quickly detected as going through quiescent states even if it's never actually observed in user mode. Flag write is easier on the CPU than an INC/DEC: only a store to a percpu location, no load needed, and no value dependencies. The store will be program ordered, so it's a spin_unlock() work-alike. The only hard requirement of this scheme from the RCU synchronization POV is that all kernel contexts that may touch RCU state need to flip this flag reliably to 'kernel mode': i.e. all irq handlers, traps, NMIs and all syscall variants need to do this. But once it's there, it's really neat. We already have to do this with the current code. I went through and checked all of the IST entries a couple versions ago. Yeah - I just mean if the primary flag is _not_ TIF driven (which is my suggestion, and which I thought you suggested as well) then all code paths need to be covered. Things like the irq-tracking callbacks are debugging and instrumentation code - the user/kernel mode flag would be a hard requirement on RCU correctness: missing a flag update might cause the kernel to crash in non-obvious ways. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/07/2015 08:29 AM, Ingo Molnar wrote: * Frederic Weisbecker fweis...@gmail.com wrote: We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. Note we have the context tracking state which tells where the current task is: user/system/guest. Yes, but that overhead is what I'm suggesting we get rid of, I thought Rik was trying to find a mechanism that would be independent of that? One thing at a time :) I am working on the timer sampling stuff, which should be easy to adapt to a different user/system/guest/irq/softirq/... tracking thing, if somebody else comes up with a more efficient way to do that. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: I think one or both of us is missing something or we're just talking about different things. That's very much possible! I think part of the problem is that I called the 'remote CPU' the RT CPU, while you seem to be calling it the CPU that does the synchronize_rcu(). So lets start again, with calling the synchronize_rcu() the 'remote CPU', and the one doing the RT workload the 'RT CPU': If the nohz/RT cpu is about to enter user mode and stay there for a long time, it does: this_cpu_inc(rcu_qs_ctr); or whatever. Maybe we add: this_cpu_set(rcu_state) = IN_USER; or however it's spelled. The remote CPU wants to check our state. If this happens just before the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll think we're in kernel mode. Now it either has to poll (which is fine) or try to get us to tell the RCU core when we become quiescent by setting TIF_RCU_THINGY. So do you mean: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_inc(rcu_qs_ctr); this_cpu_set(rcu_state) = IN_USER; ? So in your proposal we'd have an INC and two MOVs. I think we can make it just two simple stores into a byte flag, one on entry and one on exit: this_cpu_set(rcu_state) = IN_KERNEL; ... this_cpu_set(rcu_state) = IN_USER; plus the rare but magic TIF_RCU_THINGY that tells a waiting synchronize_rcu() about the next quiescent state. The problem is that I don't see how TIF_RCU_THINGY can work reliably. If the remote CPU sets it, it'll be too late and we'll still enter user mode without seeing it. If it's just an optimization, though, then it should be fine. Well, after setting it, the remote CPU has to re-check whether the RT CPU has entered user-mode - before it goes to wait. If it has entered user mode then the remote CPU has observed quiescent state and is done. If it's still in kernel mode then it waits until the TIF_RCU_THINGY does its completion. I still feel like this is all overengineered. Shouldn't rcu_qs_ctr be enough for all of the above? How would just INC rcu_qs_ctr (without the IN_KERNEL/IN_USER flag) work? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: [...] Also, we'd have to audit all the entries, and system_call_after_swapgs currently enables interrupts early enough that an interrupt before all the pushes will do unpredictable things to pt_regs. An irq hardware frame won't push zero to that selector value, right? That's the only bad thing that would confuse the code. I think it's not quite that simple. The syscall entry looks like, roughly: fix rsp; sti; push ss; push rsp; push flags; push cs; push rip; We can get an interrupt part-way through those pushes. Maybe there's no bad place where we could get an IRQ since SS is first, [...] ... and it should be covered by the 'STI window' where the instruction following a STI is still part of the irqs-off section. [...] but this is still nasty. True. Another approach would be to set up two aliases in the GDT, so we could freely change 'ss' between them and thus store information, without possibly confusing the syscall entry/exit code. That still gets complicated by IST entries, which creates multiple positions for the 'flag'. I think I prefer a per-cpu approach over a per-task approach, since it's easier to reason about and it should still only require one instruction on entry and one instruction on exit. Yes, agreed. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Ingo Molnar mi...@kernel.org wrote: The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being executed on some other CPU not doing RT work) to intelligently wait for the remote (RT work doing) CPU to finish executing kernel code, without polling or so. it's basically a cheap IPI being inserted on the remote CPU. We need the TIF_RCU_QS callback not just to wait intelligently, but mainly to elapse a grace period, otherwise synchronize_rcu() might not ever make progress: think a SCHED_FIFO task doing some kernel work, synchronize_rcu() stumbling upon it - but the SCHED_FIFO task otherwise never scheduling and never getting any timer irqs either, and thus never entering quiescent state. (Cc:-ed Paul too, he might be interested in this as well.) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Thu, May 7, 2015 at 5:44 AM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. And once we have the flag, we can get rid of the per syscall RCU callback as well, relatively easily: with CMPXCHG (in synchronize_rcu()!) we can reliably sample whether a CPU is in user mode right now, while the syscall entry/exit path does not use any atomics, we can just use a simple MOV. Once we observe 'user mode', then we have observed quiescent state and synchronize_rcu() can continue. If we've observed kernel mode we can frob the remote task's TIF_ flags to make it go into a quiescent state publishing routine on syscall-return. How does that work? If the exit code writes the per-cpu flag and then checks TIF_whatever, we need a barrier to avoid a race where we end up in user mode without seeing the flag. No, the TIF_RCU_SYNC flag would be independent of the user-mode flag. Also, the user-mode flag would be changed from the 'kernel-mode' value to the 'user-mode' value after we do the regular TIF checks - just before we hit user-space. The TIF_RCU_QS thing is just a fancy way for synchronize_rcu() (being executed on some other CPU not doing RT work) to intelligently wait for the remote (RT work doing) CPU to finish executing kernel code, without polling or so. And yes, the TIF_RCU_QS handler on the remote CPU would have to execute an SMP barrier, before it allows an RCU quiescent state to pass. Note that the regular RCU code for quiescent state publishing already has that barrier, typically something like: this_cpu_inc(rcu_qs_ctr); That in itself is enough for synchronize_rcu(), it could do a small timing loop with short sleeps, until it waits for rcu_qs_ctr to increase. I think one or both of us is missing something or we're just talking about different things. If the nohz/RT cpu is about to enter user mode and stay there for a long time, it does: this_cpu_inc(rcu_qs_ctr); or whatever. Maybe we add: this_cpu_set(rcu_state) = IN_USER; or however it's spelled. The remote CPU wants to check our state. If this happens just before the IN_USER write or rcu_qs_ctr increment becomes visible, then it'll think we're in kernel mode. Now it either has to poll (which is fine) or try to get us to tell the RCU core when we become quiescent by setting TIF_RCU_THINGY. The problem is that I don't see how TIF_RCU_THINGY can work reliably. If the remote CPU sets it, it'll be too late and we'll still enter user mode without seeing it. If it's just an optimization, though, then it should be fine. I still feel like this is all overengineered. Shouldn't rcu_qs_ctr be enough for all of the above? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On May 7, 2015 4:18 PM, Ingo Molnar mi...@kernel.org wrote: * Rik van Riel r...@redhat.com wrote: If, on the other hand, you're just going to remotely sample the in-memory context, that sounds good. It's the latter. If you look at /proc/pid/{stack,syscall,wchan} and other files, you will see we already have ways to determine, from in memory content, where a program is running at a certain point in time. In fact, the timer interrupt based accounting does a similar thing. It has a task examine its own in-memory state to figure out what it was doing before the timer interrupt happened. The kernel side stack pointer is probably enough to tell us whether a task is active in kernel space, on an irq stack, or (maybe) in user space. Not convinced about the latter, we may need to look at the same state the RCU code keeps track of to see what mode a task is in... I am looking at the code to see what locks we need to grab. I suspect the runqueue lock may be enough, to ensure that the task struct, and stack do not go away while we are looking at them. That will be enough, especially if you get to the task reference via rq-curr. We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. And once we have the flag, we can get rid of the per syscall RCU callback as well, relatively easily: with CMPXCHG (in synchronize_rcu()!) we can reliably sample whether a CPU is in user mode right now, while the syscall entry/exit path does not use any atomics, we can just use a simple MOV. Once we observe 'user mode', then we have observed quiescent state and synchronize_rcu() can continue. If we've observed kernel mode we can frob the remote task's TIF_ flags to make it go into a quiescent state publishing routine on syscall-return. How does that work? If the exit code writes the per-cpu flag and then checks TIF_whatever, we need a barrier to avoid a race where we end up in user mode without seeing the flag. I think the right solution is to accept that race and just have the RCU code send an IPI (or check again) if it sees too long of a period elapse in kernel mode. I think the flag should be a counter, though. That way a workload that makes lots of syscalls will be quickly detected as going through quiescent states even if it's never actually observed in user mode. The only hard requirement of this scheme from the RCU synchronization POV is that all kernel contexts that may touch RCU state need to flip this flag reliably to 'kernel mode': i.e. all irq handlers, traps, NMIs and all syscall variants need to do this. But once it's there, it's really neat. We already have to do this with the current code. I went through and checked all of the IST entries a couple versions ago. I think we need to clean up the current garbage asm first, though. See: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entryid=d22f1dca4c7c93fdd1ce754e38d71d1961c0f9ac (Very much unfinished, and it should probably be split up, but AFAICT it works. Don't hold your breath for a real version.) --Andy Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Frederic Weisbecker fweis...@gmail.com wrote: We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. So one worry with this and similar approaches of statistically detecting user mode would be the fact that on the way out to user-space we don't really destroy the previous call trace - we just pop off the stack (non-destructively), restore RIPs and are gone. We'll need that percpu flag I suspect. Note we have the context tracking state which tells where the current task is: user/system/guest. Yes, but that overhead is what I'm suggesting we get rid of, I thought Rik was trying to find a mechanism that would be independent of that? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: > > > On 02/05/2015 07:27, Ingo Molnar wrote: >> >> - synchronize_rcu() avoids having to send an IPI by taking a >> peak at rq->curr's pt_regs::flag, and if: >> >> - the flag is 0 then it has observed a quiescent state. >> >> - the flag is 1, then it would set TIF_NOHZ and wait for a >>completion from a TIF_NOHZ callback. > > Isn't this racy? > > synchronize_rcu CPU nohz CPU > - > set flag = 0 > read flag = 0 > return to userspace > set TIF_NOHZ > > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Actually, the "race" in this direction is fine. If flag==0, then the nohz CPU is not accessing any RCU protected data structures, and the synhcronize_rcu CPU will not be setting TIF_NOHZ. The race is only a concern if the synchronize_rcu CPU reads flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after the nohz CPU has cleared flag (and is unable to handle RCU stuff). An atomic compare and swap prevents that issue. The other race, of the synchronize_rcu CPU reading 0, followed by the nohz CPU going into kernel space, and setting the flag to 1, should be fine. After all, this means the nohz_full CPU just went into a new RCU grace period, which is just what the synchronize_rcu CPU was waiting for. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: > On 02/05/2015 07:27, Ingo Molnar wrote: >> >> - synchronize_rcu() avoids having to send an IPI by taking a >> peak at rq->curr's pt_regs::flag, and if: >> >> - the flag is 0 then it has observed a quiescent state. >> >> - the flag is 1, then it would set TIF_NOHZ and wait for a >>completion from a TIF_NOHZ callback. > > Isn't this racy? > > synchronize_rcu CPU nohz CPU > - > set flag = 0 > read flag = 0 > return to userspace > set TIF_NOHZ > > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. As an aside, I suspect the remote timer sampling stuff is probably best off piggybacking on the RCU status (and PF_VCPU) of the tasks being sampled. That seems like the most reliable in-memory state we can examine, since the frame pointer may be in registers only, and not in memory to begin with. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: > > > On 02/05/2015 07:27, Ingo Molnar wrote: >> >> - synchronize_rcu() avoids having to send an IPI by taking a >> peak at rq->curr's pt_regs::flag, and if: >> >> - the flag is 0 then it has observed a quiescent state. >> >> - the flag is 1, then it would set TIF_NOHZ and wait for a >>completion from a TIF_NOHZ callback. > > Isn't this racy? > > synchronize_rcu CPU nohz CPU > - > set flag = 0 > read flag = 0 > return to userspace > set TIF_NOHZ > > and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. If the flag and TIF_NOHZ live in the same word, we can cmpxchg and be guaranteed atomicity. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 02/05/2015 07:27, Ingo Molnar wrote: > > - synchronize_rcu() avoids having to send an IPI by taking a > peak at rq->curr's pt_regs::flag, and if: > > - the flag is 0 then it has observed a quiescent state. > > - the flag is 1, then it would set TIF_NOHZ and wait for a >completion from a TIF_NOHZ callback. Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 02/05/2015 07:27, Ingo Molnar wrote: - synchronize_rcu() avoids having to send an IPI by taking a peak at rq-curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Paolo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: On 02/05/2015 07:27, Ingo Molnar wrote: - synchronize_rcu() avoids having to send an IPI by taking a peak at rq-curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. If the flag and TIF_NOHZ live in the same word, we can cmpxchg and be guaranteed atomicity. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: On 02/05/2015 07:27, Ingo Molnar wrote: - synchronize_rcu() avoids having to send an IPI by taking a peak at rq-curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. As an aside, I suspect the remote timer sampling stuff is probably best off piggybacking on the RCU status (and PF_VCPU) of the tasks being sampled. That seems like the most reliable in-memory state we can examine, since the frame pointer may be in registers only, and not in memory to begin with. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/04/2015 05:26 AM, Paolo Bonzini wrote: On 02/05/2015 07:27, Ingo Molnar wrote: - synchronize_rcu() avoids having to send an IPI by taking a peak at rq-curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. Isn't this racy? synchronize_rcu CPU nohz CPU - set flag = 0 read flag = 0 return to userspace set TIF_NOHZ and there's no guarantee that TIF_NOHZ is ever processed by the nohz CPU. Actually, the race in this direction is fine. If flag==0, then the nohz CPU is not accessing any RCU protected data structures, and the synhcronize_rcu CPU will not be setting TIF_NOHZ. The race is only a concern if the synchronize_rcu CPU reads flag==1 (nohz CPU is in kernel space), and sets TIF_NOHZ after the nohz CPU has cleared flag (and is unable to handle RCU stuff). An atomic compare and swap prevents that issue. The other race, of the synchronize_rcu CPU reading 0, followed by the nohz CPU going into kernel space, and setting the flag to 1, should be fine. After all, this means the nohz_full CPU just went into a new RCU grace period, which is just what the synchronize_rcu CPU was waiting for. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/03/2015 02:24 PM, Andy Lutomirski wrote: > On Sun, May 3, 2015 at 10:30 AM, Rik van Riel wrote: >> On 05/03/2015 09:23 AM, Mike Galbraith wrote: >> >>> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. >>> >>> 100M * stat() on isolated cpu >>> >>> NO_HZ_FULL offinactive housekeepernohz_full >>> real0m14.266s 0m14.367s0m20.427s 0m27.921s >>> user0m1.756s 0m1.553s 0m1.976s 0m10.447s >>> sys 0m12.508s 0m12.769s0m18.400s 0m17.464s >>> (real) 1.000 1.0071.431 1.957 >> I'm convinced. >> >> Time to try the remote sampling of CPU use statistics, and >> lighten up the RCU overhead of context tracking. >> > > I don't understand the remote sampling proposal. Isn't the whole > point of full nohz to avoid periodically interrupting a busy CPU? If > what you have in mind is sending IPIs, then that's just as bad, right? > > If, on the other hand, you're just going to remotely sample the > in-memory context, that sounds good. It's the latter. If you look at /proc//{stack,syscall,wchan} and other files, you will see we already have ways to determine, from in memory content, where a program is running at a certain point in time. In fact, the timer interrupt based accounting does a similar thing. It has a task examine its own in-memory state to figure out what it was doing before the timer interrupt happened. The kernel side stack pointer is probably enough to tell us whether a task is active in kernel space, on an irq stack, or (maybe) in user space. Not convinced about the latter, we may need to look at the same state the RCU code keeps track of to see what mode a task is in... I am looking at the code to see what locks we need to grab. I suspect the runqueue lock may be enough, to ensure that the task struct, and stack do not go away while we are looking at them. We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> On Fri, May 1, 2015 at 12:11 PM, Rik van Riel wrote: >> > On 05/01/2015 02:40 PM, Ingo Molnar wrote: >> > >> >> Or we could do that in the syscall path with a single store of a >> >> constant flag to a location in the task struct. We have a number of >> >> natural flags that get written on syscall entry, such as: >> >> >> >> pushq_cfi $__USER_DS/* pt_regs->ss */ >> >> >> >> That goes to a constant location on the kernel stack. On return from >> >> system calls we could write 0 to that location. >> >> Huh? IRET with zero there will fault, and we can't guarantee that >> all syscalls return using sysret. [...] > > So IRET is a slowpath - I was thinking about the fastpath mainly. Slowpath or not, if we're using part of the hardware frame in pt_regs as an indication of whether we're in user or kernel mode, we don't get to arbitrarily change it to the "user" state before IRET or the IRET will fail. We could work around that with some kind of trampoline, but that's complicated. > >> [...] Also, we'd have to audit all the entries, and >> system_call_after_swapgs currently enables interrupts early enough >> that an interrupt before all the pushes will do unpredictable things >> to pt_regs. > > An irq hardware frame won't push zero to that selector value, right? > That's the only bad thing that would confuse the code. > I think it's not quite that simple. The syscall entry looks like, roughly: fix rsp; sti; push ss; push rsp; push flags; push cs; push rip; We can get an interrupt part-way through those pushes. Maybe there's no bad place where we could get an IRQ since SS is first, but this is still nasty. > > But yes, safely accessing the remote task is a complication, but with > such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So > even if we do: > >> If we went that route, I'd advocate sticking the flag in tss->sp1. >> That cacheline is unconditionally read on kernel entry already, and >> sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I >> lost track). [1] >> >> Alternatively, I'd suggest that we actually add a whole new word to >> pt_regs. > > ... we'd still have to set TIF_NOHZ and are back to square one in > terms of race complexity. > > But it's not overly complex: by taking the remote CPU's rq-lock from > synchronize_rcu() we could get a stable pointer to the currently > executing task. > > And if we do that, we might as well use the opportunity and take a > look at pt_regs as well - this is why sticking it into pt_regs might > be better. > > So I'd: > > - enlarge pt_regs by a word and stick the flag there (this > allocation is essentially free) > > - update the word from entry/exit > > - synchronize_rcu() avoids having to send an IPI by taking a > peak at rq->curr's pt_regs::flag, and if: > > - the flag is 0 then it has observed a quiescent state. > > - the flag is 1, then it would set TIF_NOHZ and wait for a >completion from a TIF_NOHZ callback. This seems complicated, especially given IST entries that put their pt_regs in different places. We get a bit of a free pass on some of the IST entries due to the IST stack switch thing, but we still need to worry about NMI and MCE at least. I think I want to understand the considerations a bit better before I express a real opinion. IIUC the remote CPU needs to reliably detect that we've had a grace period. This can potentially happen by observing us with some quiescent (user-mode, guest-mode or idle) flag set or by an interrupt incrementing a counter. In CONFIG_PREEMPT_RCU mode, the remote CPU can presumably observe that we're quiescent directly. Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is unnecessary? We may still want context tracking for the timing stats sampling, but I don't see why we need it for RCU. In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about possible races that could cause us to fail to ever detect a grace period. If we have the periodic tick turned off and if the remote CPU checks our context and sees us non-quiescent, then either we need a non-racy way to get us to do work when we become quiescent or the remote CPU needs to eventually send us an IPI. I don't see any way to do the former without an RMW or barrier on every transition into a quiescent state. Even if we used ti->flags, I still think that would require us to upgrade the cli; test on exit to userspace to a test-and-set or test-and-clear, either of which is much more expensive. I think I prefer a per-cpu approach over a per-task approach, since it's easier to reason about and it should still only require one instruction on entry and one instruction on exit. FWIW, if we go the per-cpu route, what if we used a counter instead of a simple context flag? With a counter, the remote CPU could observe that we changed state and therefore went through a
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Sun, May 3, 2015 at 10:30 AM, Rik van Riel wrote: > On 05/03/2015 09:23 AM, Mike Galbraith wrote: > >> Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. >> >> 100M * stat() on isolated cpu >> >> NO_HZ_FULL offinactive housekeepernohz_full >> real0m14.266s 0m14.367s0m20.427s 0m27.921s >> user0m1.756s 0m1.553s 0m1.976s 0m10.447s >> sys 0m12.508s 0m12.769s0m18.400s 0m17.464s >> (real) 1.000 1.0071.431 1.957 >>1.000 1.000 >> >> real 0m20.423s 0m27.930s +rik 1,2 >> user 0m2.072s 0m10.450s >> sys 0m18.304s 0m17.471s >> vs off1.431 1.957 >> vs prev 1.000 1.000 >> >> real 0m20.256s 0m27.803s +paolo 1,2 (2 >> missing prototypes) >> user 0m1.884s 0m10.551s >> sys 0m18.353s 0m17.242s >> vs off1.419 1.948 >> vs prev.991 .995 >> >> real 0m19.122s 0m26.946s +rik 3 >> user 0m1.896s 0m10.292s >> sys 0m17.198s 0m16.644s >> vs off1.340 1.888 >> vs prev .944.969 > > I'm convinced. > > Time to try the remote sampling of CPU use statistics, and > lighten up the RCU overhead of context tracking. > I don't understand the remote sampling proposal. Isn't the whole point of full nohz to avoid periodically interrupting a busy CPU? If what you have in mind is sending IPIs, then that's just as bad, right? If, on the other hand, you're just going to remotely sample the in-memory context, that sounds good. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/03/2015 09:23 AM, Mike Galbraith wrote: > Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. > > 100M * stat() on isolated cpu > > NO_HZ_FULL offinactive housekeepernohz_full > real0m14.266s 0m14.367s0m20.427s 0m27.921s > user0m1.756s 0m1.553s 0m1.976s 0m10.447s > sys 0m12.508s 0m12.769s0m18.400s 0m17.464s > (real) 1.000 1.0071.431 1.957 >1.000 1.000 > > real 0m20.423s 0m27.930s +rik 1,2 > user 0m2.072s 0m10.450s > sys 0m18.304s 0m17.471s > vs off1.431 1.957 > vs prev 1.000 1.000 > > real 0m20.256s 0m27.803s +paolo 1,2 (2 > missing prototypes) > user 0m1.884s 0m10.551s > sys 0m18.353s 0m17.242s > vs off1.419 1.948 > vs prev.991 .995 > > real 0m19.122s 0m26.946s +rik 3 > user 0m1.896s 0m10.292s > sys 0m17.198s 0m16.644s > vs off1.340 1.888 > vs prev .944.969 I'm convinced. Time to try the remote sampling of CPU use statistics, and lighten up the RCU overhead of context tracking. I should be able to start on the remote sampling of CPU use statistics this week. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote: > On 05/01/2015 02:40 AM, Ingo Molnar wrote: > > >> This patch builds on top of these patches by Paolo: > >> https://lkml.org/lkml/2015/4/28/188 > >> https://lkml.org/lkml/2015/4/29/139 > >> > >> Together with this patch I posted earlier this week, the syscall path > >> on a nohz_full cpu seems to be about 10% faster. > >> https://lkml.org/lkml/2015/4/24/394 > >> > >> My test is a simple microbenchmark that calls getpriority() in a loop > >> 10 million times: > >> > >>run timesystem time > >> vanilla5.49s 2.08s > >> __acct patch 5.21s 1.92s > >> both patches 4.88s 1.71s > > > > Just curious, what are the numbers if you don't have context tracking > > enabled, i.e. without nohz_full? > > > > I.e. what's the baseline we are talking about? > > It's an astounding difference. This is not a kernel without nohz_full, > just a CPU without nohz_full running the same kernel I tested with > yesterday: > > run timesystem time > vanilla 5.49s 2.08s > __acct patch 5.21s 1.92s > both patches 4.88s 1.71s > CPU w/o nohz 3.12s 1.63s<-- your numbers, mostly Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. 100M * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 1.000 1.000 real 0m20.423s 0m27.930s +rik 1,2 user 0m2.072s 0m10.450s sys 0m18.304s 0m17.471s vs off1.431 1.957 vs prev 1.000 1.000 real 0m20.256s 0m27.803s +paolo 1,2 (2 missing prototypes) user 0m1.884s 0m10.551s sys 0m18.353s 0m17.242s vs off1.419 1.948 vs prev.991 .995 real 0m19.122s 0m26.946s +rik 3 user 0m1.896s 0m10.292s sys 0m17.198s 0m16.644s vs off1.340 1.888 vs prev .944.969 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Fri, 2015-05-01 at 11:20 -0400, Rik van Riel wrote: On 05/01/2015 02:40 AM, Ingo Molnar wrote: This patch builds on top of these patches by Paolo: https://lkml.org/lkml/2015/4/28/188 https://lkml.org/lkml/2015/4/29/139 Together with this patch I posted earlier this week, the syscall path on a nohz_full cpu seems to be about 10% faster. https://lkml.org/lkml/2015/4/24/394 My test is a simple microbenchmark that calls getpriority() in a loop 10 million times: run timesystem time vanilla5.49s 2.08s __acct patch 5.21s 1.92s both patches 4.88s 1.71s Just curious, what are the numbers if you don't have context tracking enabled, i.e. without nohz_full? I.e. what's the baseline we are talking about? It's an astounding difference. This is not a kernel without nohz_full, just a CPU without nohz_full running the same kernel I tested with yesterday: run timesystem time vanilla 5.49s 2.08s __acct patch 5.21s 1.92s both patches 4.88s 1.71s CPU w/o nohz 3.12s 1.63s-- your numbers, mostly Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. 100M * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 1.000 1.000 real 0m20.423s 0m27.930s +rik 1,2 user 0m2.072s 0m10.450s sys 0m18.304s 0m17.471s vs off1.431 1.957 vs prev 1.000 1.000 real 0m20.256s 0m27.803s +paolo 1,2 (2 missing prototypes) user 0m1.884s 0m10.551s sys 0m18.353s 0m17.242s vs off1.419 1.948 vs prev.991 .995 real 0m19.122s 0m26.946s +rik 3 user 0m1.896s 0m10.292s sys 0m17.198s 0m16.644s vs off1.340 1.888 vs prev .944.969 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/03/2015 02:24 PM, Andy Lutomirski wrote: On Sun, May 3, 2015 at 10:30 AM, Rik van Riel r...@redhat.com wrote: On 05/03/2015 09:23 AM, Mike Galbraith wrote: Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. 100M * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 I'm convinced. Time to try the remote sampling of CPU use statistics, and lighten up the RCU overhead of context tracking. I don't understand the remote sampling proposal. Isn't the whole point of full nohz to avoid periodically interrupting a busy CPU? If what you have in mind is sending IPIs, then that's just as bad, right? If, on the other hand, you're just going to remotely sample the in-memory context, that sounds good. It's the latter. If you look at /proc/pid/{stack,syscall,wchan} and other files, you will see we already have ways to determine, from in memory content, where a program is running at a certain point in time. In fact, the timer interrupt based accounting does a similar thing. It has a task examine its own in-memory state to figure out what it was doing before the timer interrupt happened. The kernel side stack pointer is probably enough to tell us whether a task is active in kernel space, on an irq stack, or (maybe) in user space. Not convinced about the latter, we may need to look at the same state the RCU code keeps track of to see what mode a task is in... I am looking at the code to see what locks we need to grab. I suspect the runqueue lock may be enough, to ensure that the task struct, and stack do not go away while we are looking at them. We cannot take the lock_trace(task) from irq context, and we probably do not need to anyway, since we do not care about a precise stack trace for the task. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/03/2015 09:23 AM, Mike Galbraith wrote: Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. 100M * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 1.000 1.000 real 0m20.423s 0m27.930s +rik 1,2 user 0m2.072s 0m10.450s sys 0m18.304s 0m17.471s vs off1.431 1.957 vs prev 1.000 1.000 real 0m20.256s 0m27.803s +paolo 1,2 (2 missing prototypes) user 0m1.884s 0m10.551s sys 0m18.353s 0m17.242s vs off1.419 1.948 vs prev.991 .995 real 0m19.122s 0m26.946s +rik 3 user 0m1.896s 0m10.292s sys 0m17.198s 0m16.644s vs off1.340 1.888 vs prev .944.969 I'm convinced. Time to try the remote sampling of CPU use statistics, and lighten up the RCU overhead of context tracking. I should be able to start on the remote sampling of CPU use statistics this week. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Fri, May 1, 2015 at 10:27 PM, Ingo Molnar mi...@kernel.org wrote: * Andy Lutomirski l...@amacapital.net wrote: On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote: On 05/01/2015 02:40 PM, Ingo Molnar wrote: Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS/* pt_regs-ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. Huh? IRET with zero there will fault, and we can't guarantee that all syscalls return using sysret. [...] So IRET is a slowpath - I was thinking about the fastpath mainly. Slowpath or not, if we're using part of the hardware frame in pt_regs as an indication of whether we're in user or kernel mode, we don't get to arbitrarily change it to the user state before IRET or the IRET will fail. We could work around that with some kind of trampoline, but that's complicated. [...] Also, we'd have to audit all the entries, and system_call_after_swapgs currently enables interrupts early enough that an interrupt before all the pushes will do unpredictable things to pt_regs. An irq hardware frame won't push zero to that selector value, right? That's the only bad thing that would confuse the code. I think it's not quite that simple. The syscall entry looks like, roughly: fix rsp; sti; push ss; push rsp; push flags; push cs; push rip; We can get an interrupt part-way through those pushes. Maybe there's no bad place where we could get an IRQ since SS is first, but this is still nasty. But yes, safely accessing the remote task is a complication, but with such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So even if we do: If we went that route, I'd advocate sticking the flag in tss-sp1. That cacheline is unconditionally read on kernel entry already, and sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I lost track). [1] Alternatively, I'd suggest that we actually add a whole new word to pt_regs. ... we'd still have to set TIF_NOHZ and are back to square one in terms of race complexity. But it's not overly complex: by taking the remote CPU's rq-lock from synchronize_rcu() we could get a stable pointer to the currently executing task. And if we do that, we might as well use the opportunity and take a look at pt_regs as well - this is why sticking it into pt_regs might be better. So I'd: - enlarge pt_regs by a word and stick the flag there (this allocation is essentially free) - update the word from entry/exit - synchronize_rcu() avoids having to send an IPI by taking a peak at rq-curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. This seems complicated, especially given IST entries that put their pt_regs in different places. We get a bit of a free pass on some of the IST entries due to the IST stack switch thing, but we still need to worry about NMI and MCE at least. I think I want to understand the considerations a bit better before I express a real opinion. IIUC the remote CPU needs to reliably detect that we've had a grace period. This can potentially happen by observing us with some quiescent (user-mode, guest-mode or idle) flag set or by an interrupt incrementing a counter. In CONFIG_PREEMPT_RCU mode, the remote CPU can presumably observe that we're quiescent directly. Does this mean that, in CONFIG_PREEMPT_RCU mode, all of this is unnecessary? We may still want context tracking for the timing stats sampling, but I don't see why we need it for RCU. In non-CONFIG_PREEMPT_RCU mode, I think we need to worry about possible races that could cause us to fail to ever detect a grace period. If we have the periodic tick turned off and if the remote CPU checks our context and sees us non-quiescent, then either we need a non-racy way to get us to do work when we become quiescent or the remote CPU needs to eventually send us an IPI. I don't see any way to do the former without an RMW or barrier on every transition into a quiescent state. Even if we used ti-flags, I still think that would require us to upgrade the cli; test on exit to userspace to a test-and-set or test-and-clear, either of which is much more expensive. I think I prefer a per-cpu approach over a per-task approach, since it's easier to reason about and it should still only require one instruction on entry and one instruction on exit. FWIW, if we go the per-cpu route, what if we used a counter instead of a simple context flag? With a counter, the remote CPU could observe that we changed state and therefore went through a grace period even if the remote CPU never actually observes us in
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Sun, May 3, 2015 at 10:30 AM, Rik van Riel r...@redhat.com wrote: On 05/03/2015 09:23 AM, Mike Galbraith wrote: Below are v4.1-rc1-172-g6c3c1eb3c35e + patches measurements. 100M * stat() on isolated cpu NO_HZ_FULL offinactive housekeepernohz_full real0m14.266s 0m14.367s0m20.427s 0m27.921s user0m1.756s 0m1.553s 0m1.976s 0m10.447s sys 0m12.508s 0m12.769s0m18.400s 0m17.464s (real) 1.000 1.0071.431 1.957 1.000 1.000 real 0m20.423s 0m27.930s +rik 1,2 user 0m2.072s 0m10.450s sys 0m18.304s 0m17.471s vs off1.431 1.957 vs prev 1.000 1.000 real 0m20.256s 0m27.803s +paolo 1,2 (2 missing prototypes) user 0m1.884s 0m10.551s sys 0m18.353s 0m17.242s vs off1.419 1.948 vs prev.991 .995 real 0m19.122s 0m26.946s +rik 3 user 0m1.896s 0m10.292s sys 0m17.198s 0m16.644s vs off1.340 1.888 vs prev .944.969 I'm convinced. Time to try the remote sampling of CPU use statistics, and lighten up the RCU overhead of context tracking. I don't understand the remote sampling proposal. Isn't the whole point of full nohz to avoid periodically interrupting a busy CPU? If what you have in mind is sending IPIs, then that's just as bad, right? If, on the other hand, you're just going to remotely sample the in-memory context, that sounds good. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/02/2015 01:27 AM, Ingo Molnar wrote: > Regarding the user/kernel execution time split measurement: > > 1) the same flag could be used to sample a remote CPU's statistics > from another CPU and update the stats in the currently executing task. > As long as there's at least one non-nohz-full CPU, this would work. Or > are there systems were all CPUs are nohz-full? On a NO_HZ_FULL system, you need at least one CPU to execute RCU callbacks, and do other system things like that, so there is at least one CPU that is not nohz_full. On NUMA systems, I could even see the sane option being one CPU that is not isolated or nohz_full per NUMA node, so we have a place to route irqs, etc... > 2) Alternatively we could just drive user/kernel split statistics from > context switches, which would be inaccurate if the workload is > SCHED_FIFO that only rarely context switches. > > How does this sound? I think option (1) sounds nicer :) What locks do we need, besides the runqueue lock to make sure the task does not go away, and later the task's vtime_lock to update its time statistics? Do we even need the lock_trace(task) as taken in proc_pid_stack(), since all we care is whether or not the thing is in kernel, user, or guest mode? For guest mode, we set a flag in the task struct somewhere, that part is easy. It also looks like dump_trace() can distinguish between normal, exception, and irq stacks. Not sure how fancy we need to get... -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/02/2015 01:27 AM, Ingo Molnar wrote: Regarding the user/kernel execution time split measurement: 1) the same flag could be used to sample a remote CPU's statistics from another CPU and update the stats in the currently executing task. As long as there's at least one non-nohz-full CPU, this would work. Or are there systems were all CPUs are nohz-full? On a NO_HZ_FULL system, you need at least one CPU to execute RCU callbacks, and do other system things like that, so there is at least one CPU that is not nohz_full. On NUMA systems, I could even see the sane option being one CPU that is not isolated or nohz_full per NUMA node, so we have a place to route irqs, etc... 2) Alternatively we could just drive user/kernel split statistics from context switches, which would be inaccurate if the workload is SCHED_FIFO that only rarely context switches. How does this sound? I think option (1) sounds nicer :) What locks do we need, besides the runqueue lock to make sure the task does not go away, and later the task's vtime_lock to update its time statistics? Do we even need the lock_trace(task) as taken in proc_pid_stack(), since all we care is whether or not the thing is in kernel, user, or guest mode? For guest mode, we set a flag in the task struct somewhere, that part is easy. It also looks like dump_trace() can distinguish between normal, exception, and irq stacks. Not sure how fancy we need to get... -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > On Fri, May 1, 2015 at 12:11 PM, Rik van Riel wrote: > > On 05/01/2015 02:40 PM, Ingo Molnar wrote: > > > >> Or we could do that in the syscall path with a single store of a > >> constant flag to a location in the task struct. We have a number of > >> natural flags that get written on syscall entry, such as: > >> > >> pushq_cfi $__USER_DS/* pt_regs->ss */ > >> > >> That goes to a constant location on the kernel stack. On return from > >> system calls we could write 0 to that location. > > Huh? IRET with zero there will fault, and we can't guarantee that > all syscalls return using sysret. [...] So IRET is a slowpath - I was thinking about the fastpath mainly. > [...] Also, we'd have to audit all the entries, and > system_call_after_swapgs currently enables interrupts early enough > that an interrupt before all the pushes will do unpredictable things > to pt_regs. An irq hardware frame won't push zero to that selector value, right? That's the only bad thing that would confuse the code. > We could further abuse orig_ax, but that would require a lot of > auditing. Honestly, though, I think keeping a flag in an > otherwise-hot cache line is a better bet. [...] That would work too, at the cost of one more instruction, as now we'd have to both set and clear it. > [...] Also, making it per-cpu instead of per-task will probably be > easier on the RCU code, since otherwise the RCU code will have to do > some kind of synchronization (even if it's a wait-free probe of the > rq lock or similar) to figure out which pt_regs to look at. So the synchronize_rcu() part is an even slower slow path, in comparison with system call entry overhead. But yes, safely accessing the remote task is a complication, but with such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So even if we do: > If we went that route, I'd advocate sticking the flag in tss->sp1. > That cacheline is unconditionally read on kernel entry already, and > sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I > lost track). [1] > > Alternatively, I'd suggest that we actually add a whole new word to > pt_regs. ... we'd still have to set TIF_NOHZ and are back to square one in terms of race complexity. But it's not overly complex: by taking the remote CPU's rq-lock from synchronize_rcu() we could get a stable pointer to the currently executing task. And if we do that, we might as well use the opportunity and take a look at pt_regs as well - this is why sticking it into pt_regs might be better. So I'd: - enlarge pt_regs by a word and stick the flag there (this allocation is essentially free) - update the word from entry/exit - synchronize_rcu() avoids having to send an IPI by taking a peak at rq->curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. synchronize_rcu() often involves waiting for (timer tick driven) grace periods anyway, so this is a relatively fast solution - and it would limit the overhead to 2 instructions. On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled) we could patch out those two instructions into NOPs, which would be eliminated in the decoder. Regarding the user/kernel execution time split measurement: 1) the same flag could be used to sample a remote CPU's statistics from another CPU and update the stats in the currently executing task. As long as there's at least one non-nohz-full CPU, this would work. Or are there systems were all CPUs are nohz-full? 2) Alternatively we could just drive user/kernel split statistics from context switches, which would be inaccurate if the workload is SCHED_FIFO that only rarely context switches. How does this sound? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Fri, 2015-05-01 at 14:05 -0400, Rik van Riel wrote: > On 05/01/2015 12:34 PM, Ingo Molnar wrote: > > > > * Rik van Riel wrote: > > > >>> I can understand people running hard-RT workloads not wanting to > >>> see the overhead of a timer tick or a scheduler tick with variable > >>> (and occasionally heavy) work done in IRQ context, but the jitter > >>> caused by a single trivial IPI with constant work should be very, > >>> very low and constant. > >> > >> Not if the realtime workload is running inside a KVM guest. > > > > I don't buy this: > > > >> At that point an IPI, either on the host or in the guest, involves a > >> full VMEXIT & VMENTER cycle. > > > > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 > > usec on recent hardware, and I bet it will get better with time. > > > > I'm not aware of any hard-RT workload that cannot take 1 usec > > latencies. > > Now think about doing this kind of IPI from inside a guest, > to another VCPU on the same guest. > > Now you are looking at VMEXIT/VMENTER on the first VCPU, > plus the cost of the IPI on the host, plus the cost of > the emulation layer, plus VMEXIT/VMENTER on the second > VCPU to trigger the IPI work, and possibly a second > VMEXIT/VMENTER for IPI completion. > > I suspect it would be better to do RCU callback offload > in some other way. I don't get it. How the heck do people manage to talk about realtime in virtual boxen, and not at least crack a smile. Real, virtual, real, virtual... what's wrong with this picture? Why is virtual realtime not an oxymoron? I did that for grins once, and it was either really funny, or really sad, not sure which... but it did not look really really useful. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Fri, May 1, 2015 at 12:11 PM, Rik van Riel wrote: > On 05/01/2015 02:40 PM, Ingo Molnar wrote: > >> Or we could do that in the syscall path with a single store of a >> constant flag to a location in the task struct. We have a number of >> natural flags that get written on syscall entry, such as: >> >> pushq_cfi $__USER_DS/* pt_regs->ss */ >> >> That goes to a constant location on the kernel stack. On return from >> system calls we could write 0 to that location. Huh? IRET with zero there will fault, and we can't guarantee that all syscalls return using sysret. Also, we'd have to audit all the entries, and system_call_after_swapgs currently enables interrupts early enough that an interrupt before all the pushes will do unpredictable things to pt_regs. We could further abuse orig_ax, but that would require a lot of auditing. Honestly, though, I think keeping a flag in an otherwise-hot cache line is a better bet. Also, making it per-cpu instead of per-task will probably be easier on the RCU code, since otherwise the RCU code will have to do some kind of synchronization (even if it's a wait-free probe of the rq lock or similar) to figure out which pt_regs to look at. If we went that route, I'd advocate sticking the flag in tss->sp1. That cacheline is unconditionally read on kernel entry already, and sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I lost track). [1] Alternatively, I'd suggest that we actually add a whole new word to pt_regs. [1] It's not unconditionally accessed yet, but it wil be once Denys' latest patches are in. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 02:40 PM, Ingo Molnar wrote: > Or we could do that in the syscall path with a single store of a > constant flag to a location in the task struct. We have a number of > natural flags that get written on syscall entry, such as: > > pushq_cfi $__USER_DS/* pt_regs->ss */ > > That goes to a constant location on the kernel stack. On return from > system calls we could write 0 to that location. > > So the remote CPU would have to do a read of this location. There are > two cases: > > - If it's 0, then it has observed quiescent state on that CPU. (It >does not have to be atomics anymore, as we'd only observe the value >and MESI coherency takes care of it.) That should do the trick. > - If it's not 0 then the remote CPU is not executing user-space code >and we can install (remotely) a TIF_NOHZ flag in it and expect it >to process it either on return to user-space or on a context >switch. I may have to think about this a little more, but it seems like it should work. Can we use a separate byte in the flags word for flags that can get set remotely, so we can do stores and clearing of local-only flags without atomic instructions? > This way, unless I'm missing something, reduces the overhead to a > single store to a hot cacheline on return-to-userspace - which > instruction if we place it well might as well be close to zero cost. > No syscall entry cost. Slow-return cost only in the (rare) case of > someone using synchronize_rcu(). I think that should take care of the RCU aspect of nohz_full. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > On 05/01/2015 12:34 PM, Ingo Molnar wrote: > > > > * Rik van Riel wrote: > > > >>> I can understand people running hard-RT workloads not wanting to > >>> see the overhead of a timer tick or a scheduler tick with variable > >>> (and occasionally heavy) work done in IRQ context, but the jitter > >>> caused by a single trivial IPI with constant work should be very, > >>> very low and constant. > >> > >> Not if the realtime workload is running inside a KVM guest. > > > > I don't buy this: > > > >> At that point an IPI, either on the host or in the guest, involves a > >> full VMEXIT & VMENTER cycle. > > > > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 > > usec on recent hardware, and I bet it will get better with time. > > > > I'm not aware of any hard-RT workload that cannot take 1 usec > > latencies. > > Now think about doing this kind of IPI from inside a guest, to > another VCPU on the same guest. > > Now you are looking at VMEXIT/VMENTER on the first VCPU, Does it matter? It's not the hard-RT CPU, and this is a slowpath of synchronize_rcu(). > plus the cost of the IPI on the host, plus the cost of the emulation > layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI > work, and possibly a second VMEXIT/VMENTER for IPI completion. Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies. > I suspect it would be better to do RCU callback offload in some > other way. Well, it's not just about callback offload, but it's about the basic synchronization guarantee of synchronize_rcu(): that all RCU read-side critical sections have finished executing after the call returns. So even if a nohz-full CPU never actually queues a callback, it needs to stop using resources that a synchronize_rcu() caller expects it to stop using. We can do that only if we know it in an SMP-coherent way that the remote CPU is not in an rcu_read_lock() section. Sending an IPI is one way to achieve that. Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS/* pt_regs->ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. So the remote CPU would have to do a read of this location. There are two cases: - If it's 0, then it has observed quiescent state on that CPU. (It does not have to be atomics anymore, as we'd only observe the value and MESI coherency takes care of it.) - If it's not 0 then the remote CPU is not executing user-space code and we can install (remotely) a TIF_NOHZ flag in it and expect it to process it either on return to user-space or on a context switch. This way, unless I'm missing something, reduces the overhead to a single store to a hot cacheline on return-to-userspace - which instruction if we place it well might as well be close to zero cost. No syscall entry cost. Slow-return cost only in the (rare) case of someone using synchronize_rcu(). Hm? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 12:34 PM, Ingo Molnar wrote: > > * Rik van Riel wrote: > >>> I can understand people running hard-RT workloads not wanting to >>> see the overhead of a timer tick or a scheduler tick with variable >>> (and occasionally heavy) work done in IRQ context, but the jitter >>> caused by a single trivial IPI with constant work should be very, >>> very low and constant. >> >> Not if the realtime workload is running inside a KVM guest. > > I don't buy this: > >> At that point an IPI, either on the host or in the guest, involves a >> full VMEXIT & VMENTER cycle. > > So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 > usec on recent hardware, and I bet it will get better with time. > > I'm not aware of any hard-RT workload that cannot take 1 usec > latencies. Now think about doing this kind of IPI from inside a guest, to another VCPU on the same guest. Now you are looking at VMEXIT/VMENTER on the first VCPU, plus the cost of the IPI on the host, plus the cost of the emulation layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI work, and possibly a second VMEXIT/VMENTER for IPI completion. I suspect it would be better to do RCU callback offload in some other way. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > > I.e. much of what we do today, except that we could skip variable > > work such as the scheduler tick or (unforced) RCU processing like > > the RCU softirq work. > > Any ideas how we could avoid that sampling timer interrupt latency > stacking up when dealing with both guest and host? Well, it would be host_latency+guest_latency worst-case, right? That should still be bounded, as long as both guest and host is running nohz-full mode. Also, technically when the host gets such an IRQ, the guest is interrupted as well. So the host could do the guest's work and pass through the result via paravirt or so. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 01:12 PM, Ingo Molnar wrote: > > * Rik van Riel wrote: > >> On 05/01/2015 12:45 PM, Ingo Molnar wrote: >>> >>> * Rik van Riel wrote: >>> On 05/01/2015 12:37 PM, Ingo Molnar wrote: > Also note that this bit in context_tracking_enter(): > > if (state == CONTEXT_USER) { > trace_user_enter(0); > vtime_user_enter(current); > } > > > is related to precise time measurement of user/kernel execution > times, it's not needed by the scheduler at all, it's just exported > to tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? >>> >>> So I really think per syscall overhead is the wrong thing to do for >>> anything that a common distro enables. >>> >>> I see very little use for such precise, high-freq measurements on >>> normal systems - and abnormal systems could enable it dynamically just >>> like they can enable syscall auditing. >>> >>> I.e. I don't mind the occasional crazy piece of code, as long as it >>> does not hurt the innocent. >> >> Then how should/could we keep a rough idea of user / system / guest >> time when running without a periodic timer tick? > > So I'd split the timer tick into two parts: just the constant-work > sampling bit that doesn't do much, and the variable-work part which > gets essentially shut off when the timeout is far into the future. > > Then we could do IRQ driven sampling without introducing variable > amount jitter into hard-RT execution time. > > I.e. much of what we do today, except that we could skip variable work > such as the scheduler tick or (unforced) RCU processing like the RCU > softirq work. Any ideas how we could avoid that sampling timer interrupt latency stacking up when dealing with both guest and host? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > On 05/01/2015 12:45 PM, Ingo Molnar wrote: > > > > * Rik van Riel wrote: > > > >> On 05/01/2015 12:37 PM, Ingo Molnar wrote: > >> > >>> Also note that this bit in context_tracking_enter(): > >>> > >>> if (state == CONTEXT_USER) { > >>> trace_user_enter(0); > >>> vtime_user_enter(current); > >>> } > >>> > >>> > >>> is related to precise time measurement of user/kernel execution > >>> times, it's not needed by the scheduler at all, it's just exported > >>> to tooling. It's not fundamental to the scheduler. > >> > >> Any objections to the idea from the other thread to simply keep the > >> time accumulating in buffers in local_clock() units, and only update > >> the task vtime once a second or so? > > > > So I really think per syscall overhead is the wrong thing to do for > > anything that a common distro enables. > > > > I see very little use for such precise, high-freq measurements on > > normal systems - and abnormal systems could enable it dynamically just > > like they can enable syscall auditing. > > > > I.e. I don't mind the occasional crazy piece of code, as long as it > > does not hurt the innocent. > > Then how should/could we keep a rough idea of user / system / guest > time when running without a periodic timer tick? So I'd split the timer tick into two parts: just the constant-work sampling bit that doesn't do much, and the variable-work part which gets essentially shut off when the timeout is far into the future. Then we could do IRQ driven sampling without introducing variable amount jitter into hard-RT execution time. I.e. much of what we do today, except that we could skip variable work such as the scheduler tick or (unforced) RCU processing like the RCU softirq work. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 12:45 PM, Ingo Molnar wrote: > > * Rik van Riel wrote: > >> On 05/01/2015 12:37 PM, Ingo Molnar wrote: >> >>> Also note that this bit in context_tracking_enter(): >>> >>> if (state == CONTEXT_USER) { >>> trace_user_enter(0); >>> vtime_user_enter(current); >>> } >>> >>> >>> is related to precise time measurement of user/kernel execution >>> times, it's not needed by the scheduler at all, it's just exported >>> to tooling. It's not fundamental to the scheduler. >> >> Any objections to the idea from the other thread to simply keep the >> time accumulating in buffers in local_clock() units, and only update >> the task vtime once a second or so? > > So I really think per syscall overhead is the wrong thing to do for > anything that a common distro enables. > > I see very little use for such precise, high-freq measurements on > normal systems - and abnormal systems could enable it dynamically just > like they can enable syscall auditing. > > I.e. I don't mind the occasional crazy piece of code, as long as it > does not hurt the innocent. Then how should/could we keep a rough idea of user / system / guest time when running without a periodic timer tick? These statistics are useful, and gathering them in a low impact way (local_clock read + variable read + subtraction + addition + store) may be acceptable overhead. This is a few function calls, one tsc read, a possible cache line miss, and an unlikely branch. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > On 05/01/2015 12:37 PM, Ingo Molnar wrote: > > > Also note that this bit in context_tracking_enter(): > > > > if (state == CONTEXT_USER) { > > trace_user_enter(0); > > vtime_user_enter(current); > > } > > > > > > is related to precise time measurement of user/kernel execution > > times, it's not needed by the scheduler at all, it's just exported > > to tooling. It's not fundamental to the scheduler. > > Any objections to the idea from the other thread to simply keep the > time accumulating in buffers in local_clock() units, and only update > the task vtime once a second or so? So I really think per syscall overhead is the wrong thing to do for anything that a common distro enables. I see very little use for such precise, high-freq measurements on normal systems - and abnormal systems could enable it dynamically just like they can enable syscall auditing. I.e. I don't mind the occasional crazy piece of code, as long as it does not hurt the innocent. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 12:37 PM, Ingo Molnar wrote: > Also note that this bit in context_tracking_enter(): > > if (state == CONTEXT_USER) { > trace_user_enter(0); > vtime_user_enter(current); > } > > > is related to precise time measurement of user/kernel execution times, > it's not needed by the scheduler at all, it's just exported to > tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Ingo Molnar wrote: > > 2. To suppress the timing tick, we need to get some timing for, > > um, the scheduler? I wasn't really sure about this one. > > So we have variable timeslice timers for the scheduler implemented, > they are off by default but they worked last someone tried them. See > the 'HRTICK' scheduler feature. > > And for SCHED_FIFO that timeout can be 'never' - i.e. essentially > stopping the scheduler tick. (within reason.) Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > > I can understand people running hard-RT workloads not wanting to > > see the overhead of a timer tick or a scheduler tick with variable > > (and occasionally heavy) work done in IRQ context, but the jitter > > caused by a single trivial IPI with constant work should be very, > > very low and constant. > > Not if the realtime workload is running inside a KVM guest. I don't buy this: > At that point an IPI, either on the host or in the guest, involves a > full VMEXIT & VMENTER cycle. So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 usec on recent hardware, and I bet it will get better with time. I'm not aware of any hard-RT workload that cannot take 1 usec latencies. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > On 05/01/2015 12:03 PM, Andy Lutomirski wrote: > > > The last time I asked, the impression I got was that we needed two things: > > > > 1. We can't pluck things from the RCU list without knowing whether the > > CPU is in an RCU read-side critical section, and we can't know that > > unless we have regular grade periods or we know that the CPU is idle. > > To make the CPU detectably idle, we need to set a bit somewhere. > > More than that. We also need a way for another CPU to identify the > callbacks they could run for us, without confusing them with new > callbacks queued after we transitioned back from USER to KERNEL > context. That's easy: a simple, constant-work IPI sent to nohz-full CPUs could just dequeue these callbacks, if it sees that we are still in user-mode. It's super easy because such an IPI would essentially be equivalent to a system call context from RCU's POV, if it uses this test: if (user_mode(regs)) { ... pluck RCU callbacks ... } That way we can have lots of overhead removed from the syscall call path ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 12:21 PM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >>> So what's the point? Why not remove this big source of overhead >>> altogether? >> >> The last time I asked, the impression I got was that we needed two >> things: >> >> 1. We can't pluck things from the RCU list without knowing whether >> the CPU is in an RCU read-side critical section, and we can't know >> that unless we have regular grade periods or we know that the CPU is >> idle. To make the CPU detectably idle, we need to set a bit >> somewhere. > > 'Idle' as in 'executing pure user-space mode, without entering the > kernel and possibly doing an rcu_read_lock()', right? > > So we don't have to test it from the remote CPU: we could probe such > CPUs via a single low-overhead IPI. I'd much rather push such overhead > to sync_rcu() than to the syscall entry code! > > I can understand people running hard-RT workloads not wanting to see > the overhead of a timer tick or a scheduler tick with variable (and > occasionally heavy) work done in IRQ context, but the jitter caused by > a single trivial IPI with constant work should be very, very low and > constant. Not if the realtime workload is running inside a KVM guest. At that point an IPI, either on the host or in the guest, involves a full VMEXIT & VMENTER cycle. I suspect it would be easy enough at user_enter() or guest_enter() time to: 1) flip the bit 2) check to see if we have any callbacks queued on our on queue, and 3) if so, move the callbacks from that queue to the "another CPU can take these" queue At that point, another CPU wanting to do an RCU grace period advance can: 1) skip trying to advance the grace period on the CPU that is in userspace or guest mode, and 2) take the "for somebody else" callbacks of that CPU, and run them This does not seem overly complicated. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 12:03 PM, Andy Lutomirski wrote: > The last time I asked, the impression I got was that we needed two things: > > 1. We can't pluck things from the RCU list without knowing whether the > CPU is in an RCU read-side critical section, and we can't know that > unless we have regular grade periods or we know that the CPU is idle. > To make the CPU detectably idle, we need to set a bit somewhere. More than that. We also need a way for another CPU to identify the callbacks they could run for us, without confusing them with new callbacks queued after we transitioned back from USER to KERNEL context. > 2. To suppress the timing tick, we need to get some timing for, um, > the scheduler? I wasn't really sure about this one. > > Could we reduce the overhead by making the IN_USER vs IN_KERNEL > indication be a single bit and, worst case, an rdtsc and maybe a > subtraction? We could probably get away with banning full nohz on > non-invariant tsc systems. I suspect we can. There is no need to update the vtime every single time we enter vtime_user_enter and functions like it. We can keep a buffer, which: 1) keeps values in TSC cycles (or whatever unit local_clock does) 2) is only ever accessed by the current task, so it requires no locking 3) values can be actually folded into vtime periodically, when they exceed a certain threshold (1 second ?) That means the vtime_seqlock is something that we would only take once a second or so, and the calculations in account_user_time would only be done on a fairly infrequent basis. That has the potential to reduce overhead by a lot. If nobody has any objections to that kind of change, I would be happy to implement it. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Andy Lutomirski wrote: > > So what's the point? Why not remove this big source of overhead > > altogether? > > The last time I asked, the impression I got was that we needed two > things: > > 1. We can't pluck things from the RCU list without knowing whether > the CPU is in an RCU read-side critical section, and we can't know > that unless we have regular grade periods or we know that the CPU is > idle. To make the CPU detectably idle, we need to set a bit > somewhere. 'Idle' as in 'executing pure user-space mode, without entering the kernel and possibly doing an rcu_read_lock()', right? So we don't have to test it from the remote CPU: we could probe such CPUs via a single low-overhead IPI. I'd much rather push such overhead to sync_rcu() than to the syscall entry code! I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. If user-space RT code does not tolerate _that_ kind of latencies then it really has its priorities wrong and we should not try to please it. It should not hurt the other 99.9% of sane hard-RT users. And the other usecase, virtualization, obviously does not care and could take the IPI just fine. > 2. To suppress the timing tick, we need to get some timing for, um, > the scheduler? I wasn't really sure about this one. So we have variable timeslice timers for the scheduler implemented, they are off by default but they worked last someone tried them. See the 'HRTICK' scheduler feature. And for SCHED_FIFO that timeout can be 'never' - i.e. essentially stopping the scheduler tick. (within reason.) > Could we reduce the overhead by making the IN_USER vs IN_KERNEL > indication be a single bit and, worst case, an rdtsc and maybe a > subtraction? We could probably get away with banning full nohz on > non-invariant tsc systems. > > (I do understand why it would be tricky to transition from IN_USER > to IN_KERNEL with IRQs on. Solvable, maybe, but tricky.) We can make it literally zero overhead: by using an IPI from synchronize_rcu() and friend. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On Fri, May 1, 2015 at 8:59 AM, Ingo Molnar wrote: > > * Rik van Riel wrote: > >> > I.e. what's the baseline we are talking about? >> >> It's an astounding difference. This is not a kernel without >> nohz_full, just a CPU without nohz_full running the same kernel I >> tested with yesterday: >> >> run timesystem time >> vanilla 5.49s 2.08s >> __acct patch 5.21s 1.92s >> both patches 4.88s 1.71s >> CPU w/o nohz 3.12s 1.63s<-- your numbers, mostly >> >> What is even more interesting is that the majority of the time >> difference seems to come from _user_ time, which has gone down from >> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on >> the CPU without nohz_full enabled... >> >> At syscall entry time, the nohz_full context tracking code is very >> straightforward. We check thread_info->flags & >> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which >> handles USER -> KERNEL context transition. >> >> Syscall exit time is a convoluted mess. Both do_notify_resume and >> syscall_trace_leave call exit_user() on entry and enter_user() on >> exit, leaving the time spent looping around between int_with_check >> and syscall_return: in entry_64.S accounted as user time. >> >> I sent an email about this last night, it may be useful to add a >> third test & function call point to the syscall return code, where >> we can call user_enter() just ONCE, and remove the other context >> tracking calls from that loop. > > So what I'm wondering about is the big picture: > > - This is crazy big overhead in something as fundamental as system >calls! > > - We don't even have the excuse of the syscall auditing code, which >kind of has to run for every syscall if it wants to do its job! > > - [ The 'precise vtime' stuff that is driven from syscall entry/exit > is crazy, and I hope not enabled in any distro. ] > > - So why are we doing this in every syscall time at all? > > Basically the whole point of user-context tracking is to be able to > flush pending RCU callbacks. But that's crazy, we can sure defer a few > kfree()s on this CPU, even indefinitely! > > If some other CPU does a sync_rcu(), then it can very well pluck those > callbacks from this super low latency CPU's RCU lists (with due care) > and go and free stuff itself ... There's no need to disturb this CPU > for that! > > If user-space does not do anything kernel-ish then there won't be any > new RCU callbacks piled up, so it's not like it's a resource leak > issue either. > > So what's the point? Why not remove this big source of overhead > altogether? The last time I asked, the impression I got was that we needed two things: 1. We can't pluck things from the RCU list without knowing whether the CPU is in an RCU read-side critical section, and we can't know that unless we have regular grade periods or we know that the CPU is idle. To make the CPU detectably idle, we need to set a bit somewhere. 2. To suppress the timing tick, we need to get some timing for, um, the scheduler? I wasn't really sure about this one. Could we reduce the overhead by making the IN_USER vs IN_KERNEL indication be a single bit and, worst case, an rdtsc and maybe a subtraction? We could probably get away with banning full nohz on non-invariant tsc systems. (I do understand why it would be tricky to transition from IN_USER to IN_KERNEL with IRQs on. Solvable, maybe, but tricky.) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* Rik van Riel wrote: > > I.e. what's the baseline we are talking about? > > It's an astounding difference. This is not a kernel without > nohz_full, just a CPU without nohz_full running the same kernel I > tested with yesterday: > > run timesystem time > vanilla 5.49s 2.08s > __acct patch 5.21s 1.92s > both patches 4.88s 1.71s > CPU w/o nohz 3.12s 1.63s<-- your numbers, mostly > > What is even more interesting is that the majority of the time > difference seems to come from _user_ time, which has gone down from > around 3.4 seconds in the vanilla kernel to around 1.5 seconds on > the CPU without nohz_full enabled... > > At syscall entry time, the nohz_full context tracking code is very > straightforward. We check thread_info->flags & > _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which > handles USER -> KERNEL context transition. > > Syscall exit time is a convoluted mess. Both do_notify_resume and > syscall_trace_leave call exit_user() on entry and enter_user() on > exit, leaving the time spent looping around between int_with_check > and syscall_return: in entry_64.S accounted as user time. > > I sent an email about this last night, it may be useful to add a > third test & function call point to the syscall return code, where > we can call user_enter() just ONCE, and remove the other context > tracking calls from that loop. So what I'm wondering about is the big picture: - This is crazy big overhead in something as fundamental as system calls! - We don't even have the excuse of the syscall auditing code, which kind of has to run for every syscall if it wants to do its job! - [ The 'precise vtime' stuff that is driven from syscall entry/exit is crazy, and I hope not enabled in any distro. ] - So why are we doing this in every syscall time at all? Basically the whole point of user-context tracking is to be able to flush pending RCU callbacks. But that's crazy, we can sure defer a few kfree()s on this CPU, even indefinitely! If some other CPU does a sync_rcu(), then it can very well pluck those callbacks from this super low latency CPU's RCU lists (with due care) and go and free stuff itself ... There's no need to disturb this CPU for that! If user-space does not do anything kernel-ish then there won't be any new RCU callbacks piled up, so it's not like it's a resource leak issue either. So what's the point? Why not remove this big source of overhead altogether? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
On 05/01/2015 02:40 AM, Ingo Molnar wrote: >> This patch builds on top of these patches by Paolo: >> https://lkml.org/lkml/2015/4/28/188 >> https://lkml.org/lkml/2015/4/29/139 >> >> Together with this patch I posted earlier this week, the syscall path >> on a nohz_full cpu seems to be about 10% faster. >> https://lkml.org/lkml/2015/4/24/394 >> >> My test is a simple microbenchmark that calls getpriority() in a loop >> 10 million times: >> >> run timesystem time >> vanilla 5.49s 2.08s >> __acct patch 5.21s 1.92s >> both patches 4.88s 1.71s > > Just curious, what are the numbers if you don't have context tracking > enabled, i.e. without nohz_full? > > I.e. what's the baseline we are talking about? It's an astounding difference. This is not a kernel without nohz_full, just a CPU without nohz_full running the same kernel I tested with yesterday: run timesystem time vanilla 5.49s 2.08s __acct patch5.21s 1.92s both patches4.88s 1.71s CPU w/o nohz3.12s 1.63s<-- your numbers, mostly What is even more interesting is that the majority of the time difference seems to come from _user_ time, which has gone down from around 3.4 seconds in the vanilla kernel to around 1.5 seconds on the CPU without nohz_full enabled... At syscall entry time, the nohz_full context tracking code is very straightforward. We check thread_info->flags & _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which handles USER -> KERNEL context transition. Syscall exit time is a convoluted mess. Both do_notify_resume and syscall_trace_leave call exit_user() on entry and enter_user() on exit, leaving the time spent looping around between int_with_check and syscall_return: in entry_64.S accounted as user time. I sent an email about this last night, it may be useful to add a third test & function call point to the syscall return code, where we can call user_enter() just ONCE, and remove the other context tracking calls from that loop. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
* r...@redhat.com wrote: > From: Rik van Riel > > On syscall entry with nohz_full on, we enable interrupts, call user_exit, > disable interrupts, do something, re-enable interrupts, and go on our > merry way. > > Profiling shows that a large amount of the nohz_full overhead comes > from the extraneous disabling and re-enabling of interrupts. Andy > suggested simply not enabling interrupts until after the context > tracking code has done its thing, which allows us to skip a whole > interrupt disable & re-enable cycle. > > This patch builds on top of these patches by Paolo: > https://lkml.org/lkml/2015/4/28/188 > https://lkml.org/lkml/2015/4/29/139 > > Together with this patch I posted earlier this week, the syscall path > on a nohz_full cpu seems to be about 10% faster. > https://lkml.org/lkml/2015/4/24/394 > > My test is a simple microbenchmark that calls getpriority() in a loop > 10 million times: > > run timesystem time > vanilla 5.49s 2.08s > __acct patch 5.21s 1.92s > both patches 4.88s 1.71s Just curious, what are the numbers if you don't have context tracking enabled, i.e. without nohz_full? I.e. what's the baseline we are talking about? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Fri, 2015-05-01 at 14:05 -0400, Rik van Riel wrote: On 05/01/2015 12:34 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. Not if the realtime workload is running inside a KVM guest. I don't buy this: At that point an IPI, either on the host or in the guest, involves a full VMEXIT VMENTER cycle. So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 usec on recent hardware, and I bet it will get better with time. I'm not aware of any hard-RT workload that cannot take 1 usec latencies. Now think about doing this kind of IPI from inside a guest, to another VCPU on the same guest. Now you are looking at VMEXIT/VMENTER on the first VCPU, plus the cost of the IPI on the host, plus the cost of the emulation layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI work, and possibly a second VMEXIT/VMENTER for IPI completion. I suspect it would be better to do RCU callback offload in some other way. I don't get it. How the heck do people manage to talk about realtime in virtual boxen, and not at least crack a smile. Real, virtual, real, virtual... what's wrong with this picture? Why is virtual realtime not an oxymoron? I did that for grins once, and it was either really funny, or really sad, not sure which... but it did not look really really useful. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote: On 05/01/2015 02:40 PM, Ingo Molnar wrote: Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS/* pt_regs-ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. Huh? IRET with zero there will fault, and we can't guarantee that all syscalls return using sysret. [...] So IRET is a slowpath - I was thinking about the fastpath mainly. [...] Also, we'd have to audit all the entries, and system_call_after_swapgs currently enables interrupts early enough that an interrupt before all the pushes will do unpredictable things to pt_regs. An irq hardware frame won't push zero to that selector value, right? That's the only bad thing that would confuse the code. We could further abuse orig_ax, but that would require a lot of auditing. Honestly, though, I think keeping a flag in an otherwise-hot cache line is a better bet. [...] That would work too, at the cost of one more instruction, as now we'd have to both set and clear it. [...] Also, making it per-cpu instead of per-task will probably be easier on the RCU code, since otherwise the RCU code will have to do some kind of synchronization (even if it's a wait-free probe of the rq lock or similar) to figure out which pt_regs to look at. So the synchronize_rcu() part is an even slower slow path, in comparison with system call entry overhead. But yes, safely accessing the remote task is a complication, but with such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So even if we do: If we went that route, I'd advocate sticking the flag in tss-sp1. That cacheline is unconditionally read on kernel entry already, and sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I lost track). [1] Alternatively, I'd suggest that we actually add a whole new word to pt_regs. ... we'd still have to set TIF_NOHZ and are back to square one in terms of race complexity. But it's not overly complex: by taking the remote CPU's rq-lock from synchronize_rcu() we could get a stable pointer to the currently executing task. And if we do that, we might as well use the opportunity and take a look at pt_regs as well - this is why sticking it into pt_regs might be better. So I'd: - enlarge pt_regs by a word and stick the flag there (this allocation is essentially free) - update the word from entry/exit - synchronize_rcu() avoids having to send an IPI by taking a peak at rq-curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. synchronize_rcu() often involves waiting for (timer tick driven) grace periods anyway, so this is a relatively fast solution - and it would limit the overhead to 2 instructions. On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled) we could patch out those two instructions into NOPs, which would be eliminated in the decoder. Regarding the user/kernel execution time split measurement: 1) the same flag could be used to sample a remote CPU's statistics from another CPU and update the stats in the currently executing task. As long as there's at least one non-nohz-full CPU, this would work. Or are there systems were all CPUs are nohz-full? 2) Alternatively we could just drive user/kernel split statistics from context switches, which would be inaccurate if the workload is SCHED_FIFO that only rarely context switches. How does this sound? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: I.e. what's the baseline we are talking about? It's an astounding difference. This is not a kernel without nohz_full, just a CPU without nohz_full running the same kernel I tested with yesterday: run timesystem time vanilla 5.49s 2.08s __acct patch 5.21s 1.92s both patches 4.88s 1.71s CPU w/o nohz 3.12s 1.63s-- your numbers, mostly What is even more interesting is that the majority of the time difference seems to come from _user_ time, which has gone down from around 3.4 seconds in the vanilla kernel to around 1.5 seconds on the CPU without nohz_full enabled... At syscall entry time, the nohz_full context tracking code is very straightforward. We check thread_info-flags _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which handles USER - KERNEL context transition. Syscall exit time is a convoluted mess. Both do_notify_resume and syscall_trace_leave call exit_user() on entry and enter_user() on exit, leaving the time spent looping around between int_with_check and syscall_return: in entry_64.S accounted as user time. I sent an email about this last night, it may be useful to add a third test function call point to the syscall return code, where we can call user_enter() just ONCE, and remove the other context tracking calls from that loop. So what I'm wondering about is the big picture: - This is crazy big overhead in something as fundamental as system calls! - We don't even have the excuse of the syscall auditing code, which kind of has to run for every syscall if it wants to do its job! - [ The 'precise vtime' stuff that is driven from syscall entry/exit is crazy, and I hope not enabled in any distro. ] - So why are we doing this in every syscall time at all? Basically the whole point of user-context tracking is to be able to flush pending RCU callbacks. But that's crazy, we can sure defer a few kfree()s on this CPU, even indefinitely! If some other CPU does a sync_rcu(), then it can very well pluck those callbacks from this super low latency CPU's RCU lists (with due care) and go and free stuff itself ... There's no need to disturb this CPU for that! If user-space does not do anything kernel-ish then there won't be any new RCU callbacks piled up, so it's not like it's a resource leak issue either. So what's the point? Why not remove this big source of overhead altogether? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* r...@redhat.com r...@redhat.com wrote: From: Rik van Riel r...@redhat.com On syscall entry with nohz_full on, we enable interrupts, call user_exit, disable interrupts, do something, re-enable interrupts, and go on our merry way. Profiling shows that a large amount of the nohz_full overhead comes from the extraneous disabling and re-enabling of interrupts. Andy suggested simply not enabling interrupts until after the context tracking code has done its thing, which allows us to skip a whole interrupt disable re-enable cycle. This patch builds on top of these patches by Paolo: https://lkml.org/lkml/2015/4/28/188 https://lkml.org/lkml/2015/4/29/139 Together with this patch I posted earlier this week, the syscall path on a nohz_full cpu seems to be about 10% faster. https://lkml.org/lkml/2015/4/24/394 My test is a simple microbenchmark that calls getpriority() in a loop 10 million times: run timesystem time vanilla 5.49s 2.08s __acct patch 5.21s 1.92s both patches 4.88s 1.71s Just curious, what are the numbers if you don't have context tracking enabled, i.e. without nohz_full? I.e. what's the baseline we are talking about? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Ingo Molnar mi...@kernel.org wrote: 2. To suppress the timing tick, we need to get some timing for, um, the scheduler? I wasn't really sure about this one. So we have variable timeslice timers for the scheduler implemented, they are off by default but they worked last someone tried them. See the 'HRTICK' scheduler feature. And for SCHED_FIFO that timeout can be 'never' - i.e. essentially stopping the scheduler tick. (within reason.) Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 01:12 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:45 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:37 PM, Ingo Molnar wrote: Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? So I really think per syscall overhead is the wrong thing to do for anything that a common distro enables. I see very little use for such precise, high-freq measurements on normal systems - and abnormal systems could enable it dynamically just like they can enable syscall auditing. I.e. I don't mind the occasional crazy piece of code, as long as it does not hurt the innocent. Then how should/could we keep a rough idea of user / system / guest time when running without a periodic timer tick? So I'd split the timer tick into two parts: just the constant-work sampling bit that doesn't do much, and the variable-work part which gets essentially shut off when the timeout is far into the future. Then we could do IRQ driven sampling without introducing variable amount jitter into hard-RT execution time. I.e. much of what we do today, except that we could skip variable work such as the scheduler tick or (unforced) RCU processing like the RCU softirq work. Any ideas how we could avoid that sampling timer interrupt latency stacking up when dealing with both guest and host? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 12:21 PM, Ingo Molnar wrote: * Andy Lutomirski l...@amacapital.net wrote: So what's the point? Why not remove this big source of overhead altogether? The last time I asked, the impression I got was that we needed two things: 1. We can't pluck things from the RCU list without knowing whether the CPU is in an RCU read-side critical section, and we can't know that unless we have regular grade periods or we know that the CPU is idle. To make the CPU detectably idle, we need to set a bit somewhere. 'Idle' as in 'executing pure user-space mode, without entering the kernel and possibly doing an rcu_read_lock()', right? So we don't have to test it from the remote CPU: we could probe such CPUs via a single low-overhead IPI. I'd much rather push such overhead to sync_rcu() than to the syscall entry code! I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. Not if the realtime workload is running inside a KVM guest. At that point an IPI, either on the host or in the guest, involves a full VMEXIT VMENTER cycle. I suspect it would be easy enough at user_enter() or guest_enter() time to: 1) flip the bit 2) check to see if we have any callbacks queued on our on queue, and 3) if so, move the callbacks from that queue to the another CPU can take these queue At that point, another CPU wanting to do an RCU grace period advance can: 1) skip trying to advance the grace period on the CPU that is in userspace or guest mode, and 2) take the for somebody else callbacks of that CPU, and run them This does not seem overly complicated. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 12:37 PM, Ingo Molnar wrote: Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Andy Lutomirski l...@amacapital.net wrote: So what's the point? Why not remove this big source of overhead altogether? The last time I asked, the impression I got was that we needed two things: 1. We can't pluck things from the RCU list without knowing whether the CPU is in an RCU read-side critical section, and we can't know that unless we have regular grade periods or we know that the CPU is idle. To make the CPU detectably idle, we need to set a bit somewhere. 'Idle' as in 'executing pure user-space mode, without entering the kernel and possibly doing an rcu_read_lock()', right? So we don't have to test it from the remote CPU: we could probe such CPUs via a single low-overhead IPI. I'd much rather push such overhead to sync_rcu() than to the syscall entry code! I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. If user-space RT code does not tolerate _that_ kind of latencies then it really has its priorities wrong and we should not try to please it. It should not hurt the other 99.9% of sane hard-RT users. And the other usecase, virtualization, obviously does not care and could take the IPI just fine. 2. To suppress the timing tick, we need to get some timing for, um, the scheduler? I wasn't really sure about this one. So we have variable timeslice timers for the scheduler implemented, they are off by default but they worked last someone tried them. See the 'HRTICK' scheduler feature. And for SCHED_FIFO that timeout can be 'never' - i.e. essentially stopping the scheduler tick. (within reason.) Could we reduce the overhead by making the IN_USER vs IN_KERNEL indication be a single bit and, worst case, an rdtsc and maybe a subtraction? We could probably get away with banning full nohz on non-invariant tsc systems. (I do understand why it would be tricky to transition from IN_USER to IN_KERNEL with IRQs on. Solvable, maybe, but tricky.) We can make it literally zero overhead: by using an IPI from synchronize_rcu() and friend. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 12:03 PM, Andy Lutomirski wrote: The last time I asked, the impression I got was that we needed two things: 1. We can't pluck things from the RCU list without knowing whether the CPU is in an RCU read-side critical section, and we can't know that unless we have regular grade periods or we know that the CPU is idle. To make the CPU detectably idle, we need to set a bit somewhere. More than that. We also need a way for another CPU to identify the callbacks they could run for us, without confusing them with new callbacks queued after we transitioned back from USER to KERNEL context. 2. To suppress the timing tick, we need to get some timing for, um, the scheduler? I wasn't really sure about this one. Could we reduce the overhead by making the IN_USER vs IN_KERNEL indication be a single bit and, worst case, an rdtsc and maybe a subtraction? We could probably get away with banning full nohz on non-invariant tsc systems. I suspect we can. There is no need to update the vtime every single time we enter vtime_user_enter and functions like it. We can keep a buffer, which: 1) keeps values in TSC cycles (or whatever unit local_clock does) 2) is only ever accessed by the current task, so it requires no locking 3) values can be actually folded into vtime periodically, when they exceed a certain threshold (1 second ?) That means the vtime_seqlock is something that we would only take once a second or so, and the calculations in account_user_time would only be done on a fairly infrequent basis. That has the potential to reduce overhead by a lot. If nobody has any objections to that kind of change, I would be happy to implement it. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Fri, May 1, 2015 at 8:59 AM, Ingo Molnar mi...@kernel.org wrote: * Rik van Riel r...@redhat.com wrote: I.e. what's the baseline we are talking about? It's an astounding difference. This is not a kernel without nohz_full, just a CPU without nohz_full running the same kernel I tested with yesterday: run timesystem time vanilla 5.49s 2.08s __acct patch 5.21s 1.92s both patches 4.88s 1.71s CPU w/o nohz 3.12s 1.63s-- your numbers, mostly What is even more interesting is that the majority of the time difference seems to come from _user_ time, which has gone down from around 3.4 seconds in the vanilla kernel to around 1.5 seconds on the CPU without nohz_full enabled... At syscall entry time, the nohz_full context tracking code is very straightforward. We check thread_info-flags _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which handles USER - KERNEL context transition. Syscall exit time is a convoluted mess. Both do_notify_resume and syscall_trace_leave call exit_user() on entry and enter_user() on exit, leaving the time spent looping around between int_with_check and syscall_return: in entry_64.S accounted as user time. I sent an email about this last night, it may be useful to add a third test function call point to the syscall return code, where we can call user_enter() just ONCE, and remove the other context tracking calls from that loop. So what I'm wondering about is the big picture: - This is crazy big overhead in something as fundamental as system calls! - We don't even have the excuse of the syscall auditing code, which kind of has to run for every syscall if it wants to do its job! - [ The 'precise vtime' stuff that is driven from syscall entry/exit is crazy, and I hope not enabled in any distro. ] - So why are we doing this in every syscall time at all? Basically the whole point of user-context tracking is to be able to flush pending RCU callbacks. But that's crazy, we can sure defer a few kfree()s on this CPU, even indefinitely! If some other CPU does a sync_rcu(), then it can very well pluck those callbacks from this super low latency CPU's RCU lists (with due care) and go and free stuff itself ... There's no need to disturb this CPU for that! If user-space does not do anything kernel-ish then there won't be any new RCU callbacks piled up, so it's not like it's a resource leak issue either. So what's the point? Why not remove this big source of overhead altogether? The last time I asked, the impression I got was that we needed two things: 1. We can't pluck things from the RCU list without knowing whether the CPU is in an RCU read-side critical section, and we can't know that unless we have regular grade periods or we know that the CPU is idle. To make the CPU detectably idle, we need to set a bit somewhere. 2. To suppress the timing tick, we need to get some timing for, um, the scheduler? I wasn't really sure about this one. Could we reduce the overhead by making the IN_USER vs IN_KERNEL indication be a single bit and, worst case, an rdtsc and maybe a subtraction? We could probably get away with banning full nohz on non-invariant tsc systems. (I do understand why it would be tricky to transition from IN_USER to IN_KERNEL with IRQs on. Solvable, maybe, but tricky.) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:03 PM, Andy Lutomirski wrote: The last time I asked, the impression I got was that we needed two things: 1. We can't pluck things from the RCU list without knowing whether the CPU is in an RCU read-side critical section, and we can't know that unless we have regular grade periods or we know that the CPU is idle. To make the CPU detectably idle, we need to set a bit somewhere. More than that. We also need a way for another CPU to identify the callbacks they could run for us, without confusing them with new callbacks queued after we transitioned back from USER to KERNEL context. That's easy: a simple, constant-work IPI sent to nohz-full CPUs could just dequeue these callbacks, if it sees that we are still in user-mode. It's super easy because such an IPI would essentially be equivalent to a system call context from RCU's POV, if it uses this test: if (user_mode(regs)) { ... pluck RCU callbacks ... } That way we can have lots of overhead removed from the syscall call path ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. Not if the realtime workload is running inside a KVM guest. I don't buy this: At that point an IPI, either on the host or in the guest, involves a full VMEXIT VMENTER cycle. So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 usec on recent hardware, and I bet it will get better with time. I'm not aware of any hard-RT workload that cannot take 1 usec latencies. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:37 PM, Ingo Molnar wrote: Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? So I really think per syscall overhead is the wrong thing to do for anything that a common distro enables. I see very little use for such precise, high-freq measurements on normal systems - and abnormal systems could enable it dynamically just like they can enable syscall auditing. I.e. I don't mind the occasional crazy piece of code, as long as it does not hurt the innocent. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 12:45 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:37 PM, Ingo Molnar wrote: Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? So I really think per syscall overhead is the wrong thing to do for anything that a common distro enables. I see very little use for such precise, high-freq measurements on normal systems - and abnormal systems could enable it dynamically just like they can enable syscall auditing. I.e. I don't mind the occasional crazy piece of code, as long as it does not hurt the innocent. Then how should/could we keep a rough idea of user / system / guest time when running without a periodic timer tick? These statistics are useful, and gathering them in a low impact way (local_clock read + variable read + subtraction + addition + store) may be acceptable overhead. This is a few function calls, one tsc read, a possible cache line miss, and an unlikely branch. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:45 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:37 PM, Ingo Molnar wrote: Also note that this bit in context_tracking_enter(): if (state == CONTEXT_USER) { trace_user_enter(0); vtime_user_enter(current); } is related to precise time measurement of user/kernel execution times, it's not needed by the scheduler at all, it's just exported to tooling. It's not fundamental to the scheduler. Any objections to the idea from the other thread to simply keep the time accumulating in buffers in local_clock() units, and only update the task vtime once a second or so? So I really think per syscall overhead is the wrong thing to do for anything that a common distro enables. I see very little use for such precise, high-freq measurements on normal systems - and abnormal systems could enable it dynamically just like they can enable syscall auditing. I.e. I don't mind the occasional crazy piece of code, as long as it does not hurt the innocent. Then how should/could we keep a rough idea of user / system / guest time when running without a periodic timer tick? So I'd split the timer tick into two parts: just the constant-work sampling bit that doesn't do much, and the variable-work part which gets essentially shut off when the timeout is far into the future. Then we could do IRQ driven sampling without introducing variable amount jitter into hard-RT execution time. I.e. much of what we do today, except that we could skip variable work such as the scheduler tick or (unforced) RCU processing like the RCU softirq work. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 02:40 AM, Ingo Molnar wrote: This patch builds on top of these patches by Paolo: https://lkml.org/lkml/2015/4/28/188 https://lkml.org/lkml/2015/4/29/139 Together with this patch I posted earlier this week, the syscall path on a nohz_full cpu seems to be about 10% faster. https://lkml.org/lkml/2015/4/24/394 My test is a simple microbenchmark that calls getpriority() in a loop 10 million times: run timesystem time vanilla 5.49s 2.08s __acct patch 5.21s 1.92s both patches 4.88s 1.71s Just curious, what are the numbers if you don't have context tracking enabled, i.e. without nohz_full? I.e. what's the baseline we are talking about? It's an astounding difference. This is not a kernel without nohz_full, just a CPU without nohz_full running the same kernel I tested with yesterday: run timesystem time vanilla 5.49s 2.08s __acct patch5.21s 1.92s both patches4.88s 1.71s CPU w/o nohz3.12s 1.63s-- your numbers, mostly What is even more interesting is that the majority of the time difference seems to come from _user_ time, which has gone down from around 3.4 seconds in the vanilla kernel to around 1.5 seconds on the CPU without nohz_full enabled... At syscall entry time, the nohz_full context tracking code is very straightforward. We check thread_info-flags _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which handles USER - KERNEL context transition. Syscall exit time is a convoluted mess. Both do_notify_resume and syscall_trace_leave call exit_user() on entry and enter_user() on exit, leaving the time spent looping around between int_with_check and syscall_return: in entry_64.S accounted as user time. I sent an email about this last night, it may be useful to add a third test function call point to the syscall return code, where we can call user_enter() just ONCE, and remove the other context tracking calls from that loop. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On Fri, May 1, 2015 at 12:11 PM, Rik van Riel r...@redhat.com wrote: On 05/01/2015 02:40 PM, Ingo Molnar wrote: Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS/* pt_regs-ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. Huh? IRET with zero there will fault, and we can't guarantee that all syscalls return using sysret. Also, we'd have to audit all the entries, and system_call_after_swapgs currently enables interrupts early enough that an interrupt before all the pushes will do unpredictable things to pt_regs. We could further abuse orig_ax, but that would require a lot of auditing. Honestly, though, I think keeping a flag in an otherwise-hot cache line is a better bet. Also, making it per-cpu instead of per-task will probably be easier on the RCU code, since otherwise the RCU code will have to do some kind of synchronization (even if it's a wait-free probe of the rq lock or similar) to figure out which pt_regs to look at. If we went that route, I'd advocate sticking the flag in tss-sp1. That cacheline is unconditionally read on kernel entry already, and sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I lost track). [1] Alternatively, I'd suggest that we actually add a whole new word to pt_regs. [1] It's not unconditionally accessed yet, but it wil be once Denys' latest patches are in. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: I.e. much of what we do today, except that we could skip variable work such as the scheduler tick or (unforced) RCU processing like the RCU softirq work. Any ideas how we could avoid that sampling timer interrupt latency stacking up when dealing with both guest and host? Well, it would be host_latency+guest_latency worst-case, right? That should still be bounded, as long as both guest and host is running nohz-full mode. Also, technically when the host gets such an IRQ, the guest is interrupted as well. So the host could do the guest's work and pass through the result via paravirt or so. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 02:40 PM, Ingo Molnar wrote: Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS/* pt_regs-ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. So the remote CPU would have to do a read of this location. There are two cases: - If it's 0, then it has observed quiescent state on that CPU. (It does not have to be atomics anymore, as we'd only observe the value and MESI coherency takes care of it.) That should do the trick. - If it's not 0 then the remote CPU is not executing user-space code and we can install (remotely) a TIF_NOHZ flag in it and expect it to process it either on return to user-space or on a context switch. I may have to think about this a little more, but it seems like it should work. Can we use a separate byte in the flags word for flags that can get set remotely, so we can do stores and clearing of local-only flags without atomic instructions? This way, unless I'm missing something, reduces the overhead to a single store to a hot cacheline on return-to-userspace - which instruction if we place it well might as well be close to zero cost. No syscall entry cost. Slow-return cost only in the (rare) case of someone using synchronize_rcu(). I think that should take care of the RCU aspect of nohz_full. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
* Rik van Riel r...@redhat.com wrote: On 05/01/2015 12:34 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. Not if the realtime workload is running inside a KVM guest. I don't buy this: At that point an IPI, either on the host or in the guest, involves a full VMEXIT VMENTER cycle. So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 usec on recent hardware, and I bet it will get better with time. I'm not aware of any hard-RT workload that cannot take 1 usec latencies. Now think about doing this kind of IPI from inside a guest, to another VCPU on the same guest. Now you are looking at VMEXIT/VMENTER on the first VCPU, Does it matter? It's not the hard-RT CPU, and this is a slowpath of synchronize_rcu(). plus the cost of the IPI on the host, plus the cost of the emulation layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI work, and possibly a second VMEXIT/VMENTER for IPI completion. Only the VMEXIT/VMENTER on the second VCPU matters to RT latencies. I suspect it would be better to do RCU callback offload in some other way. Well, it's not just about callback offload, but it's about the basic synchronization guarantee of synchronize_rcu(): that all RCU read-side critical sections have finished executing after the call returns. So even if a nohz-full CPU never actually queues a callback, it needs to stop using resources that a synchronize_rcu() caller expects it to stop using. We can do that only if we know it in an SMP-coherent way that the remote CPU is not in an rcu_read_lock() section. Sending an IPI is one way to achieve that. Or we could do that in the syscall path with a single store of a constant flag to a location in the task struct. We have a number of natural flags that get written on syscall entry, such as: pushq_cfi $__USER_DS/* pt_regs-ss */ That goes to a constant location on the kernel stack. On return from system calls we could write 0 to that location. So the remote CPU would have to do a read of this location. There are two cases: - If it's 0, then it has observed quiescent state on that CPU. (It does not have to be atomics anymore, as we'd only observe the value and MESI coherency takes care of it.) - If it's not 0 then the remote CPU is not executing user-space code and we can install (remotely) a TIF_NOHZ flag in it and expect it to process it either on return to user-space or on a context switch. This way, unless I'm missing something, reduces the overhead to a single store to a hot cacheline on return-to-userspace - which instruction if we place it well might as well be close to zero cost. No syscall entry cost. Slow-return cost only in the (rare) case of someone using synchronize_rcu(). Hm? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable enable from context tracking on syscall entry
On 05/01/2015 12:34 PM, Ingo Molnar wrote: * Rik van Riel r...@redhat.com wrote: I can understand people running hard-RT workloads not wanting to see the overhead of a timer tick or a scheduler tick with variable (and occasionally heavy) work done in IRQ context, but the jitter caused by a single trivial IPI with constant work should be very, very low and constant. Not if the realtime workload is running inside a KVM guest. I don't buy this: At that point an IPI, either on the host or in the guest, involves a full VMEXIT VMENTER cycle. So a full VMEXIT/VMENTER costs how much, 2000 cycles? That's around 1 usec on recent hardware, and I bet it will get better with time. I'm not aware of any hard-RT workload that cannot take 1 usec latencies. Now think about doing this kind of IPI from inside a guest, to another VCPU on the same guest. Now you are looking at VMEXIT/VMENTER on the first VCPU, plus the cost of the IPI on the host, plus the cost of the emulation layer, plus VMEXIT/VMENTER on the second VCPU to trigger the IPI work, and possibly a second VMEXIT/VMENTER for IPI completion. I suspect it would be better to do RCU callback offload in some other way. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/