Re: [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode)

2016-09-28 Thread Frederic Weisbecker
On Wed, Aug 17, 2016 at 02:37:46PM -0500, Christoph Lameter wrote:
> On Tue, 16 Aug 2016, Chris Metcalf wrote:
> Subject: NOHZ: Correctly display increasing cputime when processor is busy
> 
> The tick may be switched off when the processor gets busy with nohz full.
> The user time fields in /proc/stat will then no longer increase because
> the tick is not run to update the cpustat values anymore.
> 
> Compensate for the missing ticks by checking if a processor is in
> such a mode. If so then add the ticks that have passed since
> the tick was switched off to the usertime.
> 
> Note that this introduces a slight inaccuracy. The process may
> actually do syscalls without triggering a tick again but the
> processing time in those calls is negligible. Any wait or sleep
> occurrence during syscalls would activate the tick again.
> 
> Any inaccuracy is corrected once the tick is switched on again
> since the actual value where cputime aggregates is not changed.
> 
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/fs/proc/stat.c
> ===
> --- linux.orig/fs/proc/stat.c 2016-08-04 09:04:57.681480937 -0500
> +++ linux/fs/proc/stat.c  2016-08-17 14:27:37.813445675 -0500
> @@ -77,6 +77,12 @@ static u64 get_iowait_time(int cpu)
> 
>  #endif
> 
> +static unsigned long inline get_cputime_user(int cpu)
> +{
> + return kcpustat_cpu(cpu).cpustat[CPUTIME_USER] +
> + tick_stopped_busy_ticks(cpu);
> +}
> +
>  static int show_stat(struct seq_file *p, void *v)
>  {
>   int i, j;
> @@ -93,7 +99,7 @@ static int show_stat(struct seq_file *p,
>   getboottime64();
> 
>   for_each_possible_cpu(i) {
> - user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
> + user += get_cputime_user(i);
>   nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
>   system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>   idle += get_idle_time(i);
> @@ -130,7 +136,7 @@ static int show_stat(struct seq_file *p,
> 
>   for_each_online_cpu(i) {
>   /* Copy values here to work around gcc-2.95.3, gcc-2.96 */
> - user = kcpustat_cpu(i).cpustat[CPUTIME_USER];
> + user = get_cputime_user(i);
>   nice = kcpustat_cpu(i).cpustat[CPUTIME_NICE];
>   system = kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
>   idle = get_idle_time(i);
> Index: linux/kernel/time/tick-sched.c
> ===
> --- linux.orig/kernel/time/tick-sched.c   2016-07-27 08:41:17.109862517 
> -0500
> +++ linux/kernel/time/tick-sched.c2016-08-17 14:16:42.073835333 -0500
> @@ -990,6 +990,24 @@ ktime_t tick_nohz_get_sleep_length(void)
>   return ts->sleep_length;
>  }
> 
> +/**
> + * tick_stopped_busy_ticks - return the ticks that did not occur while the
> + *   processor was busy and the tick was off
> + *
> + * Called from sysfs to correctly calculate cputime of nohz full processors
> + */
> +unsigned long tick_stopped_busy_ticks(int cpu)
> +{
> +#ifdef CONFIG_NOHZ_FULL
> + struct tick_sched *ts = per_cpu_ptr(_cpu_sched, cpu);
> +
> + if (!ts->inidle && ts->tick_stopped)
> + return jiffies - ts->idle_jiffies;


It won't work, ts->idle_jiffies only takes care about idle time.

That said, the tick is supposed to fire once per second, the reason for the 
freeze is
still unknown. Now in order to get rid of the 1hz, we'll need to force updates 
on
cpustats like that patch intended to.

But I see only two sane ways to do so:

_ fetch the task of CPU X and deduce on top of vtime values where it is 
executing and
  how much delta is to be added to cpustat. The problem here is that we may 
need to do that
  under the rq lock to make sure the task is really in CPU X and stays there. 
Perhaps we could
  cheat though and add the CPU number on vtime fields then vtime_seqcount would 
be enough
  to get stable results.

_ have housekeeping update all those CPUs cpustat periodically. But that means 
we need to
  turn back vtime_seqcount into a seqlock and that would be a shame for 
nohz_full performance.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-09-27 Thread Frederic Weisbecker
On Tue, Sep 27, 2016 at 04:39:26PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 27, 2016 at 04:22:20PM +0200, Frederic Weisbecker wrote:
> 
> > The RCU context tracking doesn't take care of callbacks. It's only there
> > to tell the RCU core whether the CPU runs code that may or may not run
> > RCU read side critical sections. This is assumed by "kernel may use RCU,
> > userspace can't".
> 
> Userspace never can use the kernels RCU in any case. What you mean to
> say is that userspace is treated like an idle CPU in that the CPU will
> no longer be part of the RCU quescent state machine.
> 
> The transition to userspace (as per context tracking) must ensure that
> CPUs RCU state is 'complete', just like our transition to idle (mostly)
> does.

Exactly!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ping: [PATCH v15 00/13] support "task_isolation" mode

2016-09-27 Thread Frederic Weisbecker
On Mon, Sep 12, 2016 at 08:20:16PM -0400, Francis Giraldeau wrote:
> 
> The args are valid, but the system has an unstable clock, therefore the
> operation is not supported. In the user point of view, maybe ENOTSUPP
> would be more appropriate? But then, we need to check the reason and
> can_stop_my_full_tick() returns only a boolean.
> 
> On a side note, the NOSIG mode may be confusing for the users. At first,
> I was expecting that NOSIG behaves the same way as the normal task isolation
> mode. In the current situation, if the user wants the normal behavior, but
> does not care about the signal, the user must register an empty signal 
> handler.
> 
> However, if I understand correctly, other settings beside NOHZ and isolcpus
> are required to support quiet CPUs, such as irq_affinity and rcu_nocb. It 
> would
> be very convenient from the user point of view if these other settings were 
> configure
> correctly.
> 
> I can work on that and also write some doc (Documentation/task-isolation.txt 
> ?).

That would be lovely! Part of this documentation already exists in
Documentation/timers/NO_HZ.txt and also in 
Documentation/kernel-per-CPU-kthreads.txt

I think we should extract the isolation informations that aren't related to the 
tick
from NO_HZ.txt and put them in task-isolation.txt, perhaps merge 
kernel-per-CPU-kthreads.txt
into it or at least add a pointer to it. Then add all the missing informations 
as many things
have evolved since then.

I'll gladly help.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ping: [PATCH v15 00/13] support "task_isolation" mode

2016-09-27 Thread Frederic Weisbecker
On Mon, Aug 29, 2016 at 12:27:06PM -0400, Chris Metcalf wrote:
> On 8/16/2016 5:19 PM, Chris Metcalf wrote:
> >Here is a respin of the task-isolation patch set.
> >
> >Again, I have been getting email asking me when and where this patch
> >will be upstreamed so folks can start using it.  I had been thinking
> >the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
> >thing.  But perhaps it touches enough other subsystems that that
> >doesn't really make sense?  Andrew, would it make sense to take it
> >directly via your tree?  Frederic, Ingo, what do you think?
> 
> Ping!
> 
> No concerns have been raised yet with the v15 version of the patch series
> in the two weeks since I posted it, and I think I have addressed all
> previously-raised concerns (or perhaps people have just given up arguing
> with me).
> 
> I did add Catalin's Reviewed-by to 08/13 (thanks!) and updated my
> kernel.org repo.
> 
> Does this feel like something we can merge when the 4.9 merge window opens?
> If so, whose tree is best suited for it?  Or should I ask Stephen to put it 
> into
> linux-next now and then ask Linus to merge it directly?  I recall Ingo thought
> this was a bad idea when I suggested it back in January, but I'm not sure 
> where
> we got to in terms of a better approach.

As it seems we are still debating a lot of things in this patchset that has 
already
reached v15, I think you should split it in smaller steps in order to move 
forward
and only get into the next step once the previous is merged.

You could start with a first batch that introduces the prctl() and does the 
best effort
one-shot isolation part. Which means the actions that only need to be performed 
once
on the prctl call.

Once we get that merged we can focus on what needs to be performed on every 
return
to userspace if that's really needed. Including possibly waiting on some 
completion.

Then once we rip out the residual 1hz we can start to think about signal the 
user
on any interruption, etc...

Does that sound feasible to you?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-09-27 Thread Frederic Weisbecker
On Fri, Sep 02, 2016 at 10:28:00AM -0700, Andy Lutomirski wrote:
> 
> Unless I'm missing something (which is reasonably likely), couldn't
> the isolation code just force or require rcu_nocbs on the isolated
> CPUs to avoid this problem entirely.

rcu_nocb is already implied by nohz_full. Which means that RCU callbacks
are offlined outside the nohz_full set of CPUs.

> 
> I admit I still don't understand why the RCU context tracking code
> can't just run the callback right away instead of waiting however many
> microseconds in general.  I feel like paulmck has explained it to me
> at least once, but that doesn't mean I remember the answer.

The RCU context tracking doesn't take care of callbacks. It's only there
to tell the RCU core whether the CPU runs code that may or may not run
RCU read side critical sections. This is assumed by "kernel may use RCU,
userspace can't".
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-09-03 Thread Frederic Weisbecker
On Tue, Aug 30, 2016 at 02:17:27PM -0400, Chris Metcalf wrote:
> On 8/30/2016 1:36 PM, Chris Metcalf wrote:
> >>>See the other thread with Peter Z for the longer discussion of this.
> >>>At this point I'm leaning towards replacing the set_tsk_need_resched() with
> >>>
> >>> set_current_state(TASK_INTERRUPTIBLE);
> >>> schedule();
> >>> __set_current_state(TASK_RUNNING);
> >>I don't see how that helps. What will wake the thread up except a signal?
> >
> >The answer is that the scheduler will keep bringing us back to this
> >point (either after running another runnable task if there is one,
> >or else just returning to this point immediately without doing a
> >context switch), and we will then go around the "prepare exit to
> >userspace" loop and perhaps discover that enough time has gone
> >by that the last dyntick interrupt has triggered and the kernel has
> >quiesced the dynticks.  At that point we stop calling schedule()
> >over and over and can return normally to userspace.
> 
> Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
> schedule(), I never get control back.  So I don't want to do that.
> 
> I suppose I could do a schedule_timeout() here instead and try
> to figure out how long to wait for the next dyntick.  But since we
> don't expect anything else running on the core anyway, it seems
> like we are probably working too hard at this point.  I don't think
> it's worth it just to go into the idle task and (possibly) save some
> power for a few microseconds.
> 
> The more I think about this, the more I think I am micro-optimizing
> by trying to poke the scheduler prior to some external thing setting
> need_resched, so I think the thing to do here is in fact, nothing.

Exactly, I fear there is nothing you can do about that.

> I won't worry about rescheduling but will just continue going around
> the prepare-exit-to-userspace loop until the last dyn tick fires.

You mean waiting in prepare-exit-to-userspace until the last tick fires?
I'm not sure it's a good idea either, this could take ages, it could as
well never happen.

I'd rather say that if we are in signal mode, fire such, otherwise just
return to userspace. If there is a tick, it means that the environment is
not suitable for isolation anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Frederic Weisbecker
On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote:
> On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:
> >On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
> >>On 8/11/2016 2:50 PM, Christoph Lameter wrote:
> >>>On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
> >>>
> >>>>Do we need to quiesce vmstat everytime before entering userspace?
> >>>>I thought that vmstat only need to be offlined once and for all?
> >>>Once is sufficient after disabling the tick.
> >>It's true that task_isolation_enter() is called every time before
> >>returning to user space while task isolation is enabled.
> >>
> >>But once we enter the kernel again after returning from the initial
> >>prctl() -- assuming we are in NOSIG mode so doing so is legal in the
> >>first place -- almost anything can happen, certainly including
> >>restarting the tick.  Thus, we have to make sure that normal quiescing
> >>happens again before we return to userspace.
> >Yes but we need to sort out what needs to be called only once on prctl().
> >
> >Once vmstat is quiesced, it's not going to need quiescing again even if we
> >restart the tick.
> 
> That's true, but I really do like the idea of having a clean structure
> where we verify all our prerequisites in task_isolation_ready(), and
> have code to try to get things fixed up in task_isolation_enter().
> If we start moving some things here and some things there, it gets
> harder to manage.  I think by testing "!vmstat_idle()" in
> task_isolation_enter() we are avoiding any substantial overhead.

I think that making the code clearer on what needs to be done once for
all on prctl() and what needs to be done on all actual syscall return
is more important for readability.

> 
> I think it would be clearer to rename task_isolation_enter()
> to task_isolation_prepare(); it might be less confusing.

For the prctl part, why not.

> 
> Remember too that in general, we really don't need to think about
> return-to-userspace as a hot path for task isolation, unlike how we
> think about it all the rest of the time.  So it makes sense to
> prioritize keeping things clean from a software development
> perspective first, and high-performance only secondarily.
> 
> >>The thing to remember is that this is only relevant if the user has
> >>explicitly requested the NOSIG behavior from task isolation, which we
> >>don't really expect to be the default - we are implicitly encouraging
> >>use of the default semantics of "you can't enter the kernel again
> >>until you turn off isolation".
> >That's right. Although NOSIG is the only thing we can afford as long as
> >we drag around the 1Hz.
> 
> True enough.  Hopefully we'll finish sorting that out soon enough.
> 
> >>>>+ if (!tick_nohz_tick_stopped())
> >>>>+ set_tsk_need_resched(current);
> >>>>Again, that won't help
> >>It won't be better than spinning in a loop if there aren't any other
> >>schedulable processes, but it won't be worse either.  If there is
> >>another schedulable process, we at least will schedule it sooner than
> >>if we just sat in a busy loop and waited for the scheduler to kick
> >>us. But there's nothing else we can do anyway if we want to maintain
> >>the guarantee that the dyn tick is stopped before return to userspace.
> >I don't think it helps either way. If reschedule is pending, the current
> >task already has TIF_RESCHED set.
> 
> See the other thread with Peter Z for the longer discussion of this.
> At this point I'm leaning towards replacing the set_tsk_need_resched() with
> 
> set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-29 Thread Frederic Weisbecker
On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:
> On 8/11/2016 2:50 PM, Christoph Lameter wrote:
> >On Thu, 11 Aug 2016, Frederic Weisbecker wrote:
> >
> >>Do we need to quiesce vmstat everytime before entering userspace?
> >>I thought that vmstat only need to be offlined once and for all?
> >Once is sufficient after disabling the tick.
> 
> It's true that task_isolation_enter() is called every time before
> returning to user space while task isolation is enabled.
> 
> But once we enter the kernel again after returning from the initial
> prctl() -- assuming we are in NOSIG mode so doing so is legal in the
> first place -- almost anything can happen, certainly including
> restarting the tick.  Thus, we have to make sure that normal quiescing
> happens again before we return to userspace.

Yes but we need to sort out what needs to be called only once on prctl().

Once vmstat is quiesced, it's not going to need quiescing again even if we
restart the tick.

> 
> For vmstat, you're right that it's somewhat heavyweight to do the
> quiesce, and if we don't need it, it's wasted time on the return path.
> So I will add a guard call to the new vmstat_idle() before invoking
> quiet_vmstat_sync().  This slows down the path where it turns out we
> do need to quieten vmstat, but not by too much.

Why not do this on prctl() only?

> The LRU quiesce is quite light-weight.  We just check pagevec_count()
> on a handful of pagevec's, confirm they are all zero, and return
> without further work.  So for that one, adding a separate
> lru_add_drain_needed() guard test would just be wasted effort.

Ok if this one is justified, like LRU may need update everytime we re-enter
the kernel, then we can keep it (I can't tell, I don't know much about -mm).

> The thing to remember is that this is only relevant if the user has
> explicitly requested the NOSIG behavior from task isolation, which we
> don't really expect to be the default - we are implicitly encouraging
> use of the default semantics of "you can't enter the kernel again
> until you turn off isolation".

That's right. Although NOSIG is the only thing we can afford as long as
we drag around the 1Hz.

> 
> >> +  if (!tick_nohz_tick_stopped())
> >> +  set_tsk_need_resched(current);
> >> Again, that won't help
> 
> It won't be better than spinning in a loop if there aren't any other
> schedulable processes, but it won't be worse either.  If there is
> another schedulable process, we at least will schedule it sooner than
> if we just sat in a busy loop and waited for the scheduler to kick
> us. But there's nothing else we can do anyway if we want to maintain
> the guarantee that the dyn tick is stopped before return to userspace.

I don't think it helps either way. If reschedule is pending, the current
task already has TIF_RESCHED set.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-08-13 Thread Frederic Weisbecker
On Fri, Aug 12, 2016 at 09:19:19AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 12, 2016 at 04:26:13PM +0200, Frederic Weisbecker wrote:
> > On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> > > On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> > > 
> > > > Heh!  The only really good idea is for clocks to be reliably in sync.
> > > >
> > > > But if they go out of sync, what do you want to do instead?
> > > 
> > > For a NOHZ task? Write a message to the syslog and reenable tick.
> 
> Fair enough!  Kicking off a low-priority task would achieve the latter
> but not necessarily the former.  And of course assumes that the worker
> thread is at real-time priority with various scheduler anti-starvation
> features disabled.
> 
> > Indeed, a strong clocksource is a requirement for a full tickless machine.
> 
> No disagrement here!  ;-)

I have a bot in my mind that randomly posts obvious statements about nohz_full
here and then :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-08-12 Thread Frederic Weisbecker
On Fri, Aug 12, 2016 at 09:23:13AM -0500, Christoph Lameter wrote:
> On Thu, 11 Aug 2016, Paul E. McKenney wrote:
> 
> > Heh!  The only really good idea is for clocks to be reliably in sync.
> >
> > But if they go out of sync, what do you want to do instead?
> 
> For a NOHZ task? Write a message to the syslog and reenable tick.

Indeed, a strong clocksource is a requirement for a full tickless machine.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-11 Thread Frederic Weisbecker
On Tue, Aug 09, 2016 at 04:29:46PM -0400, Chris Metcalf wrote:
> +/*
> + * Each time we try to prepare for return to userspace in a process
> + * with task isolation enabled, we run this code to quiesce whatever
> + * subsystems we can readily quiesce to avoid later interrupts.
> + */
> +void task_isolation_enter(void)
> +{
> + WARN_ON_ONCE(irqs_disabled());
> +
> + /* Drain the pagevecs to avoid unnecessary IPI flushes later. */
> + lru_add_drain();
> +
> + /* Quieten the vmstat worker so it won't interrupt us. */
> + quiet_vmstat_sync();

So, this is going to be called everytime we resume to userspace
while in task isolation mode, right?

Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

And how about lru?

> +
> + /*
> +  * Request rescheduling unless we are in full dynticks mode.
> +  * We would eventually get pre-empted without this, and if
> +  * there's another task waiting, it would run; but by
> +  * explicitly requesting the reschedule, we may reduce the
> +  * latency.  We could directly call schedule() here as well,
> +  * but since our caller is the standard place where schedule()
> +  * is called, we defer to the caller.
> +  *
> +  * A more substantive approach here would be to use a struct
> +  * completion here explicitly, and complete it when we shut
> +  * down dynticks, but since we presumably have nothing better
> +  * to do on this core anyway, just spinning seems plausible.
> +  */
> + if (!tick_nohz_tick_stopped())
> + set_tsk_need_resched(current);

Again, that won't help :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 04/13] task_isolation: add initial support

2016-06-29 Thread Frederic Weisbecker
On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:
> On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:
> >I don't remember how much I answered this email, but I need to finish that 
> >:-)
> 
> Sorry for the slow response - it's been a busy week.

I'm certainly much slower ;-)

> 
> >On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> >>On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >>>On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>>>   TL;DR: Let's make an explicit decision about whether task isolation
> >>>>   should be "persistent" or "one-shot".  Both have some advantages.
> >>>>   =
> >>>>
> >>>>An important high-level issue is how "sticky" task isolation mode is.
> >>>>We need to choose one of these two options:
> >>>>
> >>>>"Persistent mode": A task switches state to "task isolation" mode
> >>>>(kind of a level-triggered analogy) and stays there indefinitely.  It
> >>>>can make a syscall, take a page fault, etc., if it wants to, but the
> >>>>kernel protects it from incurring any further asynchronous interrupts.
> >>>>This is the model I've been advocating for.
> >>>But then in this mode, what happens when an interrupt triggers.
> >>So what happens if an interrupt does occur?
> >>
> >>In the "base" task isolation mode, you just take the interrupt, then
> >>wait to quiesce any further kernel timer ticks, etc., and return to
> >>the process.  This at least limits the damage to being a single
> >>interruption rather than potentially additional ones, if the interrupt
> >>also caused timers to get queued, etc.
> >Good, although that quiescing on kernel return must be an option.
> 
> Can you spell out why you think turning it off is helpful?  I'll admit
> this is the default mode in the commercial version of task isolation
> that we ship, and was also the default in the first LKML patch series.
> But on consideration I haven't found scenarios where skipping the
> quiescing is helpful.  Admittedly you get out of the kernel faster,
> but then you're back in userspace and vulnerable to yet more
> unexpected interrupts until the timer quiesces.  If you're asking for
> task isolation, this is surely not what you want.

I just feel that quiescing, on the way back to user after an unwanted
interruption, is awkward. The quiescing should work once and for all
on return back from the prctl. If we still get disturbed afterward,
either the quiescing is buggy or incomplete, or something is on the
way that can not be quiesced.

> 
> >>If you enable "strict" mode, we disable task isolation mode for that
> >>core and deliver a signal to it.  This lets the application know that
> >>an interrupt occurred, and it can take whatever kind of logging or
> >>debugging action it wants to, re-enable task isolation if it wants to
> >>and continue, or just exit or abort, etc.
> >Good.
> >
> >>If you don't enable "strict" mode, but you do have
> >>task_isolation_debug enabled as a boot flag, you will at least get a
> >>console dump with a backtrace and whatever other data we have.
> >>(Sometimes the debug info actually includes a backtrace of the
> >>interrupting core, if it's an IPI or TLB flush from another core,
> >>which can be pretty useful.)
> >Right, I suggest we use trace events btw.
> 
> This is probably a good idea, although I wonder if it's worth deferring
> until after the main patch series goes in - I'm reluctant to expand the scope
> of this patch series and add more reasons for it to get delayed :-)
> What do you think?

Yeah definetly, the patchset is big enough :-)

> 
> >>>>"One-shot mode": A task requests isolation via prctl(), the kernel
> >>>>ensures it is isolated on return from the prctl(), but then as soon as
> >>>>it enters the kernel again, task isolation is switched off until
> >>>>another prctl is issued.  This is what you recommended in your last
> >>>>email.
> >>>No I think we can issue syscalls for exemple. But asynchronous 
> >>>interruptions
> >>>such as exceptions (actually somewhat synchronous but can be unexpected) 
> >>>and
> >>>interrupts are what we want to avoid.
> >>Hmm, so I think I'm not really understanding what you are suggesting.
> >>
> >>We're certainly in agreement that avoiding interrupts and exceptions
> >

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-04-22 Thread Frederic Weisbecker
On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:
> On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:
> >On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> >>   TL;DR: Let's make an explicit decision about whether task isolation
> >>   should be "persistent" or "one-shot".  Both have some advantages.
> >>   =
> >>
> >> An important high-level issue is how "sticky" task isolation mode is.
> >> We need to choose one of these two options:
> >>
> >> "Persistent mode": A task switches state to "task isolation" mode
> >> (kind of a level-triggered analogy) and stays there indefinitely.  It
> >> can make a syscall, take a page fault, etc., if it wants to, but the
> >> kernel protects it from incurring any further asynchronous interrupts.
> >> This is the model I've been advocating for.
> >
> >But then in this mode, what happens when an interrupt triggers.
> 
> So here I'm taking "interrupt" to mean an external, asynchronous
> interrupt, from another core or device, or asynchronously triggered
> on the local core, like a timer interrupt.  By contrast I use "exception"
> or "fault" to refer to synchronous, locally-triggered interruptions.

Ok.

> So for interrupts, the short answer is, it's a bug! :-)
> 
> An interrupt could be a kernel bug, in which case we consider it a
> "true" bug.  This could be a timer interrupt occurring even after the
> task isolation code thought there were none pending, or a hardware
> device that incorrectly distributes interrupts to a task-isolation
> cpu, or a global IPI that should be sent to fewer cores, or a kernel
> TLB flush that could be deferred until the task-isolation task
> re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
> bug.  I'm sure there are more such bugs that we can continue to fix
> going forward; it depends on how arbitrary you want to allow code
> running on other cores to be.  For example, can another core unload a
> kernel module without interrupting a task-isolation task?  Not right now.
> 
> Or, it could be an application bug: the standard example is if you
> have an application with task-isolated cores that also does occasional
> unmaps on another thread in the same process, on another core.  This
> causes TLB flush interrupts under application control.  The
> application shouldn't do this, and we tell our customers not to build
> their applications this way.  The typical way we encourage our
> customers to arrange this kind of "multi-threading" is by having a
> pure memory API between the task isolation threads and what are
> typically "control" threads running on non-task-isolated cores.  The
> two types of threads just both mmap some common, shared memory but run
> as different processes.
> 
> So what happens if an interrupt does occur?
> 
> In the "base" task isolation mode, you just take the interrupt, then
> wait to quiesce any further kernel timer ticks, etc., and return to
> the process.  This at least limits the damage to being a single
> interruption rather than potentially additional ones, if the interrupt
> also caused timers to get queued, etc.

So if we take an interrupt that we didn't expect, we want to wait some more
in the end of that interrupt to wait for things to quiesce some more?

That doesn't look right. Things should be quiesced once and for all on
return from the initial prctl() call. We can't even expect to quiesce more
in case of interruptions, the tick can't be forced off anyway.

> 
> If you enable "strict" mode, we disable task isolation mode for that
> core and deliver a signal to it.  This lets the application know that
> an interrupt occurred, and it can take whatever kind of logging or
> debugging action it wants to, re-enable task isolation if it wants to
> and continue, or just exit or abort, etc.

That sounds sensible.

> 
> If you don't enable "strict" mode, but you do have
> task_isolation_debug enabled as a boot flag, you will at least get a
> console dump with a backtrace and whatever other data we have.
> (Sometimes the debug info actually includes a backtrace of the
> interrupting core, if it's an IPI or TLB flush from another core,
> which can be pretty useful.)

Ok.

> 
> >> "One-shot mode": A task requests isolation via prctl(), the kernel
> >> ensures it is isolated on return from the prctl(), but then as soon as
> >> it enters the kernel again, task isolation is switched off until
> >> another prctl is issued.  This is what you recommended in your last
> >> email.
> >
> >No

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-04-08 Thread Frederic Weisbecker
On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
> Frederic,
> 
> Thanks for the detailed feedback on the task isolation stuff.
> 
> This reply kind of turned into an essay, so I've added a little "TL;DR"
> sentence before each section.

I think I'm going to cut my reply into several threads, because really
I can't get myself to make a giant reply in once :-)

> 
> 
>   TL;DR: Let's make an explicit decision about whether task isolation
>   should be "persistent" or "one-shot".  Both have some advantages.
>   =
> 
> An important high-level issue is how "sticky" task isolation mode is.
> We need to choose one of these two options:
> 
> "Persistent mode": A task switches state to "task isolation" mode
> (kind of a level-triggered analogy) and stays there indefinitely.  It
> can make a syscall, take a page fault, etc., if it wants to, but the
> kernel protects it from incurring any further asynchronous interrupts.
> This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.

> 
> "One-shot mode": A task requests isolation via prctl(), the kernel
> ensures it is isolated on return from the prctl(), but then as soon as
> it enters the kernel again, task isolation is switched off until
> another prctl is issued.  This is what you recommended in your last
> email.

No I think we can issue syscalls for exemple. But asynchronous interruptions
such as exceptions (actually somewhat synchronous but can be unexpected) and
interrupts are what we want to avoid.

> 
> There are a number of pros and cons to the two models.  I think on
> balance I still like the "persistent mode" approach, but here's all
> the pros/cons I can think of:
> 
> PRO for persistent mode: A somewhat easier programming model.  Users
> can just imagine "task isolation" as a way for them to still be able
> to use the kernel exactly as they always have; it's just slower to get
> back out of the kernel so you use it judiciously. For example, a
> process is free to call write() on a socket to perform a diagnostic,
> but when returning from the write() syscall, the kernel will hold the
> task in kernel mode until any timer ticks (perhaps from networking
> stuff) are complete, and then let it return to userspace to continue
> in task isolation mode.

So this is not hard isolation anymore. This is rather soft isolation with
best efforts to avoid disturbance.

Surely we can have different levels of isolation.

I'm still wondering what to do if the task migrates to another CPU. In fact,
perhaps what you're trying to do is rather a CPU property than a process 
property?

> This is convenient to the user since they
> don't have to fret about re-enabling task isolation after that
> syscall, page fault, or whatever; they can just continue running.
> With your suggestion, the user pretty much has to leave STRICT mode
> enabled so he gets notified of any unexpected return to kernel space
> (in fact we might make it required so you always get a signal when
> leaving task isolation unless it's via a prctl or exit syscall).

Right. Although we can allow all syscalls in this mode actually.

> 
> PRO for one-shot mode: A somewhat crisper interaction with
> sched_setaffinity() etc.  With a persistent mode approach, a task can
> start up task isolation, then later another task can be placed on its
> cpu and break it (it won't return to userspace until killed or the new
> process affinitizes itself away or stops running).  By contrast, in
> one-shot mode, any return to kernel spaces turns off task isolation
> anyway, so it's very clear what the interaction looks like.  I suspect
> this is more a theoretical advantage to one-shot mode than a practical
> one, though.

I think I heard about workloads that need such strict hard isolation.
Workloads that really can not afford any disturbance. They even
use userspace network stack. Maybe HFT?

> CON for one-shot mode: It's actually hard to catch every kernel entry
> so we can turn the task-isolation flag off again - and we really do
> need to have a flag, just so that we can suitably debug any bad
> actions that bring us into the kernel when we're not expecting it.
> Right now there are things that bring us into the kernel that we don't
> bother annotating for task isolation STRICT mode, just because they're
> visible to the user anyway: e.g., a bus fault or segmentation
> violation.
> 
> I think we can actually make both modes available to users with just
> another flag bit, so maybe we can look at what that looks like in v11:
> adding a PR_TASK_ISOLATION_ONESHOT flag would turn off task
> isolation at the next syscall entry, page fault, etc.  Then we can
> think more specifically about whether we want to remove the flag or
> not, and if we remove it, whether we want to make the code that was
> controlled by it unconditionally true or unconditionally false
> (i.e. remove it again).

I think we shouldn't bother with strict hard isolation if we don't need
it 

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-01-30 Thread Frederic Weisbecker
On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:
> On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:
> >On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:
> >>You asked what happens if nohz_full= is given as well, which is a very
> >>good question.  Perhaps the right answer is to have an early_initcall
> >>that suppresses task isolation on any cores that lost their nohz_full
> >>or isolcpus status due to later boot command line arguments (and
> >>generate a console warning, obviously).
> >I'd rather imagine that the final nohz full cpumask is "nohz_full=" | 
> >"task_isolation="
> >That's the easiest way to deal with and both nohz and task isolation can call
> >a common initializer that takes care of the allocation and add the cpus to 
> >the mask.
> 
> I like it!
> 
> And by the same token, the final isolcpus cpumask is "isolcpus=" |
> "task_isolation="?
> That seems like we'd want to do it to keep things parallel.

We have reverted the patch that made isolcpus |= nohz_full. Too
many people complained about unusable machines with NO_HZ_FULL_ALL

But the user can still set that parameter manually.

> 
> >>>>+bool _task_isolation_ready(void)
> >>>>+{
> >>>>+ WARN_ON_ONCE(!irqs_disabled());
> >>>>+
> >>>>+ /* If we need to drain the LRU cache, we're not ready. */
> >>>>+ if (lru_add_drain_needed(smp_processor_id()))
> >>>>+ return false;
> >>>>+
> >>>>+ /* If vmstats need updating, we're not ready. */
> >>>>+ if (!vmstat_idle())
> >>>>+ return false;
> >>>>+
> >>>>+ /* Request rescheduling unless we are in full dynticks mode. */
> >>>>+ if (!tick_nohz_tick_stopped()) {
> >>>>+ set_tsk_need_resched(current);
> >>>I'm not sure doing this will help getting the tick to get stopped.
> >>Well, I don't know that there is anything else we CAN do, right?  If there's
> >>another task that can run, great - it may be that that's why full dynticks
> >>isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
> >>there's nothing else we can do, in which case we basically spend our time
> >>going around through the scheduler code and back out to the
> >>task_isolation_ready() test, but again, there's really nothing else more
> >>useful we can be doing at this point.  Once the RCU tick fires (or whatever
> >>it was that was preventing full dynticks from engaging), we will pass this
> >>test and return to user space.
> >There is nothing at all you can do and setting TIF_RESCHED won't help either.
> >If there is another task that can run, the scheduler takes care of resched
> >by itself :-)
> 
> The problem is that the scheduler will only take care of resched at a
> later time, typically when we get a timer interrupt later.

When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
target is remote it sends an IPI, if it's local then we wait the next reschedule
point (preemption points, voluntary reschedule, interrupts). There is just 
nothing
you can do to accelerate that.


> By invoking the scheduler here, we allow any tasks that are ready to run to 
> run
> immediately, rather than waiting for an interrupt to wake the scheduler.

Well, in this case here we are interested in the current CPU. And if a task
got awoken and waits for the current CPU, it will have an opportunity to get
schedule on syscall exit.

> Plenty of places in the kernel just call schedule() directly when they are
> waiting.  Since we're waiting here regardless, we might as well
> immediately get any other runnable tasks dealt with.
> 
> We could also just return "false" in _task_isolation_ready(), and then
> check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
> call schedule() explicitly there, but that seems a little more roundabout.
> Admittedly it's more usual to see kernel code call schedule() directly
> to yield the processor, but in this case I'm not convinced it's cleaner
> given we're already in a loop where the caller is checking TIF_RESCHED
> and then calling schedule() when it's set.

You could call cond_resched(), but really syscall exit is enough for what
you want. And the problem here if a task prevents the CPU from stopping the
tick is that task itself, not the fact it doesn't get scheduled. If we have
other tasks than the current isolated one on the CPU, it means that the
environment is not ready for hard isolation.

And in general: we shouldn't loop at all there: if something depends on the 
tick,
the CPU is not ready for isolation and something needs to be done: setting
some task affinity, etc... So we should just fail the prctl and let the user
deal with it.

> 
> -- 
> Chris Metcalf, EZChip Semiconductor
> http://www.ezchip.com
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html