Re: [PATCH v16 00/13] support "task_isolation" mode

2017-11-07 Thread Chris Metcalf

On 11/7/2017 12:10 PM, Christopher Lameter wrote:

On Mon, 6 Nov 2017, Chris Metcalf wrote:


On 11/6/2017 10:38 AM, Christopher Lameter wrote:

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?

Well if you still have the 1 HZ clock then you can simply defer the numa
remote page cleanup of the page allocator to that the time you execute
that tick.

We have to get rid of the 1 Hz tick, so we don't want to tie anything
else to it...

Yes we want to get rid of the 1 HZ tick but the work on that could also
include dealing with the remove page cleanup issue that we have deferred.

Presumably we have another context there were we may be able to call into
the cleanup code with interrupts enabled.


Right now for task isolation we run with interrupts enabled during the
initial sys_prctl() call, and call quiet_vmstat_sync() there, which 
currently

calls refresh_cpu_vm_stats(false).  In fact we could certainly pass "true"
there instead (and probably should) since we can handle dealing with
the pagesets at this time.  As we return to userspace we will test that
nothing surprising happened with vmstat; if so we jam an EAGAIN into
the syscall result value, but if not, we will be in userspace and won't need
to touch the vmstat counters until we next go back into the kernel.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v16 06/13] task_isolation: userspace hard isolation from kernel

2017-11-03 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
"nohz_full=CPULIST isolcpus=CPULIST" boot argument to enable
nohz_full and isolcpus.  The "task_isolation" state is then indicated
by setting a new task struct field, task_isolation_flag, to the
value passed by prctl(), and also setting a TIF_TASK_ISOLATION
bit in the thread_info flags.  When the kernel is returning to
userspace from the prctl() call and sees TIF_TASK_ISOLATION set,
it calls the new task_isolation_start() routine to arrange for
the task to avoid being interrupted in the future.

With interrupts disabled, task_isolation_start() ensures that kernel
subsystems that might cause a future interrupt are quiesced.  If it
doesn't succeed, it adjusts the syscall return value to indicate that
fact, and userspace can retry as desired.  In addition to stopping
the scheduler tick, the code takes any actions that might avoid
a future interrupt to the core, such as a worker thread being
scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
other exception or irq, the kernel will kill it with SIGKILL.
In addition to sending a signal, the code supports a kernel
command-line "task_isolation_debug" flag which causes a stack
backtrace to be generated whenever a task loses isolation.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can
clear the bit again later, and ignores exit/exit_group to allow
exiting the task without a pointless signal being delivered.

The prctl() API allows for specifying a signal number to use instead
of the default SIGKILL, to allow for catching the notification
signal; for example, in a production environment, it might be
helpful to log information to the application logging mechanism
before exiting.  Or, the signal handler might choose to reset the
program counter back to the code segment intended to be run isolated
via prctl() to continue execution.

In a number of cases we can tell on a remote cpu that we are
going to be interrupting the cpu, e.g. via an IPI or a TLB flush.
In that case we generate the diagnostic (and optional stack dump)
on the remote core to be able to deliver better diagnostics.
If the interrupt is not something caught by Linux (e.g. a
hypervisor interrupt) we can also request a reschedule IPI to
be sent to the remote core so it can be sure to generate a
signal to notify the process.

Separate patches that follow provide these changes for x86, tile,
arm, and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/admin-guide/kernel-parameters.txt |   6 +
 include/linux/isolation.h   | 175 +++
 include/linux/sched.h   |   4 +
 include/uapi/linux/prctl.h  |   6 +
 init/Kconfig|  28 ++
 kernel/Makefile |   1 +
 kernel/context_tracking.c   |   2 +
 kernel/isolation.c  | 402 
 kernel/signal.c |   2 +
 kernel/sys.c|   6 +
 10 files changed, 631 insertions(+)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 05496622b4ef..aaf278f2cfc3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4025,6 +4025,12 @@
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION, this
+   setting will generate console backtraces to
+   accompany the diagnostics generated about
+   interrupting tasks running with task isolation.
+
tcpm

[PATCH v16 00/13] support "task_isolation" mode

2017-11-03 Thread Chris Metcalf
time we are exiting to userspace, we just
jam in an EAGAIN and let userspace retry.  In practice, this doesn't
seem to ever happen.


What about using a per-cpu flag to stop doing new deferred work?

Andy also suggested we could structure the code to have the prctl()
set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu
data, or lru page cache).  Then, we could flush those structures right
from the sys_prctl() call, and when we were returning to user space,
we'd be confident that there wasn't going to be any new work added.

With the current set of things that we are disabling for task
isolation, though, it didn't seem necessary.  Quiescing the vmstat
shepherd seems like it is generally pretty safe since we will likely
be able to sync up the per-cpu cache and kill the deferred work with
high probability, with no expectation that additional work will show
up.  And since we can flush the LRU page cache with interrupts
disabled, that turns out not to be an issue either.

I could imagine that if we have to deal with some new kind of deferred
work, we might find the per-cpu flag becomes a good solution, but for
now we don't have a good use case for that approach.


How about stopping the dyn tick?

Right now we try to stop it on return to userspace, but if we can't,
we just return EAGAIN to userspace.  In practice, what I see is that
usually the tick stops immediately, but occasionally it doesn't; in
this case I've always seen that nr_running is >1, presumably with some
temporary kernel worker threads, and the user code just needs to call
prctl() until those threads are done.  We could structure things with
a completion that we wait for, which is set by the timer code when it
finally does stop the tick, but this may be overkill, particularly
since we'll only be running this prctl() loop from userspace on cores
where we have no other useful work that we're trying to run anyway.


What about TLB flushing?

We talked about this at Plumbers and some of the email discussion also
was about TLB flushing.  I haven't tried to add it to this patch set,
because I really want to avoid scope creep; in any case, I think I
managed to convince Andy that he was going to work on it himself. :)
Paul McKenney already contributed some framework for such a patch, in
commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of
->dynticks counter").

What about that d*mn 1 Hz clock?

It's still there, so this code still requires some further work before
it can actually get a process into long-term task isolation (without
the obvious one-line kernel hack).  Frederic suggested a while ago
forcing updates on cpustats was required as the last gating factor; do
we think that is still true?  Christoph was working on this at one
point - any progress from your point of view?


Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  Revert "sched/core: Drop the unused try_get_task_struct() helper
function"
  Add try_get_task_struct_on_cpu() to scheduler for task isolation
  Add try_stop_full_tick() API for NO_HZ_FULL
  task_isolation: userspace hard isolation from kernel
  Add task isolation hooks to arch-independent code
  arch/x86: enable task isolation functionality
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation self test

Francis Giraldeau (1):
  arch/arm: enable task isolation functionality

 Documentation/admin-guide/kernel-parameters.txt|   6 +
 arch/arm/Kconfig   |   1 +
 arch/arm/include/asm/thread_info.h |  10 +-
 arch/arm/kernel/entry-common.S |  12 +-
 arch/arm/kernel/ptrace.c   |  10 +
 arch/arm/kernel/signal.c   |  10 +-
 arch/arm/kernel/smp.c  |   4 +
 arch/arm/mm/fault.c|   8 +-
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/ptrace.c |  18 +-
 arch/arm64/kernel/signal.c |  10 +-
 arch/arm64/kernel/smp.c|   7 +
 arch/arm64/mm/fault.c  |   5 +
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   2 +
 arch/tile/kernel/hardwall.c|   2 +
 arch/tile/kernel/irq.c |   3 +
 arch/tile/kernel/messaging.c   |   4 +
 arch/tile/kernel/process.c |   4 +
 arch/tile/kernel/ptrace.c  |  10 +
 arch/tile/kernel/single_step.c |   7 +
 arch/tile/kernel/smp.c

Re: [PATCH 02/18] arm64: ilp32: add documentation on the ILP32 ABI for ARM64

2016-10-24 Thread Chris Metcalf

On 10/21/2016 4:33 PM, Yury Norov wrote:

Based on Andrew Pinski's patch-series.

Signed-off-by: Yury Norov 
---
  Documentation/arm64/ilp32.txt | 46 +++
  1 file changed, 46 insertions(+)
  create mode 100644 Documentation/arm64/ilp32.txt

diff --git a/Documentation/arm64/ilp32.txt b/Documentation/arm64/ilp32.txt
new file mode 100644
index 000..b96c18f
--- /dev/null
+++ b/Documentation/arm64/ilp32.txt
@@ -0,0 +1,46 @@
+ILP32 AARCH64 SYSCALL ABI
+=
+
+This document describes the ILP32 syscall ABI and where it differs
+from the generic compat linux syscall interface.
+
+AARCH64/ILP32 userspace can potentially access top halves of registers that
+are passed as syscall arguments, so such registers (w0-w7) are deloused.


I'm not sure what "potentially access" here means: I think what you want to say
is that userspace can pass garbage in the top half, but you should be clearer 
about
what you mean here.  Also, you shouldn't use "deloused" here, since it's not a 
term
that's defined elsewhere in the kernel, even though it's been used colloquially 
on LKML.
Provide an actual implementation definition, like "have their top 32 bits 
zeroed".


+AARCH64/ILP32 provides next types turned to 64-bit (comparing to AARCH32):


What does "turned" mean here?  And I "next types" isn't standard English; you 
want
to say something like "the following types".  Likewise later with "next 
syscalls".

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/18] 32-bit ABI: introduce ARCH_32BIT_OFF_T config option

2016-10-24 Thread Chris Metcalf

On 10/21/2016 4:33 PM, Yury Norov wrote:

All new 32-bit architectures should have 64-bit off_t type, but existing
architectures has 32-bit ones.

[...]
For syscalls sys_openat() and sys_open_by_handle_at() force_o_largefile()
is called, to set O_LARGEFILE flag, and this is the only difference
comparing to compat versions. All compat ABIs are already turned to use
64-bit off_t, except tile. So, compat versions for this syscalls are not
needed anymore. Tile is handled explicitly.

[...]
--- a/arch/tile/kernel/compat.c
+++ b/arch/tile/kernel/compat.c
@@ -103,6 +103,9 @@ COMPAT_SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned 
int, offset_high,
  #define compat_sys_readahead sys32_readahead
  #define sys_llseek compat_sys_llseek
  
+#define sys_openat compat_sys_openat

+#define sys_open_by_handle_at  compat_sys_open_by_handle_at
+
  /* Call the assembly trampolines where necessary. */
  #define compat_sys_rt_sigreturn _compat_sys_rt_sigreturn
  #define sys_clone _sys_clone


This patch accomplishes two goals that could be completely separated.
It's confusing to have them mixed in the same patch without any
discussion of why they are in the same patch.

First, you want to modify the default  behavior for
compat syscalls so that the default is sys_openat (etc) rather than
the existing compat_sys_openat, and then use that new behavior for
arm64 ILP32.  This lets you force O_LARGEFILE for arm64 ILP32 to
support having a 64-bit off_t at all times.  To do that, you fix the
asm-generic header, and then make tile have a special override.
This seems reasonable enough.

Second, you introduce ARCH_32BIT_OFF_T basically as a synonym for
"BITS_PER_WORD == 32", so that new 32-bit architectures can choose not
to enable it.  This is fine in the abstract, but I'm a bit troubled by
the fact that you are not actually introducing a new 32-bit
architecture here (just a new 32-bit mode for the arm 64-bit kernel).
Shouldn't this part of the change wait until someone actually has a
new 32-bit kernel to drive this forward?

If you want to push forward the ARCH_32BIT_OFF_T change in the absence
of an architecture that supports it, I would think it would be a lot
less confusing to have these two in separate patches, and make it
clear that the ARCH_32BIT_OFF_T change is just laying groundwork
for some hypothetical future architecture.

The existing commit language itself is also confusing. You write "All
compat ABIs are already turned to use 64-bit off_t, except tile."
First, I'm not sure what you mean by "turned" here.  And, tile is just
one of many compat ABIs that allow O_LARGEFILE not to be part of the
open call: see arm64's AArch32 ABI, MIPS o32, s390 31-bit emulation,
sparc64's 32-bit mode, and of course x86's 32-bit compat mode.
Presumably your point here is that tile is the only pre-existing
architecture that #includes  to create its compat
syscall table, and so I think "all except tile" here is particularly
confusing, since there are no architectures except tile that use the
__SYSCALL_COMPAT functionality in the current tree.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ping: [PATCH v15 00/13] support "task_isolation" mode

2016-09-30 Thread Chris Metcalf

On 9/27/2016 10:35 AM, Frederic Weisbecker wrote:

On 8/16/2016 5:19 PM, Chris Metcalf wrote:

Here is a respin of the task-isolation patch set.

Again, I have been getting email asking me when and where this patch
will be upstreamed so folks can start using it.  I had been thinking
the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
thing.  But perhaps it touches enough other subsystems that that
doesn't really make sense?  Andrew, would it make sense to take it
directly via your tree?  Frederic, Ingo, what do you think?

As it seems we are still debating a lot of things in this patchset that has 
already
reached v15, I think you should split it in smaller steps in order to move 
forward
and only get into the next step once the previous is merged.

You could start with a first batch that introduces the prctl() and does the 
best effort
one-shot isolation part. Which means the actions that only need to be performed 
once
on the prctl call.


So combining this with my reply a moment ago to Andy about just
disabling all deferrable work creation on task isolation cores, that
means we just need a way of checking that the dyntick is off on return
from the prctl.

We could do this in the prctl() itself, but it feels a bit fragile, since
we could do the check for no dyntick and try to return success,
and then some kind of interrupt and/or schedule event might happen
and by the time we actually got back to userspace the dyntick might
be running again.

I think what we can do is arrange to set a bit in the process state
that says we are returning from prctl, and then right as we are
returning to userspace with interrupts disabled, we can check if
that bit is set, and if so check at that point to see if the dyntick
is enabled, and if it is, force the syscall return value to EAGAIN
(and clear the bit regardless).

Within the prctl() code itself, we check for hard prerequisites like being on
a task-isolation cpu, and fail -EINVAL if not.

The upshot is that we end up spinning on a loop through userspace where
we keep retrying the prctl() until the timer quiesces.


Once we get that merged we can focus on what needs to be performed on every 
return
to userspace if that's really needed. Including possibly waiting on some 
completion.


So in NOSIG mode, instead of setting EAGAIN in the return to
userspace path, we arrange to just wait.  We can figure out in a
follow-on patch whether we want to wait by spinning in some way
or by actually waiting on a completion.  For now I'll just include the
remainder of the patch (with spinning) as an RFC just so people
can have the next piece to look ahead to, but I like your idea of
breaking it out of the main patch series entirely.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ping: [PATCH v15 00/13] support "task_isolation" mode

2016-09-13 Thread Chris Metcalf

Thanks for your explanation of the TIF_TASK_ISOLATION
flag being needed for x86  _TIF_WORK_SYSCALL_ENTRY.
It makes perfect sense in retrospect :-)

On 9/12/2016 8:20 PM, Francis Giraldeau wrote:

On a side note, the NOSIG mode may be confusing for the users. At first,
I was expecting that NOSIG behaves the same way as the normal task isolation
mode. In the current situation, if the user wants the normal behavior, but
does not care about the signal, the user must register an empty signal handler.


So, "normal behavior" isn't really well defined once we start
allowing empty signal handlers.  In particular, task isolation will
be turned off before invoking your signal handler, and if the
handler is empty, you just lose isolation and that's that.  By
contrast, the NOSIG mode will try to keep you isolated.

I'm definitely open to suggestions about how to frame the API
for NOSIG or equivalent modes.  What were you expecting to
be able to do by suppressing the signal, and how is NOSIG not
the thing you wanted?


However, if I understand correctly, other settings beside NOHZ and isolcpus
are required to support quiet CPUs, such as irq_affinity and rcu_nocb. It would
be very convenient from the user point of view if these other settings were 
configure
correctly.


I think it makes sense to move towards a mode where enabling
task_isolation sets up the rcu_nocbs and irq_affinity automatically,
rather than requiring users to understand all the fiddly configuration
and boot argument details.


I can work on that and also write some doc (Documentation/task-isolation.txt ?).


Sure, documentation is always welcome!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-09-12 Thread Chris Metcalf

On 9/12/2016 1:41 PM, Andy Lutomirski wrote:

On Sep 9, 2016 1:40 PM, "Chris Metcalf"  wrote:

On 9/2/2016 1:28 PM, Andy Lutomirski wrote:

On Sep 2, 2016 7:04 AM, "Chris Metcalf"  wrote:

On 8/30/2016 3:50 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf  wrote:

So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.

My concern is that it's not obvious to readers of the code that the
loop ever terminates.  It really ought to, but it's doing something
very odd.  Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.


True, although, comments :-)

Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though.  It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc.  Eventually it settles down.  It's the same with the
vmstat/lru stuff.

Only kind of.

When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se.  We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable.  So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.

If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change.  The condition in
question is fairly complicated and won't wake us when we are ready.  I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.


So, how about a code structure like this?

In the main return-to-userspace loop where we check TIF flags,
we keep the notion of our TIF_TASK_ISOLATION flag that causes
us to invoke a task_isolation_prepare() routine.  This routine
does the following things:

1. As you suggested, set a new TIF bit (or equivalent) that says the
system should no longer create deferred work on this core, and then
flush any necessary already-deferred work (currently, the LRU cache
and the vmstat stuff).  We never have to go flush the deferred work
again during this task's return to userspace.  Note that this bit can
only be set on a core marked for task isolation, so it can't be used
for denial of service type attacks on normal cores that are trying to
multitask normal Linux processes.

I think it can't be a TIF flag unless you can do the whole mess with
preemption off because, if you get preempted, other tasks on the CPU
won't see the flag.  You could do it with a percpu flag, I think.


Yes, a percpu flag - you're right.  I think it will make sense for this to
be a flag declared in linux/isolation.h which can be read by vmstat, LRU, etc.


2. Check if the dyntick is stopped, and if not, wait on a completion
that will be set when it does stop.  This means we may schedule out at
this point, but when we return, the deferred work stuff is still safe
since your bit is still set, and in principle the dyn tick is
stopped.

Then, after we disable interrupts and re-read the thread-info flags,
we check to see if the TIF_TASK_ISOLATION flag is the ONLY flag still
set that would keep us in the loop.  This will always end up happening
on each return to userspace, since the only thing that actually clears
the bit is a prctl() call.  When that happens we know we are about to
return to userspace, so we call task_isolation_ready(), which now has
two things to do:

Would it perhaps be more straightforward to do the stuff before the
loop and not check TIF_TASK_ISOLATION in the loop?


We can certainly play around with just not looping in this case, but
in particular I can imagine an isolated task entering the kernel and
then doing something that requires scheduling a kernel task.  We'd
clearly like that other task to run before the isolated task returns to
userspace.  But then, that other task might do something to re-enable
the dyntick.  That's why we'd like to recheck that dyntick is off in
the loop after each potential call to schedule().


1. We check that the

Re: Ping: [PATCH v15 00/13] support "task_isolation" mode

2016-09-12 Thread Chris Metcalf

On 9/7/2016 5:11 PM, Francis Giraldeau wrote:

On 2016-08-29 12:27 PM, Chris Metcalf wrote:

On 8/16/2016 5:19 PM, Chris Metcalf wrote:

Here is a respin of the task-isolation patch set.

No concerns have been raised yet with the v15 version of the patch series
in the two weeks since I posted it, and I think I have addressed all
previously-raised concerns (or perhaps people have just given up arguing
with me).

There is a cycle with header include in the v15 patch set on x86_64 that cause 
a compilation error. You will find the patch (and other fixes) in the following 
branch:

 https://github.com/giraldeau/linux/commits/dataplane-x86-fix-inc


Thanks, I fixed the header inclusion loop by converting
task_isolation_set_flags() to a macro, removing the unnecessary
include of , and replacing the include of 
with a "struct task_struct;" declaration.  That avoids having to dump
too much isolation-related stuff into the apic.h header (note that
you'd also need to include the empty #define for when isolation is
configured off).


In the test file, it is indicated to change timer-tick.c to disable the 1Hz 
tick, is there a reason why the change is not in the patch set? I added this 
trivial change in the branch dataplane-x86-fix-inc above.


Yes, Frederic prefers that we not allow any way of automatically
disabling the tick for now.  Hopefully we will clean up the last
few things that are requiring it to keep ticking shortly.  But for
now it's on a parallel track to the task isolation stuff.


The syscall test fails on x86:

 $ sudo ./isolation
 [...]
 test_syscall: FAIL (0x100)
 test_syscall (SIGUSR1): FAIL (0x100)


Your next email suggested adding TIF_TASK_ISOLATION to the set of
flags in _TIF_WORK_SYSCALL_ENTRY.  I'm happy to make this change
regardless (it's consistent with Andy's request to add the task
isolation flag to _TIF_ALLWORK_MASK), but I'm puzzled: as far as
I know there is no way for TIF_TASK_ISOLATION to be set unless
TIF_NOHZ is also set.  The context_tracking_init() code forces TIF_NOHZ
on for every task during boot up, and nothing ever clears it, so...


I wanted to debug this problem with gdb and a KVM virtual machine. However, the 
TSC clock source is detected as non reliable, even with the boot parameter 
tsc=reliable, and therefore prctl(PR_SET_TASK_ISOLATION, 
PR_TASK_ISOLATION_ENABLE) always returns EAGAIN. Is there a trick to run task 
isolation in a VM (at least for debugging purposes)?

BTW, this was causing the test to enter an infinite loop. If the clock source 
is not reliable, maybe a different error code should be returned, because this 
situation not transient.


That's a good idea - do you know what the check should be in that
case?  We can just return EINVAL, as you suggest.


In the mean time, I added a check for 100 max retries in the test 
(prctl_safe()).


Thanks, that's a good idea.  I'll add your changes to the selftest code for the
next release.


When running only the test_jitter(), the isolation mode is lost:

 [ 6741.566048] isolation/9515: task_isolation mode lost due to irq_work

With ftrace (events/workqueue/workqueue_execute_start), I get a bit more info:

  kworker/1:1-676   [001]   6610.097128: workqueue_execute_start: work 
struct 8801a784ca20: function dbs_work_handler

The governor was ondemand, so I tried to set the frequency scaling governor to 
performance, but that does not solve the issue. Is there a way to suppress this 
irq_work? Should we run the isolated task with high real-time priority, such 
that it never get preempted?


On the tile platform we don't have the frequency scaling stuff to contend with, 
so
I don't know much about it.  I'd be very curious to know what you can figure out
on this front.

Thanks a lot for your help!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-09-09 Thread Chris Metcalf

On 9/3/2016 11:31 AM, Frederic Weisbecker wrote:

On Tue, Aug 30, 2016 at 02:17:27PM -0400, Chris Metcalf wrote:

On 8/30/2016 1:36 PM, Chris Metcalf wrote:

See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

 set_current_state(TASK_INTERRUPTIBLE);
 schedule();
 __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?

The answer is that the scheduler will keep bringing us back to this
point (either after running another runnable task if there is one,
or else just returning to this point immediately without doing a
context switch), and we will then go around the "prepare exit to
userspace" loop and perhaps discover that enough time has gone
by that the last dyntick interrupt has triggered and the kernel has
quiesced the dynticks.  At that point we stop calling schedule()
over and over and can return normally to userspace.

Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
schedule(), I never get control back.  So I don't want to do that.

I suppose I could do a schedule_timeout() here instead and try
to figure out how long to wait for the next dyntick.  But since we
don't expect anything else running on the core anyway, it seems
like we are probably working too hard at this point.  I don't think
it's worth it just to go into the idle task and (possibly) save some
power for a few microseconds.

The more I think about this, the more I think I am micro-optimizing
by trying to poke the scheduler prior to some external thing setting
need_resched, so I think the thing to do here is in fact, nothing.

Exactly, I fear there is nothing you can do about that.


I won't worry about rescheduling but will just continue going around
the prepare-exit-to-userspace loop until the last dyn tick fires.

You mean waiting in prepare-exit-to-userspace until the last tick fires?
I'm not sure it's a good idea either, this could take ages, it could as
well never happen.


If you don't mind, let's take this to the other thread discussing what to do
at return-to-userspace time:

https://lkml.kernel.org/r/440e20d1-441a-3228-6b37-6e71e9fce...@mellanox.com

In general, I think if your task ends up waiting forever for the dyntick to
stop, with the approach suggested in that thread you will at least be
able to tell more easily, since the core will be running the idle task and
your task will be waiting on a dyntick-specific completion.


I'd rather say that if we are in signal mode, fire such, otherwise just
return to userspace. If there is a tick, it means that the environment is
not suitable for isolation anyway.


True if there is an ongoing tick, but not if the tick is about to stop and we 
just need
to wait for the last tick to fire.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-09-09 Thread Chris Metcalf

On 9/2/2016 1:28 PM, Andy Lutomirski wrote:

On Sep 2, 2016 7:04 AM, "Chris Metcalf"  wrote:

On 8/30/2016 3:50 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf  wrote:

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:

What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.


This is an interesting idea!

However, there are a number of implementation ideas that make me
worry that it might be a trickier approach overall.

First, "on the next entry" hides a world of hurt in four simple words.
Some platforms (arm64 and tile, that I'm familiar with) have a common
chunk of code that always runs on every entry to the kernel.  It would
not be too hard to poke at the assembly and make those platforms
always run some task-isolation specific code on entry.  But x86 scares
me - there seem to be a whole lot of ways to get into the kernel, and
I'm not convinced there is a lot of shared macrology or whatever that
would make it straightforward to intercept all of them.

Just use the context tracking entry hook.  It's 100% reliable.  The
relevant x86 function is enter_from_user_mode(), but I would just hook
into user_exit() in the common code.  (This code is had better be
reliable, because context tracking depends on it, and, if context
tracking doesn't work on a given arch, then isolation isn't going to
work regardless.


This looks a lot cleaner than last time I looked at the x86 code. So yes, I 
think
we could do an entry-point approach plausibly now.

This is also good for when we want to look at deferring the kernel TLB flush,
since it's the same mechanism that would be required for that.



There's at least one gotcha for the latter: NMIs aren't currently
guaranteed to go through context tracking.  Instead they use their own
RCU hooks.  Deferred TLB flushes can still be made to work, but a bit
more care will be needed.  I would probably approach it with an
additional NMI hook in the same places as rcu_nmi_enter() that does,
more or less:

if (need_tlb_flush) flush();

and then make sure that the normal exit hook looks like:

if (need_tlb_flush) {
   flush();
   barrier(); /* An NMI must not see !need_tlb_flush if the TLB hasn't
been flushed */
   flush the TLB;
}


This is a good point.  For now I will continue not trying to include the TLB 
flush
in the current patch series, so I will sit on this until we're ready to do so.


So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.

My concern is that it's not obvious to readers of the code that the
loop ever terminates.  It really ought to, but it's doing something
very odd.  Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.


True, although, comments :-)

Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though.  It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc.  Eventually it settles down.  It's the same with the
vmstat/lru stuff.

Only kind of.

When we say, effectively, while (need_resched()) schedule();, we're
not waiting for an event or condition per se.  We're runnable (in the
sense that userspace wants to run and we're not blocked on anything)
the entire time -- we're simply yielding to some other thread that is
also runnable.  So if that loop runs forever, it either means that
we're at low priority and we genuinely shouldn't be running or that
there's a scheduler bug.

If, on the other hand, we say while (not quiesced) schedule(); (or
equivalent), we're saying that we're *not* really ready to run and
that we're waiting for some condition to change.  The condition in
question is fairly complicated and won't wake us when we are ready.  I
can also imagine the scheduler getting rather confused, since, as far
as the scheduler knows, we are runnable and we are supposed to be
running.


So, how about a code structure like this?

In the main return-to-user

Re: [PATCH v15 04/13] task_isolation: add initial support

2016-09-02 Thread Chris Metcalf

On 8/30/2016 3:50 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 12:37 PM, Chris Metcalf  wrote:

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:

What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.


This is an interesting idea!

However, there are a number of implementation ideas that make me
worry that it might be a trickier approach overall.

First, "on the next entry" hides a world of hurt in four simple words.
Some platforms (arm64 and tile, that I'm familiar with) have a common
chunk of code that always runs on every entry to the kernel.  It would
not be too hard to poke at the assembly and make those platforms
always run some task-isolation specific code on entry.  But x86 scares
me - there seem to be a whole lot of ways to get into the kernel, and
I'm not convinced there is a lot of shared macrology or whatever that
would make it straightforward to intercept all of them.

Just use the context tracking entry hook.  It's 100% reliable.  The
relevant x86 function is enter_from_user_mode(), but I would just hook
into user_exit() in the common code.  (This code is had better be
reliable, because context tracking depends on it, and, if context
tracking doesn't work on a given arch, then isolation isn't going to
work regardless.


This looks a lot cleaner than last time I looked at the x86 code. So yes, I 
think
we could do an entry-point approach plausibly now.

This is also good for when we want to look at deferring the kernel TLB flush,
since it's the same mechanism that would be required for that.


But it does seem like we are adding noticeable maintenance cost on
the mainline kernel to support task isolation by doing this.  My guess
is that it is easier to support the kind of "are you clean?" / "get clean"
APIs for subsystems, rather than weaving a whole set of "stay clean"
mechanism into each subsystem.

My intuition is that it's the other way around.  For the mainline
kernel, having a nice clean well-integrated implementation is nicer
than having a bolted-on implementation that interacts in potentially
complicated ways.  Once quiescence support is in mainline, the size of
the diff or the degree to which it's scattered around is irrelevant
because it's not a diff any more.


I'm not concerned with the size of the diff, just with the intrusiveness into
the various subsystems.

That said, code talks, so let me take a swing at doing it the way you suggest
for vmstat/lru and we'll see what it looks like.


So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.

My concern is that it's not obvious to readers of the code that the
loop ever terminates.  It really ought to, but it's doing something
very odd.  Normally we can loop because we get scheduled out, but
actually blocking in the return-to-userspace path, especially blocking
on a condition that doesn't have a wakeup associated with it, is odd.


True, although, comments :-)

Regardless, though, this doesn't seem at all weird to me in the
context of the vmstat and lru stuff, though.  It's exactly parallel to
the fact that we loop around on checking need_resched and signal, and
in some cases you could imagine multiple loops around when we schedule
out and get a signal, so loop around again, and then another
reschedule event happens during signal processing so we go around
again, etc.  Eventually it settles down.  It's the same with the
vmstat/lru stuff.


Also, this cond_resched stuff doesn't worry me too much at a
fundamental level -- if we're really going quiescent, shouldn't we be
able to arrange that there are no other schedulable tasks on the CPU
in question?

We aren't currently planning to enforce things in the scheduler, so if
the application affinitizes another task on top of an existing task
isolation task, by default the task isolation task just dies. (Unless
it's using NOSIG mode, in which case it just ends up stuck in the
kernel trying to wait out the dyntick until you either kill it, or
re-affinitize the offending task.)  But I'm reluctant to guarantee
every possible way that you might (perhaps briefly) have some
schedulable task, and the current approach seems pretty robust if that
sort of thing happens.

This kind of waiting out the dyntick scares me.  Why is there ever a
dyntick that you&#x

Re: [PATCH v15 04/13] task_isolation: add initial support

2016-09-02 Thread Chris Metcalf

On 9/1/2016 6:06 AM, Peter Zijlstra wrote:

On Tue, Aug 30, 2016 at 11:32:16AM -0400, Chris Metcalf wrote:

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

What !? I really don't get this, what are you waiting for? Why is
rescheduling making things better.

We need to wait for the last dyntick to fire before we can return to
userspace.  There are plenty of options as to what we can do in the
meanwhile.

Why not keep your _TIF_TASK_ISOLATION_FOO flag set and re-enter the
loop?

I really don't see how setting TIF_NEED_RESCHED is helping anything.


Yes, I think I addressed that in an earlier reply to Frederic; and you're right,
I don't think TIF_NEED_RESCHED or schedule() are the way to go.

https://lkml.kernel.org/g/107bd666-dbcf-7fa5-ff9c-f79358899...@mellanox.com

Any thoughts on the question of "just re-enter the loop" vs. schedule_timeout()?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 1:10 PM, Frederic Weisbecker wrote:

On Tue, Aug 30, 2016 at 11:41:36AM -0400, Chris Metcalf wrote:

On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:

On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:

On 8/11/2016 2:50 PM, Christoph Lameter wrote:

On Thu, 11 Aug 2016, Frederic Weisbecker wrote:


Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

Once is sufficient after disabling the tick.

It's true that task_isolation_enter() is called every time before
returning to user space while task isolation is enabled.

But once we enter the kernel again after returning from the initial
prctl() -- assuming we are in NOSIG mode so doing so is legal in the
first place -- almost anything can happen, certainly including
restarting the tick.  Thus, we have to make sure that normal quiescing
happens again before we return to userspace.

Yes but we need to sort out what needs to be called only once on prctl().

Once vmstat is quiesced, it's not going to need quiescing again even if we
restart the tick.

That's true, but I really do like the idea of having a clean structure
where we verify all our prerequisites in task_isolation_ready(), and
have code to try to get things fixed up in task_isolation_enter().
If we start moving some things here and some things there, it gets
harder to manage.  I think by testing "!vmstat_idle()" in
task_isolation_enter() we are avoiding any substantial overhead.

I think that making the code clearer on what needs to be done once for
all on prctl() and what needs to be done on all actual syscall return
is more important for readability.


We don't need to just do it on prctl(), though.  We also need to do
it whenever we have been in the kernel for another reason, which
can happen with NOSIG.  So we need to do this on the common return
to userspace path no matter what, right?  Or am I missing something?
It seems like if, for example, we do mmaps or page faults, then on return
to userspace, some of those counters will have been incremented and
we'll need to run the quiet_vmstat_sync() code.


+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);
Again, that won't help

It won't be better than spinning in a loop if there aren't any other
schedulable processes, but it won't be worse either.  If there is
another schedulable process, we at least will schedule it sooner than
if we just sat in a busy loop and waited for the scheduler to kick
us. But there's nothing else we can do anyway if we want to maintain
the guarantee that the dyn tick is stopped before return to userspace.

I don't think it helps either way. If reschedule is pending, the current
task already has TIF_RESCHED set.

See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

 set_current_state(TASK_INTERRUPTIBLE);
 schedule();
 __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?


The answer is that the scheduler will keep bringing us back to this
point (either after running another runnable task if there is one,
or else just returning to this point immediately without doing a
context switch), and we will then go around the "prepare exit to
userspace" loop and perhaps discover that enough time has gone
by that the last dyntick interrupt has triggered and the kernel has
quiesced the dynticks.  At that point we stop calling schedule()
over and over and can return normally to userspace.

It's very counter-intuitive to burn cpu time intentionally in the kernel.
I really don't see another way to resolve this, though.  I don't think
it would be safe, for example, to just promote the next dyntick to
running immediately (rather than waiting a few microseconds until
it is scheduled to go off).

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 2:43 PM, Andy Lutomirski wrote:

On Aug 30, 2016 10:02 AM, "Chris Metcalf"  wrote:

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf  wrote:

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.

Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.


What if we did it the other way around: set a percpu flag saying
"going quiescent; disallow new deferred work", then finish all
existing work and return to userspace.  Then, on the next entry, clear
that flag.  With the flag set, vmstat would just flush anything that
it accumulates immediately, nothing would be added to the LRU list,
etc.


This is an interesting idea!

However, there are a number of implementation ideas that make me
worry that it might be a trickier approach overall.

First, "on the next entry" hides a world of hurt in four simple words.
Some platforms (arm64 and tile, that I'm familiar with) have a common
chunk of code that always runs on every entry to the kernel.  It would
not be too hard to poke at the assembly and make those platforms
always run some task-isolation specific code on entry.  But x86 scares
me - there seem to be a whole lot of ways to get into the kernel, and
I'm not convinced there is a lot of shared macrology or whatever that
would make it straightforward to intercept all of them.

Then, there are the two actual subsystems in question.  It looks like
we could intercept LRU reasonably cleanly by hooking pagevec_add()
to return zero when we are in this "going quiescent" mode, and that
would keep the per-cpu vectors empty.  The vmstat stuff is a little
trickier since all the existing code is built around updating the per-cpu
stuff and then only later copying it off to the global state.  I suppose
we could add a test-and-flush at the end of every public API and not
worry about the implementation cost.

But it does seem like we are adding noticeable maintenance cost on
the mainline kernel to support task isolation by doing this.  My guess
is that it is easier to support the kind of "are you clean?" / "get clean"
APIs for subsystems, rather than weaving a whole set of "stay clean"
mechanism into each subsystem.

So to pop up a level, what is your actual concern about the existing
"do it in a loop" model?  The macrology currently in use means there
is zero cost if you don't configure TASK_ISOLATION, and the software
maintenance cost seems low since the idioms used for task isolation
in the loop are generally familiar to people reading that code.


Also, this cond_resched stuff doesn't worry me too much at a
fundamental level -- if we're really going quiescent, shouldn't we be
able to arrange that there are no other schedulable tasks on the CPU
in question?


We aren't currently planning to enforce things in the scheduler, so if
the application affinitizes another task on top of an existing task
isolation task, by default the task isolation task just dies. (Unless
it's using NOSIG mode, in which case it just ends up stuck in the
kernel trying to wait out the dyntick until you either kill it, or
re-affinitize the offending task.)  But I'm reluctant to guarantee
every possible way that you might (perhaps briefly) have some
schedulable task, and the current approach seems pretty robust if

Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 1:36 PM, Chris Metcalf wrote:

See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

 set_current_state(TASK_INTERRUPTIBLE);
 schedule();
 __set_current_state(TASK_RUNNING);

I don't see how that helps. What will wake the thread up except a signal?


The answer is that the scheduler will keep bringing us back to this
point (either after running another runnable task if there is one,
or else just returning to this point immediately without doing a
context switch), and we will then go around the "prepare exit to
userspace" loop and perhaps discover that enough time has gone
by that the last dyntick interrupt has triggered and the kernel has
quiesced the dynticks.  At that point we stop calling schedule()
over and over and can return normally to userspace. 


Oops, you're right that if I set TASK_INTERRUPTIBLE, then if I call
schedule(), I never get control back.  So I don't want to do that.

I suppose I could do a schedule_timeout() here instead and try
to figure out how long to wait for the next dyntick.  But since we
don't expect anything else running on the core anyway, it seems
like we are probably working too hard at this point.  I don't think
it's worth it just to go into the idle task and (possibly) save some
power for a few microseconds.

The more I think about this, the more I think I am micro-optimizing
by trying to poke the scheduler prior to some external thing setting
need_resched, so I think the thing to do here is in fact, nothing.
I won't worry about rescheduling but will just continue going around
the prepare-exit-to-userspace loop until the last dyn tick fires.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:

On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf  wrote:

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:

+   /*
+* Request rescheduling unless we are in full dynticks mode.
+* We would eventually get pre-empted without this, and if
+* there's another task waiting, it would run; but by
+* explicitly requesting the reschedule, we may reduce the
+* latency.  We could directly call schedule() here as well,
+* but since our caller is the standard place where schedule()
+* is called, we defer to the caller.
+*
+* A more substantive approach here would be to use a struct
+* completion here explicitly, and complete it when we shut
+* down dynticks, but since we presumably have nothing better
+* to do on this core anyway, just spinning seems plausible.
+*/
+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?


I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.

None of the above should need to *loop*, though, AFAIK.


Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/29/2016 8:55 PM, Frederic Weisbecker wrote:

On Mon, Aug 15, 2016 at 10:59:55AM -0400, Chris Metcalf wrote:

On 8/11/2016 2:50 PM, Christoph Lameter wrote:

On Thu, 11 Aug 2016, Frederic Weisbecker wrote:


Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

Once is sufficient after disabling the tick.

It's true that task_isolation_enter() is called every time before
returning to user space while task isolation is enabled.

But once we enter the kernel again after returning from the initial
prctl() -- assuming we are in NOSIG mode so doing so is legal in the
first place -- almost anything can happen, certainly including
restarting the tick.  Thus, we have to make sure that normal quiescing
happens again before we return to userspace.

Yes but we need to sort out what needs to be called only once on prctl().

Once vmstat is quiesced, it's not going to need quiescing again even if we
restart the tick.


That's true, but I really do like the idea of having a clean structure
where we verify all our prerequisites in task_isolation_ready(), and
have code to try to get things fixed up in task_isolation_enter().
If we start moving some things here and some things there, it gets
harder to manage.  I think by testing "!vmstat_idle()" in
task_isolation_enter() we are avoiding any substantial overhead.

I think it would be clearer to rename task_isolation_enter()
to task_isolation_prepare(); it might be less confusing.

Remember too that in general, we really don't need to think about
return-to-userspace as a hot path for task isolation, unlike how we
think about it all the rest of the time.  So it makes sense to
prioritize keeping things clean from a software development
perspective first, and high-performance only secondarily.


The thing to remember is that this is only relevant if the user has
explicitly requested the NOSIG behavior from task isolation, which we
don't really expect to be the default - we are implicitly encouraging
use of the default semantics of "you can't enter the kernel again
until you turn off isolation".

That's right. Although NOSIG is the only thing we can afford as long as
we drag around the 1Hz.


True enough.  Hopefully we'll finish sorting that out soon enough.


+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);
Again, that won't help

It won't be better than spinning in a loop if there aren't any other
schedulable processes, but it won't be worse either.  If there is
another schedulable process, we at least will schedule it sooner than
if we just sat in a busy loop and waited for the scheduler to kick
us. But there's nothing else we can do anyway if we want to maintain
the guarantee that the dyn tick is stopped before return to userspace.

I don't think it helps either way. If reschedule is pending, the current
task already has TIF_RESCHED set.


See the other thread with Peter Z for the longer discussion of this.
At this point I'm leaning towards replacing the set_tsk_need_resched() with

set_current_state(TASK_INTERRUPTIBLE);
schedule();
__set_current_state(TASK_RUNNING);

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-30 Thread Chris Metcalf

On 8/30/2016 3:58 AM, Peter Zijlstra wrote:

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:

+   /*
+* Request rescheduling unless we are in full dynticks mode.
+* We would eventually get pre-empted without this, and if
+* there's another task waiting, it would run; but by
+* explicitly requesting the reschedule, we may reduce the
+* latency.  We could directly call schedule() here as well,
+* but since our caller is the standard place where schedule()
+* is called, we defer to the caller.
+*
+* A more substantive approach here would be to use a struct
+* completion here explicitly, and complete it when we shut
+* down dynticks, but since we presumably have nothing better
+* to do on this core anyway, just spinning seems plausible.
+*/
+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

So this isn't spelled out anywhere. Why does this need to be in the
return to user path?


I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.


Even without that consideration, we really can't be sure we stay in
dynticks mode if we disable the dynamic tick, but then enable interrupts,
and end up taking an interrupt on the way back to userspace, and
it turns the tick back on.  That's why we do it here, where we know
interrupts will stay disabled until we get to userspace.

But but but.. task_isolation_enter() is explicitly ran with IRQs
_enabled_!! It even WARNs if they're disabled.


Yes, true!  But if you pop up to the caller, the key thing is the
task_isolation_ready() routine where we are invoked with interrupts
disabled, and we confirm that all our criteria are met (including
tick_nohz_tick_stopped), and then leave interrupts disabled as we
return from there onwards to userspace.

The task_isolation_enter() code just does its best-faith attempt to
make sure all these criteria are met, just like all the other TIF_xxx
flag tests do in exit_to_usermode_loop() on x86, like scheduling,
delivering signals, etc.  As you know, we might run that code, go
around the loop, and discover that the TIF flag has been re-set, and
we have to run the code again before all of that stuff has "quiesced".
The isolation code uses that same model; the only difference is that
we clear the TIF flag manually in the loop by checking
task_isolation_ready().


So if we are doing it here, what else can/should we do?  There really
shouldn't be any other tasks waiting to run at this point, so there's
not a heck of a lot else to do on this core.  We could just spin and
check need_resched and signal status manually instead, but that
seems kind of duplicative of code already done in our caller here.

What !? I really don't get this, what are you waiting for? Why is
rescheduling making things better.


We need to wait for the last dyntick to fire before we can return to
userspace.  There are plenty of options as to what we can do in the
meanwhile.

1. Try to schedule().  Good luck with that in practice, since a
userspace process that has enabled task isolation is going to be alone
on its core unless something pretty broken is happening on the system.
But, at least folks understand the idiom of scheduling out while you wait.

2. Another variant of that: set up a wait completion and have the
dynticks code complete it when the tick turns off.  But this adds
complexity to option 1, and really doesn't buy us much in practice
that I can see.

3. Just admit that we are likely alone on the core, and just burn
cycles in a busy loop waiting for that last tick to fire.  Obviously
if we do this we also need to test for signals and resched so the core
remains responsive.  We can either do this in a loop just by spinning
explicitly, 

Ping: [PATCH v15 00/13] support "task_isolation" mode

2016-08-29 Thread Chris Metcalf

On 8/16/2016 5:19 PM, Chris Metcalf wrote:

Here is a respin of the task-isolation patch set.

Again, I have been getting email asking me when and where this patch
will be upstreamed so folks can start using it.  I had been thinking
the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
thing.  But perhaps it touches enough other subsystems that that
doesn't really make sense?  Andrew, would it make sense to take it
directly via your tree?  Frederic, Ingo, what do you think?


Ping!

No concerns have been raised yet with the v15 version of the patch series
in the two weeks since I posted it, and I think I have addressed all
previously-raised concerns (or perhaps people have just given up arguing
with me).

I did add Catalin's Reviewed-by to 08/13 (thanks!) and updated my
kernel.org repo.

Does this feel like something we can merge when the 4.9 merge window opens?
If so, whose tree is best suited for it?  Or should I ask Stephen to put it into
linux-next now and then ask Linus to merge it directly?  I recall Ingo thought
this was a bad idea when I suggested it back in January, but I'm not sure where
we got to in terms of a better approach.

Thanks all!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-29 Thread Chris Metcalf

On 8/29/2016 12:48 PM, Peter Zijlstra wrote:

On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:

+   /*
+* Request rescheduling unless we are in full dynticks mode.
+* We would eventually get pre-empted without this, and if
+* there's another task waiting, it would run; but by
+* explicitly requesting the reschedule, we may reduce the
+* latency.  We could directly call schedule() here as well,
+* but since our caller is the standard place where schedule()
+* is called, we defer to the caller.
+*
+* A more substantive approach here would be to use a struct
+* completion here explicitly, and complete it when we shut
+* down dynticks, but since we presumably have nothing better
+* to do on this core anyway, just spinning seems plausible.
+*/
+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.

Can you elaborate?

Naked use of TIF_NEED_RESCHED like this is busted. There is more state
that needs to be poked to keep things consistent / working.


Would it be cleaner to just replace the set_tsk_need_resched() call
with something like:

set_current_state(TASK_INTERRUPTIBLE);
schedule();
__set_current_state(TASK_RUNNING);

or what would you recommend?

Or, as I said, just doing a busy loop here while testing to see
if need_resched or signal had been set?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 04/13] task_isolation: add initial support

2016-08-29 Thread Chris Metcalf

On 8/29/2016 12:33 PM, Peter Zijlstra wrote:

On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:

+   /*
+* Request rescheduling unless we are in full dynticks mode.
+* We would eventually get pre-empted without this, and if
+* there's another task waiting, it would run; but by
+* explicitly requesting the reschedule, we may reduce the
+* latency.  We could directly call schedule() here as well,
+* but since our caller is the standard place where schedule()
+* is called, we defer to the caller.
+*
+* A more substantive approach here would be to use a struct
+* completion here explicitly, and complete it when we shut
+* down dynticks, but since we presumably have nothing better
+* to do on this core anyway, just spinning seems plausible.
+*/
+   if (!tick_nohz_tick_stopped())
+   set_tsk_need_resched(current);

This is broken.. and it would be really good if you don't actually need
to do this.


Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.

Even without that consideration, we really can't be sure we stay in
dynticks mode if we disable the dynamic tick, but then enable interrupts,
and end up taking an interrupt on the way back to userspace, and
it turns the tick back on.  That's why we do it here, where we know
interrupts will stay disabled until we get to userspace.

So if we are doing it here, what else can/should we do?  There really
shouldn't be any other tasks waiting to run at this point, so there's
not a heck of a lot else to do on this core.  We could just spin and
check need_resched and signal status manually instead, but that
seems kind of duplicative of code already done in our caller here.

So... thoughts?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix /proc/stat freezes (was [PATCH v15] "task_isolation" mode)

2016-08-20 Thread Chris Metcalf

On 8/17/2016 3:37 PM, Christoph Lameter wrote:

On Tue, 16 Aug 2016, Chris Metcalf wrote:


- Dropped Christoph Lameter's patch to avoid scheduling the
   clocksource watchdog on nohz cores; the recommendation is to just
   boot with tsc=reliable for NOHZ in any case, if necessary.

We also said that there should be a WARN_ON if tsc=reliable is not
specified and processors are put into NOHZ mode. This is something not
obvious causing scheduling events on NOHZ processors.


Yes, I agree.  Frederic said he would queue a patch to do that, so I
didn't want to propose another patch that would conflict.


Frederic, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

Here is a potential fix to the problem that /proc/stat values freeze when
processors go into NOHZ busy mode. I'd like to hear what people think
about the approach here. In particular one issue may be that I am
accessing remote tick-sched structures without serialization. But for
top/ps this may be ok. I noticed that other values shown by top/os also
sometime are a bit fuzzy.


This seems pretty plausible to me, but I'm not an expert on what kind
of locking might be required for these data structures.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v15 00/13] support "task_isolation" mode

2016-08-16 Thread Chris Metcalf
Here is a respin of the task-isolation patch set.

Again, I have been getting email asking me when and where this patch
will be upstreamed so folks can start using it.  I had been thinking
the obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
thing.  But perhaps it touches enough other subsystems that that
doesn't really make sense?  Andrew, would it make sense to take it
directly via your tree?  Frederic, Ingo, what do you think?

Changes since v14:

- Rebased on v4.8-rc2 (so incorporates my NOHZ bugfix vs v4.8-rc1)

- Dropped Christoph Lameter's patch to avoid scheduling the
  clocksource watchdog on nohz cores; the recommendation is to just
  boot with tsc=reliable for NOHZ in any case, if necessary.

- Optimize task_isolation_enter() by checking vmstat_idle() before
  calling quiet_vmstat_sync() [Frederic, Christoph]

- Correct buggy x86 syscall_trace_enter() support [Andy]

- Add _TIF_TASK_ISOLATION to x86 _TIF_ALLWORK_MASK; not technically
  necessary but good for self-documentation [Andy]

- Improve comment for task_isolation_syscall() callsites to clarify
  that we are delivering a signal if we bail out of the syscall [Andy]

- Ran the selftest through checkpatch and cleaned up style issues

The previous (v14) patch series is here:

https://lkml.kernel.org/r/1470774596-17341-1-git-send-email-cmetc...@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86 (plus some volunteer testing on x86 by
Christoph and Francis).

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederic, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal
  task_isolation self test

 Documentation/kernel-parameters.txt|  16 +
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S  |  12 +-
 arch/arm64/kernel/ptrace.c |  18 +-
 arch/arm64/kernel/signal.c |  42 +-
 arch/arm64/kernel/smp.c|   2 +
 arch/arm64/mm/fault.c  |   8 +-
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   4 +-
 arch/tile/kernel/process.c |   9 +
 arch/tile/kernel/ptrace.c  |  10 +
 arch/tile/kernel/single_step.c |   7 +
 arch/tile/kernel/smp.c |  26 +-
 arch/tile/kernel/time.c|   1 +
 arch/tile/kernel/unaligned.c   |   4 +
 arch/tile/mm/fault.c   |  13 +-
 arch/tile/mm/homecache.c   |   2 +
 arch/x86/Kconfig   |   1 +
 arch/x86/entry/common.c|  21 +-
 arch/x86/include/asm/thread_info.h |   4 +-
 arch/x86/kernel/smp.c  |   2 +
 arch/x86/kernel/traps.c|   3 +
 arch/x86/mm/fault.c|   5 +
 drivers/base/cpu.c |  18 +
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h  |  73 +++
 include/linux/sched.h  |   3 +
 include/linux/swap.h   |   1 +
 include/linux/tick.h   |   2 +
 include/linux/vmstat.h |   4 +
 include/uapi/linux/prctl.h |  10 +
 init/Kconfig   |  37 ++
 kernel/Makefile|   1 +
 kernel/fork.c  |   3 +
 kernel/irq_work.c  |   5 +-
 kernel/isolation.c | 338 +++
 kernel/sched/core.c|  14 +
 kernel/signal.c   

[PATCH v15 11/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-08-16 Thread Chris Metcalf
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf 
---
 init/Kconfig   | 10 ++
 kernel/isolation.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index a95a35a31b46..a9b9c7635de2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -813,6 +813,16 @@ config TASK_ISOLATION
 You should say "N" unless you are intending to run a
 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.
+
 config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index be7e95192e76..3dbb01ac503f 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -43,8 +43,14 @@ int __init task_isolation_init(void)
 {
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+   alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+   cpumask_copy(task_isolation_map, cpu_possible_mask);
+   cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}
 
/*
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v15 12/13] task_isolation: add user-settable notification signal

2016-08-16 Thread Chris Metcalf
By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf 
---
 include/uapi/linux/prctl.h |  5 
 kernel/isolation.c | 62 ++
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION  48
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_USERSIG (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+   (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 3dbb01ac503f..ba643ad9d02b 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. 
*/
+static int task_isolation_sig(int flags)
+{
+   if (flags & PR_TASK_ISOLATION_USERSIG)
+   return PR_TASK_ISOLATION_GET_SIG(flags);
+   else
+   return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
if (flags != 0) {
+   int sig = task_isolation_sig(flags);
+
if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
!task_isolation_possible(raw_smp_processor_id())) {
/* Invalid task affinity setting. */
-   return -EINVAL;
+   if (sig)
+   return -EINVAL;
+   else
+   pr_warn("%s/%d: enabling non-signalling task 
isolation\n"
+   "and not bound to a single task 
isolation core\n",
+   current->comm, current->pid);
}
-   if (!can_stop_my_full_tick_now()) {
+
+   if (sig && !can_stop_my_full_tick_now()) {
/* System not yet ready for task isolation. */
return -EAGAIN;
}
@@ -161,11 +184,11 @@

[PATCH v15 05/13] task_isolation: track asynchronous interrupts

2016-08-16 Thread Chris Metcalf
This commit adds support for tracking asynchronous interrupts
delivered to task-isolation tasks, e.g. IPIs or IRQs.  Just
as for exceptions and syscalls, when this occurs we arrange to
deliver a signal to the task so that it knows it has been
interrupted.  If the task is interrupted by an NMI, we can't
safely deliver a signal, so we just dump out a console stack.

We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless.  We try to catch
the original source of the interrupt, e.g. if an IPI is dispatched
to a task-isolation task, we dump the backtrace of the remote
core that is sending the IPI, rather than just dumping out a
trace showing the core received an IPI from somewhere.

Calls to task_isolation_debug() can be placed in the
platform-independent code when that results in fewer lines
of code changes, as for example is true of the users of the
arch_send_call_function_*() APIs.  Or, they can be placed in the
per-architecture code when there are many callers, as for example
is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt|  8 
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h  | 13 ++
 kernel/irq_work.c  |  5 ++-
 kernel/isolation.c | 74 ++
 kernel/sched/core.c| 14 +++
 kernel/signal.c|  7 
 kernel/smp.c   |  6 ++-
 kernel/softirq.c   | 33 +++
 9 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 7f1336b50dcc..f172cd310cf4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3951,6 +3951,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION
+   and booted in task_isolation= mode, this
+   setting will generate console backtraces when
+   the kernel is about to interrupt a task that
+   has requested PR_TASK_ISOLATION_ENABLE and is
+   running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+   return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index d9288b85b41f..02728b1f8775 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -46,6 +46,17 @@ extern void _task_isolation_quiet_exception(const char *fmt, 
...);
_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
} while (0)
 
+extern void _task_isolation_debug(int cpu, const char *type);
+#define task_isolation_debug(cpu, type)
\
+   do {\
+   if (task_isolation_possible(cpu))   \
+   _task_isolation_debug(cpu, type);   \
+   } while (0)
+
+extern void task_isolation_debug_cpumask(const struct cpumask *,
+const char *type);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p,
+ const char *type);
 #else
 static inline void task_isolation_init(void) { }
 sta

[PATCH v15 04/13] task_isolation: add initial support

2016-08-16 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c  |  18 +++
 include/linux/isolation.h   |  60 ++
 include/linux/sched.h   |   3 +
 include/linux/tick.h|   2 +
 include/uapi/linux/prctl.h  |   5 +
 init/Kconfig|  27 +
 kernel/Makefile |   1 +
 kernel/fork.c   |   3 +
 kernel/isolation.c  | 218 
 kernel/signal.c |   8 ++
 kernel/sys.c|   9 ++
 kernel/time/tick-sched.c|  36 +++---
 13 files changed, 385 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 46c030a49186..7f1336b50dcc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3943,6 +3943,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation= [KNL]
+   In kernels built with CONFIG_TASK_ISOLATION=y, set
+   the specified list of CPUs where cpus will be able
+   to use prctl(PR_SET_TASK_ISOLATION) to set up task
+   isolation mode.  Setting this boot flag implicitly
+   also sets up nohz_full and isolcpus mode for the
+   listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drive

Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-08-15 Thread Chris Metcalf

On 8/11/2016 7:58 AM, Frederic Weisbecker wrote:

Arguably we should issue a boot time warning if NOHZ_FULL is configured
>and the TSC watchdog is running.

That's a very good idea! We do that when tsc is unstable but indeed we can't
seriously run NOHZ_FULL on a non-reliable tsc.

I'll take care of that warning.


Thanks.  So I will drop Christoph's patch to run the TSC watchdog on just
housekeeping cores and we will rely on the "boot time warning" instead.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v14 04/14] task_isolation: add initial support

2016-08-15 Thread Chris Metcalf

On 8/11/2016 2:50 PM, Christoph Lameter wrote:

On Thu, 11 Aug 2016, Frederic Weisbecker wrote:


Do we need to quiesce vmstat everytime before entering userspace?
I thought that vmstat only need to be offlined once and for all?

Once is sufficient after disabling the tick.


It's true that task_isolation_enter() is called every time before
returning to user space while task isolation is enabled.

But once we enter the kernel again after returning from the initial
prctl() -- assuming we are in NOSIG mode so doing so is legal in the
first place -- almost anything can happen, certainly including
restarting the tick.  Thus, we have to make sure that normal quiescing
happens again before we return to userspace.

For vmstat, you're right that it's somewhat heavyweight to do the
quiesce, and if we don't need it, it's wasted time on the return path.
So I will add a guard call to the new vmstat_idle() before invoking
quiet_vmstat_sync().  This slows down the path where it turns out we
do need to quieten vmstat, but not by too much.

The LRU quiesce is quite light-weight.  We just check pagevec_count()
on a handful of pagevec's, confirm they are all zero, and return
without further work.  So for that one, adding a separate
lru_add_drain_needed() guard test would just be wasted effort.

The thing to remember is that this is only relevant if the user has
explicitly requested the NOSIG behavior from task isolation, which we
don't really expect to be the default - we are implicitly encouraging
use of the default semantics of "you can't enter the kernel again
until you turn off isolation".


> +  if (!tick_nohz_tick_stopped())
> +  set_tsk_need_resched(current);
> Again, that won't help


It won't be better than spinning in a loop if there aren't any other
schedulable processes, but it won't be worse either.  If there is
another schedulable process, we at least will schedule it sooner than
if we just sat in a busy loop and waited for the scheduler to kick
us. But there's nothing else we can do anyway if we want to maintain
the guarantee that the dyn tick is stopped before return to userspace.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-08-11 Thread Chris Metcalf

On 8/10/2016 6:16 PM, Frederic Weisbecker wrote:

On Wed, Jul 27, 2016 at 08:55:28AM -0500, Christoph Lameter wrote:

On Mon, 25 Jul 2016, Christoph Lameter wrote:


Guess so. I will have a look at this when I get some time again.

Ok so the problem is the clocksource_watchdog() function in
kernel/time/clocksource.c. This function is active if
CONFIG_CLOCKSOURCE_WATCHDOG is defined. It will check the timesources of
each processor for being within bounds and then reschedule itself on the
next one.

The purpose of the function seems to be to determine *if* a clocksource is
unstable. It does not mean that the clocksource *is* unstable.

The critical piece of code is this:

 /*
  * Cycle through CPUs to check if the CPUs stay synchronized
  * to each other.
  */
 next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
 if (next_cpu >= nr_cpu_ids)
 next_cpu = cpumask_first(cpu_online_mask);
 watchdog_timer.expires += WATCHDOG_INTERVAL;
 add_timer_on(&watchdog_timer, next_cpu);


Should we just cycle through the cpus that are not isolated? Otherwise we
need to have some means to check the clocksources for accuracy remotely
(probably impossible for TSC etc).

The WATCHDOG_INTERVAL is 1 second so this causes an interrupt every
second.

Note that we are running with the patch that removes the 1 HZ mininum time
tick. With an older kernel code base (redhat) we can keep the kernel quiet
for minutes. The clocksource watchdog causes timers to fire again.

I had similar issues, this seems to happen when the tsc is considered not 
reliable
(which doesn't necessarily mean unstable. I think it has to do with some x86 
CPU feature
flag).

IIRC, this _has_ to execute on all online CPUs because every TSCs of running 
CPUs
are concerned.

I personally override that with passing the tsc=reliable kernel parameter. Of 
course
use it at your own risk.

But eventually I don't think we can offline that to housekeeping only CPUs.


Maybe the eventual model here is that as task-isolation cores
re-enter the kernel, they catch a hook that tells them to go
call the unreliable-tsc stuff and see what the state of it is.

This would be the same hook that we could use to defer
kernel TLB flushes, also.

The hard part is that on some platforms it may be fairly
intrusive to get all the hooks in.  Arm64 has a nice consistent
set of assembly routines to enter the kernel, which is how they
manage the context_tracking as well, but I fear that x86 may
have a lot more.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v14 00/14] support "task_isolation" mode

2016-08-09 Thread Chris Metcalf
Here is a respin of the task-isolation patch set.  This primarily
reflects some testing on x86, and a rebase to 4.8.

I have been getting email asking me when and where this patch will be
upstreamed so folks can start using it.  I had been thinking the
obvious path was via Frederic Weisbecker to Ingo as a NOHZ kind of
thing.  But perhaps it touches enough other subsystems that that
doesn't really make sense?  Andrew, would it make sense to take it
directly via your tree?  Frederic, Ingo, what do you think?

Changes since v13:

- Rebased on v4.8-rc1 (and thus uses the standard try_get_task_struct).

- Fixes a bug when using the clocksource watchdog; it is now scheduled
  to run only on the housekeeping cpus [by Christoph Lameter].

- Fixes a bug in x86 syscall_trace_enter() [seen by Francis Giraldeau].

- Includes a selftest.

The previous (v13) patch series is here:

https://lkml.kernel.org/r/1468529299-27929-1-git-send-email-cmetc...@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86 (plus some volunteer testing on x86 by
Christoph and Francis).

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederic, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal
  task_isolation self test

Christoph Lameter (1):
  clocksource: Do not schedule watchdog on isolated or NOHZ cpus

 Documentation/kernel-parameters.txt|  16 +
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S  |  12 +-
 arch/arm64/kernel/ptrace.c |  15 +-
 arch/arm64/kernel/signal.c |  42 +-
 arch/arm64/kernel/smp.c|   2 +
 arch/arm64/mm/fault.c  |   8 +-
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   4 +-
 arch/tile/kernel/process.c |   9 +
 arch/tile/kernel/ptrace.c  |   7 +
 arch/tile/kernel/single_step.c |   7 +
 arch/tile/kernel/smp.c |  26 +-
 arch/tile/kernel/time.c|   1 +
 arch/tile/kernel/unaligned.c   |   4 +
 arch/tile/mm/fault.c   |  13 +-
 arch/tile/mm/homecache.c   |   2 +
 arch/x86/Kconfig   |   1 +
 arch/x86/entry/common.c|  20 +-
 arch/x86/include/asm/thread_info.h |   2 +
 arch/x86/kernel/smp.c  |   2 +
 arch/x86/kernel/traps.c|   3 +
 arch/x86/mm/fault.c|   5 +
 drivers/base/cpu.c |  18 +
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h  |  73 +++
 include/linux/sched.h  |   3 +
 include/linux/swap.h   |   1 +
 include/linux/tick.h   |   2 +
 include/linux/vmstat.h |   4 +
 include/uapi/linux/prctl.h |  10 +
 init/Kconfig   |  37 ++
 kernel/Makefile|   1 +
 kernel/fork.c  |   3 +
 kernel/irq_work.c  |   5 +-
 kernel/isolation.c | 337 +++
 kernel/sched/core.c|  14 +
 kernel/signal.c|  15 +
 kernel/smp.c   |   6 +-
 kernel/softirq.c   |  33 ++
 kernel/sys.c   |   9 +
 kernel/time/clocksource.c  |  10 +-
 kernel/time/tick-sched.c   |  36 +

[PATCH v14 05/14] task_isolation: track asynchronous interrupts

2016-08-09 Thread Chris Metcalf
This commit adds support for tracking asynchronous interrupts
delivered to task-isolation tasks, e.g. IPIs or IRQs.  Just
as for exceptions and syscalls, when this occurs we arrange to
deliver a signal to the task so that it knows it has been
interrupted.  If the task is interrupted by an NMI, we can't
safely deliver a signal, so we just dump out a console stack.

We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless.  We try to catch
the original source of the interrupt, e.g. if an IPI is dispatched
to a task-isolation task, we dump the backtrace of the remote
core that is sending the IPI, rather than just dumping out a
trace showing the core received an IPI from somewhere.

Calls to task_isolation_debug() can be placed in the
platform-independent code when that results in fewer lines
of code changes, as for example is true of the users of the
arch_send_call_function_*() APIs.  Or, they can be placed in the
per-architecture code when there are many callers, as for example
is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt|  8 
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h  | 13 ++
 kernel/irq_work.c  |  5 ++-
 kernel/isolation.c | 74 ++
 kernel/sched/core.c| 14 +++
 kernel/signal.c|  7 
 kernel/smp.c   |  6 ++-
 kernel/softirq.c   | 33 +++
 9 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 7f1336b50dcc..f172cd310cf4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3951,6 +3951,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION
+   and booted in task_isolation= mode, this
+   setting will generate console backtraces when
+   the kernel is about to interrupt a task that
+   has requested PR_TASK_ISOLATION_ENABLE and is
+   running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+   return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index d9288b85b41f..02728b1f8775 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -46,6 +46,17 @@ extern void _task_isolation_quiet_exception(const char *fmt, 
...);
_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
} while (0)
 
+extern void _task_isolation_debug(int cpu, const char *type);
+#define task_isolation_debug(cpu, type)
\
+   do {\
+   if (task_isolation_possible(cpu))   \
+   _task_isolation_debug(cpu, type);   \
+   } while (0)
+
+extern void task_isolation_debug_cpumask(const struct cpumask *,
+const char *type);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p,
+ const char *type);
 #else
 static inline void task_isolation_init(void) { }
 sta

[PATCH v14 13/14] task_isolation: add user-settable notification signal

2016-08-09 Thread Chris Metcalf
By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf 
---
 include/uapi/linux/prctl.h |  5 
 kernel/isolation.c | 62 ++
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION  48
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_USERSIG (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+   (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index f8ccf5e67e38..d36cb3943c80 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. 
*/
+static int task_isolation_sig(int flags)
+{
+   if (flags & PR_TASK_ISOLATION_USERSIG)
+   return PR_TASK_ISOLATION_GET_SIG(flags);
+   else
+   return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
if (flags != 0) {
+   int sig = task_isolation_sig(flags);
+
if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
!task_isolation_possible(raw_smp_processor_id())) {
/* Invalid task affinity setting. */
-   return -EINVAL;
+   if (sig)
+   return -EINVAL;
+   else
+   pr_warn("%s/%d: enabling non-signalling task 
isolation\n"
+   "and not bound to a single task 
isolation core\n",
+   current->comm, current->pid);
}
-   if (!can_stop_my_full_tick_now()) {
+
+   if (sig && !can_stop_my_full_tick_now()) {
/* System not yet ready for task isolation. */
return -EAGAIN;
}
@@ -160,11 +183,11 @@

[PATCH v14 04/14] task_isolation: add initial support

2016-08-09 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c  |  18 +++
 include/linux/isolation.h   |  60 ++
 include/linux/sched.h   |   3 +
 include/linux/tick.h|   2 +
 include/uapi/linux/prctl.h  |   5 +
 init/Kconfig|  27 +
 kernel/Makefile |   1 +
 kernel/fork.c   |   3 +
 kernel/isolation.c  | 217 
 kernel/signal.c |   8 ++
 kernel/sys.c|   9 ++
 kernel/time/tick-sched.c|  36 +++---
 13 files changed, 384 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 46c030a49186..7f1336b50dcc 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3943,6 +3943,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation= [KNL]
+   In kernels built with CONFIG_TASK_ISOLATION=y, set
+   the specified list of CPUs where cpus will be able
+   to use prctl(PR_SET_TASK_ISOLATION) to set up task
+   isolation mode.  Setting this boot flag implicitly
+   also sets up nohz_full and isolcpus mode for the
+   listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drive

[PATCH v14 12/14] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-08-09 Thread Chris Metcalf
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf 
---
 init/Kconfig   | 10 ++
 kernel/isolation.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 85a4b6dd26f2..2d49c5b78b93 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -813,6 +813,16 @@ config TASK_ISOLATION
 You should say "N" unless you are intending to run a
 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.
+
 config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 7cd57ca95be5..f8ccf5e67e38 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -43,8 +43,14 @@ int __init task_isolation_init(void)
 {
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+   alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+   cpumask_copy(task_isolation_map, cpu_possible_mask);
+   cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}
 
/*
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-07-29 Thread Chris Metcalf
iled: rc %d", rc);
goto fail;
}
*statep = 1;

	// Wait for child to come disturb us.

while (*statep == 1) {
gettimeofday(&quiesce_end, NULL);
time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
(quiesce_end.tv_usec - quiesce_start.tv_usec)/100.0;
if (time > 0.1 && *statep == 1)  {
prctl(PR_SET_TASK_ISOLATION, 0);
printf("timed out at %gs in child migrate loop (%d)\n",
   time, *childstate);
char buf[100];
sprintf(buf, "cat /proc/%d/stack", child_pid);
system(buf);
goto fail;
}
}
assert(*statep == 2);

// At this point the child is spinning, so any interrupt will keep us
// in kernel space.  Make a syscall to make sure it happens at least
// once during the second that the child is spinning.
kill(0, 0);
gettimeofday(&quiesce_end, NULL);
prctl(PR_SET_TASK_ISOLATION, 0);
time = (quiesce_end.tv_sec - quiesce_start.tv_sec) +
(quiesce_end.tv_usec - quiesce_start.tv_usec) / 100.0;
if (time < 0.4 || time > 0.6) {
printf("expected 1s wait after quiesce: was %g\n", time);
goto fail;
}
kill(child_pid, SIGKILL);
return EXIT_SUCCESS;

fail:
kill(child_pid, SIGKILL);
return EXIT_FAILURE;
}

int main(int argc, char **argv)
{
/* How many seconds to wait after running the other tests? */
double waittime;
if (argc == 1)
waittime = 10;
else if (argc == 2)
waittime = strtof(argv[1], NULL);
else {
printf("syntax: isolation [seconds]\n");
exit(EXIT_FAILURE);
}

/* Test that the /sys device is present and pick a cpu. */
FILE *f = fopen("/sys/devices/system/cpu/task_isolation", "r");
if (f == NULL) {
printf("/sys device: FAIL\n");
exit(EXIT_FAILURE);
}
char buf[100];
char *result = fgets(buf, sizeof(buf), f);
assert(result == buf);
fclose(f);
char *end;
task_isolation_cpu = strtol(buf, &end, 10);
assert(end != buf);
assert(*end == ',' || *end == '-' || *end == '\n');
assert(task_isolation_cpu >= 0);
printf("/sys device : OK\n");

// Test to see if with no mask set, we fail.
if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
errno != EINVAL) {
printf("prctl unaffinitized: FAIL\n");
exit_status = EXIT_FAILURE;
} else {
printf("prctl unaffinitized: OK\n");
}

// Or if affinitized to the wrong cpu.
set_my_cpu(0);
if (prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) == 0 ||
errno != EINVAL) {
printf("prctl on cpu 0: FAIL\n");
exit_status = EXIT_FAILURE;
} else {
printf("prctl on cpu 0: OK\n");
}

// Run the tests.
test_killed("test_fault", setup_fault, do_fault);
test_killed("test_syscall", NULL, do_syscall);
test_munmap();
test_unaligned();
test_ok("test_off", NULL, do_syscall_off);
test_nosig("test_multi", NULL, do_syscall_multi);
test_nosig("test_quiesce", setup_quiesce, do_quiesce);

// Exit failure if any test failed.
    if (exit_status != EXIT_SUCCESS)
return exit_status;

// Wait for however long was requested on the command line.
// Note that this requires a vDSO implementation of gettimeofday();
// if it's not available, we could just spin a fixed number of
// iterations instead.
struct timeval start, tv;
gettimeofday(&start, NULL);
while (1) {
gettimeofday(&tv, NULL);
double time = (tv.tv_sec - start.tv_sec) +
(tv.tv_usec - start.tv_usec) / 100.0;
if (time >= waittime)
break;
}

return EXIT_SUCCESS;
}

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-07-28 Thread Chris Metcalf

On 7/27/2016 9:55 AM, Christoph Lameter wrote:

The critical piece of code is this:

 /*
  * Cycle through CPUs to check if the CPUs stay synchronized
  * to each other.
  */
 next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
 if (next_cpu >= nr_cpu_ids)
 next_cpu = cpumask_first(cpu_online_mask);
 watchdog_timer.expires += WATCHDOG_INTERVAL;
 add_timer_on(&watchdog_timer, next_cpu);


Should we just cycle through the cpus that are not isolated? Otherwise we
need to have some means to check the clocksources for accuracy remotely
(probably impossible for TSC etc).


That sounds like the right idea - use the housekeeping cpu mask instead of the
cpu online mask.  Should be a straightforward patch; do you want to do that
and test it in your configuration, and I'll include it in the next spin of the
patch series?

Thanks for your testing!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-07-27 Thread Chris Metcalf

On 7/27/2016 3:53 PM, Christoph Lameter wrote:

On Wed, 27 Jul 2016, Chris Metcalf wrote:


Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
on purpose?  For now I just took your change, but tweaked it to add the
equivalent diff with cpumask_first_and() there.

Can the watchdog be started on an isolated cpu at all? I would expect that
the code would start a watchdog only on a housekeeping cpu.


The code just starts the watchdog initially on the first online cpu.
In principle you could have configured that as an isolated cpu, so
without any change to that code, you'd interrupt that cpu.

I guess another way to slice it would be to start the watchdog on the
current core.  But just using the same idiom as in clocksource_watchdog()
seems cleanest to me.

I added your patch to the series and pushed it up (along with adding your
Tested-by to the x86 enablement commit).  It's still based on 4.6 so I'll need
to rebase it once the merge window closes.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-07-27 Thread Chris Metcalf

On 7/27/2016 2:56 PM, Christoph Lameter wrote:

On Wed, 27 Jul 2016, Chris Metcalf wrote:


How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Ok here is V2:


Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus V2

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter 

Index: linux/kernel/time/clocksource.c
===
--- linux.orig/kernel/time/clocksource.c
+++ linux/kernel/time/clocksource.c
@@ -269,9 +269,10 @@ static void clocksource_watchdog(unsigne
 * Cycle through CPUs to check if the CPUs stay synchronized
 * to each other.
 */
-   next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
+   next_cpu = cpumask_next_and(raw_smp_processor_id(), cpu_online_mask, 
housekeeping_cpumask());
if (next_cpu >= nr_cpu_ids)
-   next_cpu = cpumask_first(cpu_online_mask);
+   next_cpu = cpumask_first_and(cpu_online_mask, 
housekeeping_cpumask());
+
watchdog_timer.expires += WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer, next_cpu);
  out:


Looks good.  Did you omit the equivalent fix in clocksource_start_watchdog()
on purpose?  For now I just took your change, but tweaked it to add the
equivalent diff with cpumask_first_and() there.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: clocksource_watchdog causing scheduling of timers every second (was [v13] support "task_isolation" mode)

2016-07-27 Thread Chris Metcalf

On 7/27/2016 11:31 AM, Christoph Lameter wrote:

Ok here is a possible patch that explicitly checks for housekeeping cpus:

Subject: clocksource: Do not schedule watchdog on isolated or NOHZ cpus

watchdog checks can only run on housekeeping capable cpus. Otherwise
we will be generating noise that we would like to avoid on the isolated
processors.

Signed-off-by: Christoph Lameter 

Index: linux/kernel/time/clocksource.c
===
--- linux.orig/kernel/time/clocksource.c2016-07-27 08:41:17.109862517 
-0500
+++ linux/kernel/time/clocksource.c 2016-07-27 10:28:31.172447732 -0500
@@ -269,9 +269,12 @@ static void clocksource_watchdog(unsigne
 * Cycle through CPUs to check if the CPUs stay synchronized
 * to each other.
 */
-   next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
-   if (next_cpu >= nr_cpu_ids)
-   next_cpu = cpumask_first(cpu_online_mask);
+   do {
+   next_cpu = cpumask_next(raw_smp_processor_id(), 
cpu_online_mask);
+   if (next_cpu >= nr_cpu_ids)
+   next_cpu = cpumask_first(cpu_online_mask);
+   } while (!is_housekeeping_cpu(next_cpu));
+
watchdog_timer.expires += WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer, next_cpu);
  out:


How about using cpumask_next_and(raw_smp_processor_id(), cpu_online_mask,
housekeeping_cpumask()), likewise cpumask_first_and()?  Does that work?

Note that you should also  cpumask_first_and() in clocksource_start_watchdog(),
just to be complete.

Hopefully the init code runs after tick_init().  It seems like that's probably 
true.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 00/12] support "task_isolation" mode

2016-07-22 Thread Chris Metcalf

On 7/21/2016 10:20 PM, Christoph Lameter wrote:

On Thu, 21 Jul 2016, Chris Metcalf wrote:

On 7/20/2016 10:04 PM, Christoph Lameter wrote:
unstable, and then scheduling work to safely remove that timer.
I haven't looked at this code before (in kernel/time/clocksource.c
under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
arm64 and tile aren't unstable.  Is it possible to boot your machine
with a stable clocksource?

It already as a stable clocksource. Sorry but that was one of the criteria
for the server when we ordered them. Could this be clock adjustments?


We probably need to get clock folks to jump in on this thread!

Maybe it's disabling some built-in unstable clock just as part of
falling back to using the better, stable clock that you also have?
So maybe there's a way of just disabling that clocksource from the
get-go instead of having it be marked unstable later.

If you run the test again after this storm of unstable marking, does
it all happen again?  Or is it a persistent state in the kernel?
If so, maybe you can just arrange to get to that state before starting
your application's task-isolation code.

Or, if you think it's clock adjustments, perhaps running your test with
ntpd disabled would make it work better?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 00/12] support "task_isolation" mode

2016-07-21 Thread Chris Metcalf

On 7/20/2016 10:04 PM, Christoph Lameter wrote:

We are trying to test the patchset on x86 and are getting strange
backtraces and aborts. It seems that the cpu before the cpu we are running
on creates an irq_work event that causes a latency event on the next cpu.

This is weird. Is there a new round robin IPI feature in the kernel that I
am not aware of?


This seems to be from your clocksource declaring itself to be
unstable, and then scheduling work to safely remove that timer.
I haven't looked at this code before (in kernel/time/clocksource.c
under CONFIG_CLOCKSOURCE_WATCHDOG) since the timers on
arm64 and tile aren't unstable.  Is it possible to boot your machine
with a stable clocksource?



Backtraces from dmesg:

[  956.603223] latencytest/7928: task_isolation mode lost due to irq_work
[  956.610817] cpu 12: irq_work violating task isolation for latencytest/7928 
on cpu 13
[  956.619985] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[  956.628765] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 
03/15/2016
[  956.637642]  0086 ce6735c7b39e7b81 88103e783d00 
8134f6ff
[  956.646739]  88102c50d700 000d 88103e783d28 
811986f4
[  956.655828]  88102c50d700 88203cf97f80 000d 
88103e783d68
[  956.664924] Call Trace:
[  956.667945][] dump_stack+0x63/0x84
[  956.674740]  [] task_isolation_debug_task+0xb4/0xd0
[  956.682229]  [] _task_isolation_debug+0x83/0xc0
[  956.689331]  [] irq_work_queue_on+0x9c/0x120
[  956.696142]  [] tick_nohz_full_kick_cpu+0x44/0x50
[  956.703438]  [] wake_up_nohz_cpu+0x99/0x110
[  956.710150]  [] internal_add_timer+0x71/0xb0
[  956.716959]  [] add_timer_on+0xbb/0x140
[  956.723283]  [] clocksource_watchdog+0x230/0x300
[  956.730480]  [] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.738555]  [] call_timer_fn+0x35/0x120
[  956.744973]  [] ? __clocksource_unstable.isra.2+0x40/0x40
[  956.753046]  [] run_timer_softirq+0x23c/0x2f0
[  956.759952]  [] __do_softirq+0xd7/0x2c5
[  956.766272]  [] irq_exit+0xf5/0x100
[  956.772209]  [] smp_apic_timer_interrupt+0x42/0x50
[  956.779600]  [] apic_timer_interrupt+0x8c/0xa0
[  956.786602][] ? poll_idle+0x40/0x80
[  956.793490]  [] cpuidle_enter_state+0x9c/0x260
[  956.800498]  [] cpuidle_enter+0x17/0x20
[  956.806810]  [] cpu_startup_entry+0x2b7/0x3a0
[  956.813717]  [] start_secondary+0x15c/0x1a0
[ 1036.601758] cpu 12: irq_work violating task isolation for latencytest/8447 
on cpu 13
[ 1036.610922] CPU: 12 PID: 0 Comm: swapper/12 Not tainted 4.7.0-rc7-stream1 #1
[ 1036.619692] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 2.0.2 
03/15/2016
[ 1036.628551]  0086 ce6735c7b39e7b81 88103e783d00 
8134f6ff
[ 1036.637648]  88102dca 000d 88103e783d28 
811986f4
[ 1036.646741]  88102dca 88203cf97f80 000d 
88103e783d68
[ 1036.655833] Call Trace:
[ 1036.658852][] dump_stack+0x63/0x84
[ 1036.665649]  [] task_isolation_debug_task+0xb4/0xd0
[ 1036.673136]  [] _task_isolation_debug+0x83/0xc0
[ 1036.680237]  [] irq_work_queue_on+0x9c/0x120
[ 1036.687091]  [] tick_nohz_full_kick_cpu+0x44/0x50
[ 1036.694388]  [] wake_up_nohz_cpu+0x99/0x110
[ 1036.701089]  [] internal_add_timer+0x71/0xb0
[ 1036.707896]  [] add_timer_on+0xbb/0x140
[ 1036.714210]  [] clocksource_watchdog+0x230/0x300
[ 1036.721411]  [] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.729478]  [] call_timer_fn+0x35/0x120
[ 1036.735899]  [] ? __clocksource_unstable.isra.2+0x40/0x40
[ 1036.743970]  [] run_timer_softirq+0x23c/0x2f0
[ 1036.750878]  [] __do_softirq+0xd7/0x2c5
[ 1036.757199]  [] irq_exit+0xf5/0x100
[ 1036.763132]  [] smp_apic_timer_interrupt+0x42/0x50
[ 1036.770520]  [] apic_timer_interrupt+0x8c/0xa0
[ 1036.777520][] ? poll_idle+0x40/0x80
[ 1036.784410]  [] cpuidle_enter_state+0x9c/0x260
[ 1036.791413]  [] cpuidle_enter+0x17/0x20
[ 1036.797734]  [] cpu_startup_entry+0x2b7/0x3a0
[ 1036.804641]  [] start_secondary+0x15c/0x1a0




--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 00/12] support "task_isolation" mode

2016-07-18 Thread Chris Metcalf

On 7/18/2016 6:11 PM, Andy Lutomirski wrote:

As an example, enough vmalloc/vfree activity will eventually cause
flush_tlb_kernel_range to be called and*boom*, there goes your shiny
production dataplane application.


Well, that's actually a refinement that I did not inflict on this patch
series.

Submit it separately, perhaps?

The "kill the process if it goofs" thing while there are known goofs
in the kernel, apparently with patches written but unsent, seems
questionable.


Sure, that's a good idea.

I think what I will plan to do is, once the patch series is accepted into
some tree, return to this piece.  I'll have to go back and look at the internal
Tilera version of this code, since we have diverged quite a ways from that
in the 13 versions of the patch series, but my memory is that the kernel TLB
flush management was the only substantial piece of additional code not in
the initial batch of changes.  The extra requirement is the need to have a
hook very early on in the kernel entry path that you can hook in all paths;
arm64 has the ct_user_exit macro and tile has the finish_interrupt_save macro,
but I'm not sure there's something equivalent on x86 to catch all entries.

It's worth noting that the typical target application for task isolation, though
(at least in our experience) is a pretty dedicated machine, with the primary
application running in task isolation mode almost all of the time, and so
you are generally in pretty good control of all aspects of the system, including
whether or not you are generating kernel TLB flushes from your non task
isolation cores.  So I would argue the kernel TLB flush management piece is
an improvement to, not a requirement for, the main patch series.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 00/12] support "task_isolation" mode

2016-07-14 Thread Chris Metcalf

On 7/14/2016 5:03 PM, Andy Lutomirski wrote:

On Thu, Jul 14, 2016 at 1:48 PM, Chris Metcalf  wrote:

Here is a respin of the task-isolation patch set.  This primarily
reflects feedback from Frederic and Peter Z.

I still think this is the wrong approach, at least at this point.  The
first step should be to instrument things if necessary and fix the
obvious cases where the kernel gets entered asynchronously.


Note, however, that the task_isolation_debug mode is a very convenient
way of discovering what is going on when things do go wrong for task isolation.


Only once
there's a credible reason to believe it can work well should any form
of strictness be applied.


I'm not sure what criteria you need for this, though.  Certainly we've been
shipping our version of task isolation to customers since 2008, and there
are quite a few customer applications in production that are working well.
I'd argue that's a credible reason.


As an example, enough vmalloc/vfree activity will eventually cause
flush_tlb_kernel_range to be called and *boom*, there goes your shiny
production dataplane application.


Well, that's actually a refinement that I did not inflict on this patch series.

In our code base, we have a hook for kernel TLB flushes that defers such
flushes for cores that are running in userspace, because, after all, they
don't yet care about such flushes.  Instead, we atomically set a flag that
is checked on entry to the kernel, and that causes the TLB flush to occur
at that point.


On very brief inspection, __kmem_cache_shutdown will be a problem on
some workloads as well.


That looks like it should be amenable to a version of the same fix I pushed
upstream in 5fbc461636c32efd ("mm: make lru_add_drain_all() selective").
You would basically check which cores have non-empty caches, and only
interrupt those cores.  For extra credit, you empty the cache on your local cpu
when you are entering task isolation mode.  Now you don't get interrupted.

To be fair, I've never seen this particular path cause an interruption.  And I
think this speaks to the fact that there really can't be a black and white
decision about when you have removed enough possible interrupt paths.
It really does depend on what else is running on your machine in addition
to the task isolation code, and that will vary from application to application.
And, as the kernel evolves, new ways of interrupting task isolation cores
will get added and need to be dealt with.  There really isn't a perfect time
you can wait for and then declare that all the asynchronous entry cases
have been dealt with and now things are safe for task isolation.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 05/12] task_isolation: track asynchronous interrupts

2016-07-14 Thread Chris Metcalf
This commit adds support for tracking asynchronous interrupts
delivered to task-isolation tasks, e.g. IPIs or IRQs.  Just
as for exceptions and syscalls, when this occurs we arrange to
deliver a signal to the task so that it knows it has been
interrupted.  If the task is interrupted by an NMI, we can't
safely deliver a signal, so we just dump out a console stack.

We also support a new "task_isolation_debug" flag which forces
the console stack to be dumped out regardless.  We try to catch
the original source of the interrupt, e.g. if an IPI is dispatched
to a task-isolation task, we dump the backtrace of the remote
core that is sending the IPI, rather than just dumping out a
trace showing the core received an IPI from somewhere.

Calls to task_isolation_debug() can be placed in the
platform-independent code when that results in fewer lines
of code changes, as for example is true of the users of the
arch_send_call_function_*() APIs.  Or, they can be placed in the
per-architecture code when there are many callers, as for example
is true of the smp_send_reschedule() call.

A further cleanup might be to create an intermediate layer, so that
for example smp_send_reschedule() is a single generic function that
just calls arch_smp_send_reschedule(), allowing generic code to be
called every time smp_send_reschedule() is invoked.  But for now,
we just update either callers or callees as makes most sense.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt|  8 
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h  | 13 ++
 kernel/irq_work.c  |  5 ++-
 kernel/isolation.c | 74 ++
 kernel/sched/core.c| 42 +++
 kernel/signal.c|  7 
 kernel/smp.c   |  6 ++-
 kernel/softirq.c   | 33 +++
 9 files changed, 192 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 3db9bea08ed6..15fe7f029f8b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3900,6 +3900,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION
+   and booted in task_isolation= mode, this
+   setting will generate console backtraces when
+   the kernel is about to interrupt a task that
+   has requested PR_TASK_ISOLATION_ENABLE and is
+   running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+   return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index d9288b85b41f..02728b1f8775 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -46,6 +46,17 @@ extern void _task_isolation_quiet_exception(const char *fmt, 
...);
_task_isolation_quiet_exception(fmt, ## __VA_ARGS__); \
} while (0)
 
+extern void _task_isolation_debug(int cpu, const char *type);
+#define task_isolation_debug(cpu, type)
\
+   do {\
+   if (task_isolation_possible(cpu))   \
+   _task_isolation_debug(cpu, type);   \
+   } while (0)
+
+extern void task_isolation_debug_cpumask(const struct cpumask *,
+const char *type);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p,
+ const char *type);
 #else
 static inline void task_isolation_init(vo

[PATCH v13 04/12] task_isolation: add initial support

2016-07-14 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

Once the task has returned to userspace after issuing the prctl(),
if it enters the kernel again via system call, page fault, or any
of a number of other synchronous traps, the kernel will kill it
with SIGKILL.  For system calls, this test is performed immediately
before the SECCOMP test and causes the syscall to return immediately
with ENOSYS.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c  |  18 +++
 include/linux/isolation.h   |  60 ++
 include/linux/sched.h   |   3 +
 include/linux/tick.h|   2 +
 include/uapi/linux/prctl.h  |   5 +
 init/Kconfig|  27 +
 kernel/Makefile |   1 +
 kernel/fork.c   |   3 +
 kernel/isolation.c  | 217 
 kernel/signal.c |   8 ++
 kernel/sys.c|   9 ++
 kernel/time/tick-sched.c|  36 +++---
 13 files changed, 384 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 82b42c958d1c..3db9bea08ed6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3892,6 +3892,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation= [KNL]
+   In kernels built with CONFIG_TASK_ISOLATION=y, set
+   the specified list of CPUs where cpus will be able
+   to use prctl(PR_SET_TASK_ISOLATION) to set up task
+   isolation mode.  Setting this boot flag implicitly
+   also sets up nohz_full and isolcpus mode for the
+   listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drive

[PATCH v13 11/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-07-14 Thread Chris Metcalf
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf 
---
 init/Kconfig   | 10 ++
 kernel/isolation.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index fc71444f9c30..0b8384c76571 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -810,6 +810,16 @@ config TASK_ISOLATION
 You should say "N" unless you are intending to run a
 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.
+
 config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index a9fd4709825a..5e6cd67dfb0c 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -43,8 +43,14 @@ int __init task_isolation_init(void)
 {
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+   alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+   cpumask_copy(task_isolation_map, cpu_possible_mask);
+   cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}
 
/*
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v13 12/12] task_isolation: add user-settable notification signal

2016-07-14 Thread Chris Metcalf
By default, if a task in task isolation mode re-enters the kernel,
it is terminated with SIGKILL.  With this commit, the application
can choose what signal to receive on a task isolation violation
by invoking prctl() with PR_TASK_ISOLATION_ENABLE, or'ing in the
PR_TASK_ISOLATION_USERSIG bit, and setting the specific requested
signal by or'ing in PR_TASK_ISOLATION_SET_SIG(sig).

This mode allows for catching the notification signal; for example,
in a production environment, it might be helpful to log information
to the application logging mechanism before exiting.  Or, the
application might choose to re-enable task isolation and return to
continue execution.

As a special case, the user may set the signal to 0, which means
that no signal will be delivered.  In this mode, the application
may freely enter the kernel for syscalls and synchronous exceptions
such as page faults, but each time it will be held in the kernel
before returning to userspace until the kernel has quiesced timer
ticks or other potential future interruptions, just like it does
on return from the initial prctl() call.  Note that in this mode,
the task can be migrated away from its initial task_isolation core,
and if it is migrated to a non-isolated core it will lose task
isolation until it is migrated back to an isolated core.
In addition, in this mode we no longer require the affinity to
be set correctly on entry (though we warn on the console if it's
not right), and we don't bother to notify the user that the kernel
isn't ready to quiesce either (since we'll presumably be in and
out of the kernel multiple times with task isolation enabled anyway).
The PR_TASK_ISOLATION_NOSIG define is provided as a convenience
wrapper to express this semantic.

Signed-off-by: Chris Metcalf 
---
 include/uapi/linux/prctl.h |  5 
 kernel/isolation.c | 62 ++
 2 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 2a49d0d2940a..7af6eb51c1dc 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,10 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION  48
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_USERSIG (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
+# define PR_TASK_ISOLATION_NOSIG \
+   (PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 5e6cd67dfb0c..aca5de5e2e05 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -85,6 +85,15 @@ static bool can_stop_my_full_tick_now(void)
return ret;
 }
 
+/* Get the signal number that will be sent for a particular set of flag bits. 
*/
+static int task_isolation_sig(int flags)
+{
+   if (flags & PR_TASK_ISOLATION_USERSIG)
+   return PR_TASK_ISOLATION_GET_SIG(flags);
+   else
+   return SIGKILL;
+}
+
 /*
  * This routine controls whether we can enable task-isolation mode.
  * The task must be affinitized to a single task_isolation core, or
@@ -92,16 +101,30 @@ static bool can_stop_my_full_tick_now(void)
  * stop the nohz_full tick (e.g., no other schedulable tasks currently
  * running, no POSIX cpu timers currently set up, etc.); if not, we
  * return EAGAIN.
+ *
+ * If we will not be strictly enforcing kernel re-entry with a signal,
+ * we just generate a warning printk if there is a bad affinity set
+ * on entry (since after all you can always change it again after you
+ * call prctl) and we don't bother failing the prctl with -EAGAIN
+ * since we assume you will go in and out of kernel mode anyway.
  */
 int task_isolation_set(unsigned int flags)
 {
if (flags != 0) {
+   int sig = task_isolation_sig(flags);
+
if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||
!task_isolation_possible(raw_smp_processor_id())) {
/* Invalid task affinity setting. */
-   return -EINVAL;
+   if (sig)
+   return -EINVAL;
+   else
+   pr_warn("%s/%d: enabling non-signalling task 
isolation\n"
+   "and not bound to a single task 
isolation core\n",
+   current->comm, current->pid);
}
-   if (!can_stop_my_full_tick_now()) {
+
+   if (sig && !can_stop_my_full_tick_now()) {
/* System not yet ready for task isolation. */
return -EAGAIN;
}
@@ -160,11 +183,11 @@

[PATCH v13 00/12] support "task_isolation" mode

2016-07-14 Thread Chris Metcalf
Here is a respin of the task-isolation patch set.  This primarily
reflects feedback from Frederic and Peter Z.

Changes since v12:

- Rebased on v4.7-rc7.

- New default "strict" model for task isolation - tasks exit the
  kernel from the initial prctl() to userspace, and can only legally
  exit by calling prctl() again to turn off isolation.  Any other
  kernel entry results in a SIGKILL by default.

- New optional "relaxed" mode, where the application can receive some
  signal other than SIGKILL, or no signal at all, when it re-enters
  the kernel.  Since by default task isolation is now strict, there is
  no longer an additional "STRICT" mode, but rather a new "NOSIG" mode
  that builds on top of the "USERSIG" support for setting a signal
  other than SIGKILL to be delivered to the process.  The "NOSIG" mode
  also relaxes the required criteria for entering task isolation mode;
  we just issue a warning if the affinity isn't set right, and we
  don't fail with EAGAIN if the kernel isn't ready to stop the tick.

  Running your task-isolation application in this "NOSIG" mode is also
  necessary when debugging, since otherwise hitting breakpoints, etc.,
  will cause a fatal signal to be sent to the process.

  Frederic has suggested we might want to defer this functionality
  until later, but (in addition to the debuggability aspect) there is
  some thought that it might be useful for e.g. HPC, so I have just
  broken out the additional semantics into a single separate patch at
  the end of the series.

- Function naming has been changed and comments have been added to try
  to clarify the role of the task-isolation reporting on kernel
  entries that do NOT cause signals.  This hopefully clarifies why we
  only invoke the renamed task_isolation_quiet_exception() in a few
  places, since all the other places generate signals anyway. [PeterZ]

- The task_isolation_debug() call now has an inline piece that checks
  to see if the target is a task_isolation cpu before actually
  calling. [PeterZ]

- In _task_isolation_debug(), we use the new task_struct_trylock()
  call that is in linux-next now; for now I just have a static copy of
  the function, which I will switch to using the version from
  linux-next in the next rebasing. [PeterZ]

- We now pass a string describing the interrupt up from
  task_isolation_debug() so there is more information on where the
  interrupt came from beyond just the stack backtrace. [PeterZ]

- I added task_isolation_debug() hooks to smp_sched_reschedule() on
  x86, which was missing before, and removed the hooks in the tile
  send_IPI_*() routines, since there were already hooks in the
  callers.  Likewise I moved the hook for arm64 from the generic
  smp_cross_call() routine to the only caller that wasn't already
  hooked, smp_send_reschedule().  The commit message clarifies the
  rationale for where hooks are placed.

- I moved the page fault reporting so that it only reports in the case
  that we are not also sending a SIGSEGV/SIGBUS, for consistency with
  other uses of task_isolation_quiet_exception().

The previous (v12) patch series is here:

https://lkml.kernel.org/g/1459877922-15512-1-git-send-email-cmetc...@mellanox.com

This version of the patch series has been tested on arm64 and tilegx,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.
Frederick, do you have a sense of what is left to be done there?
I can certainly try to contribute to that effort as well.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: track asynchronous interrupts
  arch/x86: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm, tile: turn off timer tick for oneshot_stopped state
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: add user-settable notification signal

 Documentation/kernel-parameters.txt|  16 ++
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S  |  12 +-
 arch/arm64/kernel/ptrace.c |  15 +-
 arch/arm64/kernel/signal.c |  42 +++-
 arch/arm64/kernel/smp.c|   2 +
 arch/arm64/mm/fault.c  |   8 +-
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   4 +-
 arch/tile/kernel/process.c |   9 +
 arch/tile/kernel/ptrace.c   

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-07-01 Thread Chris Metcalf

On 6/29/2016 11:18 AM, Frederic Weisbecker wrote:

On Fri, Jun 03, 2016 at 03:32:04PM -0400, Chris Metcalf wrote:

On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:

On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:

On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:

  TL;DR: Let's make an explicit decision about whether task isolation
  should be "persistent" or "one-shot".  Both have some advantages.
  =

An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:

"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely.  It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

Good, although that quiescing on kernel return must be an option.


Can you spell out why you think turning it off is helpful?  I'll admit
this is the default mode in the commercial version of task isolation
that we ship, and was also the default in the first LKML patch series.
But on consideration I haven't found scenarios where skipping the
quiescing is helpful.  Admittedly you get out of the kernel faster,
but then you're back in userspace and vulnerable to yet more
unexpected interrupts until the timer quiesces.  If you're asking for
task isolation, this is surely not what you want.


I just feel that quiescing, on the way back to user after an unwanted
interruption, is awkward. The quiescing should work once and for all
on return back from the prctl. If we still get disturbed afterward,
either the quiescing is buggy or incomplete, or something is on the
way that can not be quiesced.


If we are thinking of an initial implementation that doesn't allow any
subsequent kernel entry to be valid, then this all gets much easier,
since any subsequent kernel entry except for a prctl() syscall will
result in a signal, which will turn off task isolation, and we will
never have to worry about additional quiescing.  I think that's where
we got from the discussion at the bottom of this email.

So for your question here, we're really just thinking about future
directions as far as how to handle interrupts, and if in the future we
add support for allowing syscalls and/or exceptions without leaving
task isolation mode, then we have to think about how that interacts
with interrupts.  The problem is that it's hard to tell, as you're
returning to userspace, whether you're returning from an exception or
an interrupt; you typically don't have that information available.  So
from a purely ease-of-implementation perspective, we'd likely want to
handle exceptions and interrupts the same way, and quiesce both.

In general, I think it would also be a better explanation to users of
task isolation to say "every enter/exit to the kernel is either an
error that causes a signal, or it quiesces on return".  It's a simpler
semantic, and I think it also is better for interrupts anyway, since
it potentially avoids multiple interrupts to the application (whatever
interrupted to begin with, plus potential timer interrupts later).

But that said, if we start with "pure strict" mode only, all of this
becomes hypothetical, and we may in fact choose never to allow "safe"
modes of entering the kernel.


I'm not actually sure what
you're recommending we do to avoid exceptions.  Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them.  For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region.  I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.

They are not all deterministic. For example a breakpoint, a step, a trap
can be set up by another process. So this is not entirely under the control
of the user.


That's true, but I'd argue the behavior in that case should be that you can
raise that kind of exception validly (so you can debug), and then you should
quiesce on return to userspace so the application doesn't see additional
exceptions.


I don't see how we can quiesce such things.


I'm imagining task A is

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-06-03 Thread Chris Metcalf

On 5/25/2016 9:07 PM, Frederic Weisbecker wrote:

I don't remember how much I answered this email, but I need to finish that :-)


Sorry for the slow response - it's been a busy week.


On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:

On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:

   TL;DR: Let's make an explicit decision about whether task isolation
   should be "persistent" or "one-shot".  Both have some advantages.
   =

An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:

"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely.  It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

Good, although that quiescing on kernel return must be an option.


Can you spell out why you think turning it off is helpful?  I'll admit
this is the default mode in the commercial version of task isolation
that we ship, and was also the default in the first LKML patch series.
But on consideration I haven't found scenarios where skipping the
quiescing is helpful.  Admittedly you get out of the kernel faster,
but then you're back in userspace and vulnerable to yet more
unexpected interrupts until the timer quiesces.  If you're asking for
task isolation, this is surely not what you want.


If you enable "strict" mode, we disable task isolation mode for that
core and deliver a signal to it.  This lets the application know that
an interrupt occurred, and it can take whatever kind of logging or
debugging action it wants to, re-enable task isolation if it wants to
and continue, or just exit or abort, etc.

Good.


If you don't enable "strict" mode, but you do have
task_isolation_debug enabled as a boot flag, you will at least get a
console dump with a backtrace and whatever other data we have.
(Sometimes the debug info actually includes a backtrace of the
interrupting core, if it's an IPI or TLB flush from another core,
which can be pretty useful.)

Right, I suggest we use trace events btw.


This is probably a good idea, although I wonder if it's worth deferring
until after the main patch series goes in - I'm reluctant to expand the scope
of this patch series and add more reasons for it to get delayed :-)
What do you think?


"One-shot mode": A task requests isolation via prctl(), the kernel
ensures it is isolated on return from the prctl(), but then as soon as
it enters the kernel again, task isolation is switched off until
another prctl is issued.  This is what you recommended in your last
email.

No I think we can issue syscalls for exemple. But asynchronous interruptions
such as exceptions (actually somewhat synchronous but can be unexpected) and
interrupts are what we want to avoid.

Hmm, so I think I'm not really understanding what you are suggesting.

We're certainly in agreement that avoiding interrupts and exceptions
is important.  I'm arguing that the way to deal with them is to
generate appropriate signals/printks, etc.

Yes.


I'm not actually sure what
you're recommending we do to avoid exceptions.  Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them.  For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region.  I'd argue it's an application
bug; one should enable "strict" mode to catch and deal with such bugs.

They are not all deterministic. For example a breakpoint, a step, a trap
can be set up by another process. So this is not entirely under the control
of the user.


That's true, but I'd argue the behavior in that case should be that you can
raise that kind of exception validly (so you can debug), and then you should
quiesce on return to userspace so the application doesn't see additional
exceptions.  There are two ways you could handle debugging:

1. Require the program to set the flag that says it doesn't want a signal
when it is interrupted (so you can interrupt it to debug it, and not kill it);

2. Or have debugging automatically set that flag in the target process.
Similarly,

Re: [PATCH v12 07/13] task_isolation: add debug boot flag

2016-05-19 Thread Chris Metcalf

On 5/19/2016 1:54 PM, Peter Zijlstra wrote:

So the 'simple' thing is:

struct rq *rq = cpu_rq(cpu);
struct task_struct *task;

raw_spin_lock_irq(&rq->lock);
task = rq->curr;
get_task_struct(task);
raw_spin_unlock_irq(&rq->lock);

Because by holding rq->lock, the remote CPU cannot schedule and the
current task_must_  still be valid.


I will plan to use that idiom in the next patch series.  Thanks!


And note; the above can result in a task which already has PF_EXITING
set.


I think that should be benign though; we may generate an unnecessary
warning, but somebody was doing something that could have resulted in
interrupting an isolated task anyway, so warning about it is reasonable.  And
presumably PF_EXITING just means we don't wake any threads and leave
the signal queued, but that gets flushed when the task finally exits.


The complex thing is described in the linked thread and will likely make
your head hurt.


I read the linked thread and was entertained. :-)  I suspect locking the
runqueue may be the more robust solution anyway, and since this is
presumably not a hot path, it seems easier to reason about this way.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 07/13] task_isolation: add debug boot flag

2016-05-19 Thread Chris Metcalf

(Resending in text/plain.  I just screwed around with my Thunderbird
config some more in hopes of getting it to pay attention to all the
settings that say "use plain text for LKML", but, we'll see.)

On 5/18/2016 1:06 PM, Peter Zijlstra wrote:

On Wed, May 18, 2016 at 12:35:19PM -0400, Chris Metcalf wrote:

On 5/18/2016 9:56 AM, Peter Zijlstra wrote:

On Tue, Apr 05, 2016 at 01:38:36PM -0400, Chris Metcalf wrote:

+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_debug(int cpu)
+{
+   struct task_struct *p;
+
+   if (!task_isolation_possible(cpu))
+   return;
+
+   rcu_read_lock();
+   p = cpu_curr(cpu);
+   get_task_struct(p);
+   rcu_read_unlock();
+   task_isolation_debug_task(cpu, p);
+   put_task_struct(p);

This is still broken...

I don't know how or why, though. :-)  Can you give me a better idiom?
This looks to my eye just like how it's done for something like
sched_setaffinity() by one task on another task, and I would have
assumed the risks there of the other task evaporating part way
through would be the same as the risks here.

Because rcu_read_lock() does not stop the task pointed to by
cpu_curr(cpu) from disappearing on you entirely.


So clearly once we have a task struct with an incremented usage count,
we are golden: the task isolation code only touches immediate fields
of task_struct, which are guaranteed to stick around until we
put_task_struct(), and the other path is into send_sig_info(), which
is already robust to the task struct being exited (the ->sighand
becomes NULL and we bail out in __lock_task_sighand, otherwise we're
holding sighand->siglock until we deliver the signal).

So, I think what you're saying is that there is a race between when we
read per_cpu(runqueues, cpu).curr, and when we increment the
p->usage value in the task, and that the RCU read lock doesn't help
with that?  My impression was that by being the ".curr" task, we are
guaranteed that it hasn't gone through do_exit() yet, and thus we
benefit from an RCU guarantee around being able to validly dereference
the pointer, i.e. it hasn't yet been freed and so dereferencing is safe.

I don't see how grabbing the ->curr from the runqueue is any more
fragile from an RCU perspective than grabbing the task from the pid in
kill_pid_info().  And in fact, that code doesn't even bump
task->usage, as far as I can see, just relying on getting the
sighand->siglock.

Anyway, whatever more clarity you can offer me, or suggestions for
APIs to use are welcome.


See also the discussion around:

lkml.kernel.org/r/20160518170218.gy3...@twins.programming.kicks-ass.net


This makes me wonder if I should use rcu_dereference(&cpu_curr(p))
just for clarity, though I think it's just as correct either way.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode

2016-05-18 Thread Chris Metcalf

On 5/18/2016 9:44 AM, Peter Zijlstra wrote:

On Tue, Apr 05, 2016 at 01:38:35PM -0400, Chris Metcalf wrote:

+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+   siginfo_t info = {};
+   int sig;
+
+   pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+   task->comm, task->pid, buf);
+

So the function name suggests this is called for interrupts, except its
purpose is to deliver a signal.


Fair point.  I'll name it task_isolation_deliver_signal() in the next patch 
series.


Now, in case of exceptions the violation isn't necessarily _by_ the task
itself. You might want to change that to report the exception
type/number instead of the affected task.


Well, we do report whatever exception information we have.  For example
a page fault exception will report the address or whatever other info is
handy; it's easy to tune since it's just a vsnprintf of some varargs from the
architecture layer.

For things like IPIs or TLB invalidations or whatever, the code currently just
reports "interrupt"; I could arrange to pass down more informative varargs
from the caller for that as well.  Let me look into it.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 04/13] task_isolation: add initial support

2016-05-18 Thread Chris Metcalf

On 5/18/2016 9:34 AM, Peter Zijlstra wrote:

On Tue, Apr 05, 2016 at 01:38:33PM -0400, Chris Metcalf wrote:

diff --git a/kernel/signal.c b/kernel/signal.c
index aa9bf00749c1..53e4e62f2778 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -34,6 +34,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define CREATE_TRACE_POINTS

  #include 
@@ -2213,6 +2214,9 @@ relock:
/* Trace actually delivered signals. */
trace_signal_deliver(signr, &ksig->info, ka);
  
+		/* Disable task isolation when delivering a signal. */

Why !? Changelog is quiet on this.


There are really two reasons.

1. If the task is receiving a signal, it will know it's not isolated
   any more, so we don't need to worry about notifying it explicitly.
   This behavior is easy to document and allows the application to decide
   if the signal is unexpected and it should go straight to its error
   handling path (likely outcome, and in that case you want task isolation
   off anyway) or if it thinks it can plausibly re-enable isolation and
   return to where the signal interrupted you at (hard to imagine how this
   would ever make sense, but you could if you wanted to).

2. When we are delivering a signal we may already be holding the lock
   for the signal subsystem, and it gets hard to figure out whether it's
   safe to send another signal to the application as a "task isolation
   broken" notification.  For example, sending a signal to a task on
   another core involves doing an IPI to that core to kick it; the IPI
   normally is a generic point for notifying the remote core of broken
   task isolation and sending a signal - except that at the point where
   we would do that on the signal path we are already holding the lock,
   so we end up deadlocked.  We could no doubt work around that, but it
   seemed cleaner to decouple the existing signal mechanism from the
   signal delivery for task isolation.

I will add more discussion of the rationale to the commit message.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-05-18 Thread Chris Metcalf

On 5/18/2016 9:35 AM, Peter Zijlstra wrote:

On Tue, Apr 05, 2016 at 01:38:34PM -0400, Chris Metcalf wrote:

This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Hurm, we still have that option? I thought we killed it, because random
people set it and 'complain' their system misbehaves.


It's still in, as of 4.6 (and still in linux-next too).  I did receive
feedback saying the option was useful, when setting up a kernel to run
isolation apps on systems that may have a varying number of processsors,
since it means you don't need to tweak the boot arguments each time.

A different approach that I'd be happy to pursue would be to provide
a "clipping" version of cpulist_parse() that allows you to pass a boot
argument like "nohz_full=1-1000" and just clip off the impossible cpus.
We could then change "nohz_full=" and "task_isolation=" to use it.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 07/13] task_isolation: add debug boot flag

2016-05-18 Thread Chris Metcalf

(Oops, missed one that I should have forced to text/plain. Resending.)

On 5/18/2016 9:56 AM, Peter Zijlstra wrote:

On Tue, Apr 05, 2016 at 01:38:36PM -0400, Chris Metcalf wrote:

+#ifdef CONFIG_TASK_ISOLATION
+void task_isolation_debug(int cpu)
+{
+   struct task_struct *p;
+
+   if (!task_isolation_possible(cpu))
+   return;
+
+   rcu_read_lock();
+   p = cpu_curr(cpu);
+   get_task_struct(p);
+   rcu_read_unlock();
+   task_isolation_debug_task(cpu, p);
+   put_task_struct(p);

This is still broken...


I don't know how or why, though. :-)  Can you give me a better idiom?
This looks to my eye just like how it's done for something like
sched_setaffinity() by one task on another task, and I would have
assumed the risks there of the other task evaporating part way
through would be the same as the risks here.


Also, I really don't like how you sprinkle a call all over the core
kernel. At the very least make an inline fast path for this function to
avoid the call whenever possible.


I can boost the "task_isolation_possible()" test up into a static inline,
and only call in the case where we have a target cpu that is actually
in the "task_isolation=" boot argument set.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 00/13] support "task_isolation" mode

2016-05-12 Thread Chris Metcalf

Ping, since the 4.7 merge window is opening soon and I haven't received
too much feedback on this version of the patch series based on 4.6-rc1.

1. Patch 09/13 for timer ticks was acked by Daniel Lezcano and is standalone,
   so could be taken into the relevant trees.  I'm not sure if it should go in
   as two separate patches through the tile and arm architecture trees,
   or through the timer tree as a combined patch.  Catalin/Will, any ideas?

   
http://lkml.kernel.org/g/1459877922-15512-10-git-send-email-cmetc...@mellanox.com

2. Patch 12/13, factoring the work_pending state machine for ARM64 into C,
   should go via the arm64 tree.  Mark Rutland should probably Ack it but then
   it should go via the ARM64 tree:

   
http://lkml.kernel.org/g/1459877922-15512-13-git-send-email-cmetc...@mellanox.com

3. Frederick provided some more feedback and I think we are still waiting
   to close the loop on the notion of how strict we should be by default:

   http://lkml.kernel.org/g/571e7fc9.60...@mellanox.com

We have been flogging this patch series along for just over a year now;
v1 of the patch series was sent on May 8, 2015.  Phew!

On 4/5/2016 1:38 PM, Chris Metcalf wrote:

Here is a respin of the task-isolation patch set.  The previous one
came out just before the merge window for 4.6 opened, so I suspect
folks may have been busy merging, since it got few comments.

Frederic, how are you feeling about taking this all via your tree?
And what is your take on the new PR_TASK_ISOLATION_ONE_SHOT mode?
I'm not sure what the right path to upstream for this series is.

Changes since v11:

- Rebased on v4.6-rc1.  This required me to create a
   can_stop_my_full_tick() helper in tick-sched.c, since the underlying
   can_stop_full_tick() now takes a struct tick_sched.

- Added a HAVE_ARCH_TASK_ISOLATION Kconfig flag so that you can't
   try to build with TASK_ISOLATION enabled for an architecture until
   it is explicitly configured to work.  This avoids possible
   allyesconfig build failures for unsupported architectures, or even
   for supported ones when bisecting to the middle of this series.

- Return EAGAIN instead of EINVAL for the enabling prctl() if the task
   is affinitized to a task-isolation core, but things just aren't yet
   right for it (e.g. another task running).  This lets the caller
   differentiate a potentially transient failure from a permanent
   failure, for which we still return EINVAL.

The previous (v11) patch series is here:

https://lkml.kernel.org/r/1457734223-26209-1-git-send-email-cmetc...@mellanox.com

This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.

The series is available at:

   git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (13):
   vmstat: add quiet_vmstat_sync function
   vmstat: add vmstat_idle function
   lru_add_drain_all: factor out lru_add_drain_needed
   task_isolation: add initial support
   task_isolation: support CONFIG_TASK_ISOLATION_ALL
   task_isolation: support PR_TASK_ISOLATION_STRICT mode
   task_isolation: add debug boot flag
   task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag
   arm, tile: turn off timer tick for oneshot_stopped state
   arch/x86: enable task isolation functionality
   arch/tile: enable task isolation functionality
   arm64: factor work_pending state machine to C
   arch/arm64: enable task isolation functionality

  Documentation/kernel-parameters.txt|  16 ++
  arch/arm64/Kconfig |   1 +
  arch/arm64/include/asm/thread_info.h   |   5 +-
  arch/arm64/kernel/entry.S  |  12 +-
  arch/arm64/kernel/ptrace.c |  15 +-
  arch/arm64/kernel/signal.c |  42 -
  arch/arm64/kernel/smp.c|   2 +
  arch/arm64/mm/fault.c  |   4 +
  arch/tile/Kconfig  |   1 +
  arch/tile/include/asm/thread_info.h|   4 +-
  arch/tile/kernel/process.c |   9 +
  arch/tile/kernel/ptrace.c  |   7 +
  arch/tile/kernel/single_step.c |   5 +
  arch/tile/kernel/smp.c |  28 +--
  arch/tile/kernel/time.c|   1 +
  arch/tile/kernel/unaligned.c   |   3 +
  arch/tile/mm/fault.c   |   3 +
  arch/tile/mm/homecache.c   |   2 +
  arch/x86/Kconfig   |   1 +
  arch/x86/entry/common.c|  18 +-
  arch/x86/include/asm/thread_info.h |   2 +
  arch/x86/kernel/traps.c|   2 +
  arch/x86/mm/fault.c|   2 +
  drivers/base/cpu.c |  18 ++
  drivers/clocksource/arm_arch_timer.c   |   2 +
  include/linux/context_tracking_state.h |   6 +
  include/linux/isolation.h  |  63 ++

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-04-25 Thread Chris Metcalf

On 4/22/2016 9:16 AM, Frederic Weisbecker wrote:

On Fri, Apr 08, 2016 at 12:34:48PM -0400, Chris Metcalf wrote:

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:

On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:

   TL;DR: Let's make an explicit decision about whether task isolation
   should be "persistent" or "one-shot".  Both have some advantages.
   =

An important high-level issue is how "sticky" task isolation mode is.
We need to choose one of these two options:

"Persistent mode": A task switches state to "task isolation" mode
(kind of a level-triggered analogy) and stays there indefinitely.  It
can make a syscall, take a page fault, etc., if it wants to, but the
kernel protects it from incurring any further asynchronous interrupts.
This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.

So here I'm taking "interrupt" to mean an external, asynchronous
interrupt, from another core or device, or asynchronously triggered
on the local core, like a timer interrupt.  By contrast I use "exception"
or "fault" to refer to synchronous, locally-triggered interruptions.

Ok.


So for interrupts, the short answer is, it's a bug! :-)

An interrupt could be a kernel bug, in which case we consider it a
"true" bug.  This could be a timer interrupt occurring even after the
task isolation code thought there were none pending, or a hardware
device that incorrectly distributes interrupts to a task-isolation
cpu, or a global IPI that should be sent to fewer cores, or a kernel
TLB flush that could be deferred until the task-isolation task
re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
bug.  I'm sure there are more such bugs that we can continue to fix
going forward; it depends on how arbitrary you want to allow code
running on other cores to be.  For example, can another core unload a
kernel module without interrupting a task-isolation task?  Not right now.

Or, it could be an application bug: the standard example is if you
have an application with task-isolated cores that also does occasional
unmaps on another thread in the same process, on another core.  This
causes TLB flush interrupts under application control.  The
application shouldn't do this, and we tell our customers not to build
their applications this way.  The typical way we encourage our
customers to arrange this kind of "multi-threading" is by having a
pure memory API between the task isolation threads and what are
typically "control" threads running on non-task-isolated cores.  The
two types of threads just both mmap some common, shared memory but run
as different processes.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

So if we take an interrupt that we didn't expect, we want to wait some more
in the end of that interrupt to wait for things to quiesce some more?


I think it's actually pretty plausible.

Consider the "application bug" case, where you're running some code that does
packet dispatch to different cores.  If a core seems to back up you stop
dispatching packets to it.

Now, we get a TLB flush.  If handling the flush causes us to restart the tick
(maybe just as a side effect of entering the kernel in the first place) we
really are better off staying in the kernel until the tick is handled and
things are quiesced again.  That way, although we may end up dropping a
bunch of packets that were queued up to that core, we only do so ONCE - we
don't do it again when the tick fires a little bit later on, when the core
has already caught up and is claiming to be able to handle packets again.

Also, pragmatically, we would require a whole bunch of machinery in the
kernel to figure out whether we were returning from a syscall, an exception,
or an interrupt, and only skip the task-isolation work for interrupts.  We
don't actually have that information available to us at the moment we are
returning to userspace right now, so we'd need to add that tracking state
in each platform's code somehow.



That doesn't look right. Things should be quiesced once and for all on
return from the initial prctl() call. We can't even expect to quiesce more
in case of interruptions, the tick can't be forced off anyway.


Yes, things are quiesced once and for all after prctl().  We also need to
be prepared to handle unexpected interrupts, though.  It's true that we can't
force the tick off, but as I suggested above, just waiting for the tick may
well be a

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-04-12 Thread Chris Metcalf

On 4/8/2016 12:34 PM, Chris Metcalf wrote:

However, this makes me wonder if "strict" mode should be the default
for task isolation??  That way task isolation really doesn't conflict
semantically with migration.  And we could provide a "weak" mode, or a
"kernel-friendly" mode, or some such nomenclature, and define the
migration semantics just for that case, where it makes it clear it's a
bit unusual. 


I noodled around with this and decided it was a better default,
so I've made the changes and pushed it up to the branch:

git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Now, by default when you enter task isolation mode, you are in
what I used to call "strict" mode, i.e. you can't use the kernel.

You can select a user-specified signal you want to deliver instead of
the default SIGKILL, and if you select signal 0, then you don't get
a signal at all and instead you get to keep running in task
isolation mode after making a syscall, page fault, etc.

Thus the API now looks like this in :

#define PR_SET_TASK_ISOLATION   48
#define PR_GET_TASK_ISOLATION   49
# define PR_TASK_ISOLATION_ENABLE   (1 << 0)
# define PR_TASK_ISOLATION_USERSIG  (1 << 1)
# define PR_TASK_ISOLATION_SET_SIG(sig) (((sig) & 0x7f) << 8)
# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
# define PR_TASK_ISOLATION_NOSIG \
(PR_TASK_ISOLATION_USERSIG | PR_TASK_ISOLATION_SET_SIG(0))

I think this better matches what people should want to do in
their applications, and also matches the expectations people
have about what it means to go into task isolation mode in the
first place.

I got rid of the ONESHOT mode that I added in the v12 series, since
it didn't seem like it was what Frederic had been asking for anyway,
and it didn't seem particularly useful on its own.

Frederic, how does this align with your intuition for this stuff?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 04/13] task_isolation: add initial support

2016-04-08 Thread Chris Metcalf

On 4/8/2016 9:56 AM, Frederic Weisbecker wrote:

On Wed, Mar 09, 2016 at 02:39:28PM -0500, Chris Metcalf wrote:
>   TL;DR: Let's make an explicit decision about whether task isolation
>   should be "persistent" or "one-shot".  Both have some advantages.
>   =
>
> An important high-level issue is how "sticky" task isolation mode is.
> We need to choose one of these two options:
>
> "Persistent mode": A task switches state to "task isolation" mode
> (kind of a level-triggered analogy) and stays there indefinitely.  It
> can make a syscall, take a page fault, etc., if it wants to, but the
> kernel protects it from incurring any further asynchronous interrupts.
> This is the model I've been advocating for.

But then in this mode, what happens when an interrupt triggers.


So here I'm taking "interrupt" to mean an external, asynchronous
interrupt, from another core or device, or asynchronously triggered
on the local core, like a timer interrupt.  By contrast I use "exception"
or "fault" to refer to synchronous, locally-triggered interruptions.

So for interrupts, the short answer is, it's a bug! :-)

An interrupt could be a kernel bug, in which case we consider it a
"true" bug.  This could be a timer interrupt occurring even after the
task isolation code thought there were none pending, or a hardware
device that incorrectly distributes interrupts to a task-isolation
cpu, or a global IPI that should be sent to fewer cores, or a kernel
TLB flush that could be deferred until the task-isolation task
re-enters the kernel later, etc.  Regardless, I'd consider it a kernel
bug.  I'm sure there are more such bugs that we can continue to fix
going forward; it depends on how arbitrary you want to allow code
running on other cores to be.  For example, can another core unload a
kernel module without interrupting a task-isolation task?  Not right now.

Or, it could be an application bug: the standard example is if you
have an application with task-isolated cores that also does occasional
unmaps on another thread in the same process, on another core.  This
causes TLB flush interrupts under application control.  The
application shouldn't do this, and we tell our customers not to build
their applications this way.  The typical way we encourage our
customers to arrange this kind of "multi-threading" is by having a
pure memory API between the task isolation threads and what are
typically "control" threads running on non-task-isolated cores.  The
two types of threads just both mmap some common, shared memory but run
as different processes.

So what happens if an interrupt does occur?

In the "base" task isolation mode, you just take the interrupt, then
wait to quiesce any further kernel timer ticks, etc., and return to
the process.  This at least limits the damage to being a single
interruption rather than potentially additional ones, if the interrupt
also caused timers to get queued, etc.

If you enable "strict" mode, we disable task isolation mode for that
core and deliver a signal to it.  This lets the application know that
an interrupt occurred, and it can take whatever kind of logging or
debugging action it wants to, re-enable task isolation if it wants to
and continue, or just exit or abort, etc.

If you don't enable "strict" mode, but you do have
task_isolation_debug enabled as a boot flag, you will at least get a
console dump with a backtrace and whatever other data we have.
(Sometimes the debug info actually includes a backtrace of the
interrupting core, if it's an IPI or TLB flush from another core,
which can be pretty useful.)


> "One-shot mode": A task requests isolation via prctl(), the kernel
> ensures it is isolated on return from the prctl(), but then as soon as
> it enters the kernel again, task isolation is switched off until
> another prctl is issued.  This is what you recommended in your last
> email.

No I think we can issue syscalls for exemple. But asynchronous interruptions
such as exceptions (actually somewhat synchronous but can be unexpected) and
interrupts are what we want to avoid.


Hmm, so I think I'm not really understanding what you are suggesting.

We're certainly in agreement that avoiding interrupts and exceptions
is important.  I'm arguing that the way to deal with them is to
generate appropriate signals/printks, etc.  I'm not actually sure what
you're recommending we do to avoid exceptions.  Since they're
synchronous and deterministic, we can't really avoid them if the
program wants to issue them.  For example, mmap() some anonymous
memory and then start running, and you'll take exceptions each time
you touch a page in that mapped region.  I'd argue it's an application
bug; one should enable 

[PATCH v12 00/13] support "task_isolation" mode

2016-04-05 Thread Chris Metcalf
Here is a respin of the task-isolation patch set.  The previous one
came out just before the merge window for 4.6 opened, so I suspect
folks may have been busy merging, since it got few comments.

Frederic, how are you feeling about taking this all via your tree?
And what is your take on the new PR_TASK_ISOLATION_ONE_SHOT mode?
I'm not sure what the right path to upstream for this series is.

Changes since v11:

- Rebased on v4.6-rc1.  This required me to create a
  can_stop_my_full_tick() helper in tick-sched.c, since the underlying
  can_stop_full_tick() now takes a struct tick_sched.

- Added a HAVE_ARCH_TASK_ISOLATION Kconfig flag so that you can't
  try to build with TASK_ISOLATION enabled for an architecture until
  it is explicitly configured to work.  This avoids possible
  allyesconfig build failures for unsupported architectures, or even
  for supported ones when bisecting to the middle of this series.

- Return EAGAIN instead of EINVAL for the enabling prctl() if the task
  is affinitized to a task-isolation core, but things just aren't yet
  right for it (e.g. another task running).  This lets the caller
  differentiate a potentially transient failure from a permanent
  failure, for which we still return EINVAL.

The previous (v11) patch series is here:

https://lkml.kernel.org/r/1457734223-26209-1-git-send-email-cmetc...@mellanox.com

This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag
  arm, tile: turn off timer tick for oneshot_stopped state
  arch/x86: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality

 Documentation/kernel-parameters.txt|  16 ++
 arch/arm64/Kconfig |   1 +
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S  |  12 +-
 arch/arm64/kernel/ptrace.c |  15 +-
 arch/arm64/kernel/signal.c |  42 -
 arch/arm64/kernel/smp.c|   2 +
 arch/arm64/mm/fault.c  |   4 +
 arch/tile/Kconfig  |   1 +
 arch/tile/include/asm/thread_info.h|   4 +-
 arch/tile/kernel/process.c |   9 +
 arch/tile/kernel/ptrace.c  |   7 +
 arch/tile/kernel/single_step.c |   5 +
 arch/tile/kernel/smp.c |  28 +--
 arch/tile/kernel/time.c|   1 +
 arch/tile/kernel/unaligned.c   |   3 +
 arch/tile/mm/fault.c   |   3 +
 arch/tile/mm/homecache.c   |   2 +
 arch/x86/Kconfig   |   1 +
 arch/x86/entry/common.c|  18 +-
 arch/x86/include/asm/thread_info.h |   2 +
 arch/x86/kernel/traps.c|   2 +
 arch/x86/mm/fault.c|   2 +
 drivers/base/cpu.c |  18 ++
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h  |  63 +++
 include/linux/sched.h  |   3 +
 include/linux/swap.h   |   1 +
 include/linux/tick.h   |   2 +
 include/linux/vmstat.h |   4 +
 include/uapi/linux/prctl.h |   9 +
 init/Kconfig   |  33 
 kernel/Makefile|   1 +
 kernel/fork.c  |   3 +
 kernel/irq_work.c  |   5 +-
 kernel/isolation.c | 316 +
 kernel/sched/core.c|  18 ++
 kernel/signal.c|   8 +
 kernel/smp.c   |   6 +-
 kernel/softirq.c   |  33 
 kernel/sys.c   |   9 +
 kernel/time/tick-sched.c   |  36 ++--
 mm/swap.c  |  15 +-
 mm/vmstat.c|  21 +++
 45 files changed, 743 insertions(+), 54 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 08/13] task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag

2016-04-05 Thread Chris Metcalf
When this flag is set by the initial prctl(), the semantics of task
isolation change to be "one-shot", i.e. as soon as the kernel is
re-entered for any reason, task isolation is turned off.

During application development, use of this flag is best coupled with
STRICT mode, since otherwise any bug (e.g. an munmap from another
thread in the same task causing an IPI TLB flush) could cause the
task to fall out of task isolation mode without being aware of it.

In production it is typically still best to use STRICT mode, with
a signal handler that will report violations of task isolation
up to the application layer.  However, if you are confident the
application will never fall out of task isolation mode, you may
wish to use ONE_SHOT mode to allow switching from userspace task
isolation mode, to using the kernel freely, without the small extra
penalty of invoking prctl() explicitly to turn task isolation off
before starting to use kernel services.

Signed-off-by: Chris Metcalf 
---
 include/uapi/linux/prctl.h | 1 +
 kernel/isolation.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a5582ace987f..1e204f1a0f4a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,6 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
 # define PR_TASK_ISOLATION_STRICT  (1 << 1)
+# define PR_TASK_ISOLATION_ONE_SHOT(1 << 2)
 # define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
 # define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 1c4f320a24a0..d0e94505bfac 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -205,7 +205,11 @@ void _task_isolation_exception(const char *fmt, ...)
va_end(args);
 
task_isolation_interrupt(task, buf);
+   return;
}
+
+   if (task->task_isolation_flags & PR_TASK_ISOLATION_ONE_SHOT)
+   task_isolation_set_flags(task, 0);
 }
 
 /*
@@ -229,6 +233,9 @@ int task_isolation_syscall(int syscall)
return -1;
}
 
+   if (task->task_isolation_flags & PR_TASK_ISOLATION_ONE_SHOT)
+   task_isolation_set_flags(task, 0);
+
return 0;
 }
 
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-04-05 Thread Chris Metcalf
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf 
---
 init/Kconfig   | 10 ++
 kernel/isolation.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 767f37bc3391..b2717e505157 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -805,6 +805,16 @@ config TASK_ISOLATION
 You should say "N" unless you are intending to run a
 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.
+
 config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 282a34ecb22a..b364182dd8e2 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -40,8 +40,14 @@ int __init task_isolation_init(void)
 {
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+   alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+   cpumask_copy(task_isolation_map, cpu_possible_mask);
+   cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}
 
/*
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v12 07/13] task_isolation: add debug boot flag

2016-04-05 Thread Chris Metcalf
The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, we notify either the
process (if STRICT mode is set and the interrupt is not an NMI)
or with a kernel stack dump on the console (otherwise).

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.
Additionally, delivering a signal to the process in STRICT mode
allows applications to report up task isolation failures into their
own application logging framework.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt|  8 
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h  |  5 +++
 kernel/irq_work.c  |  5 ++-
 kernel/isolation.c | 77 ++
 kernel/sched/core.c| 18 
 kernel/signal.c|  4 ++
 kernel/smp.c   |  6 ++-
 kernel/softirq.c   | 33 +++
 9 files changed, 160 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9bd5e91357b1..7884e69d08fa 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3816,6 +3816,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION
+   and booted in task_isolation= mode, this
+   setting will generate console backtraces when
+   the kernel is about to interrupt a task that
+   has requested PR_TASK_ISOLATION_ENABLE and is
+   running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+   return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index eb78175ed811..f04252c51cf1 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -44,6 +44,9 @@ extern void _task_isolation_exception(const char *fmt, ...);
_task_isolation_exception(fmt, ## __VA_ARGS__); \
} while (0)
 
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -53,6 +56,8 @@ extern inline void task_isolation_set_flags(struct 
task_struct *p,
unsigned int flags) { }
 static inline int task_isolation_syscall(int nr) { return 0; }
 static inline void task_isolation_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;
 
-   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+  

[PATCH v12 04/13] task_isolation: add initial support

2016-04-05 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c  |  18 +
 include/linux/isolation.h   |  48 +++
 include/linux/sched.h   |   3 +
 include/linux/tick.h|   2 +
 include/uapi/linux/prctl.h  |   5 ++
 init/Kconfig|  23 ++
 kernel/Makefile |   1 +
 kernel/fork.c   |   3 +
 kernel/isolation.c  | 153 
 kernel/signal.c |   4 +
 kernel/sys.c|   9 +++
 kernel/time/tick-sched.c|  36 ++---
 13 files changed, 300 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index ecc74fa4bfde..9bd5e91357b1 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3808,6 +3808,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation= [KNL]
+   In kernels built with CONFIG_TASK_ISOLATION=y, set
+   the specified list of CPUs where cpus will be able
+   to use prctl(PR_SET_TASK_ISOLATION) to set up task
+   isolation mode.  Setting this boot flag implicitly
+   also sets up nohz_full and isolcpus mode for the
+   listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(noh

[PATCH v12 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode

2016-04-05 Thread Chris Metcalf
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry generates a signal.  For system
calls, this test is performed immediately before the SECCOMP test
and causes the syscall to return immediately with ENOSYS.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

Signed-off-by: Chris Metcalf 
---
 include/linux/isolation.h  | 10 +++
 include/uapi/linux/prctl.h |  3 ++
 kernel/isolation.c | 73 ++
 3 files changed, 86 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 99b909462e64..eb78175ed811 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -36,6 +36,14 @@ static inline void task_isolation_set_flags(struct 
task_struct *p,
clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
 }
 
+extern int task_isolation_syscall(int nr);
+extern void _task_isolation_exception(const char *fmt, ...);
+#define task_isolation_exception(fmt, ...) \
+   do {\
+   if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+   _task_isolation_exception(fmt, ## __VA_ARGS__); \
+   } while (0)
+
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -43,6 +51,8 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 extern inline void task_isolation_set_flags(struct task_struct *p,
unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION  48
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_STRICT  (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index b364182dd8e2..f44e90109472 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -157,3 +159,74 @@ void task_isolation_enter(void)
if (!tick_nohz_tick_stopped())
set_tsk_need_resched(current);
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+   siginfo_t info = {};
+   int sig;
+
+   pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+   task->comm, task->pid, buf);
+
+   /* Get the signal number to use. */
+   sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+   if (sig == 0)
+   sig = SIGKILL;
+   info.si_signo = sig;
+
+   /*
+* Turn off task isolation mode entirely to avoid spamming
+* the process with signals.  It can re-enable task isolation
+* mode in the signal handler if it wants to.
+*/
+   task_isolation_set_flags(task, 0);
+
+   send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_exception(const char *fmt, ...)
+{
+   struct task_struct *task = current;
+
+   /* RCU should have been enabled prior to this point. */
+   RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+   if (task->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+   va_list args;
+   char buf[100];
+
+   va_start(args, fmt);
+   vsnprintf(buf, sizeof(buf), fmt, args);
+   va_end(args);
+
+   task_isolation_interrupt(task, buf);
+   }
+}
+
+/*
+ * Th

[PATCH v11 00/13] support "task_isolation" mode

2016-03-11 Thread Chris Metcalf
Here is a respin of the task-isolation patch set, folding in
comments from Frederic Weisbecker, Will Deacon, Andy Lutomirski,
Kees Cook and others.

Changes since v10:

- In the API, I added a new PR_TASK_ISOLATION_ONE_SHOT flag to
  implement the semantics that Frederic had requested.  It remains to
  be seen whether it makes sense to: leave this as a dynamic flag; back
  out the change and remove the flag and leave the semantics always
  "persistent" (as before); or remove the flag and make the semantics
  always one-shot.  I tend to favor removing the flag and keeping the
  semantics persistent, but having it as a flag provides a specific
  implementation to let us think about the tradeoffs.

- I added a TIF_TASK_ISOLATION flag to clarify and simplify the tests for
  whether task isolation is currently enabled.  We remove the previous
  inline wrappers for task_isolation_ready/enter() and just call the
  real functions unconditionally if TIF_TASK_ISOLATION is set, and
  similarly simplify the task_isolation_syscall/exception() helpers.

- I added a task_isolation_set_flags() helper to set or clear
  TIF_TASK_ISOLATION as needed; it also allows me to get rid of the
  #ifdefs in signal.c and fork.c, which is a nice plus.

- The initial prctl() to enable task isolation now also checks
  can_stop_full_tick() to look for additional potential problems when
  starting up task isolation (other schedulable tasks or POSIX cpu
  timers being the two most obvious examples).  The function is now no
  longer static in kernel/time/tick-sched.c.

- I expanded the existing comment justifying calling
  set_tsk_need_resched() if dynticks are still running when a task
  isolation task wants to enter userspace.  As mentioned in my reply
  to Frederic, I still consider it an open question whether we should
  do some form of struct notification type work here, but on balance I
  think it's overcomplicated to do so.

- We now make sure to clear task isolation when delivering a signal,
  since by definition signals pretty much mean you've lost task
  isolation, it's a well-defined semantic to provide to userspace, and
  it means we can always deliver the signal for STRICT mode saying we
  were interrupted.  Also, doing this is necessary to catch more of the
  cases where we clear task isolation mode for the new ONE_SHOT mode.

- For STRICT mode, I moved the setting of the attempted syscall's return
  value to the generic code via the syscall_set_return_value()
  function.  I also restructured the code slightly to make it
  easier to add ONE_SHOT support in a following patch.  On Kees Cook's
  advice I continue to just support the simple TIF_TASK_ISOLATION check
  in syscall entry that calls out to a few lines of C code, but there is
  an ongoing conversation with Andy Lutomirski about using a proposed
  seccomp() extension to guard syscall entry instead.

- The arch/arm64 patch to factor the work_pending state machine into C was
  updated to include the arch/arm call to trace_hardirqs_off() at the
  top.  Will Deacon noticed that we were missing this support.  I also
  restructured the loop as a do/while at his suggestion, rather than
  copying the x86 while(true)/break idiom.

- Changed the S-O-B lines from ezchip.com to mellanox.com.

The previous (v10) patch series is here:

https://lkml.kernel.org/r/1456949376-4910-1-git-send-email-cmetc...@ezchip.com

This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (13):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag
  arm, tile: turn off timer tick for oneshot_stopped state
  arch/x86: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality

 Documentation/kernel-parameters.txt|  16 ++
 arch/arm64/include/asm/thread_info.h   |   5 +-
 arch/arm64/kernel/entry.S  |  12 +-
 arch/arm64/kernel/ptrace.c |  15 +-
 arch/arm64/kernel/signal.c |  42 -
 arch/arm64/kernel/smp.c|   2 +
 arch/arm64/mm/fault.c  |   4 +
 arch/tile/include/asm/thread_info.h|   4 +-
 arch/tile/kernel/process.c |   9 +
 arch/tile/kernel/ptrace.c  |   7 +
 arch/tile/kernel/single_step.

[PATCH v11 06/13] task_isolation: support PR_TASK_ISOLATION_STRICT mode

2016-03-11 Thread Chris Metcalf
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry generates a signal.  For system
calls, this test is performed immediately before the SECCOMP test
and causes the syscall to return immediately with ENOSYS.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, the syscall checking
test ignores the prctl() syscall so that we can clear the bit again
later, and ignores exit/exit_group to allow exiting the task without
a pointless signal killing you as you try to do so.

Signed-off-by: Chris Metcalf 
---
 include/linux/isolation.h  | 10 +++
 include/uapi/linux/prctl.h |  3 ++
 kernel/isolation.c | 73 ++
 3 files changed, 86 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 99b909462e64..9202ced4511c 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -36,6 +36,14 @@ static inline void task_isolation_set_flags(struct 
task_struct *p,
clear_tsk_thread_flag(p, TIF_TASK_ISOLATION);
 }
 
+extern int task_isolation_syscall(int nr);
+extern void _task_isolation_exception(const char *fmt, ...);
+#define task_isolation_exception(fmt, ...) \
+   do {\
+   if (current_thread_info()->flags & _TIF_TASK_ISOLATION) \
+   _task_isolation_exception(fmt, ## __VA_ARGS__); \
+   } while (0)
+
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -43,6 +51,8 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 extern inline void task_isolation_set_flags(struct task_struct *p,
unsigned int flags) { }
+static inline int task_isolation_syscall(int nr) { return 0; }
+static inline void task_isolation_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION  48
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_STRICT  (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index f59dcf06ba56..e1d844b5c567 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -154,3 +156,74 @@ void task_isolation_enter(void)
if (!tick_nohz_tick_stopped())
set_tsk_need_resched(current);
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+   siginfo_t info = {};
+   int sig;
+
+   pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+   task->comm, task->pid, buf);
+
+   /* Get the signal number to use. */
+   sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+   if (sig == 0)
+   sig = SIGKILL;
+   info.si_signo = sig;
+
+   /*
+* Turn off task isolation mode entirely to avoid spamming
+* the process with signals.  It can re-enable task isolation
+* mode in the signal handler if it wants to.
+*/
+   task_isolation_set_flags(task, 0);
+
+   send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception that doesn't
+ * otherwise trigger a signal to the user process (e.g. simple page fault).
+ */
+void _task_isolation_exception(const char *fmt, ...)
+{
+   struct task_struct *task = current;
+
+   /* RCU should have been enabled prior to this point. */
+   RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+   if (task->task_isolation_flags & PR_TASK_ISOLATION_STRICT) {
+   va_list args;
+   char buf[100];
+
+   va_start(args, fmt);
+   vsnprintf(buf, sizeof(buf), fmt, args);
+   va_end(args);
+
+   task_isolation_interrupt(task, buf);
+   }
+}
+
+/*
+ * Th

[PATCH v11 05/13] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-03-11 Thread Chris Metcalf
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.

Signed-off-by: Chris Metcalf 
---
 init/Kconfig   | 10 ++
 kernel/isolation.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 6cab348fe454..314b09347fba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -802,6 +802,16 @@ config TASK_ISOLATION
 You should say "N" unless you are intending to run a
 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.
+
 config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index c67f77d217f9..f59dcf06ba56 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -40,8 +40,14 @@ int __init task_isolation_init(void)
 {
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+   alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+   cpumask_copy(task_isolation_map, cpu_possible_mask);
+   cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}
 
/*
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 04/13] task_isolation: add initial support

2016-03-11 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in
thread_info flags.  When task isolation is enabled for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

A new /sys/devices/system/cpu/task_isolation pseudo-file is added,
parallel to the comparable nohz_full file.

The task_isolation_ready() call is invoked when TIF_TASK_ISOLATION is
set in prepare_exit_to_usermode() or its architectural equivalent,
and forces the loop to retry if the system is not ready.  It is
called with interrupts disabled and inspects the kernel state
to determine if it is safe to return into an isolated state.
In particular, if it sees that the scheduler tick is still enabled,
it reports that it is not yet safe.

Each time through the loop of TIF work to do, if TIF_TASK_ISOLATION
is set, we call the new task_isolation_enter() routine.  This
takes any actions that might avoid a future interrupt to the core,
such as a worker thread being scheduled that could be quiesced now
(e.g. the vmstat worker) or a future IPI to the core to clean up some
state that could be cleaned up now (e.g. the mm lru per-cpu cache).
In addition, it reqeusts rescheduling if the scheduler dyntick is
still running.

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, tile,
and arm64.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt |   8 ++
 drivers/base/cpu.c  |  18 +
 include/linux/isolation.h   |  48 
 include/linux/sched.h   |   3 +
 include/linux/tick.h|   2 +
 include/uapi/linux/prctl.h  |   5 ++
 init/Kconfig|  20 +
 kernel/Makefile |   1 +
 kernel/fork.c   |   3 +
 kernel/isolation.c  | 150 
 kernel/signal.c |   3 +
 kernel/sys.c|   9 +++
 kernel/time/tick-sched.c|  33 
 13 files changed, 289 insertions(+), 14 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9a53c929f017..c8d0b42d984a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3747,6 +3747,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation= [KNL]
+   In kernels built with CONFIG_TASK_ISOLATION=y, set
+   the specified list of CPUs where cpus will be able
+   to use prctl(PR_SET_TASK_ISOLATION) to set up task
+   isolation mode.  Setting this boot flag implicitly
+   also sets up nohz_full and isolcpus mode for the
+   listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(noh

[PATCH v11 08/13] task_isolation: add PR_TASK_ISOLATION_ONE_SHOT flag

2016-03-11 Thread Chris Metcalf
When this flag is set by the initial prctl(), the semantics of task
isolation change to be "one-shot", i.e. as soon as the kernel is
re-entered for any reason, task isolation is turned off.

During application development, use of this flag is best coupled with
STRICT mode, since otherwise any bug (e.g. an munmap from another
thread in the same task causing an IPI TLB flush) could cause the
task to fall out of task isolation mode without being aware of it.

In production it is typically still best to use STRICT mode, with
a signal handler that will report violations of task isolation
up to the application layer.  However, if you are confident the
application will never fall out of task isolation mode, you may
wish to use ONE_SHOT mode to allow switching from userspace task
isolation mode, to using the kernel freely, without the small extra
penalty of invoking prctl() explicitly to turn task isolation off
before starting to use kernel services.

Signed-off-by: Chris Metcalf 
---
 include/uapi/linux/prctl.h | 1 +
 kernel/isolation.c | 7 +++
 2 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index a5582ace987f..1e204f1a0f4a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -202,6 +202,7 @@ struct prctl_mm_map {
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
 # define PR_TASK_ISOLATION_STRICT  (1 << 1)
+# define PR_TASK_ISOLATION_ONE_SHOT(1 << 2)
 # define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
 # define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
diff --git a/kernel/isolation.c b/kernel/isolation.c
index db281dee7d7e..d94a137e0349 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -202,7 +202,11 @@ void _task_isolation_exception(const char *fmt, ...)
va_end(args);
 
task_isolation_interrupt(task, buf);
+   return;
}
+
+   if (task->task_isolation_flags & PR_TASK_ISOLATION_ONE_SHOT)
+   task_isolation_set_flags(task, 0);
 }
 
 /*
@@ -226,6 +230,9 @@ int task_isolation_syscall(int syscall)
return -1;
}
 
+   if (task->task_isolation_flags & PR_TASK_ISOLATION_ONE_SHOT)
+   task_isolation_set_flags(task, 0);
+
return 0;
 }
 
-- 
2.7.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v11 07/13] task_isolation: add debug boot flag

2016-03-11 Thread Chris Metcalf
The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, we notify either the
process (if STRICT mode is set and the interrupt is not an NMI)
or with a kernel stack dump on the console (otherwise).

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.
Additionally, delivering a signal to the process in STRICT mode
allows applications to report up task isolation failures into their
own application logging framework.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt|  8 
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h  |  5 +++
 kernel/irq_work.c  |  5 ++-
 kernel/isolation.c | 77 ++
 kernel/sched/core.c| 18 
 kernel/signal.c|  5 +++
 kernel/smp.c   |  6 ++-
 kernel/softirq.c   | 33 +++
 9 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index c8d0b42d984a..ea0434fa906e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3755,6 +3755,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION
+   and booted in task_isolation= mode, this
+   setting will generate console backtraces when
+   the kernel is about to interrupt a task that
+   has requested PR_TASK_ISOLATION_ENABLE and is
+   running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+   return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index 9202ced4511c..560608ae72d0 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -44,6 +44,9 @@ extern void _task_isolation_exception(const char *fmt, ...);
_task_isolation_exception(fmt, ## __VA_ARGS__); \
} while (0)
 
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
@@ -53,6 +56,8 @@ extern inline void task_isolation_set_flags(struct 
task_struct *p,
unsigned int flags) { }
 static inline int task_isolation_syscall(int nr) { return 0; }
 static inline void task_isolation_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;
 
-   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+  

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-03-09 Thread Chris Metcalf
led, and interrupts are not then re-enabled before
return to userspace.  Anything else is just keeping your fingers
crossed and guessing.


  TL;DR: Returning -EBUSY from prctl() isn't really that helpful.
  =

Frederic wonders if we can test for various things not being ready
(dynticks not off yet, etc) and just return -EBUSY and let userspace
do the spinning.

First, note that this is only possible for one-shot mode.  For
persistent mode, we have the potential to run up against this on
return from any syscall, and we obviously can't add new error returns
to other syscalls.  So it doesn't really make sense to add EBUSY
semantics to prctl if nothing else can use it.

But even in one-shot mode, I'm not really sure what the advantage is
here.  We still need to do something like task_isolation_ready() in
the prepare_exit_to_usermode() loop, since that's where we have
interrupts disabled and can do a final assessment of the state of the
kernel for this core.  So, while you could imagine having that code
just hook in and call syscall_set_return_value() there instead of
causing things to loop back, that doesn't really save us much
complexity in the kernel, and instead pushes complexity back to
userspace, which may well handle it just by busywaiting on the prctl()
anyway.  You might argue that if we just return to userspace, userspace
can sleep briefly and retry, thus avoiding spinning in the scheduler.
But it's relatively easy to do that (or better) in the kernel, so I'm
not sure that's more than a straw man.  See the next point.


  TL;DR: Should we arrange to actually use a completion in
  task_isolation_enter when dynticks are ticking, and call complete()
  in tick-sched.c when we shut down dynticks, or, just spin in
  schedule() and not worry about burning a little cpu?
  =

One question that keeps getting asked is how useful it is to just call
schedule() while we're waiting for dynticks to shut off, since it
could just be a busy spin going into schedule() over and over.  Even
if another task is ready to run we might not switch to it right away.
So one thing we could think about is arranging so that whenever we
turn off dynticks, we also notify any tasks that were waiting for it
to be turned off; that way we can just sleep in task_isolation_enter()
and wait to be notified, thus guaranteeing any other task that wants
to run can run, or even just waiting in cpu idle for a little while.
Does this seem like it's worth coding up?  My impression has always
been that we wait pretty briefly for dynticks to shut down, so it
doesn't really matter if we spin - and even if we do spin, in
principle we already arranged for this cpu to be dedicated to this
task anyway, so it doesn't really do anything bad except maybe burn a
little bit of extra cpu power.  But I'm willing to be convinced...


  TL;DR: We should turn off task isolation mode for signals.
  =

One thing that occurs to me is that we should arrange so that
any signal delivery turns off task isolation mode.  This is
easily documented semantics even in persistent mode, and it
allows the userspace program to run and discover that something bad
has happened, rather than potentially hanging in the kernel trying to
wait for isolation to be possible before calling the signal handler.
I'll make this change for v11 in any case.

Also, doing this is something of a requirement for the proposed
one-shot mode, since if we have STRICT mode enabled, then any entry
into the kernel is either a syscall, or else ends up causing a signal,
and by hooking the signal mechanism we have a place to catch all the
non-syscall entrypoints, more or less.


  TL;DR: Maybe we should use seccomp for STRICT mode syscall detection.
  =

This is being investigated in a separate email thread with Andy
Lutomirski.  Whether it gets included in v11 is still TBD.


  TL;DR: Various minor issues in answer to Frederic's comments :-)
  =

On 03/04/2016 07:56 AM, Frederic Weisbecker wrote:

On Thu, Feb 11, 2016 at 02:24:25PM -0500, Chris Metcalf wrote:

We don't want to wait for preemption points or interrupts, and there are
no other voluntary reschedules in the prepare_exit_to_usermode() loop.

If the other task had been woken up for some completion, then yes we would
already have had TIF_RESCHED set, but if the other runnable task was (for
example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
this point, and thus we might need to call schedule() explicitly.


There can't be another task in the runqueue waiting to be preempted since
we (the current task) are running on the CPU.


My earlier sentence may not have been clear.  By saying "if the other
runnable task was pre-empted on a timer tick", I meant that
TIF_RESCHED wasn't set on our task, and we'd only eventually schedule
to that other task once a timer interrupt fired and ended our
scheduler slice. 

Re: [PATCH v10 05/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-03-03 Thread Chris Metcalf

On 03/03/2016 01:34 PM, Andi Kleen wrote:

Chris Metcalf  writes:
  
+config TASK_ISOLATION_ALL

+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.

That seems like a very dangerous Kconfig option.
"CONFIG_BREAK_EVERYTHING"
If someone sets that by default they will have a lot of trouble.

I wouldn't add that, make it a run time option only.


So you were thinking, allow a special boot syntax "task_isolation=all",
which puts all the cores into task isolation mode except the boot core?

My original argument was that it was so parallel to the existing
CONFIG_NO_HZ_FULL_ALL option that it just made sense to do it,
and some testers complained about having to specify the precise
cpu range, so this seemed like an easy fix.

The commit comment for NO_HZ_FULL_ALL (f98823ac758ba) reads:

nohz: New option to default all CPUs in full dynticks range
   
Provide a new kernel config that defaults all CPUs to be part

of the full dynticks range, except the boot one for timekeeping.
   
This default setting is overriden by the nohz_full= boot option

if passed by the user.
   
This is helpful for those who don't need a finegrained range

of full dynticks CPU and also for automated testing.

The same arguments would seem to apply to TASK_ISOLATION_ALL;
note that applications don't actually go into task isolation mode
without issuing the appropriate prctl(), so it shouldn't be too
dangerous if users enable it by mistake.  There will be some
extra checks at kernel entry and exit, that's all.

So on balance it still seems like a reasonable choice.  Thoughts?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 07/12] task_isolation: add debug boot flag

2016-03-02 Thread Chris Metcalf

On 3/2/2016 3:37 PM, Peter Zijlstra wrote:

On Wed, Mar 02, 2016 at 03:09:31PM -0500, Chris Metcalf wrote:

+void task_isolation_debug(int cpu)
+{
+   struct task_struct *p;
+
+   if (!task_isolation_possible(cpu))
+   return;
+
+   rcu_read_lock();
+   p = cpu_curr(cpu);
+   get_task_struct(p);

As I think Oleg keeps reminding me, this is not actually a safe thing to
do.


So what's the right solution?  The fast path in task_isolation_debug_task 
basically
just uses the new "task_isolation_flags", and "pid" and "comm".  I would think 
those
would all have to be safe because of the get_task_struct().

The piece that might be problematic is the eventual call to send_sig_info() 
using the
task_struct pointer (called via task_isolation_debug_task -> 
task_isolation_interrupt).
Clearly this is safe at some level, since that's more or less what sys_kill() 
does and the
process could similarly evaporate half way through sending the signal.

Suggestions?  Thanks!

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 07/12] task_isolation: add debug boot flag

2016-03-02 Thread Chris Metcalf
The new "task_isolation_debug" flag simplifies debugging
of TASK_ISOLATION kernels when processes are running in
PR_TASK_ISOLATION_ENABLE mode.  Such processes should get no
interrupts from the kernel, and if they do, when this boot flag is
specified a kernel stack dump on the console is generated.

It's possible to use ftrace to simply detect whether a task_isolation
core has unexpectedly entered the kernel.  But what this boot flag
does is allow the kernel to provide better diagnostics, e.g. by
reporting in the IPI-generating code what remote core and context
is preparing to deliver an interrupt to a task_isolation core.

It may be worth considering other ways to generate useful debugging
output rather than console spew, but for now that is simple and direct.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt|  8 
 include/linux/context_tracking_state.h |  6 +++
 include/linux/isolation.h  |  5 +++
 kernel/irq_work.c  |  5 ++-
 kernel/isolation.c | 77 ++
 kernel/sched/core.c| 18 
 kernel/signal.c|  5 +++
 kernel/smp.c   |  6 ++-
 kernel/softirq.c   | 33 +++
 9 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index c8d0b42d984a..ea0434fa906e 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3755,6 +3755,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
also sets up nohz_full and isolcpus mode for the
listed set of cpus.
 
+   task_isolation_debug[KNL]
+   In kernels built with CONFIG_TASK_ISOLATION
+   and booted in task_isolation= mode, this
+   setting will generate console backtraces when
+   the kernel is about to interrupt a task that
+   has requested PR_TASK_ISOLATION_ENABLE and is
+   running on a task_isolation core.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 1d34fe68f48a..4e2c4b900b82 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,8 +39,14 @@ static inline bool context_tracking_in_user(void)
 {
return __this_cpu_read(context_tracking.state) == CONTEXT_USER;
 }
+
+static inline bool context_tracking_cpu_in_user(int cpu)
+{
+   return per_cpu(context_tracking.state, cpu) == CONTEXT_USER;
+}
 #else
 static inline bool context_tracking_in_user(void) { return false; }
+static inline bool context_tracking_cpu_in_user(int cpu) { return false; }
 static inline bool context_tracking_active(void) { return false; }
 static inline bool context_tracking_is_enabled(void) { return false; }
 static inline bool context_tracking_cpu_is_enabled(void) { return false; }
diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index ba6c4d510db8..f1ae7b663746 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -45,6 +45,9 @@ static inline void task_isolation_enter(void)
 extern bool task_isolation_syscall(int nr);
 extern void task_isolation_exception(const char *fmt, ...);
 extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+extern void task_isolation_debug(int cpu);
+extern void task_isolation_debug_cpumask(const struct cpumask *);
+extern void task_isolation_debug_task(int cpu, struct task_struct *p);
 
 static inline bool task_isolation_strict(void)
 {
@@ -73,6 +76,8 @@ static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
 static inline bool task_isolation_check_syscall(int nr) { return false; }
 static inline void task_isolation_check_exception(const char *fmt, ...) { }
+static inline void task_isolation_debug(int cpu) { }
+#define task_isolation_debug_cpumask(mask) do {} while (0)
 #endif
 
 #endif
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index bcf107ce0854..a9b95ce00667 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 
@@ -75,8 +76,10 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
if (!irq_work_claim(work))
return false;
 
-   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu)))
+   if (llist_add(&work->llnode, &per_cpu(raised_list, cpu))) {
+   task_isolation_debug(cpu);
arch_send_call_function_single_ipi(cpu);
+   }
 
return true;

[PATCH v10 05/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-03-02 Thread Chris Metcalf
This option, similar to NO_HZ_FULL_ALL, simplifies configuring
a system to boot by default with all cores except the boot core
running in task isolation mode.
---
 init/Kconfig   | 10 ++
 kernel/isolation.c |  6 ++
 2 files changed, 16 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 6cab348fe454..314b09347fba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -802,6 +802,16 @@ config TASK_ISOLATION
 You should say "N" unless you are intending to run a
 high-performance userspace driver or similar task.
 
+config TASK_ISOLATION_ALL
+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
+   depends on TASK_ISOLATION
+   help
+If the user doesn't pass the task_isolation boot option to
+define the range of task isolation CPUs, consider that all
+CPUs in the system are task isolation by default.
+Note the boot CPU will still be kept outside the range to
+handle timekeeping duty, etc.
+
 config BUILD_BIN2C
bool
default n
diff --git a/kernel/isolation.c b/kernel/isolation.c
index e954afd8cce8..42ad7a746a1e 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -40,8 +40,14 @@ int __init task_isolation_init(void)
 {
/* For offstack cpumask, ensure we allocate an empty cpumask early. */
if (!saw_boot_arg) {
+#ifdef CONFIG_TASK_ISOLATION_ALL
+   alloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
+   cpumask_copy(task_isolation_map, cpu_possible_mask);
+   cpumask_clear_cpu(smp_processor_id(), task_isolation_map);
+#else
zalloc_cpumask_var(&task_isolation_map, GFP_KERNEL);
return 0;
+#endif
}
 
/*
-- 
2.1.2

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v10 06/12] task_isolation: support PR_TASK_ISOLATION_STRICT mode

2016-03-02 Thread Chris Metcalf
With task_isolation mode, the task is in principle guaranteed not to
be interrupted by the kernel, but only if it behaves.  In particular,
if it enters the kernel via system call, page fault, or any of a
number of other synchronous traps, it may be unexpectedly exposed
to long latencies.  Add a simple flag that puts the process into
a state where any such kernel entry is fatal; this is defined as
happening immediately before the SECCOMP test.

By default, the task is signalled with SIGKILL, but we add prctl()
bits to support requesting a specific signal instead.

To allow the state to be entered and exited, we ignore the prctl()
syscall so that we can clear the bit again later, and we ignore
exit/exit_group to allow exiting the task without a pointless signal
killing you as you try to do so.

Signed-off-by: Chris Metcalf 
---
 include/linux/isolation.h  | 25 +++
 include/uapi/linux/prctl.h |  3 +++
 kernel/isolation.c | 60 ++
 3 files changed, 88 insertions(+)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index c564cf1886bb..ba6c4d510db8 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -42,12 +42,37 @@ static inline void task_isolation_enter(void)
_task_isolation_enter();
 }
 
+extern bool task_isolation_syscall(int nr);
+extern void task_isolation_exception(const char *fmt, ...);
+extern void task_isolation_interrupt(struct task_struct *, const char *buf);
+
+static inline bool task_isolation_strict(void)
+{
+   return ((current->task_isolation_flags &
+(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) ==
+   (PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_STRICT)) &&
+   task_isolation_possible(raw_smp_processor_id());
+}
+
+static inline bool task_isolation_check_syscall(int nr)
+{
+   return task_isolation_strict() && task_isolation_syscall(nr);
+}
+
+#define task_isolation_check_exception(fmt, ...)   \
+   do {\
+   if (task_isolation_strict())\
+   task_isolation_exception(fmt, ## __VA_ARGS__);  \
+   } while (0)
+
 #else
 static inline void task_isolation_init(void) { }
 static inline bool task_isolation_possible(int cpu) { return false; }
 static inline bool task_isolation_enabled(void) { return false; }
 static inline bool task_isolation_ready(void) { return true; }
 static inline void task_isolation_enter(void) { }
+static inline bool task_isolation_check_syscall(int nr) { return false; }
+static inline void task_isolation_check_exception(const char *fmt, ...) { }
 #endif
 
 #endif
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 67224df4b559..a5582ace987f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -201,5 +201,8 @@ struct prctl_mm_map {
 #define PR_SET_TASK_ISOLATION  48
 #define PR_GET_TASK_ISOLATION  49
 # define PR_TASK_ISOLATION_ENABLE  (1 << 0)
+# define PR_TASK_ISOLATION_STRICT  (1 << 1)
+# define PR_TASK_ISOLATION_SET_SIG(sig)(((sig) & 0x7f) << 8)
+# define PR_TASK_ISOLATION_GET_SIG(bits) (((bits) >> 8) & 0x7f)
 
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 42ad7a746a1e..5621fdf15b17 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "time/tick-sched.h"
 
 cpumask_var_t task_isolation_map;
@@ -122,3 +123,62 @@ void _task_isolation_enter(void)
if (!tick_nohz_tick_stopped())
set_tsk_need_resched(current);
 }
+
+void task_isolation_interrupt(struct task_struct *task, const char *buf)
+{
+   siginfo_t info = {};
+   int sig;
+
+   pr_warn("%s/%d: task_isolation strict mode violated by %s\n",
+   task->comm, task->pid, buf);
+
+   /*
+* Turn off task isolation mode entirely to avoid spamming
+* the process with signals.  It can re-enable task isolation
+* mode in the signal handler if it wants to.
+*/
+   task->task_isolation_flags = 0;
+
+   sig = PR_TASK_ISOLATION_GET_SIG(task->task_isolation_flags);
+   if (sig == 0)
+   sig = SIGKILL;
+   info.si_signo = sig;
+   send_sig_info(sig, &info, task);
+}
+
+/*
+ * This routine is called from any userspace exception if the _STRICT
+ * flag is set.
+ */
+void task_isolation_exception(const char *fmt, ...)
+{
+   va_list args;
+   char buf[100];
+
+   /* RCU should have been enabled prior to this point. */
+   RCU_LOCKDEP_WARN(!rcu_is_watching(), "kernel entry without RCU");
+
+   va_start(args, fmt);
+   vsnprintf(buf, sizeof(buf), fmt, args);
+   va_end(args);
+
+   tas

[PATCH v10 04/12] task_isolation: add initial support

2016-03-02 Thread Chris Metcalf
The existing nohz_full mode is designed as a "soft" isolation mode
that makes tradeoffs to minimize userspace interruptions while
still attempting to avoid overheads in the kernel entry/exit path,
to provide 100% kernel semantics, etc.

However, some applications require a "hard" commitment from the
kernel to avoid interruptions, in particular userspace device driver
style applications, such as high-speed networking code.

This change introduces a framework to allow applications
to elect to have the "hard" semantics as needed, specifying
prctl(PR_SET_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so.
Subsequent commits will add additional flags and additional
semantics.

The kernel must be built with the new TASK_ISOLATION Kconfig flag
to enable this mode, and the kernel booted with an appropriate
task_isolation=CPULIST boot argument, which enables nohz_full and
isolcpus as well.  The "task_isolation" state is then indicated by
setting a new task struct field, task_isolation_flag, to the value
passed by prctl().  When the _ENABLE bit is set for a task, and it
is returning to userspace on a task isolation core, it calls the
new task_isolation_ready() / task_isolation_enter() routines to
take additional actions to help the task avoid being interrupted
in the future.

The task_isolation_ready() call plays an equivalent role to the
TIF_xxx flags when returning to userspace, and should be checked
in the loop check of the prepare_exit_to_usermode() routine or its
architecture equivalent.  It is called with interrupts disabled and
inspects the kernel state to determine if it is safe to return into
an isolated state.  In particular, if it sees that the scheduler
tick is still enabled, it reports that it is not yet safe.

Each time through the loop of TIF work to do, we call the new
task_isolation_enter() routine, which takes any actions that might
avoid a future interrupt to the core, such as a worker thread
being scheduled that could be quiesced now (e.g. the vmstat worker)
or a future IPI to the core to clean up some state that could be
cleaned up now (e.g. the mm lru per-cpu cache).  In addition, it
requests rescheduling if the scheduler dyntick is still running.

As a result of these tests on the "return to userspace" path, sys
calls (and page faults, etc.) can be inordinately slow.  However,
this quiescing guarantees that no unexpected interrupts will occur,
even if the application intentionally calls into the kernel.

Separate patches that follow provide these changes for x86, arm64,
and tile.

Signed-off-by: Chris Metcalf 
---
 Documentation/kernel-parameters.txt |   8 +++
 drivers/base/cpu.c  |  18 ++
 include/linux/isolation.h   |  53 
 include/linux/sched.h   |   3 +
 include/linux/tick.h|   1 +
 include/uapi/linux/prctl.h  |   5 ++
 init/Kconfig|  20 ++
 kernel/Makefile |   1 +
 kernel/fork.c   |   3 +
 kernel/isolation.c  | 118 
 kernel/sys.c|   9 +++
 kernel/time/tick-sched.c|  31 ++
 12 files changed, 257 insertions(+), 13 deletions(-)
 create mode 100644 include/linux/isolation.h
 create mode 100644 kernel/isolation.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 9a53c929f017..c8d0b42d984a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3747,6 +3747,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
neutralize any effect of /proc/sys/kernel/sysrq.
Useful for debugging.
 
+   task_isolation= [KNL]
+   In kernels built with CONFIG_TASK_ISOLATION=y, set
+   the specified list of CPUs where cpus will be able
+   to use prctl(PR_SET_TASK_ISOLATION) to set up task
+   isolation mode.  Setting this boot flag implicitly
+   also sets up nohz_full and isolcpus mode for the
+   listed set of cpus.
+
tcpmhash_entries= [KNL,NET]
Set the number of tcp_metrics_hash slots.
Default value is 8192 or 16384 depending on total
diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index 691eeea2f19a..eaf40f4264ee 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 
@@ -290,6 +291,20 @@ static ssize_t print_cpus_nohz_full(struct device *dev,
 static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
 #endif
 
+#ifdef CONFIG_TASK_ISOLATION
+static ssize_t print_cpus_task_isolation(struct device *dev,
+struct device_attribute *attr,
+   

[PATCH v10 00/12] support "task_isolation" mode

2016-03-02 Thread Chris Metcalf
Here is the latest version of the task-isolation patch set, adopting
various suggestions made about the v9 patch series, including some
feedback from testing on the new EZchip NPS ARC platform (although the
new arch/arc support is not included in this patch series).  All of the
suggestions were relatively non-controversial.

Perhaps we are getting close to being able to merge this. :-)

Changes since v9:

- task_isolation is now set up by adding its required cpus to both the
  nohz_full and isolcpus cpumasks.  This allows users to separately
  specify all three flags, if so desired, and still get reasonably
  sane semantics.  This is done with a new tick_nohz_full_add_cpus()
  method for nohz, and just directly updating the isolcpus cpumask.

- We add a /sys/devices/system/cpu/task_isolation file to parallel
  the equivalent nohz_full file.  (This should have been in v8 since
  once task_isolation isn't equivalent to nohz_full, it needs its
  own way to let userspace know where to run.)

- We add a new Kconfig option, TASK_ISOLATION_ALL, which sets all but
  the boot processor to run in task isolation mode.  This parallels
  the existing NO_HZ_FULL_ALL and works around the fact that you can't
  easily specify a boot argument with the desired semantics.

- For task_isolation_debug, we add a check of the context_tracking
  state of the remote cpu before issuing a warning; if the remote cpu
  is actually in the kernel, we don't need to warn.

- A cloned child of a task_isolation task is not enabled for
  task_isolation, since otherwise they would both fight over who could
  safely return to userspace without requiring scheduling interrupts.

- The quiet_vmstat() function's semantics was changed since the v9
  patch series, so I introduce a quiet_vmstat_sync() for isolation.

- The lru_add_drain_needed() function is updated to know about the new
  lru_deactivate_pvecs variable.

- The arm64 patch factoring assembly into C has been modified based
  on an earlier patch by Mark Rutland.

- I simplified the enabling patch for arm64 by realizing we could just
  test TIF_NOHZ as the only bit for TIF_WORK_MASK for task isolation,
  so I didn't have to renumber all the TIF_xxx bits.

- Small fixes to avoid preemption warnings.

- Rebased on v4.5-rc5

For changes in earlier versions of the patch series, please see:

http://lkml.kernel.org/r/1451936091-29247-1-git-send-email-cmetc...@ezchip.com

A couple of the tile patches that refactored the context tracking
code were taken into 4.5 so are no longer present in this series.

This version of the patch series has been tested on arm64 and tile,
and build-tested on x86.

It remains true that the 1 Hz tick needs to be disabled for this
patch series to be able to achieve its primary goal of enabling
truly tick-free operation, but that is ongoing orthogonal work.

The series is available at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git 
dataplane

Chris Metcalf (12):
  vmstat: add quiet_vmstat_sync function
  vmstat: add vmstat_idle function
  lru_add_drain_all: factor out lru_add_drain_needed
  task_isolation: add initial support
  task_isolation: support CONFIG_TASK_ISOLATION_ALL
  task_isolation: support PR_TASK_ISOLATION_STRICT mode
  task_isolation: add debug boot flag
  arm, tile: turn off timer tick for oneshot_stopped state
  arch/x86: enable task isolation functionality
  arch/tile: enable task isolation functionality
  arm64: factor work_pending state machine to C
  arch/arm64: enable task isolation functionality

 Documentation/kernel-parameters.txt|  16 ++
 arch/arm64/include/asm/thread_info.h   |   8 +-
 arch/arm64/kernel/entry.S  |  12 +-
 arch/arm64/kernel/ptrace.c |  12 +-
 arch/arm64/kernel/signal.c |  34 -
 arch/arm64/kernel/smp.c|   2 +
 arch/arm64/mm/fault.c  |   4 +
 arch/tile/kernel/process.c |   6 +-
 arch/tile/kernel/ptrace.c  |   6 +
 arch/tile/kernel/single_step.c |   5 +
 arch/tile/kernel/smp.c |  28 ++--
 arch/tile/kernel/time.c|   1 +
 arch/tile/kernel/unaligned.c   |   3 +
 arch/tile/mm/fault.c   |   3 +
 arch/tile/mm/homecache.c   |   2 +
 arch/x86/entry/common.c|  18 ++-
 arch/x86/kernel/traps.c|   2 +
 arch/x86/mm/fault.c|   2 +
 drivers/base/cpu.c |  18 +++
 drivers/clocksource/arm_arch_timer.c   |   2 +
 include/linux/context_tracking_state.h |   6 +
 include/linux/isolation.h  |  83 +++
 include/linux/sched.h  |   3 +
 include/linux/swap.h   |   1 +
 include/linux/tick.h   |   1 +
 include/linux/vmstat.h |   4 +
 include/uapi/linux/prctl.h |   8 +
 init/Kconfig   |  30 
 kernel/Makefile 

Re: [PATCH v8 04/14] task_isolation: add initial support

2016-02-11 Thread Chris Metcalf

On 01/28/2016 11:38 AM, Frederic Weisbecker wrote:

On Tue, Oct 27, 2015 at 12:40:29PM -0400, Chris Metcalf wrote:

On 10/21/2015 12:12 PM, Frederic Weisbecker wrote:

On Tue, Oct 20, 2015 at 04:36:02PM -0400, Chris Metcalf wrote:

+/*
+ * This routine controls whether we can enable task-isolation mode.
+ * The task must be affinitized to a single nohz_full core or we will
+ * return EINVAL.  Although the application could later re-affinitize
+ * to a housekeeping core and lose task isolation semantics, this
+ * initial test should catch 99% of bugs with task placement prior to
+ * enabling task isolation.
+ */
+int task_isolation_set(unsigned int flags)
+{
+   if (cpumask_weight(tsk_cpus_allowed(current)) != 1 ||

I think you'll have to make sure the task can not be concurrently reaffined
to more CPUs. This may involve setting task_isolation_flags under the runqueue
lock and thus move that tiny part to the scheduler code. And then we must forbid
changing the affinity while the task has the isolation flag, or deactivate the 
flag.

In any case this needs some synchronization.

Well, as the comment says, this is not intended as a hard guarantee.
As written, it might race with a concurrent sched_setaffinity(), but
then again, it also is totally OK as written for sched_setaffinity() to
change it away after the prctl() is complete, so it's not necessary to
do any explicit synchronization.

This harks back again to the whole "polite vs aggressive" issue with
how we envision task isolation.

The "polite" model basically allows you to set up the conditions for
task isolation to be useful, and then if they are useful, great! What
you're suggesting here is a bit more of the "aggressive" model, where
we actually fail sched_setaffinity() either for any cpumask after
task isolation is set, or perhaps just for resetting it to housekeeping
cores.  (Note that we could in principle use PF_NO_SETAFFINITY to
just hard fail all attempts to call sched_setaffinity once we enable
task isolation, so we don't have to add more mechanism on that path.)

I'm a little reluctant to ever fail sched_setaffinity() based on the
task isolation status with the current "polite" model, since an
unprivileged application can set up for task isolation, and then
presumably no one can override it via sched_setaffinity() from another
task.  (I suppose you could do some kind of permissions-based thing
where root can always override it, or some suitable capability, etc.,
but I feel like that gets complicated quickly, for little benefit.)

The alternative you mention is that if the task is re-affinitized, it
loses its task-isolation status, and that also seems like an unfortunate
API, since if you are setting it with prctl(), it's really cleanest just to
only be able to unset it with prctl() as well.

I think given the current "polite" API, the only question is whether in
fact *no* initial test is the best thing, or if an initial test (as
introduced
in the v8 version) is defensible just as a help for catching an obvious
mistake in setting up your task isolation.  I decided the advantage
of catching the mistake were more important than the "API purity"
of being 100% consistent in how we handled the interactions between
affinity and isolation, but I am certainly open to argument on that one.

Meanwhile I think it still feels like the v8 code is the best compromise.

So what is the way to deal with a migration for example? When the task wakes
up on the non-isolated CPU, it gets warned or killed?


Good question!  We can only enable task isolation on an isolcpus core,
so it must be a manual migration, either externally, or by the program
itself calling sched_setaffinity().  So at some level, it's just an
application bug.  In the current code, if you have enabled STRICT mode task
isolation, the process will get killed since it has to go through the kernel
to migrate.  If not in STRICT mode, then it will hang until it is manually
killed since full dynticks will never get turned on once it wakes up on a
non-isolated CPU - unless it is then manually migrated back to a proper
task-isolation cpu.  And, perhaps the intent was to do some cpu offlining
and rearrange the task isolation tasks, and therefore that makes sense?

So, maybe that semantics is good enough!?  I'm not completely sure, but
I think I'm willing to claim that for something this much of a corner
case, it's probably reasonable.


+   /* If the tick is running, request rescheduling; we're not ready. */
+   if (!tick_nohz_tick_stopped()) {

Note that this function tells whether the tick is in dynticks mode, which means
the tick currently only run on-demand. But it's not necessarily completely 
stopped.

I think in fact this is the semantics we want (and that people requested),
e.g. if the user requests an alarm(), we may still be ticking even though
tick_nohz_tick

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-02-11 Thread Chris Metcalf

On 01/30/2016 04:11 PM, Frederic Weisbecker wrote:

On Fri, Jan 29, 2016 at 01:18:05PM -0500, Chris Metcalf wrote:

On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:

On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:

You asked what happens if nohz_full= is given as well, which is a very
good question.  Perhaps the right answer is to have an early_initcall
that suppresses task isolation on any cores that lost their nohz_full
or isolcpus status due to later boot command line arguments (and
generate a console warning, obviously).

I'd rather imagine that the final nohz full cpumask is "nohz_full=" | 
"task_isolation="
That's the easiest way to deal with and both nohz and task isolation can call
a common initializer that takes care of the allocation and add the cpus to the 
mask.

I like it!

And by the same token, the final isolcpus cpumask is "isolcpus=" |
"task_isolation="?
That seems like we'd want to do it to keep things parallel.

We have reverted the patch that made isolcpus |= nohz_full. Too
many people complained about unusable machines with NO_HZ_FULL_ALL

But the user can still set that parameter manually.


Yes.  What I was suggesting is that if the user specifies task_isolation=X-Y
we should add cpus X-Y to both the nohz_full set and the isolcpus set.
I've changed it to work that way for the v10 patch series.



+bool _task_isolation_ready(void)
+{
+   WARN_ON_ONCE(!irqs_disabled());
+
+   /* If we need to drain the LRU cache, we're not ready. */
+   if (lru_add_drain_needed(smp_processor_id()))
+   return false;
+
+   /* If vmstats need updating, we're not ready. */
+   if (!vmstat_idle())
+   return false;
+
+   /* Request rescheduling unless we are in full dynticks mode. */
+   if (!tick_nohz_tick_stopped()) {
+   set_tsk_need_resched(current);

I'm not sure doing this will help getting the tick to get stopped.

Well, I don't know that there is anything else we CAN do, right?  If there's
another task that can run, great - it may be that that's why full dynticks
isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
there's nothing else we can do, in which case we basically spend our time
going around through the scheduler code and back out to the
task_isolation_ready() test, but again, there's really nothing else more
useful we can be doing at this point.  Once the RCU tick fires (or whatever
it was that was preventing full dynticks from engaging), we will pass this
test and return to user space.

There is nothing at all you can do and setting TIF_RESCHED won't help either.
If there is another task that can run, the scheduler takes care of resched
by itself :-)

The problem is that the scheduler will only take care of resched at a
later time, typically when we get a timer interrupt later.

When a task is enqueued, the scheduler sets TIF_RESCHED on the target. If the
target is remote it sends an IPI, if it's local then we wait the next reschedule
point (preemption points, voluntary reschedule, interrupts). There is just 
nothing
you can do to accelerate that.


But that's exactly what I'm saying.  If we're sitting in a loop here waiting
for some short-lived process (maybe kernel thread) to run and get out of
the way, we don't want to just spin sitting in prepare_exit_to_usermode().
We want to call schedule(), get the short-lived process to run, then when
it calls schedule() again, we're back in prepare_exit_to_usermode but now
we can return to userspace.

We don't want to wait for preemption points or interrupts, and there are
no other voluntary reschedules in the prepare_exit_to_usermode() loop.

If the other task had been woken up for some completion, then yes we would
already have had TIF_RESCHED set, but if the other runnable task was (for
example) pre-empted on a timer tick, we wouldn't have TIF_RESCHED set at
this point, and thus we might need to call schedule() explicitly.

Note that the prepare_exit_to_usermode() loop is exactly the point at
which we normally call schedule() if we are in syscall exit, so we are
just encouraging that schedule() to happen if otherwise it might not.


By invoking the scheduler here, we allow any tasks that are ready to run to run
immediately, rather than waiting for an interrupt to wake the scheduler.

Well, in this case here we are interested in the current CPU. And if a task
got awoken and waits for the current CPU, it will have an opportunity to get
schedule on syscall exit.


That's true if TIF_RESCHED was set because a completion occurred that
the other task was waiting for.  But there might not be any such completion
and the task just got preempted earlier and is still ready to run.

My point is that setting TIF_RESCHED is never harmful, and there are
cases like involuntary preemption wh

Re: [PATCH v9 04/13] task_isolation: add initial support

2016-01-29 Thread Chris Metcalf

On 01/27/2016 07:28 PM, Frederic Weisbecker wrote:

On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote:

You asked what happens if nohz_full= is given as well, which is a very
good question.  Perhaps the right answer is to have an early_initcall
that suppresses task isolation on any cores that lost their nohz_full
or isolcpus status due to later boot command line arguments (and
generate a console warning, obviously).

I'd rather imagine that the final nohz full cpumask is "nohz_full=" | 
"task_isolation="
That's the easiest way to deal with and both nohz and task isolation can call
a common initializer that takes care of the allocation and add the cpus to the 
mask.


I like it!

And by the same token, the final isolcpus cpumask is "isolcpus=" | 
"task_isolation="?

That seems like we'd want to do it to keep things parallel.


+bool _task_isolation_ready(void)
+{
+   WARN_ON_ONCE(!irqs_disabled());
+
+   /* If we need to drain the LRU cache, we're not ready. */
+   if (lru_add_drain_needed(smp_processor_id()))
+   return false;
+
+   /* If vmstats need updating, we're not ready. */
+   if (!vmstat_idle())
+   return false;
+
+   /* Request rescheduling unless we are in full dynticks mode. */
+   if (!tick_nohz_tick_stopped()) {
+   set_tsk_need_resched(current);

I'm not sure doing this will help getting the tick to get stopped.

Well, I don't know that there is anything else we CAN do, right?  If there's
another task that can run, great - it may be that that's why full dynticks
isn't happening yet.  Or, it might be that we're waiting for an RCU tick and
there's nothing else we can do, in which case we basically spend our time
going around through the scheduler code and back out to the
task_isolation_ready() test, but again, there's really nothing else more
useful we can be doing at this point.  Once the RCU tick fires (or whatever
it was that was preventing full dynticks from engaging), we will pass this
test and return to user space.

There is nothing at all you can do and setting TIF_RESCHED won't help either.
If there is another task that can run, the scheduler takes care of resched
by itself :-)


The problem is that the scheduler will only take care of resched at a
later time, typically when we get a timer interrupt later.  By invoking the
scheduler here, we allow any tasks that are ready to run to run
immediately, rather than waiting for an interrupt to wake the scheduler.
Plenty of places in the kernel just call schedule() directly when they are
waiting.  Since we're waiting here regardless, we might as well
immediately get any other runnable tasks dealt with.

We could also just return "false" in _task_isolation_ready(), and then
check tick_nohz_tick_stopped() in _task_isolation_enter() and if false,
call schedule() explicitly there, but that seems a little more roundabout.
Admittedly it's more usual to see kernel code call schedule() directly
to yield the processor, but in this case I'm not convinced it's cleaner
given we're already in a loop where the caller is checking TIF_RESCHED
and then calling schedule() when it's set.

--
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html