Re: [EXT] Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs
On 1/29/21 06:23, Marcelo Tosatti wrote: External Email -- On Fri, Jan 29, 2021 at 08:55:20AM -0500, Nitesh Narayan Lal wrote: On 1/28/21 3:01 PM, Thomas Gleixner wrote: On Thu, Jan 28 2021 at 13:59, Marcelo Tosatti wrote: The whole pile wants to be reverted. It's simply broken in several ways. I was asking for your comments on interaction with CPU hotplug :-) Which I answered in an seperate mail :) So housekeeping_cpumask has multiple meanings. In this case: ... So as long as the meaning of the flags are respected, seems alright. Yes. Stuff like the managed interrupts preference for housekeeping CPUs when a affinity mask spawns housekeeping and isolated is perfectly fine. It's well thought out and has no limitations. Nitesh, is there anything preventing this from being fixed in userspace ? (as Thomas suggested previously). Everything with is not managed can be steered by user space. Thanks, tglx So, I think the conclusion here would be to revert the change made in cpumask_local_spread via the patch: - lib: Restrict cpumask_local_spread to housekeeping CPUs Also, a similar case can be made for the rps patch that went in with this: - net: Restrict receive packets queuing to housekeeping CPUs Yes, this is the userspace solution: https://lkml.org/lkml/2021/1/22/815 Should have a kernel document with this info and examples (the network queue configuration as well). Will send something. + net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus I am not sure about the PCI patch as I don't think we can control that from the userspace or maybe I am wrong? You mean "lib: Restrict cpumask_local_spread to housekeeping CPUs" ? If we want to do it from userspace, we should have something that triggers it in userspace. Should we use udev for this purpose? -- Alex
Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
On Wed, 2020-12-02 at 14:20 +, Mark Rutland wrote: > External Email > > --- > --- > On Mon, Nov 23, 2020 at 05:58:22PM +0000, Alex Belits wrote: > > From: Yuri Norov > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov > > [abel...@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index a213952541db..6c8679e200f0 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > What does this barrier pair with? The commit message doesn't mention > it, > and it's not clear in-context. With barriers in task_isolation_kernel_enter() and task_isolation_exit_to_user_mode(). -- Alex
Re: [EXT] Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks
On Wed, 2020-12-02 at 14:18 +, Mark Rutland wrote: > External Email > > --- > --- > On Mon, Nov 23, 2020 at 05:57:42PM +0000, Alex Belits wrote: > > Some drivers don't call functions that call > > task_isolation_kernel_enter() in interrupt handlers. Call it > > directly. > > I don't think putting this in drivers is the right approach. IIUC we > only need to track user<->kernel transitions, and we can do that > within > the architectural entry code before we ever reach irqchip code. I > suspect the current approacch is an artifact of that being difficult > in > the old structure of the arch code; recent rework should address > that, > and we can restruecture things further in future. I agree completely. This patch only covers irqchip drivers with unusual entry procedures. -- Alex
Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode
On Wed, 2020-12-02 at 14:02 +, Mark Rutland wrote: > On Tue, Nov 24, 2020 at 05:40:49PM +0000, Alex Belits wrote: > > > > > I am having problems applying the patchset to today's linux-next. > > > > > > Which kernel should I be using ? > > > > The patches are against Linus' tree, in particular, commit > > a349e4c659609fd20e4beea89e5c4a4038e33a95 > > Is there any reason to base on that commit in particular? No specific reason for that particular commit. > Generally it's preferred that a series is based on a tag (so either a > release or an -rc kernel), and that the cover letter explains what > the > base is. If you can do that in future it'll make the series much > easier > to work with. Ok. -- Alex
Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
On Wed, 2020-12-02 at 13:59 +, Mark Rutland wrote: > External Email > > --- > --- > Hi Alex, > > On Mon, Nov 23, 2020 at 05:58:06PM +, Alex Belits wrote: > > In do_notify_resume(), call > > task_isolation_before_pending_work_check() > > first, to report isolation breaking, then after handling all > > pending > > work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. > > > > Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, > > define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since > > we > > don't clear _TIF_TASK_ISOLATION in the loop. > > > > Early kernel entry code calls task_isolation_kernel_enter(). In > > particular: > > > > Vectors: > > el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() > > el1_irq -> asm_nmi_enter(), handle_arch_irq() > > el1_error -> do_serror() > > el0_sync -> el0_sync_handler() > > el0_irq -> handle_arch_irq() > > el0_error -> do_serror() > > el0_sync_compat -> el0_sync_compat_handler() > > el0_irq_compat -> handle_arch_irq() > > el0_error_compat -> do_serror() > > > > SDEI entry: > > __sdei_asm_handler -> __sdei_handler() -> nmi_enter() > > As a heads-up, the arm64 entry code is changing, as we found that our > lockdep, RCU, and context-tracking management wasn't quite right. I > have > a series of patches: > > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutl...@arm.com > > ... which are queued in the arm64 for-next/fixes branch. I intend to > have some further rework ready for the next cycle. Thanks! > I'd appreciate if you > could Cc me on any patches altering the arm64 entry code, as I have a > vested interest. I will do that. > > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were > chosen and context tracking was in use (e.g. with > CONTEXT_TRACKING_FORCE), I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and CONTEXT_TRACKING have to be enabled for it to do anything. I will check it with PROVE_LOCKING and your patches. Entry code only adds an inline function that, if task isolation is enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level operations and accesses per-CPU variabled by offset, so at very least it should not add any problems. Even raw_local_irq_save() / raw_local_irq_restore() probably should be removed, however I wanted to have something that can be safely called if by whatever reason interrupts were enabled before kernel was fully entered. > so I'm assuming that this series has not been > tested in that configuration. What sort of testing has this seen? > On various available arm64 hardware, with enabled CONFIG_TASK_ISOLATION CONFIG_NO_HZ_FULL CONFIG_HIGH_RES_TIMERS and disabled: CONFIG_HZ_PERIODIC CONFIG_NO_HZ_IDLE CONFIG_NO_HZ > It would be very helpful for the next posting if you could provide > any > instructions on how to test this series (e.g. with pointers to any > test > suite that you have), since it's very easy to introduce subtle > breakage > in this area without realising it. I will. Currently libtmc ( https://github.com/abelits/libtmc ) contains all userspace code used for testing, however I should document the testing procedures. > > > Functions called from there: > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() > > > > Handlers: > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter() > > or task_isolation_kernel_enter() > > el1_sync_handler() -> task_isolation_kernel_enter() > > el0_sync_handler() -> task_isolation_kernel_enter() > > el0_sync_compat_handler() -> task_isolation_kernel_enter() > > > > handle_arch_irq() is irqchip-specific, most call > > handle_domain_irq() > > There is a separate patch for irqchips that do not follow this > > rule. > > > > handle_domain_irq() -> task_isolation_kernel_enter() > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) > > nmi_enter() -> task_isolation_kernel_enter() > > The IRQ cases look very odd to me. With the rework I've just done for > arm64, we'll do the regular context tracking accounting before we > ever > get into handle_domain_irq() or similar, so I suspect that's not > necessary at all? The goal is to call task_isolation_kernel_enter() before anything that depends on a CPU state, including pipeline, that could remain un- synchronized when the rest of the kernel was send
Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote: > On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > > > This is different from timers. The original design was based on the > > idea that every CPU should be able to enter kernel at any time and > > run > > kernel code with no additional preparation. Then the only solution > > is > > to always do full broadcast and require all CPUs to process it. > > > > What I am trying to introduce is the idea of CPU that is not likely > > to > > run kernel code any soon, and can afford to go through an > > additional > > synchronization procedure on the next entry into kernel. The > > synchronization is not skipped, it simply happens later, early in > > kernel entry code. > > Ah I see, this is ordered that way: > > ll_isol_flags = ISOLATED > > CPU 0CPU 1 > -- - > // kernel entry > data_to_sync = 1ll_isol_flags = ISOLATED_BROKEN > smp_mb()smp_mb() > if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync > smp_call(CPU 1) > The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more complex. In terms of scenarios, on entry from isolation the following can happen: 1. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0CPU 1 -- - // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN data_to_sync = 1 smp_mb() // data_to_sync undetermined smp_mb()} // ll_isol_flags(CPU 1) updated if ll_isol_flags(CPU 1) != ISOLATED // interrupts enabled smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will seeREAD data_to_sync // data_to_sync = 1 when in kernel 2. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0CPU 1 -- - data_to_sync = 1// kernel entry smp_mb()// data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN // ll_isol_flags(CPU 1) undetermined smp_mb() // data_to_sync updated if ll_isol_flags(CPU 1) != ISOLATED } // possibly nothing happens // CPU 0 assumes, CPU 1 will seeREAD data_to_sync // data_to_sync = 1 when in kernel 3. Kernel entry processing completed before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0CPU 1 -- - // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() } // interrupts are enabled at some data_to_sync = 1// point here, data_to_sync value smp_mb()// is undetermined, CPU 0 makes no // ll_isol_flags(CPU 1) updated // assumptions about it if ll_isol_flags(CPU 1) != ISOLATED // smp_call(CPU 1) // kernel entry again
Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode
On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote: > External Email > > --- > --- > > On 11/23/20 9:42 AM, Alex Belits wrote: > > This is an update of task isolation work that was originally done > > by > > Chris Metcalf and maintained by him until > > November 2017. It is adapted to the current kernel and cleaned up > > to > > implement its functionality in a more complete and cleaner manner. > > I am having problems applying the patchset to today's linux-next. > > Which kernel should I be using ? The patches are against Linus' tree, in particular, commit a349e4c659609fd20e4beea89e5c4a4038e33a95 -- Alex
Re: [EXT] Re: [PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
On Mon, 2020-11-23 at 23:29 +0100, Frederic Weisbecker wrote: > External Email > > --- > --- > On Mon, Nov 23, 2020 at 05:58:42PM +0000, Alex Belits wrote: > > From: Yuri Norov > > > > Make sure that kick_all_cpus_sync() does not call CPUs that are > > running > > isolated tasks. > > > > Signed-off-by: Yuri Norov > > [abel...@marvell.com: use safe task_isolation_cpumask() > > implementation] > > Signed-off-by: Alex Belits > > --- > > kernel/smp.c | 14 +- > > 1 file changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/smp.c b/kernel/smp.c > > index 4d17501433be..b2faecf58ed0 100644 > > --- a/kernel/smp.c > > +++ b/kernel/smp.c > > @@ -932,9 +932,21 @@ static void do_nothing(void *unused) > > */ > > void kick_all_cpus_sync(void) > > { > > + struct cpumask mask; > > + > > /* Make sure the change is visible before we kick the cpus */ > > smp_mb(); > > - smp_call_function(do_nothing, NULL, 1); > > + > > + preempt_disable(); > > +#ifdef CONFIG_TASK_ISOLATION > > + cpumask_clear(&mask); > > + task_isolation_cpumask(&mask); > > + cpumask_complement(&mask, &mask); > > +#else > > + cpumask_setall(&mask); > > +#endif > > + smp_call_function_many(&mask, do_nothing, NULL, 1); > > + preempt_enable(); > > Same comment about IPIs here. This is different from timers. The original design was based on the idea that every CPU should be able to enter kernel at any time and run kernel code with no additional preparation. Then the only solution is to always do full broadcast and require all CPUs to process it. What I am trying to introduce is the idea of CPU that is not likely to run kernel code any soon, and can afford to go through an additional synchronization procedure on the next entry into kernel. The synchronization is not skipped, it simply happens later, early in kernel entry code. -- Alex
Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
On Mon, 2020-11-23 at 23:13 +0100, Frederic Weisbecker wrote: > External Email > > --- > --- > Hi Alex, > > On Mon, Nov 23, 2020 at 05:58:22PM +, Alex Belits wrote: > > From: Yuri Norov > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov > > [abel...@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index a213952541db..6c8679e200f0 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > > return; > > Like I said in subsequent reviews, we are not going to ignore IPIs. > We must fix the sources of these IPIs instead. This is what I am working on right now. This is made with an assumption that CPU running isolated task has no reason to be kicked because nothing else is supposed to be there. Usually this is true and when not true is still safe when everything else is behaving right. For this version I have kept the original implementation with minimal changes to make it possible to use task isolation at all. I agree that it's a much better idea is to determine if the CPU should be kicked. If it really should, that will be a legitimate cause to break isolation there, because CPU running isolated task has no legitimate reason to have timers running. Right now I am trying to determine the origin of timers that _still_ show up as running in the current kernel version, so I think, this is a rather large chunk of work that I have to do separately. -- Alex
[PATCH v5 8/9] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
From: Yuri Norov CPUs running isolated tasks are in userspace, so they don't have to perform ring buffer updates immediately. If ring_buffer_resize() schedules the update on those CPUs, isolation is broken. To prevent that, updates for CPUs running isolated tasks are performed locally, like for offline CPUs. A race condition between this update and isolation breaking is avoided at the cost of disabling per_cpu buffer writing for the time of update when it coincides with isolation breaking. Signed-off-by: Yuri Norov [abel...@marvell.com: updated to prevent race with isolation breaking] Signed-off-by: Alex Belits --- kernel/trace/ring_buffer.c | 63 ++ 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index dc83b3fa9fe7..9e4fb3ed2af0 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -1939,6 +1940,38 @@ static void update_pages_handler(struct work_struct *work) complete(&cpu_buffer->update_done); } +static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer, + int cpu) +{ + bool rv = false; + + smp_rmb(); + if (task_isolation_on_cpu(cpu)) { + /* +* CPU is running isolated task. Since it may lose +* isolation and re-enter kernel simultaneously with +* this update, disable recording until it's done. +*/ + atomic_inc(&cpu_buffer->record_disabled); + /* Make sure, update is done, and isolation state is current */ + smp_mb(); + if (task_isolation_on_cpu(cpu)) { + /* +* If CPU is still running isolated task, we +* can be sure that breaking isolation will +* happen while recording is disabled, and CPU +* will not touch this buffer until the update +* is done. +*/ + rb_update_pages(cpu_buffer); + cpu_buffer->nr_pages_to_update = 0; + rv = true; + } + atomic_dec(&cpu_buffer->record_disabled); + } + return rv; +} + /** * ring_buffer_resize - resize the ring buffer * @buffer: the buffer to resize. @@ -2028,13 +2061,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, if (!cpu_buffer->nr_pages_to_update) continue; - /* Can't run something on an offline CPU. */ + /* +* Can't run something on an offline CPU. +* +* CPUs running isolated tasks don't have to +* update ring buffers until they exit +* isolation because they are in +* userspace. Use the procedure that prevents +* race condition with isolation breaking. +*/ if (!cpu_online(cpu)) { rb_update_pages(cpu_buffer); cpu_buffer->nr_pages_to_update = 0; } else { - schedule_work_on(cpu, - &cpu_buffer->update_pages_work); + if (!update_if_isolated(cpu_buffer, cpu)) + schedule_work_on(cpu, + &cpu_buffer->update_pages_work); } } @@ -2083,13 +2125,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, get_online_cpus(); - /* Can't run something on an offline CPU. */ + /* +* Can't run something on an offline CPU. +* +* CPUs running isolated tasks don't have to update +* ring buffers until they exit isolation because they +* are in userspace. Use the procedure that prevents +* race condition with isolation breaking. +*/ if (!cpu_online(cpu_id)) rb_update_pages(cpu_buffer); else { - schedule_work_on(cpu_id, + if (!update_if_isolated(cpu_buffer, cpu_id)) + schedule_work_on(cpu_id, &cpu_buffer->update_pages_work); - wait_for_completion(&cpu_buf
[PATCH v5 9/9] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
From: Yuri Norov Make sure that kick_all_cpus_sync() does not call CPUs that are running isolated tasks. Signed-off-by: Yuri Norov [abel...@marvell.com: use safe task_isolation_cpumask() implementation] Signed-off-by: Alex Belits --- kernel/smp.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/smp.c b/kernel/smp.c index 4d17501433be..b2faecf58ed0 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -932,9 +932,21 @@ static void do_nothing(void *unused) */ void kick_all_cpus_sync(void) { + struct cpumask mask; + /* Make sure the change is visible before we kick the cpus */ smp_mb(); - smp_call_function(do_nothing, NULL, 1); + + preempt_disable(); +#ifdef CONFIG_TASK_ISOLATION + cpumask_clear(&mask); + task_isolation_cpumask(&mask); + cpumask_complement(&mask, &mask); +#else + cpumask_setall(&mask); +#endif + smp_call_function_many(&mask, do_nothing, NULL, 1); + preempt_enable(); } EXPORT_SYMBOL_GPL(kick_all_cpus_sync); -- 2.20.1
[PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
From: Yuri Norov For nohz_full CPUs the desirable behavior is to receive interrupts generated by tick_nohz_full_kick_cpu(). But for hard isolation it's obviously not desirable because it breaks isolation. This patch adds check for it. Signed-off-by: Yuri Norov [abel...@marvell.com: updated, only exclude CPUs running isolated tasks] Signed-off-by: Alex Belits --- kernel/time/tick-sched.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index a213952541db..6c8679e200f0 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) */ void tick_nohz_full_kick_cpu(int cpu) { - if (!tick_nohz_full_cpu(cpu)) + smp_rmb(); + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) return; irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); -- 2.20.1
[PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality
In do_notify_resume(), call task_isolation_before_pending_work_check() first, to report isolation breaking, then after handling all pending work, call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK, define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. Early kernel entry code calls task_isolation_kernel_enter(). In particular: Vectors: el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() el1_irq -> asm_nmi_enter(), handle_arch_irq() el1_error -> do_serror() el0_sync -> el0_sync_handler() el0_irq -> handle_arch_irq() el0_error -> do_serror() el0_sync_compat -> el0_sync_compat_handler() el0_irq_compat -> handle_arch_irq() el0_error_compat -> do_serror() SDEI entry: __sdei_asm_handler -> __sdei_handler() -> nmi_enter() Functions called from there: asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() Handlers: do_serror() -> nmi_enter() -> task_isolation_kernel_enter() or task_isolation_kernel_enter() el1_sync_handler() -> task_isolation_kernel_enter() el0_sync_handler() -> task_isolation_kernel_enter() el0_sync_compat_handler() -> task_isolation_kernel_enter() handle_arch_irq() is irqchip-specific, most call handle_domain_irq() There is a separate patch for irqchips that do not follow this rule. handle_domain_irq() -> task_isolation_kernel_enter() do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant) nmi_enter() -> task_isolation_kernel_enter() Signed-off-by: Chris Metcalf [abel...@marvell.com: simplified to match kernel 5.10] Signed-off-by: Alex Belits --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/barrier.h | 1 + arch/arm64/include/asm/thread_info.h | 7 +-- arch/arm64/kernel/entry-common.c | 7 +++ arch/arm64/kernel/ptrace.c | 10 ++ arch/arm64/kernel/signal.c | 13 - arch/arm64/kernel/smp.c | 3 +++ 7 files changed, 39 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1515f6f153a0..fc958d8d8945 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -141,6 +141,7 @@ config ARM64 select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_STACKLEAK + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index c3009b0e5239..ad5a6dd380cf 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -49,6 +49,7 @@ #define dma_rmb() dmb(oshld) #define dma_wmb() dmb(oshst) +#define instr_sync() isb() /* * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz * and 0 otherwise. diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 1fbab854a51b..3321c69c46fe 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -68,6 +68,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK5 /* Check FS is USER_DS on return */ #define TIF_MTE_ASYNC_FAULT6 /* MTE Asynchronous Tag Check Fault */ +#define TIF_TASK_ISOLATION 7 /* task isolation enabled for task */ #define TIF_SYSCALL_TRACE 8 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 9 /* syscall auditing */ #define TIF_SYSCALL_TRACEPOINT 10 /* syscall tracepoint for ftrace */ @@ -87,6 +88,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) @@ -101,11 +103,12 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ -_TIF_UPROBE | _TIF_FSCHECK | _TIF_MTE_ASYNC_FAULT) +_TIF_UPROBE | _TIF_FSCHECK | \ +_TIF_MTE_ASYNC_FAULT | _TIF_TASK_ISOLATION) #define _TIF_SYSCALL_WORK (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
[PATCH v5 5/9] task_isolation: Add driver-specific hooks
Some drivers don't call functions that call task_isolation_kernel_enter() in interrupt handlers. Call it directly. Signed-off-by: Alex Belits --- drivers/irqchip/irq-armada-370-xp.c | 6 ++ drivers/irqchip/irq-gic-v3.c| 3 +++ drivers/irqchip/irq-gic.c | 3 +++ drivers/s390/cio/cio.c | 3 +++ 4 files changed, 15 insertions(+) diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c index d7eb2e93db8f..4ac7babe1abe 100644 --- a/drivers/irqchip/irq-armada-370-xp.c +++ b/drivers/irqchip/irq-armada-370-xp.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -572,6 +573,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = { static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) { u32 msimask, msinr; + int isol_entered = 0; msimask = readl_relaxed(per_cpu_int_base + ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS) @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) continue; if (is_chained) { + if (!isol_entered) { + task_isolation_kernel_enter(); + isol_entered = 1; + } irq = irq_find_mapping(armada_370_xp_msi_inner_domain, msinr - PCI_MSI_DOORBELL_START); generic_handle_irq(irq); diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 16fecc0febe8..ded26dd4da0f 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs { u32 irqnr; + task_isolation_kernel_enter(); + irqnr = gic_read_iar(); if (gic_supports_nmi() && diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index 6053245a4754..bb482b4ae218 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) struct gic_chip_data *gic = &gic_data[0]; void __iomem *cpu_base = gic_data_cpu_base(gic); + task_isolation_kernel_enter(); + do { irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK); irqnr = irqstat & GICC_IAR_INT_ID_MASK; diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c index 6d716db2a46a..beab1b6d 100644 --- a/drivers/s390/cio/cio.c +++ b/drivers/s390/cio/cio.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch) struct irb *irb; int irq_context; + task_isolation_kernel_enter(); + irb = this_cpu_ptr(&cio_irb); /* Store interrupt response block to lowcore. */ if (tsch(sch->schid, irb) != 0) -- 2.20.1
[PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code
Kernel entry and exit functions for task isolation are added to context tracking and common entry points. Common handling of pending work on exit to userspace now processes isolation breaking, cleanup and start. Signed-off-by: Chris Metcalf [abel...@marvell.com: adapted for kernel 5.10] Signed-off-by: Alex Belits --- include/linux/hardirq.h | 2 ++ include/linux/sched.h | 2 ++ kernel/context_tracking.c | 5 + kernel/entry/common.c | 10 +- kernel/irq/irqdesc.c | 5 + 5 files changed, 23 insertions(+), 1 deletion(-) diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 754f67ac4326..b9e604ae6a0d 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -7,6 +7,7 @@ #include #include #include +#include #include extern void synchronize_irq(unsigned int irq); @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void); do {\ lockdep_off(); \ arch_nmi_enter(); \ + task_isolation_kernel_enter(); \ printk_nmi_enter(); \ BUG_ON(in_nmi() == NMI_MASK); \ __preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET); \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 5d8b17aa544b..51c2d774250b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -34,6 +34,7 @@ #include #include #include +#include /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); #ifdef CONFIG_SMP static __always_inline void scheduler_ipi(void) { + task_isolation_kernel_enter(); /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting * TIF_NEED_RESCHED remotely (for the first time) will also send diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36a98c48aedc..379a48fd0e65 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -21,6 +21,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state state) __this_cpu_write(context_tracking.state, state); } context_tracking_recursion_exit(); + + task_isolation_exit_to_user_mode(); } EXPORT_SYMBOL_GPL(__context_tracking_enter); @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) if (!context_tracking_recursion_enter()) return; + task_isolation_kernel_enter(); + if (__this_cpu_read(context_tracking.state) == state) { if (__this_cpu_read(context_tracking.active)) { /* diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e9e2df3f3f9e..10a520894105 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -4,6 +4,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -183,13 +184,20 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, static void exit_to_user_mode_prepare(struct pt_regs *regs) { - unsigned long ti_work = READ_ONCE(current_thread_info()->flags); + unsigned long ti_work; lockdep_assert_irqs_disabled(); + task_isolation_before_pending_work_check(); + + ti_work = READ_ONCE(current_thread_info()->flags); + if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) ti_work = exit_to_user_mode_loop(regs, ti_work); + if (unlikely(ti_work & _TIF_TASK_ISOLATION)) + task_isolation_start(); + arch_exit_to_user_mode_prepare(regs, ti_work); /* Ensure that the address limit is intact and no locks are held */ diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c index 1a7723604399..b8f0a7574f55 100644 --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "internals.h" @@ -669,6 +670,8 @@ int __handle_domain_irq(struct irq_domain *domain, unsigned int hwirq, unsigned int irq = hwirq; int ret = 0; + task_isolation_kernel_enter(); + irq_enter(); #ifdef CONFIG_IRQ_DOMAIN @@ -710,6 +713,8 @@ int handle_domain_nmi(struct irq_domain *domain, unsigned int hwirq, unsigned int irq; int ret = 0; + task_isolation_kernel_enter(); + /* * NMI context needs to be setup earlier in order to deal with tracing. */ -- 2.20.1
[PATCH v5 3/9] task_isolation: userspace hard isolation from kernel
The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE, 0, 0, 0) to do so. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate "isolcpus=nohz,domain,CPULIST" boot argument to enable nohz_full and isolcpus. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in the thread_info flags. When the kernel is returning to userspace from the prctl() call and sees TIF_TASK_ISOLATION set, it calls the new task_isolation_start() routine to arrange for the task to avoid being interrupted in the future. With interrupts disabled, task_isolation_start() ensures that kernel subsystems that might cause a future interrupt are quiesced. If it doesn't succeed, it adjusts the syscall return value to indicate that fact, and userspace can retry as desired. In addition to stopping the scheduler tick, the code takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). The last stage of enabling task isolation happens in task_isolation_exit_to_user_mode() that runs last before returning to userspace and changes ll_isol_flags (see later) to prevent other CPUs from interfering with isolated task. Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any other exception or irq, the kernel will send it a signal to indicate isolation loss. In addition to sending a signal, the code supports a kernel command-line "task_isolation_debug" flag which causes a stack backtrace to be generated whenever a task loses isolation. To allow the state to be entered and exited, the syscall checking test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal being delivered. The prctl() API allows for specifying a signal number to use instead of the default SIGKILL, to allow for catching the notification signal; for example, in a production environment, it might be helpful to log information to the application logging mechanism before exiting. Or, the signal handler might choose to reset the program counter back to the code segment intended to be run isolated via prctl() to continue execution. Isolation also disables CPU state synchronization mechanisms that are. normally done by IPI. In the future, more synchronization mechanisms, such as TLB flushes, may be disabled for isolated tasks. This requires careful handling of kernel entry from isolated task -- remote synchronization requests must be re-enabled and synchronization procedure triggered, before anything other than low-level kernel entry code is called. Same applies to exiting from kernel to userspace after isolation is enabled. For this purpose, per-CPU low-level flags ll_isol_flags are used to indicate isolation state, and task_isolation_kernel_enter() is used to safely clear them early in kernel entry. CPU mask corresponding to isolation bit in ll_isol_flags is visible to userspace as /sys/devices/system/cpu/isolation_running, and can be used for monitoring. Signed-off-by: Chris Metcalf Signed-off-by: Alex Belits --- .../admin-guide/kernel-parameters.txt | 6 + drivers/base/cpu.c| 23 + include/linux/hrtimer.h | 4 + include/linux/isolation.h | 326 include/linux/sched.h | 5 + include/linux/tick.h | 3 + include/uapi/linux/prctl.h| 6 + init/Kconfig | 27 + kernel/Makefile | 2 + kernel/isolation.c| 714 ++ kernel/signal.c | 2 + kernel/sys.c | 6 + kernel/time/hrtimer.c | 27 + kernel/time/tick-sched.c | 18 + 14 files changed, 1169 insertions(+) create mode 100644 include/linux/isolation.h create mode 100644
[PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function
From: Chris Metcalf This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Signed-off-by: Chris Metcalf Signed-off-by: Alex Belits --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 10 ++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 300ce6648923..24392a957cfc 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -285,6 +285,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -393,6 +394,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 43999caf47a4..5b0ad7ed65f7 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1945,6 +1945,16 @@ void quiet_vmstat_sync(void) refresh_cpu_vm_stats(false); } +/* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) && + !need_update(smp_processor_id()); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.20.1
[PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function
From: Chris Metcalf In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") the quiet_vmstat() function became asynchronous, in the sense that the vmstat work was still scheduled to run on the core when the function returned. For task isolation, we need a synchronous version of the function that guarantees that the vmstat worker will not run on the core on return from the function. Add a quiet_vmstat_sync() function with that semantic. Signed-off-by: Chris Metcalf Signed-off-by: Alex Belits --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 9 + 2 files changed, 11 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 322dcbfcc933..300ce6648923 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -284,6 +284,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); +void quiet_vmstat_sync(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -391,6 +392,7 @@ static inline void __dec_node_page_state(struct page *page, static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat_sync(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 698bc0bc18d1..43999caf47a4 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1936,6 +1936,15 @@ void quiet_vmstat(void) refresh_cpu_vm_stats(false); } +/* + * Synchronously quiet vmstat so the work is guaranteed not to run on return. + */ +void quiet_vmstat_sync(void) +{ + cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work)); + refresh_cpu_vm_stats(false); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.20.1
[PATCH v5 0/9] "Task_isolation" mode
This is an update of task isolation work that was originally done by Chris Metcalf and maintained by him until November 2017. It is adapted to the current kernel and cleaned up to implement its functionality in a more complete and cleaner manner. Previous version is at https://lore.kernel.org/netdev/04be044c1bcd76b7438b7563edc35383417f12c8.ca...@marvell.com/ The last version by Chris Metcalf (now obsolete but may be relevant for comparison and understanding the origin of the changes) is at https://lore.kernel.org/lkml/1509728692-10460-1-git-send-email-cmetc...@mellanox.com Supported architectures This version includes only architecture-independent code and arm64 support. x86 and arm support, and everything related to virtualization will be re-added later when new kernel entry/exit implementation will be accommodated. Support for other architectures can be added in a somewhat modular manner, however it heavily depends on the details of a kernel entry/exit support on any particular architecture. Development of common entry/exit and conversion to it should simplify that task. For now, this is the version that is currently being developed on arm64. Major changes since v4 The goal was to make isolation-breaking detection as generic as possible, and remove everything related to determining, _why_ isolation was broken. Originally reporting isolation breaking was done with a large number of of hooks in specific code (hardware interrupts, syscalls, IPIs, page faults, etc.), and it was necessary to cover all possible such events to have a reliable notification of a task about its isolation being broken. To avoid such a fragile mechanism, this version relies on mere fact of kernel being entered in isolation mode. As a result, reporting happens later in kernel code, however it covers everything. This means that now there is no specific reporting, in kernel log or elsewhere, about the reasons for breaking isolation. Information about that may be valuable at runtime, so a separate mechanism for generic reporting "why did CPU enter kernel" (with isolation or under other conditions) may be a good thing. That can be done later, however at this point it's important that task isolation does not require it, and such mechanism will not be developed with the limited purpose of supporting isolation alone. General description This is the result of development and maintenance of task isolation functionality that originally started based on task isolation patch v15 and was later updated to include v16. It provided predictable environment for userspace tasks running on arm64 processors alongside with full-featured Linux environment. It is intended to provide reliable interruption-free environment from the point when a userspace task enters isolation and until the moment it leaves isolation or receives a signal intentionally sent to it, and was successfully used for this purpose. While CPU isolation with nohz provides an environment that is close to this requirement, the remaining IPIs and other disturbances keep it from being usable for tasks that require complete predictability of CPU timing. This set of patches only covers the implementation of task isolation, however additional functionality, such as selective TLB flushes, may be implemented to avoid other kinds of disturbances that affect latency and performance of isolated tasks. The userspace support and test program is now at https://github.com/abelits/libtmc . It was originally developed for earlier implementation, so it has some checks that may be redundant now but kept for compatibility. My thanks to Chris Metcalf for design and maintenance of the original task isolation patch, Francis Giraldeau and Yuri Norov for various contributions to this work, Frederic Weisbecker for his work on CPU isolation and housekeeping that made possible to remove some less elegant solutions that I had to devise for earlier, <4.17 kernels, and Nitesh Narayan Lal for adapting earlier patches related to interrupt and work distribution in presence of CPU isolation. -- Alex
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Sat, 2020-10-17 at 18:08 +0200, Thomas Gleixner wrote: > On Sat, Oct 17 2020 at 01:08, Alex Belits wrote: > > On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote: > > > On 10/4/20 7:14 PM, Frederic Weisbecker wrote: > > I think that the goal of "finding source of disturbance" interface > > is > > different from what can be accomplished by tracing in two ways: > > > > 1. "Source of disturbance" should provide some useful information > > about > > category of event and it cause as opposed to determining all > > precise > > details about things being called that resulted or could result in > > disturbance. It should not depend on the user's knowledge about > > details > > Tracepoints already give you selectively useful information. Carefully placed tracepoints also can give the user information about failures of open(), write(), execve() or mmap(). However syscalls still provide an error code instead of returning generic failure and letting user debug the cause. -- Alex
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote: > On 10/4/20 7:14 PM, Frederic Weisbecker wrote: > > On Sun, Oct 04, 2020 at 02:44:39PM +0000, Alex Belits wrote: > > > On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote: > > > > External Email > > > > > > > > - > > > > ------ > > > > --- > > > > On Wed, Jul 22, 2020 at 02:49:49PM +, Alex Belits wrote: > > > > > +/* > > > > > + * Description of the last two tasks that ran isolated on a > > > > > given > > > > > CPU. > > > > > + * This is intended only for messages about isolation > > > > > breaking. We > > > > > + * don't want any references to actual task while accessing > > > > > this > > > > > from > > > > > + * CPU that caused isolation breaking -- we know nothing > > > > > about > > > > > timing > > > > > + * and don't want to use locking or RCU. > > > > > + */ > > > > > +struct isol_task_desc { > > > > > + atomic_t curr_index; > > > > > + atomic_t curr_index_wr; > > > > > + boolwarned[2]; > > > > > + pid_t pid[2]; > > > > > + pid_t tgid[2]; > > > > > + charcomm[2][TASK_COMM_LEN]; > > > > > +}; > > > > > +static DEFINE_PER_CPU(struct isol_task_desc, > > > > > isol_task_descs); > > > > So that's quite a huge patch that would have needed to be split > > > > up. > > > > Especially this tracing engine. > > > > > > > > Speaking of which, I agree with Thomas that it's unnecessary. > > > > It's > > > > too much > > > > code and complexity. We can use the existing trace events and > > > > perform > > > > the > > > > analysis from userspace to find the source of the disturbance. > > > The idea behind this is that isolation breaking events are > > > supposed to > > > be known to the applications while applications run normally, and > > > they > > > should not require any analysis or human intervention to be > > > handled. > > Sure but you can use trace events for that. Just trace interrupts, > > workqueues, > > timers, syscalls, exceptions and scheduler events and you get all > > the local > > disturbance. You might want to tune a few filters but that's pretty > > much it. > > > > As for the source of the disturbances, if you really need that > > information, > > you can trace the workqueue and timer queue events and just filter > > those that > > target your isolated CPUs. > > > > I agree that we can do all those things with tracing. > However, IMHO having a simplified logging mechanism to gather the > source of > violation may help in reducing the manual effort. > > Although, I am not sure how easy will it be to maintain such an > interface > over time. I think that the goal of "finding source of disturbance" interface is different from what can be accomplished by tracing in two ways: 1. "Source of disturbance" should provide some useful information about category of event and it cause as opposed to determining all precise details about things being called that resulted or could result in disturbance. It should not depend on the user's knowledge about details of implementations, it should provide some definite answer of what happened (with whatever amount of details can be given in a generic mechanism) even if the user has no idea how those things happen and what part of kernel is responsible for either causing or processing them. Then if the user needs further details, they can be obtained with tracing. 2. It should be usable as a runtime error handling mechanism, so the information it provides should be suitable for application use and logging. It should be usable when applications are running on a system in production, and no specific tracing or monitoring mechanism can be in use. If, say, thousands of devices are controlling neutrino detectors on an ocean floor, and in a month of work one of them got one isolation breaking event, it should be able to report that isolation was broken by an interrupt from a network interface, so the users will be able to track it down to some userspace application reconfiguring those interrupts. It will be a good idea to make such mechanism optional and suitable for tracking things on conditions other than "always enabled" and "enabled with task isolation". However in my opinion, there should be something in kernel entry procedure that, if enabled, prepared something to be filled by the cause data, and we know at least one such situation when this kernel entry procedure should be triggered -- when task isolation is on. -- Alex
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Tue, 2020-10-06 at 12:35 +0200, Frederic Weisbecker wrote: > On Mon, Oct 05, 2020 at 02:52:49PM -0400, Nitesh Narayan Lal wrote: > > On 10/4/20 7:14 PM, Frederic Weisbecker wrote: > > > On Sun, Oct 04, 2020 at 02:44:39PM +, Alex Belits wrote: > > > > > > > The idea behind this is that isolation breaking events are > > > > supposed to > > > > be known to the applications while applications run normally, > > > > and they > > > > should not require any analysis or human intervention to be > > > > handled. > > > Sure but you can use trace events for that. Just trace > > > interrupts, workqueues, > > > timers, syscalls, exceptions and scheduler events and you get all > > > the local > > > disturbance. You might want to tune a few filters but that's > > > pretty much it. > > formation, > > > you can trace the workqueue and timer queue events and just > > > filter those that > > > target your isolated CPUs. > > > > > > > I agree that we can do all those things with tracing. > > However, IMHO having a simplified logging mechanism to gather the > > source of > > violation may help in reducing the manual effort. > > > > Although, I am not sure how easy will it be to maintain such an > > interface > > over time. > > The thing is: tracing is your simplified logging mechanism here. You > can achieve > the same in userspace with _way_ less code, no race, and you can do > it in > bash. The idea is that this mechanism should be usable when no one is there to run things in bash, or no information about what might happen. It should be able to report rare events in production when users may not be able to reproduce them. -- Alex
Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
On Tue, 2020-10-06 at 23:41 +0200, Frederic Weisbecker wrote: > On Sun, Oct 04, 2020 at 03:22:09PM +0000, Alex Belits wrote: > > On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote: > > > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > > > */ > > > > void tick_nohz_full_kick_cpu(int cpu) > > > > { > > > > - if (!tick_nohz_full_cpu(cpu)) > > > > + smp_rmb(); > > > > > > What is it ordering? > > > > ll_isol_flags will be read in task_isolation_on_cpu(), that accrss > > should be ordered against writing in > > task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup() > > and task_isolation_start(). > > > > Since task_isolation_on_cpu() is often called for multiple CPUs in > > a > > sequence, it would be wasteful to include a barrier inside it. > > Then I think you meant a full barrier: smp_mb() For read-only operation? task_isolation_on_cpu() is the only place where per-cpu ll_isol_flags is accessed, read-only, from multiple CPUs. All other access to ll_isol_flags is done from the local CPU, and writes are followed by smp_mb(). There are no other dependencies here, except operations that depend on the value returned from task_isolation_on_cpu(). If/when more flags will be added, those rules will be still followed, because the intention is to store the state of isolation and phases of entering/breaking/reporting it that can only be updated from the local CPUs. > > > > > + if (!tick_nohz_full_cpu(cpu) || > > > > task_isolation_on_cpu(cpu)) > > > > return; > > > > > > You can't simply ignore an IPI. There is always a reason for a > > > nohz_full CPU > > > to be kicked. Something triggered a tick dependency. It can be > > > posix > > > cpu timers > > > for example, or anything. This was added some time ago, when timers appeared and CPUs were kicked seemingly out of nowhere. At that point breaking posix timers when running tasks that are not supposed to rely on posix timers, was the least problematic solution. From user's point of view in this case entering isolation had an effect on timer similar to task exiting while the timer is running. Right now, there are still sources of superfluous calls to this, when tick_nohz_full_kick_all() is used. If I will be able to confirm that this is the only problematic place, I would rather fix calls to it, and make this condition produce a warning. This gives me an idea that if there will be a mechanism specifically for reporting kernel entry and isolation breaking, maybe it should be possible to add a distinction between: 1. isolation breaking that already happened upon kernel entry; 2. performing operation that will immediately and synchronously cause isolation breaking; 3. operations or conditions that will eventually or asynchronously cause isolation breaking (having timers running, possibly sending signals should be in the same category). This will be (2). I assume that when reporting of isolation breaking will be separated from the isolation implementation, it will be implemented as a runtime error condition reporting mechanism. Then it can be focused on providing information about category of events and their sources, and have internal logic designed for that purpose, as opposed to designed entirely for debugging, providing flexibility and obtaining maximum details about internals involved. > > > > I realize that this is unusual, however the idea is that while the > > task > > is running in isolated mode in userspace, we assume that from this > > CPUs > > point of view whatever is happening in kernel, can wait until CPU > > is > > back in kernel and when it first enters kernel from this mode, it > > should "catch up" with everything that happened in its absence. > > task_isolation_kernel_enter() is supposed to do that, so by the > > time > > anything should be done involving the rest of the kernel, CPU is > > back > > to normal. > > You can't assume that. If something needs the tick, this can't wait. > If the user did something wrong, such as setting a posix cpu timer > to an isolated task, that's his fault and the kernel has to stick > with > correctness and kick that task out of isolation mode. That would be true if not multiple "let's just tell all other CPUs that they should check if they have to update something" situations like the above. In case of timers it's possible that I will be able to eliminate all specific instances when this is done, however I think that as a general approach we have to establish some distinction between things that must cause IPI (and
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Mon, 2020-10-05 at 01:14 +0200, Frederic Weisbecker wrote: > Speaking of which, I agree with Thomas that it's unnecessary. > > > It's > > > too much > > > code and complexity. We can use the existing trace events and > > > perform > > > the > > > analysis from userspace to find the source of the disturbance. > > > > The idea behind this is that isolation breaking events are supposed > > to > > be known to the applications while applications run normally, and > > they > > should not require any analysis or human intervention to be > > handled. > > Sure but you can use trace events for that. Just trace interrupts, > workqueues, > timers, syscalls, exceptions and scheduler events and you get all the > local > disturbance. You might want to tune a few filters but that's pretty > much it. And keep all tracing enabled all the time, just to be able to figure out that disturbance happened at all? Or do you mean that we can use kernel entry mechanism to reliably determine that isolation breaking event happened (so the isolation- breaking procedure can be triggered as early as possible), yet avoid trying to determine why exactly it happened, and use tracing if we want to know? Original patch did the opposite, it triggered any isolation-breaking procedure only once it was known specifically, what kind of event happened -- a hardware interrupt, IPI, syscall, page fault, or any other kind of exception, possibly something architecture-specific. This, of course, always had a potential problem with coverage -- if handling of something is missing, isolation breaking is not handled at all, and there is no obvious way of finding if we covered everything. This also made the patch large and somewhat ugly. When I have added a mechanism for low-level isolation breaking handling on kernel entry, it also partially improved the problem with completeness. Partially because I have not yet added handling of "unknown cause" before returning to userspace, however that would be a logical thing to do. Then if we entered kernel from isolation, did something, and are returning to userspace still not knowing what kind of isolation-breaking event happened, we can still trigger isolation breaking. Did I get it right, and you mean that we can remove all specific handling of isolation breaking causes, except for syscall that exits isolation, and report isolation breaking instead of normally returning to userspace? Then isolation breaking will be handled reliably without knowing the cause, and we can leave determining the cause to the tracing mechanism (if enabled)? This does make sense. However for me it looks somewhat strange, because I assume isolation breaking to be a kind of runtime error, that userspace software is supposed to get some basic information about -- like, signals distinguishing between, say, SIGSEGV and SIGPIPE, or write() being able to set errno to ENOSPC or EIO. Then userspace receives basic information about the cause of exception or error, and can do some meaningful reporting, or decide if the error should be fatal for the application or handled differently, based on its internal logic. To get those distinctions, application does not have to be aware of anything internal to the kernel. Similarly distinguishing between, say, a page fault, device interrupt and a timer may be important for a logic implemented in userspace, and I think, it may be nice to allow userspace to get this information immediately and without being aware of any additional details of kernel implementation. The current patch doesn't do this yet, however the intention is to implement reliable isolation breaking by checking on userspace re-entry, plus make reporting of causes, if any were found, visible to the userspace in some convenient way. The part that determines the cause can be implemented separately from isolation breaking mechanism. Then we can have isolation breaking on kernel entry (or potentially some other condition on kernel entry that requires logging the cause) enable reporting, then reporting mechanism, if it exists will fill the blanks, and once either cause is known, or it's time to return to userspace, notification will be done with whatever information is available. For some in-depth analysis, if necessary for debugging the kernel, we can have tracing check if we are in this "suspicious kernel entry" mode, and log things that otherwise would not be. > As for the source of the disturbances, if you really need that > information, > you can trace the workqueue and timer queue events and just filter > those that > target your isolated CPUs. For the purpose of human debugging the kernel or application, the more information is (usually) the better, so the only concern here is that now user is responsible for completeness of things he is tracing. However from application's point of view, or for logging in a production environment it's usually more important to get general type of events, so it's possible to, say, confirm that no
Re: [EXT] Re: [PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
On Thu, 2020-10-01 at 16:47 +0200, Frederic Weisbecker wrote: > External Email > > --- > --- > On Wed, Jul 22, 2020 at 02:58:24PM +0000, Alex Belits wrote: > > From: Yuri Norov > > > > If CPU runs isolated task, there's no any backlog on it, and > > so we don't need to flush it. > > What guarantees that we have no backlog on it? I believe, the logic was that it is not supposed to have backlog because it could not be produced while the CPU was in userspace, because one has to enter kernel to receive (by interrupt) or send (by syscall) anything. Now, looking at this patch. I don't think, it can be guaranteed that there was no backlog before it entered userspace. Then backlog processing will be delayed until exit from isolation. It won't be queued, and flush_work() will not wait when no worker is assigned, so there won't be a deadlock, however this delay may not be such a great idea. So it may be better to flush backlog before entering isolation, and in flush_all_backlogs() instead of skipping all CPUs in isolated mode, check if their per-CPU softnet_data->input_pkt_queue and softnet_data- >process_queue are empty, and if they are not, call backlog anyway. Then, if for whatever reason backlog will appear after flushing (we can't guarantee that nothing preempted us then), it will cause one isolation breaking event, and if nothing will be queued before re- entering isolation, there will be no backlog until exiting isolation. > > > Currently flush_all_backlogs() > > enqueues corresponding work on all CPUs including ones that run > > isolated tasks. It leads to breaking task isolation for nothing. > > > > In this patch, backlog flushing is enqueued only on non-isolated > > CPUs. > > > > Signed-off-by: Yuri Norov > > [abel...@marvell.com: use safe task_isolation_on_cpu() > > implementation] > > Signed-off-by: Alex Belits > > --- > > net/core/dev.c | 7 ++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 90b59fc50dc9..83a282f7453d 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -74,6 +74,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void) > > > > get_online_cpus(); > > > > - for_each_online_cpu(cpu) > > + smp_rmb(); > > What is it ordering? Same as with other calls to task_isolation_on_cpu(cpu), it orders access to ll_isol_flags. > > + for_each_online_cpu(cpu) { > > + if (task_isolation_on_cpu(cpu)) > > + continue; > > queue_work_on(cpu, system_highpri_wq, > > per_cpu_ptr(&flush_works, cpu)); > > + } > > > > for_each_online_cpu(cpu) > > flush_work(per_cpu_ptr(&flush_works, cpu)); > > Thanks.
Re: [EXT] Re: [PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
On Thu, 2020-10-01 at 16:44 +0200, Frederic Weisbecker wrote: > External Email > > --- > --- > On Wed, Jul 22, 2020 at 02:57:33PM +0000, Alex Belits wrote: > > From: Yuri Norov > > > > For nohz_full CPUs the desirable behavior is to receive interrupts > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > > obviously not desirable because it breaks isolation. > > > > This patch adds check for it. > > > > Signed-off-by: Yuri Norov > > [abel...@marvell.com: updated, only exclude CPUs running isolated > > tasks] > > Signed-off-by: Alex Belits > > --- > > kernel/time/tick-sched.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c > > index 6e4cd8459f05..2f82a6daf8fc 100644 > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -20,6 +20,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) > > */ > > void tick_nohz_full_kick_cpu(int cpu) > > { > > - if (!tick_nohz_full_cpu(cpu)) > > + smp_rmb(); > > What is it ordering? ll_isol_flags will be read in task_isolation_on_cpu(), that accrss should be ordered against writing in task_isolation_kernel_enter(), fast_task_isolation_cpu_cleanup() and task_isolation_start(). Since task_isolation_on_cpu() is often called for multiple CPUs in a sequence, it would be wasteful to include a barrier inside it. > > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > > return; > > You can't simply ignore an IPI. There is always a reason for a > nohz_full CPU > to be kicked. Something triggered a tick dependency. It can be posix > cpu timers > for example, or anything. I realize that this is unusual, however the idea is that while the task is running in isolated mode in userspace, we assume that from this CPUs point of view whatever is happening in kernel, can wait until CPU is back in kernel, and when it first enters kernel from this mode, it should "catch up" with everything that happened in its absence. task_isolation_kernel_enter() is supposed to do that, so by the time anything should be done involving the rest of the kernel, CPU is back to normal. It is application's responsibility to avoid triggering things that break its isolation, so the application assumes that everything that involves entering kernel will not be available while it is isolated. If isolation will be broken, or application will request return from isolation, everything will go back to normal environment with all functionality available. > > > > irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); > > -- > > 2.26.2 > >
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Thu, 2020-10-01 at 16:40 +0200, Frederic Weisbecker wrote: > External Email > > --- > --- > On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote: > > +/** > > + * task_isolation_kernel_enter() - clear low-level task isolation > > flag > > + * > > + * This should be called immediately after entering kernel. > > + */ > > +static inline void task_isolation_kernel_enter(void) > > +{ > > + unsigned long flags; > > + > > + /* > > +* This function runs on a CPU that ran isolated task. > > +* > > +* We don't want this CPU running code from the rest of kernel > > +* until other CPUs know that it is no longer isolated. > > +* When CPU is running isolated task until this point anything > > +* that causes an interrupt on this CPU must end up calling > > this > > +* before touching the rest of kernel. That is, this function > > or > > +* fast_task_isolation_cpu_cleanup() or stop_isolation() > > calling > > +* it. If any interrupt, including scheduling timer, arrives, > > it > > +* will still end up here early after entering kernel. > > +* From this point interrupts are disabled until all CPUs will > > see > > +* that this CPU is no longer running isolated task. > > +* > > +* See also fast_task_isolation_cpu_cleanup(). > > +*/ > > + smp_rmb(); > > I'm a bit confused what this read memory barrier is ordering. Also > against > what it pairs. My bad, I have kept it after there were left no write accesses from other CPUs. > > > + if((this_cpu_read(ll_isol_flags) & FLAG_LL_TASK_ISOLATION) == > > 0) > > + return; > > + > > + local_irq_save(flags); > > + > > + /* Clear low-level flags */ > > + this_cpu_write(ll_isol_flags, 0); > > + > > + /* > > +* If something happened that requires a barrier that would > > +* otherwise be called from remote CPUs by CPU kick procedure, > > +* this barrier runs instead of it. After this barrier, CPU > > +* kick procedure would see the updated ll_isol_flags, so it > > +* will run its own IPI to trigger a barrier. > > +*/ > > + smp_mb(); > > + /* > > +* Synchronize instructions -- this CPU was not kicked while > > +* in isolated mode, so it might require synchronization. > > +* There might be an IPI if kick procedure happened and > > +* ll_isol_flags was already updated while it assembled a CPU > > +* mask. However if this did not happen, synchronize everything > > +* here. > > +*/ > > + instr_sync(); > > It's the first time I meet an instruction barrier. I should get > information > about that but what is it ordering here? Against barriers in instruction cache flushing (flush_icache_range() and such). > > + local_irq_restore(flags); > > +} > > Thanks.
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Thu, 2020-10-01 at 15:56 +0200, Frederic Weisbecker wrote: > External Email > > --- > --- > On Wed, Jul 22, 2020 at 02:49:49PM +0000, Alex Belits wrote: > > +/* > > + * Description of the last two tasks that ran isolated on a given > > CPU. > > + * This is intended only for messages about isolation breaking. We > > + * don't want any references to actual task while accessing this > > from > > + * CPU that caused isolation breaking -- we know nothing about > > timing > > + * and don't want to use locking or RCU. > > + */ > > +struct isol_task_desc { > > + atomic_t curr_index; > > + atomic_t curr_index_wr; > > + boolwarned[2]; > > + pid_t pid[2]; > > + pid_t tgid[2]; > > + charcomm[2][TASK_COMM_LEN]; > > +}; > > +static DEFINE_PER_CPU(struct isol_task_desc, isol_task_descs); > > So that's quite a huge patch that would have needed to be split up. > Especially this tracing engine. > > Speaking of which, I agree with Thomas that it's unnecessary. It's > too much > code and complexity. We can use the existing trace events and perform > the > analysis from userspace to find the source of the disturbance. The idea behind this is that isolation breaking events are supposed to be known to the applications while applications run normally, and they should not require any analysis or human intervention to be handled. A process may exit isolation because some leftover delayed work, for example, a timer or a workqueue, is still present on a CPU, or because a page fault or some other exception, normally handled silently, is caused by the task. It is also possible to direct an interrupt to a CPU that is running an isolated task -- currently it's perfectly valid to set interrupt smp affinity to a CPU running isolated task, and then interrupt will cause breaking isolation. While it's probably not the best way of handling interrupts, I would rather not prohibit this explicitly. There is also a matter of avoiding race conditions on entering isolation. Once CPU entered isolation, other CPUs should avoid disturbing it when they know that CPU is running a task in isolated mode. However for a short time after entering isolation other CPUs may be unaware of this, and will still send IPIs to it. Preventing this scenario completely would be very costly in terms of what other CPUs will have to do before notifying others, so similar to how EINTR works, we can simply specify that this is allowed, and task is supposed to re- enter isolation after this. It's still a bad idea to specify that isolation breaking can continue happening while application is running in isolated mode, however allowing some "grace period" after entering is acceptable as long as application is aware of this happening. In libtmc I have moved this handling of isolation breaking into a separate thread, intended to become a separate daemon if necessary. In part it was done because initial implementation of isolation made it very difficult to avoid repeating delayed work on isolated CPUs, so something had to watch for it from non-isolated CPU. It's possible that now, when delayed work does not appear on isolated CPUs out of nowhere, the need in isolation manager thread will disappear, and task itself will be able to handle all isolation breaking, like original implementation by Chris was supposed to. However in either case it's still useful for the task, or isolation manager, to get a description of the isolation-breaking event. This is what those things are intended for. Now they only produce log messages because this is where initially all description of isolation-breaking events went, however I would prefer to make logging optional but always let applications read those events descriptions, regardless of any tracing mechanism being used. I was more focused on making the reporting mechanism properly detect the cause of isolation breaking because that functionality was not quite working in earlier work by Chris and Yuri, so I have kept logging as the only output, but made it suitable for producing events that applications will be able to receive. Application, or isolation manager, will receive clear and unambiguous reporting, so there will be no need for any additional analysis or guesswork. After adding a proper "low-level" isolation flags, I got the idea that we might have a better yet reporting mechanism. Early isolation breaking detection on kernel entry may set a flag that says that isolation breaking happened, however its cause is unknown. Or, more likely, only some general information about isolation breaking is available, like a type of exception. Then, once a known isolation- breaking reporting mechanism is called from interrupt, syscall
Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote: > External Email > > --- > --- > Alex Belits writes: > > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote: > > > 'What does noinstr mean? and why do we have it" -- don't dare > > > touch > > > the > > > entry code until you can answer that. > > > > noinstr disables instrumentation, so there would not be calls and > > dependencies on other parts of the kernel when it's not yet safe to > > call them. Relevant functions already have it, and I add an inline > > call > > to perform flags update and synchronization. Unless something else > > is > > involved, those operations are safe, so I am not adding anything > > that > > can break those. > > Sure. > > 1) That inline function can be put out of line by the compiler and > placed into the regular text section which makes it subject to > instrumentation > > 2) That inline function invokes local_irq_save() which is subject to > instrumentation _before_ the entry state for the instrumentation > mechanisms is established. > > 3) That inline function invokes sync_core() before important state > has > been established, which is especially interesting in NMI like > exceptions. > > As you clearly documented why all of the above is safe and does not > cause any problems, it's just me and Peter being silly, right? > > Try again. I don't think, accusations and mockery are really necessary here. I am trying to do the right thing here. In particular, I am trying to port the code that was developed on platforms that have not yet implemented those useful instrumentation safety features of x86 arch support. For most of the development time I had to figure out, where the synchronization can be safely inserted into kernel entry code on three platforms and tens of interrupt controller drivers, with some of those presenting unusual exceptions (forgive me the pun) from platform- wide conventions. I really appreciate the work you did cleaning up kernel entry procedures, my 5.6 version of this patch had to follow a much more complex and I would say, convoluted entry handling on x86, and now I don't have to do that, thanks to you. Unfortunately, most of my mental effort recently had to be spent on three things: 1. (small): finding a way to safely enable events and synchronize state on kernel entry, so it will not have a race condition between isolation-breaking kernel entry and an event that was disabled while the task was isolated. 2. (big): trying to derive any useful rules applicable to kernel entry in various architectures, finding that there is very little consistency across architectures, and whatever exists, can be broken by interrupt controller drivers that don't all follow the same rules as the rest of the platform. 3. (medium): introducing calls to synchronization on all kernel entry procedures, in places where it is guaranteed to not normally yet have done any calls to parts of the kernel that may be affected by "stale" state, and do it in a manner as consistent and generalized as possible. The current state of kernel entry handling on arm and arm64 architectures has significant differences from x86 and from each other. There is also a matter of interrupt controllers. As can be seen in interrupt controller-specific patch, I had to accommodate some variety of custom interrupt entry code. What can not be seen, is that I had to check that all other interrupt controller drivers and architecture- specific entry procedures, and find that they _do_ follow some understandable rules -- unfortunately architecture-specific and not documented in any manner. I have no valid reasons for complaining about it. I could not expect that authors of all kernel entry procedures would have any foreknowledge that someone at some point may have a reason to establish any kind of synchronization point for CPU cores. And this is why I had to do my research by manually drawing call trees and sequences, separately for every entry on every supported architecture, and across two or three versions of kernel, as those were changing along the way. The result of this may be not a "design" per se, but an understanding of how things are implemented, and what rules are being followed, so I could add my code in a manner consistent with what is done, and document the whole thing. Then there will be some written rules to check for, when anything of this kind will be necessary again (say, with TLB, but considering how much now is done in userspace, possibly to accommodate more exotic CPU features that may have state messed up by userspace). I am afraid, this task, kernel entry documentation, would take me some
Re: [PATCH v4 00/13] "Task_isolation" mode
On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote: > > 'What does noinstr mean? and why do we have it" -- don't dare touch > the > entry code until you can answer that. noinstr disables instrumentation, so there would not be calls and dependencies on other parts of the kernel when it's not yet safe to call them. Relevant functions already have it, and I add an inline call to perform flags update and synchronization. Unless something else is involved, those operations are safe, so I am not adding anything that can break those. -- Alex
Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
On Thu, 2020-07-23 at 17:48 +0200, Peter Zijlstra wrote: > On Thu, Jul 23, 2020 at 03:41:46PM +0000, Alex Belits wrote: > > On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote: > > > . > > > > > > This.. as presented it is an absolutely unreviewable pile of > > > junk. It > > > presents code witout any coherent problem description and > > > analysis. > > > And > > > the patches are not split sanely either. > > > > There is a more complete and slightly outdated description in the > > previous version of the patch at > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_07c25c246c55012981ec0296eee23e68c719333a.camel-40marvell.com_&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=FFZaj-KanwqEiXYCdjd96JOgP_GAOnanpkw6bBvNrK4&e= > > > > Not the point, you're mixing far too many things in one go. You also > have the patches split like 'generic / arch-1 / arch-2' which is > wrong > per definition, as patches should be split per change and not care > about > sily boundaries. This follows the original patch by Chris Metcalf. There is a reason for that -- per-architecture changes are independent from each other and affect not just code but functionality that was implemented per- architecture. To support more architectures, it will be necessary to do it separately for each, and mark them supported with HAVE_ARCH_TASK_ISOLATION. Having only some architectures supported does not break anything for the rest -- architectures that are not covered, would not have this functionality. > > Also, if you want generic entry code, there's patches for that here: > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.kernel.org_r_20200722215954.464281930-40linutronix.de&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=1qgvOnXfk3ZHJA3p7RIb6NFqs4SPPDyPI_PcwNFp8KY&m=shk9a5FDwktOZysSbFIjxmgUg-IPyw2UkbVAHGBhNV0&s=nZXIviY7rva31KvPgSVnTacwFNbsmkdW0LxSTfYSiqg&e= > > > That looks useful. Why didn't Thomas Gleixner mention it in his criticism of my approach if he already solved that exact problem, at least for x86? -- Alex
Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
On Thu, 2020-07-23 at 16:29 +0200, Peter Zijlstra wrote: > . > > This.. as presented it is an absolutely unreviewable pile of junk. It > presents code witout any coherent problem description and analysis. > And > the patches are not split sanely either. There is a more complete and slightly outdated description in the previous version of the patch at https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.ca...@marvell.com/ . It allows userspace application to take a CPU core for itself and run completely isolated, with no disturbances. There is work in progress that also disables and re-enables TLB flushes, and depending on CPU it may be possible to also pre-allocate cache, so it would not be affected by the rest of the system. Events that cause interaction with isolated task, cause isolation breaking, turning the task into a regular userspace task that can continue running normally and enter isolated state again if necessary. To make this feature suitable for any practical use, many mechanisms that normally would cause events on a CPU, should exclude CPU cores in this state, and synchronization should happen later, at the time of isolation breaking. There are three architectures supported, x86, arm and arm64, and it should be possible to extend it to others. Unfortunately kernel entry procedures are neither unified, nor straightforward, so introducing new feature to them causes an appearance of a mess. -- Alex
Re: [PATCH v4 00/13] "Task_isolation" mode
On Thu, 2020-07-23 at 15:17 +0200, Thomas Gleixner wrote: > > Without going into details of the individual patches, let me give you a > high level view of this series: > > 1) Entry code handling: > > That's completely broken vs. the careful ordering and instrumentation > protection of the entry code. You can't just slap stuff randomly > into places which you think are safe w/o actually trying to understand > why this code is ordered in the way it is. > > This clearly was never built and tested with any of the relevant > debug options enabled. Both build and boot would have told you. This is intended to avoid a race condition when entry or exit from isolation happens at the same time as an event that requires synchronization. The idea is, it is possible to insulate the core from all events while it is running isolated task in userspace, it will receive those calls normally after breaking isolation and entering kernel, and it will synchronize itself on kernel entry. This has two potential problems that I am trying to solve: 1. Without careful ordering, there will be a race condition with events that happen at the same time as kernel entry or exit. 2. CPU runs some kernel code after entering but before synchronization. This code should be restricted to early entry that is not affected by the "stale" state, similar to how IPI code that receives synchronization events does it normally. I can't say that I am completely happy with the amount of kernel entry handling that had to be added. The problem is, I am trying to introduce a feature that allows CPU cores to go into "de-synchronized" state while running isolated tasks and not receiving synchronization events that normally would reach them. This means, there should be established some point on kernel entry when it is safe for the core to catch up with the rest of kernel. It may be useful for other purposes, however at this point task isolation is the first to need it, so I had to determine where such point is for every supported architecture and method of kernel entry. I have found that each architecture has its own way of handling this, and sometimes individual interrupt controller drivers vary in their sequence of calls on early kernel entry. For x86 I also have an implementation for kernel 5.6, before your changes to IDT macros. That version is much less straightforward, so I am grateful for those relatively recent improvements. Nevertheless, I believe that the goal of finding those points and using them for synchronization is valid. If you can recommend me a better way for at least x86, I will be happy to follow your advice. I have tried to cover kernel entry in a generic way while making the changes least disruptive, and this is why it looks simple and spread over multiple places. I also had to do the same for arm and arm64 (that I use for development), and for each architecture I had to produce sequences of entry points and function calls to determine the correct placement of task_isolation_enter() calls in them. It is not random, however it does reflect the complex nature of kernel entry code. I believe, RCU implementation faced somewhat similar requirements for calls on kernel entry, however it is not completely unified, either > 2) Instruction synchronization > Trying to do instruction synchronization delayed is a clear recipe > for hard to diagnose failures. Just because it blew not up in your > face does not make it correct in any way. It's broken by design and > violates _all_ rules of safe instruction patching and introduces a > complete trainwreck in x86 NMI processing. The idea is that just like synchronization events are handled by regular IPI, we already use some code with the assumption that it is safe to be entered in "stale" state before synchronization. I have extended it to allow synchronization points on all kernel entry points. > If you really think that this is correct, then please have at least > the courtesy to come up with a detailed and precise argumentation > why this is a valid approach. > > While writing that up you surely will find out why it is not. > I had to document a sequence of calls for every entry point on three supported architectures, to determine the points for synchronization. It is possible that I have somehow missed something, however I don't see a better approach, save for establishing a kernel-wide infrastructure for this. And even if we did just that, it would be possible to implement this kind of synchronization point calls first, and convert them to something more generic later. > > 3) Debug calls > > Sprinkling debug calls around the codebase randomly is not going to > happen. That's an unmaintainable mess. Those report isolation breaking causes, and are intended for application and system debugging. > > Aside of that none of these dmesg based debug things is necessary. > This can simply be monito
[PATCH 13/13] task_isolation: kick_all_cpus_sync: don't kick isolated cpus
From: Yuri Norov Make sure that kick_all_cpus_sync() does not call CPUs that are running isolated tasks. Signed-off-by: Yuri Norov [abel...@marvell.com: use safe task_isolation_cpumask() implementation] Signed-off-by: Alex Belits --- kernel/smp.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/smp.c b/kernel/smp.c index 6a6849783948..ff0d95db33b3 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -803,9 +803,21 @@ static void do_nothing(void *unused) */ void kick_all_cpus_sync(void) { + struct cpumask mask; + /* Make sure the change is visible before we kick the cpus */ smp_mb(); - smp_call_function(do_nothing, NULL, 1); + + preempt_disable(); +#ifdef CONFIG_TASK_ISOLATION + cpumask_clear(&mask); + task_isolation_cpumask(&mask); + cpumask_complement(&mask, &mask); +#else + cpumask_setall(&mask); +#endif + smp_call_function_many(&mask, do_nothing, NULL, 1); + preempt_enable(); } EXPORT_SYMBOL_GPL(kick_all_cpus_sync); -- 2.26.2
[PATCH v4 12/13] task_isolation: ringbuffer: don't interrupt CPUs running isolated tasks on buffer resize
From: Yuri Norov CPUs running isolated tasks are in userspace, so they don't have to perform ring buffer updates immediately. If ring_buffer_resize() schedules the update on those CPUs, isolation is broken. To prevent that, updates for CPUs running isolated tasks are performed locally, like for offline CPUs. A race condition between this update and isolation breaking is avoided at the cost of disabling per_cpu buffer writing for the time of update when it coincides with isolation breaking. Signed-off-by: Yuri Norov [abel...@marvell.com: updated to prevent race with isolation breaking] Signed-off-by: Alex Belits --- kernel/trace/ring_buffer.c | 63 ++ 1 file changed, 57 insertions(+), 6 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 00867ff82412..22d4731f0def 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -1705,6 +1706,38 @@ static void update_pages_handler(struct work_struct *work) complete(&cpu_buffer->update_done); } +static bool update_if_isolated(struct ring_buffer_per_cpu *cpu_buffer, + int cpu) +{ + bool rv = false; + + smp_rmb(); + if (task_isolation_on_cpu(cpu)) { + /* +* CPU is running isolated task. Since it may lose +* isolation and re-enter kernel simultaneously with +* this update, disable recording until it's done. +*/ + atomic_inc(&cpu_buffer->record_disabled); + /* Make sure, update is done, and isolation state is current */ + smp_mb(); + if (task_isolation_on_cpu(cpu)) { + /* +* If CPU is still running isolated task, we +* can be sure that breaking isolation will +* happen while recording is disabled, and CPU +* will not touch this buffer until the update +* is done. +*/ + rb_update_pages(cpu_buffer); + cpu_buffer->nr_pages_to_update = 0; + rv = true; + } + atomic_dec(&cpu_buffer->record_disabled); + } + return rv; +} + /** * ring_buffer_resize - resize the ring buffer * @buffer: the buffer to resize. @@ -1794,13 +1827,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, if (!cpu_buffer->nr_pages_to_update) continue; - /* Can't run something on an offline CPU. */ + /* +* Can't run something on an offline CPU. +* +* CPUs running isolated tasks don't have to +* update ring buffers until they exit +* isolation because they are in +* userspace. Use the procedure that prevents +* race condition with isolation breaking. +*/ if (!cpu_online(cpu)) { rb_update_pages(cpu_buffer); cpu_buffer->nr_pages_to_update = 0; } else { - schedule_work_on(cpu, - &cpu_buffer->update_pages_work); + if (!update_if_isolated(cpu_buffer, cpu)) + schedule_work_on(cpu, + &cpu_buffer->update_pages_work); } } @@ -1849,13 +1891,22 @@ int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size, get_online_cpus(); - /* Can't run something on an offline CPU. */ + /* +* Can't run something on an offline CPU. +* +* CPUs running isolated tasks don't have to update +* ring buffers until they exit isolation because they +* are in userspace. Use the procedure that prevents +* race condition with isolation breaking. +*/ if (!cpu_online(cpu_id)) rb_update_pages(cpu_buffer); else { - schedule_work_on(cpu_id, + if (!update_if_isolated(cpu_buffer, cpu_id)) + schedule_work_on(cpu_id, &cpu_buffer->update_pages_work); - wait_for_completion(&cpu_buf
[PATCH v4 11/13] task_isolation: net: don't flush backlog on CPUs running isolated tasks
From: Yuri Norov If CPU runs isolated task, there's no any backlog on it, and so we don't need to flush it. Currently flush_all_backlogs() enqueues corresponding work on all CPUs including ones that run isolated tasks. It leads to breaking task isolation for nothing. In this patch, backlog flushing is enqueued only on non-isolated CPUs. Signed-off-by: Yuri Norov [abel...@marvell.com: use safe task_isolation_on_cpu() implementation] Signed-off-by: Alex Belits --- net/core/dev.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 90b59fc50dc9..83a282f7453d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -74,6 +74,7 @@ #include #include #include +#include #include #include #include @@ -5624,9 +5625,13 @@ static void flush_all_backlogs(void) get_online_cpus(); - for_each_online_cpu(cpu) + smp_rmb(); + for_each_online_cpu(cpu) { + if (task_isolation_on_cpu(cpu)) + continue; queue_work_on(cpu, system_highpri_wq, per_cpu_ptr(&flush_works, cpu)); + } for_each_online_cpu(cpu) flush_work(per_cpu_ptr(&flush_works, cpu)); -- 2.26.2
[PATCH v4 09/13] task_isolation: arch/arm: enable task isolation functionality
From: Francis Giraldeau This patch is a port of the task isolation functionality to the arm 32-bit architecture. The task isolation needs an additional thread flag that requires to change the entry assembly code to accept a bitfield larger than one byte. The constants _TIF_SYSCALL_WORK and _TIF_WORK_MASK are now defined in the literal pool. The rest of the patch is straightforward and reflects what is done on other architectures. To avoid problems with the tst instruction in the v7m build, we renumber TIF_SECCOMP to bit 8 and let TIF_TASK_ISOLATION use bit 7. Early kernel entry relies on task_isolation_kernel_enter(). vector_swi to label __sys_trace -> syscall_trace_enter() when task isolation is enabled, -> task_isolation_kernel_enter() nvic_handle_irq() -> handle_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter() __fiq_svc, __fiq_abt __fiq_usr -> handle_fiq_as_nmi() -> uses nmi_enter() / nmi_exit() __irq_svc -> irq_handler __irq_usr -> irq_handler irq_handler -> (handle_arch_irq or (arch_irq_handler_default -> (asm_do_IRQ() -> __handle_domain_irq()) or do_IPI() -> handle_IPI()) asm_do_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter() do_IPI() -> handle_IPI() -> task_isolation_kernel_enter() handle_arch_irq for arm-specific controllers calls (handle_IRQ() -> __handle_domain_irq() -> task_isolation_kernel_enter()) or (handle_domain_irq() -> __handle_domain_irq() -> task_isolation_kernel_enter()) Not covered: __dabt_svc -> dabt_helper __dabt_usr -> dabt_helper dabt_helper -> CPU_DABORT_HANDLER (cpu-specific) -> do_DataAbort or PROCESSOR_DABT_FUNC -> _data_abort (cpu-specific) -> do_DataAbort __pabt_svc -> pabt_helper __pabt_usr -> pabt_helper pabt_helper -> CPU_PABORT_HANDLER (cpu-specific) -> do_PrefetchAbort or PROCESSOR_PABT_FUNC -> _prefetch_abort (cpu-specific) -> do_PrefetchAbort Signed-off-by: Francis Giraldeau Signed-off-by: Chris Metcalf [with modifications] [abel...@marvell.com: modified for kernel 5.6, added isolation cleanup] Signed-off-by: Alex Belits --- arch/arm/Kconfig | 1 + arch/arm/include/asm/barrier.h | 2 ++ arch/arm/include/asm/thread_info.h | 10 +++--- arch/arm/kernel/entry-common.S | 15 ++- arch/arm/kernel/ptrace.c | 12 arch/arm/kernel/signal.c | 13 - arch/arm/kernel/smp.c | 6 ++ arch/arm/mm/fault.c| 8 +++- 8 files changed, 57 insertions(+), 10 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 2ac74904a3ce..f06d0e0e4fe9 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -67,6 +67,7 @@ config ARM select HAVE_ARCH_KGDB if !CPU_ENDIAN_BE32 && MMU select HAVE_ARCH_MMAP_RND_BITS if MMU select HAVE_ARCH_SECCOMP_FILTER if AEABI && !OABI_COMPAT + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARM_SMCCC if CPU_V7 diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h index 83ae97c049d9..3c603df6c290 100644 --- a/arch/arm/include/asm/barrier.h +++ b/arch/arm/include/asm/barrier.h @@ -66,12 +66,14 @@ extern void arm_heavy_mb(void); #define wmb() __arm_heavy_mb(st) #define dma_rmb() dmb(osh) #define dma_wmb() dmb(oshst) +#define instr_sync() isb() #else #define mb() barrier() #define rmb() barrier() #define wmb() barrier() #define dma_rmb() barrier() #define dma_wmb() barrier() +#define instr_sync() barrier() #endif #define __smp_mb() dmb(ish) diff --git a/arch/arm/include/asm/thread_info.h b/arch/arm/include/asm/thread_info.h index 3609a6980c34..ec0f11e1bb4c 100644 --- a/arch/arm/include/asm/thread_info.h +++ b/arch/arm/include/asm/thread_info.h @@ -139,7 +139,8 @@ extern int vfp_restore_user_hwstate(struct user_vfp *, #define TIF_SYSCALL_TRACE 4 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 5 /* syscall auditing active */ #define TIF_SYSCALL_TRACEPOINT 6 /* syscall tracepoint instrumentation */ -#define TIF_SECCOMP7 /* seccomp syscall filtering active */ +#define TIF_TASK_ISOLATION 7 /* task isolation enabled for task */ +#define TIF_SECCOMP8 /* seccomp syscall filtering active */ #define TIF_USING_IWMMXT 17 #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ @@ -152,18 +153,21 @@ extern int vfp_restore_user_hwstate(struct user_vfp *, #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) +#define _TIF_TASK_ISOLATION(1 << TIF_
[PATCH v4 10/13] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
From: Yuri Norov For nohz_full CPUs the desirable behavior is to receive interrupts generated by tick_nohz_full_kick_cpu(). But for hard isolation it's obviously not desirable because it breaks isolation. This patch adds check for it. Signed-off-by: Yuri Norov [abel...@marvell.com: updated, only exclude CPUs running isolated tasks] Signed-off-by: Alex Belits --- kernel/time/tick-sched.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index 6e4cd8459f05..2f82a6daf8fc 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void) */ void tick_nohz_full_kick_cpu(int cpu) { - if (!tick_nohz_full_cpu(cpu)) + smp_rmb(); + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) return; irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu); -- 2.26.2
[PATCH 08/13] task_isolation: arch/arm64: enable task isolation functionality
From: Chris Metcalf In do_notify_resume(), call task_isolation_start() for TIF_TASK_ISOLATION tasks. Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and define a local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we don't clear _TIF_TASK_ISOLATION in the loop. We instrument the smp_send_reschedule() routine so that it checks for isolated tasks and generates a suitable warning if needed. Finally, report on page faults in task-isolation processes in do_page_faults(). Early kernel entry code calls task_isolation_kernel_enter(). In particular: Vectors: el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter() el1_irq -> asm_nmi_enter(), handle_arch_irq() el1_error -> do_serror() el0_sync -> el0_sync_handler() el0_irq -> handle_arch_irq() el0_error -> do_serror() el0_sync_compat -> el0_sync_compat_handler() el0_irq_compat -> handle_arch_irq() el0_error_compat -> do_serror() SDEI entry: __sdei_asm_handler -> __sdei_handler() -> nmi_enter() Functions called from there: asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter() asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return() Handlers: do_serror() -> nmi_enter() -> task_isolation_kernel_enter() or task_isolation_kernel_enter() el1_sync_handler() -> task_isolation_kernel_enter() el0_sync_handler() -> task_isolation_kernel_enter() el0_sync_compat_handler() -> task_isolation_kernel_enter() handle_arch_irq() is irqchip-specific, most call handle_domain_irq() or handle_IPI() There is a separate patch for irqchips that do not follow this rule. handle_domain_irq() -> task_isolation_kernel_enter() handle_IPI() -> task_isolation_kernel_enter() nmi_enter() -> task_isolation_kernel_enter() Signed-off-by: Chris Metcalf [abel...@marvell.com: simplified to match kernel 5.6] Signed-off-by: Alex Belits --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/barrier.h | 2 ++ arch/arm64/include/asm/thread_info.h | 5 - arch/arm64/kernel/entry-common.c | 7 +++ arch/arm64/kernel/ptrace.c | 16 +++- arch/arm64/kernel/sdei.c | 2 ++ arch/arm64/kernel/signal.c | 13 - arch/arm64/kernel/smp.c | 9 + arch/arm64/mm/fault.c| 5 + 9 files changed, 57 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 66dc41fd49f2..96fefabfa10f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -137,6 +137,7 @@ config ARM64 select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER select HAVE_ARCH_STACKLEAK + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index fb4c27506ef4..bf4a2adabd5b 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -48,6 +48,8 @@ #define dma_rmb() dmb(oshld) #define dma_wmb() dmb(oshst) +#define instr_sync() isb() + /* * Generate a mask for array_index__nospec() that is ~0UL when 0 <= idx < sz * and 0 otherwise. diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 5e784e16ee89..73269bb8a57d 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -67,6 +67,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define TIF_FOREIGN_FPSTATE3 /* CPU's FP state is not current's */ #define TIF_UPROBE 4 /* uprobe breakpoint or singlestep */ #define TIF_FSCHECK5 /* Check FS is USER_DS on return */ +#define TIF_TASK_ISOLATION 6 /* task isolation enabled for task */ #define TIF_SYSCALL_TRACE 8 /* syscall trace active */ #define TIF_SYSCALL_AUDIT 9 /* syscall auditing */ #define TIF_SYSCALL_TRACEPOINT 10 /* syscall tracepoint for ftrace */ @@ -86,6 +87,7 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_NEED_RESCHED (1 << TIF_NEED_RESCHED) #define _TIF_NOTIFY_RESUME (1 << TIF_NOTIFY_RESUME) #define _TIF_FOREIGN_FPSTATE (1 << TIF_FOREIGN_FPSTATE) +#define _TIF_TASK_ISOLATION(1 << TIF_TASK_ISOLATION) #define _TIF_SYSCALL_TRACE (1 << TIF_SYSCALL_TRACE) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) @@ -99,7 +101,8 @@ void arch_release_task_struct(struct task_struct *tsk); #define _TIF_WORK_MASK (_TIF_NEED_RESCHED | _TIF_SIGPENDING | \ _TIF_NOTIFY_RESUME | _TIF_FOREIGN_FPSTATE | \ -_TIF_UPROBE | _TIF_FSCHECK) +_TIF_UPROBE | _TIF_FSCHE
[PATCH v4 07/13] task_isolation: arch/x86: enable task isolation functionality
In prepare_exit_to_usermode(), run cleanup for tasks exited fromi isolation and call task_isolation_start() for tasks that entered TIF_TASK_ISOLATION. In syscall_trace_enter(), add the necessary support for reporting syscalls for task-isolation processes. Add task_isolation_remote() calls for the kernel exception types that do not result in signals, namely non-signalling page faults. Add task_isolation_kernel_enter() calls to interrupt and syscall entry handlers. This mechanism relies on calls to functions that call task_isolation_kernel_enter() early after entry into kernel. Those functions are: enter_from_user_mode() called from do_syscall_64(), do_int80_syscall_32(), do_fast_syscall_32(), idtentry_enter_user(), idtentry_enter_cond_rcu() idtentry_enter_cond_rcu() called from non-raw IDT macros and other entry points idtentry_enter_user() nmi_enter() xen_call_function_interrupt() xen_call_function_single_interrupt() xen_irq_work_interrupt() Signed-off-by: Chris Metcalf [abel...@marvell.com: adapted for kernel 5.8] Signed-off-by: Alex Belits --- arch/x86/Kconfig | 1 + arch/x86/entry/common.c| 20 +++- arch/x86/include/asm/barrier.h | 2 ++ arch/x86/include/asm/thread_info.h | 4 +++- arch/x86/kernel/apic/ipi.c | 2 ++ arch/x86/mm/fault.c| 4 arch/x86/xen/smp.c | 3 +++ arch/x86/xen/smp_pv.c | 2 ++ 8 files changed, 36 insertions(+), 2 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 883da0abf779..3a80142f85c8 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -149,6 +149,7 @@ config X86 select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT select HAVE_ARCH_PREL32_RELOCATIONS select HAVE_ARCH_SECCOMP_FILTER + select HAVE_ARCH_TASK_ISOLATION select HAVE_ARCH_THREAD_STRUCT_WHITELIST select HAVE_ARCH_STACKLEAK select HAVE_ARCH_TRACEHOOK diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index f09288431f28..ab94d90a2bd5 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -26,6 +26,7 @@ #include #include #include +#include #ifdef CONFIG_XEN_PV #include @@ -86,6 +87,7 @@ static noinstr void enter_from_user_mode(void) { enum ctx_state state = ct_state(); + task_isolation_kernel_enter(); lockdep_hardirqs_off(CALLER_ADDR0); user_exit_irqoff(); @@ -97,6 +99,7 @@ static noinstr void enter_from_user_mode(void) #else static __always_inline void enter_from_user_mode(void) { + task_isolation_kernel_enter(); lockdep_hardirqs_off(CALLER_ADDR0); instrumentation_begin(); trace_hardirqs_off_finish(); @@ -161,6 +164,15 @@ static long syscall_trace_enter(struct pt_regs *regs) return -1L; } + /* +* In task isolation mode, we may prevent the syscall from +* running, and if so we also deliver a signal to the process. +*/ + if (work & _TIF_TASK_ISOLATION) { + if (task_isolation_syscall(regs->orig_ax) == -1) + return -1L; + work &= ~_TIF_TASK_ISOLATION; + } #ifdef CONFIG_SECCOMP /* * Do seccomp after ptrace, to catch any tracer changes. @@ -263,6 +275,8 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs) lockdep_assert_irqs_disabled(); lockdep_sys_exit(); + task_isolation_check_run_cleanup(); + cached_flags = READ_ONCE(ti->flags); if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) @@ -278,6 +292,9 @@ static void __prepare_exit_to_usermode(struct pt_regs *regs) if (unlikely(cached_flags & _TIF_NEED_FPU_LOAD)) switch_fpu_return(); + if (cached_flags & _TIF_TASK_ISOLATION) + task_isolation_start(); + #ifdef CONFIG_COMPAT /* * Compat syscalls set TS_COMPAT. Make sure we clear it before @@ -597,7 +614,8 @@ bool noinstr idtentry_enter_cond_rcu(struct pt_regs *regs) check_user_regs(regs); enter_from_user_mode(); return false; - } + } else + task_isolation_kernel_enter(); /* * If this entry hit the idle task invoke rcu_irq_enter() whether diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index 7f828fe49797..5be6ca0519fc 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -4,6 +4,7 @@ #include #include +#include /* * Force strict CPU ordering. @@ -53,6 +54,7 @@ static inline unsigned long array_index_mask_nospec(unsigned long index, #define dma_rmb() barrier() #define dma_wmb() barrier() +#define instr_sync() sync_core() #ifdef CONFIG_X86_32 #define __smp_mb() asm volatile("lock; addl $0,-4(%%esp)" ::: "memory", "cc") dif
[PATCH 06/13] task_isolation: Add driver-specific hooks
Some drivers don't call functions that call task_isolation_kernel_enter() in interrupt handlers. Call it directly. Signed-off-by: Alex Belits --- drivers/irqchip/irq-armada-370-xp.c | 6 ++ drivers/irqchip/irq-gic-v3.c| 3 +++ drivers/irqchip/irq-gic.c | 3 +++ drivers/s390/cio/cio.c | 3 +++ 4 files changed, 15 insertions(+) diff --git a/drivers/irqchip/irq-armada-370-xp.c b/drivers/irqchip/irq-armada-370-xp.c index c9bdc5221b82..df7f2cce3a54 100644 --- a/drivers/irqchip/irq-armada-370-xp.c +++ b/drivers/irqchip/irq-armada-370-xp.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -473,6 +474,7 @@ static const struct irq_domain_ops armada_370_xp_mpic_irq_ops = { static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) { u32 msimask, msinr; + int isol_entered = 0; msimask = readl_relaxed(per_cpu_int_base + ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS) @@ -489,6 +491,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool is_chained) continue; if (is_chained) { + if (!isol_entered) { + task_isolation_kernel_enter(); + isol_entered = 1; + } irq = irq_find_mapping(armada_370_xp_msi_inner_domain, msinr - PCI_MSI_DOORBELL_START); generic_handle_irq(irq); diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index cc46bc2d634b..be0e0ffa0fb7 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -629,6 +630,8 @@ static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs *regs { u32 irqnr; + task_isolation_kernel_enter(); + irqnr = gic_read_iar(); if (gic_supports_nmi() && diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c index c17fabd6741e..fde547a31566 100644 --- a/drivers/irqchip/irq-gic.c +++ b/drivers/irqchip/irq-gic.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -353,6 +354,8 @@ static void __exception_irq_entry gic_handle_irq(struct pt_regs *regs) struct gic_chip_data *gic = &gic_data[0]; void __iomem *cpu_base = gic_data_cpu_base(gic); + task_isolation_kernel_enter(); + do { irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK); irqnr = irqstat & GICC_IAR_INT_ID_MASK; diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c index 6d716db2a46a..beab1b6d 100644 --- a/drivers/s390/cio/cio.c +++ b/drivers/s390/cio/cio.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch) struct irb *irb; int irq_context; + task_isolation_kernel_enter(); + irb = this_cpu_ptr(&cio_irb); /* Store interrupt response block to lowcore. */ if (tsch(sch->schid, irb) != 0) -- 2.26.2
[PATCH v4 05/13] task_isolation: Add xen-specific hook
xen_evtchn_do_upcall() should call task_isolation_kernel_enter() to indicate that isolation is broken and perform synchronization. Signed-off-by: Alex Belits --- drivers/xen/events/events_base.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 140c7bf33a98..4c16cd58f36b 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -33,6 +33,7 @@ #include #include #include +#include #ifdef CONFIG_X86 #include @@ -1236,6 +1237,8 @@ void xen_evtchn_do_upcall(struct pt_regs *regs) { struct pt_regs *old_regs = set_irq_regs(regs); + task_isolation_kernel_enter(); + irq_enter(); __xen_evtchn_do_upcall(); -- 2.26.2
[PATCH v4 04/13] task_isolation: Add task isolation hooks to arch-independent code
This commit adds task isolation hooks as follows: - __handle_domain_irq() and handle_domain_nmi() generate an isolation warning for the local task - irq_work_queue_on() generates an isolation warning for the remote task being interrupted for irq_work (through __smp_call_single_queue()) - generic_exec_single() generates a remote isolation warning for the remote cpu being IPI'd (through __smp_call_single_queue()) - smp_call_function_many() generates a remote isolation warning for the set of remote cpus being IPI'd (through smp_call_function_many_cond()) - on_each_cpu_cond_mask() generates a remote isolation warning for the set of remote cpus being IPI'd (through smp_call_function_many_cond()) - __ttwu_queue_wakelist() generates a remote isolation warning for the remote cpu being IPI'd (through __smp_call_single_queue()) - nmi_enter(), __context_tracking_exit(), __handle_domain_irq(), handle_domain_nmi() and scheduler_ipi() clear low-level flags and synchronize CPUs by calling task_isolation_kernel_enter() Calls to task_isolation_remote() or task_isolation_interrupt() can be placed in the platform-independent code like this when doing so results in fewer lines of code changes, as for example is true of the users of the arch_send_call_function_*() APIs. Or, they can be placed in the per-architecture code when there are many callers, as for example is true of the smp_send_reschedule() call. A further cleanup might be to create an intermediate layer, so that for example smp_send_reschedule() is a single generic function that just calls arch_smp_send_reschedule(), allowing generic code to be called every time smp_send_reschedule() is invoked. But for now, we just update either callers or callees as makes most sense. Calls to task_isolation_kernel_enter() are intended for early kernel entry code. They may be called in platform-independent or platform-specific code. It may be possible to clean up low-level entry code and somehow organize calls to task_isolation_kernel_enter() to avoid multiple per-architecture or driver-specific calls to it. RCU initialization may be a good reference point for those places in kernel (task_isolation_kernel_enter() should precede it), however right now it is not unified between architectures. Signed-off-by: Chris Metcalf [abel...@marvell.com: adapted for kernel 5.8, added low-level flags handling] Signed-off-by: Alex Belits --- include/linux/hardirq.h | 2 ++ include/linux/sched.h | 2 ++ kernel/context_tracking.c | 4 kernel/irq/irqdesc.c | 13 + kernel/smp.c | 6 +- 5 files changed, 26 insertions(+), 1 deletion(-) diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h index 03c9fece7d43..5aab1d0a580e 100644 --- a/include/linux/hardirq.h +++ b/include/linux/hardirq.h @@ -7,6 +7,7 @@ #include #include #include +#include #include extern void synchronize_irq(unsigned int irq); @@ -114,6 +115,7 @@ extern void rcu_nmi_exit(void); #define nmi_enter()\ do {\ arch_nmi_enter(); \ + task_isolation_kernel_enter(); \ printk_nmi_enter(); \ lockdep_off(); \ BUG_ON(in_nmi() == NMI_MASK); \ diff --git a/include/linux/sched.h b/include/linux/sched.h index 7fb7bb3fddaa..cacfa415dc59 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -32,6 +32,7 @@ #include #include #include +#include /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -1743,6 +1744,7 @@ extern char *__get_task_comm(char *to, size_t len, struct task_struct *tsk); #ifdef CONFIG_SMP static __always_inline void scheduler_ipi(void) { + task_isolation_kernel_enter(); /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting * TIF_NEED_RESCHED remotely (for the first time) will also send diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 36a98c48aedc..481a722ddbce 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -21,6 +21,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -148,6 +149,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) if (!context_tracking_recursion_enter()) return; + task_isolation_kernel_enter(); + if (__this_cpu_read(context_tracking.state) == state) { if (__this_cpu_read(context_tracking.active)) { /* @@ -159,6 +162,7 @@ void noinstr __context_tracking_exit(enum ctx_state state) instrumentation_begin(); vti
[PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
The existing nohz_full mode is designed as a "soft" isolation mode that makes tradeoffs to minimize userspace interruptions while still attempting to avoid overheads in the kernel entry/exit path, to provide 100% kernel semantics, etc. However, some applications require a "hard" commitment from the kernel to avoid interruptions, in particular userspace device driver style applications, such as high-speed networking code. This change introduces a framework to allow applications to elect to have the "hard" semantics as needed, specifying prctl(PR_TASK_ISOLATION, PR_TASK_ISOLATION_ENABLE) to do so. The kernel must be built with the new TASK_ISOLATION Kconfig flag to enable this mode, and the kernel booted with an appropriate "isolcpus=nohz,domain,CPULIST" boot argument to enable nohz_full and isolcpus. The "task_isolation" state is then indicated by setting a new task struct field, task_isolation_flag, to the value passed by prctl(), and also setting a TIF_TASK_ISOLATION bit in the thread_info flags. When the kernel is returning to userspace from the prctl() call and sees TIF_TASK_ISOLATION set, it calls the new task_isolation_start() routine to arrange for the task to avoid being interrupted in the future. With interrupts disabled, task_isolation_start() ensures that kernel subsystems that might cause a future interrupt are quiesced. If it doesn't succeed, it adjusts the syscall return value to indicate that fact, and userspace can retry as desired. In addition to stopping the scheduler tick, the code takes any actions that might avoid a future interrupt to the core, such as a worker thread being scheduled that could be quiesced now (e.g. the vmstat worker) or a future IPI to the core to clean up some state that could be cleaned up now (e.g. the mm lru per-cpu cache). Once the task has returned to userspace after issuing the prctl(), if it enters the kernel again via system call, page fault, or any other exception or irq, the kernel will send it a signal to indicate isolation loss. In addition to sending a signal, the code supports a kernel command-line "task_isolation_debug" flag which causes a stack backtrace to be generated whenever a task loses isolation. To allow the state to be entered and exited, the syscall checking test ignores the prctl(PR_TASK_ISOLATION) syscall so that we can clear the bit again later, and ignores exit/exit_group to allow exiting the task without a pointless signal being delivered. The prctl() API allows for specifying a signal number to use instead of the default SIGKILL, to allow for catching the notification signal; for example, in a production environment, it might be helpful to log information to the application logging mechanism before exiting. Or, the signal handler might choose to reset the program counter back to the code segment intended to be run isolated via prctl() to continue execution. In a number of cases we can tell on a remote cpu that we are going to be interrupting the cpu, e.g. via an IPI or a TLB flush. In that case we generate the diagnostic (and optional stack dump) on the remote core to be able to deliver better diagnostics. If the interrupt is not something caught by Linux (e.g. a hypervisor interrupt) we can also request a reschedule IPI to be sent to the remote core so it can be sure to generate a signal to notify the process. Isolation also disables CPU state synchronization mechanisms that are. normally done by IPI. In the future, more synchronization mechanisms, such as TLB flushes, may be disabled for isolated tasks. This requires careful handling of kernel entry from isolated task -- remote synchronization requests must be re-enabled and synchronization procedure triggered, before anything other than low-level kernel entry code is called. Same applies to exiting from kernel to userspace after isolation is enabled -- either the code should not depend on synchronization, or isolation should be broken. For this purpose, per-CPU low-level flags ll_isol_flags are used to indicate isolation state, and task_isolation_kernel_enter() is used to safely clear them early in kernel entry. CPU mask corresponding to isolation bit in ll_isol_flags is visible to userspace as /sys/devices/system/cpu/isolation_running, and can be used for monitoring. Separate patches that follow provide these changes for x86, arm, and arm64 architectures, xen and irqchip drivers. Signed-off-by: Alex Belits --- .../admin-guide/kernel-parameters.txt | 6 + drivers/base/cpu.c| 23 + include/linux/hrtimer.h | 4 + include/linux/isolation.h | 295 ++ include/linux/sched.h | 5 + include/linux/tick.h | 3 + include/uapi/linux/prctl.h| 6 + init/Kconfig | 28 + kernel/Makefile
[PATCH v4 02/13] task_isolation: vmstat: add vmstat_idle function
From 7823be8cd3ba2e66308f334a2e47f60ba7829e0b Mon Sep 17 00:00:00 2001 From: Chris Metcalf Date: Sat, 1 Feb 2020 08:05:45 + Subject: [PATCH 02/13] task_isolation: vmstat: add vmstat_idle function This function checks to see if a vmstat worker is not running, and the vmstat diffs don't require an update. The function is called from the task-isolation code to see if we need to actually do some work to quiet vmstat. Signed-off-by: Chris Metcalf Signed-off-by: Alex Belits --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 10 ++ 2 files changed, 12 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index ded16dfd21fa..97bc9ed92036 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -273,6 +273,7 @@ extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); void quiet_vmstat_sync(void); +bool vmstat_idle(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -376,6 +377,7 @@ static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } static inline void quiet_vmstat_sync(void) { } +static inline bool vmstat_idle(void) { return true; } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 93534f8537ca..f3693ef0a958 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1898,6 +1898,16 @@ void quiet_vmstat_sync(void) refresh_cpu_vm_stats(false); } +/* + * Report on whether vmstat processing is quiesced on the core currently: + * no vmstat worker running and no vmstat updates to perform. + */ +bool vmstat_idle(void) +{ + return !delayed_work_pending(this_cpu_ptr(&vmstat_work)) && + !need_update(smp_processor_id()); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.26.2
[PATCH v4 01/13] task_isolation: vmstat: add quiet_vmstat_sync function
In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") the quiet_vmstat() function became asynchronous, in the sense that the vmstat work was still scheduled to run on the core when the function returned. For task isolation, we need a synchronous version of the function that guarantees that the vmstat worker will not run on the core on return from the function. Add a quiet_vmstat_sync() function with that semantic. Signed-off-by: Chris Metcalf Signed-off-by: Alex Belits --- include/linux/vmstat.h | 2 ++ mm/vmstat.c| 9 + 2 files changed, 11 insertions(+) diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index aa961088c551..ded16dfd21fa 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -272,6 +272,7 @@ extern void __dec_zone_state(struct zone *, enum zone_stat_item); extern void __dec_node_state(struct pglist_data *, enum node_stat_item); void quiet_vmstat(void); +void quiet_vmstat_sync(void); void cpu_vm_stats_fold(int cpu); void refresh_zone_stat_thresholds(void); @@ -374,6 +375,7 @@ static inline void __dec_node_page_state(struct page *page, static inline void refresh_zone_stat_thresholds(void) { } static inline void cpu_vm_stats_fold(int cpu) { } static inline void quiet_vmstat(void) { } +static inline void quiet_vmstat_sync(void) { } static inline void drain_zonestat(struct zone *zone, struct per_cpu_pageset *pset) { } diff --git a/mm/vmstat.c b/mm/vmstat.c index 3fb23a21f6dd..93534f8537ca 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1889,6 +1889,15 @@ void quiet_vmstat(void) refresh_cpu_vm_stats(false); } +/* + * Synchronously quiet vmstat so the work is guaranteed not to run on return. + */ +void quiet_vmstat_sync(void) +{ + cancel_delayed_work_sync(this_cpu_ptr(&vmstat_work)); + refresh_cpu_vm_stats(false); +} + /* * Shepherd worker thread that checks the * differentials of processors that have their worker -- 2.26.2
[PATCH v4 00/13] "Task_isolation" mode
This is a new version of task isolation implementation. Previous version is at https://lore.kernel.org/lkml/07c25c246c55012981ec0296eee23e68c719333a.ca...@marvell.com/ Mostly this covers race conditions prevention on breaking isolation. Early after kernel entry, task_isolation_enter() is called to update flags visible to other CPU cores and to perform synchronization if necessary. Before this call only "safe" operations happen, as long as CONFIG_TRACE_IRQFLAGS is not enabled. This is also intended for future TLB handling -- the idea is to also isolate those CPU cores from TLB flushes while they are running isolated task in userspace, and do one flush on exiting, before any code is called that may touch anything updated. The functionality and interface is unchanged, except for /sys/devices/system/cpu/isolation_running containing the list of CPUs running isolated tasks. This should be useful for userspace helper libraries.
[tip: sched/core] lib: Restrict cpumask_local_spread to houskeeping CPUs
The following commit has been merged into the sched/core branch of tip: Commit-ID: 1abdfe706a579a702799fce465bceb9fb01d407c Gitweb: https://git.kernel.org/tip/1abdfe706a579a702799fce465bceb9fb01d407c Author:Alex Belits AuthorDate:Thu, 25 Jun 2020 18:34:41 -04:00 Committer: Peter Zijlstra CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00 lib: Restrict cpumask_local_spread to houskeeping CPUs The current implementation of cpumask_local_spread() does not respect the isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task, it will return it to the caller for pinning of its IRQ threads. Having these unwanted IRQ threads on an isolated CPU adds up to a latency overhead. Restrict the CPUs that are returned for spreading IRQs only to the available housekeeping CPUs. Signed-off-by: Alex Belits Signed-off-by: Nitesh Narayan Lal Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200625223443.2684-2-nit...@redhat.com --- lib/cpumask.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/lib/cpumask.c b/lib/cpumask.c index fb22fb2..85da6ab 100644 --- a/lib/cpumask.c +++ b/lib/cpumask.c @@ -6,6 +6,7 @@ #include #include #include +#include /** * cpumask_next - get the next cpu in a cpumask @@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask) */ unsigned int cpumask_local_spread(unsigned int i, int node) { - int cpu; + int cpu, hk_flags; + const struct cpumask *mask; + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ; + mask = housekeeping_cpumask(hk_flags); /* Wrap: we always want a cpu. */ - i %= num_online_cpus(); + i %= cpumask_weight(mask); if (node == NUMA_NO_NODE) { - for_each_cpu(cpu, cpu_online_mask) + for_each_cpu(cpu, mask) { if (i-- == 0) return cpu; + } } else { /* NUMA first. */ - for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask) + for_each_cpu_and(cpu, cpumask_of_node(node), mask) { if (i-- == 0) return cpu; + } - for_each_cpu(cpu, cpu_online_mask) { + for_each_cpu(cpu, mask) { /* Skip NUMA nodes, done above. */ if (cpumask_test_cpu(cpu, cpumask_of_node(node))) continue;
[tip: sched/core] net: Restrict receive packets queuing to housekeeping CPUs
The following commit has been merged into the sched/core branch of tip: Commit-ID: 07bbecb3410617816a99e76a2df7576507a0c8ad Gitweb: https://git.kernel.org/tip/07bbecb3410617816a99e76a2df7576507a0c8ad Author:Alex Belits AuthorDate:Thu, 25 Jun 2020 18:34:43 -04:00 Committer: Peter Zijlstra CommitterDate: Wed, 08 Jul 2020 11:39:02 +02:00 net: Restrict receive packets queuing to housekeeping CPUs With the existing implementation of store_rps_map(), packets are queued in the receive path on the backlog queues of other CPUs irrespective of whether they are isolated or not. This could add a latency overhead to any RT workload that is running on the same CPU. Ensure that store_rps_map() only uses available housekeeping CPUs for storing the rps_map. Signed-off-by: Alex Belits Signed-off-by: Nitesh Narayan Lal Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200625223443.2684-4-nit...@redhat.com --- net/core/net-sysfs.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index e353b82..677868f 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -741,7 +742,7 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, { struct rps_map *old_map, *map; cpumask_var_t mask; - int err, cpu, i; + int err, cpu, i, hk_flags; static DEFINE_MUTEX(rps_map_mutex); if (!capable(CAP_NET_ADMIN)) @@ -756,6 +757,13 @@ static ssize_t store_rps_map(struct netdev_rx_queue *queue, return err; } + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; + cpumask_and(mask, mask, housekeeping_cpumask(hk_flags)); + if (cpumask_empty(mask)) { + free_cpumask_var(mask); + return -EINVAL; + } + map = kzalloc(max_t(unsigned int, RPS_MAP_SIZE(cpumask_weight(mask)), L1_CACHE_BYTES), GFP_KERNEL);
[tip: sched/core] PCI: Restrict probe functions to housekeeping CPUs
The following commit has been merged into the sched/core branch of tip: Commit-ID: 69a18b18699b59654333651d95f8ca09d01048f8 Gitweb: https://git.kernel.org/tip/69a18b18699b59654333651d95f8ca09d01048f8 Author:Alex Belits AuthorDate:Thu, 25 Jun 2020 18:34:42 -04:00 Committer: Peter Zijlstra CommitterDate: Wed, 08 Jul 2020 11:39:01 +02:00 PCI: Restrict probe functions to housekeeping CPUs pci_call_probe() prevents the nesting of work_on_cpu() for a scenario where a VF device is probed from work_on_cpu() of the PF. Replace the cpumask used in pci_call_probe() from all online CPUs to only housekeeping CPUs. This is to ensure that there are no additional latency overheads caused due to the pinning of jobs on isolated CPUs. Signed-off-by: Alex Belits Signed-off-by: Nitesh Narayan Lal Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Frederic Weisbecker Acked-by: Bjorn Helgaas Link: https://lkml.kernel.org/r/20200625223443.2684-3-nit...@redhat.com --- drivers/pci/pci-driver.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index da6510a..449466f 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -333,6 +334,7 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, const struct pci_device_id *id) { int error, node, cpu; + int hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; struct drv_dev_and_id ddi = { drv, dev, id }; /* @@ -353,7 +355,8 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, pci_physfn_is_probed(dev)) cpu = nr_cpu_ids; else - cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask); + cpu = cpumask_any_and(cpumask_of_node(node), + housekeeping_cpumask(hk_flags)); if (cpu < nr_cpu_ids) error = work_on_cpu(cpu, local_pci_probe, &ddi);
Re: how to look for source code in kernel
On Fri, 28 Dec 2012, anish singh wrote: have source insight.We can use wine in linux but that sucks. Funny you say that! Never heard of cscope, ctags ? It is not as convenient as source insight or is it? There is also LXR. If it's not good enough for you, then don't look at it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Âèçû.Ïàñïîðòà.Ãðàæäàíñòâà.Ïðîáëåìû.=?ISO-8859-1?Q?=D0=E5=F8=E5
On Fri, 15 Jun 2001, Dan Hollis wrote: > Received: from [195.161.132.168] ([195.161.132.168]:38150 "HELO 777") > by vger.kernel.org with SMTP id ; > Fri, 15 Jun 2001 17:19:32 -0400 > > inetnum: 195.161.132.0 - 195.161.132.255 > netname: RT-CLNT-MMTEL > descr:Moscow Long Distance and International Telephone > > Anyone want to fire the nuclear larts? Me! 1. It's a spam. 2. It's in the dreaded windows-1251 charset. 3. The text in the header is mis-identified as ISO 8859-1 -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Serial device with very large buffer
On Fri, 9 Feb 2001, Pavel Machek wrote: > > > I also propose to increase the size of flip buffer to 640 bytes (so the > > > flipping won't occur every time in the middle of the full buffer), however > > > I understand that it's a rather drastic change for such a simple goal, and > > > not everyone will agree that it's worth the trouble: > > > > Going to a 1K flip buffer would make sense IMHO for high speed devices too > > Actually bigger flipbufs are needed for highspeed serials and > irda. Tytso received patch to make flipbuf size settable by the > driver. (Setting it to 1K is not easy, you need to change allocation > mechanism of buffers.) The need for changes in allocation mechanism was the reason why I have limited the buffer increase to 640 bytes. If changes already exist, and there is no some hidden overhead associated with them, I am all for it. Still it's not a replacement for the change in serial driver that I have posted -- assumption that hardware is slower than we are, that it has limited buffer in the way, and that it's ok to discard all the data beyond our buffer's size is, to say least, silly. -- Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Serial device with very large buffer
On Thu, 1 Feb 2001, Joe deBlaquiere wrote: > >>I'm a little confused here... why are we overrunning? This thing is > >> running externally at 19200 at best, even if it does all come in as a > >> packet. > > > > > > Different Merlin -- original Merlin is 19200, "Merlin for Ricochet" is > > 128Kbps (or faster), and uses Metricom/Ricochet network. > > so can you still limit the mru? No. And even if I could, there is no guarantee that it won't fill the whole buffer anyway by attaching head of the second packet after the tail of the first one -- this thing treats interface as asynchronous and ignores PPP packets boundaries. -- Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Serial device with very large buffer
On Thu, 1 Feb 2001, Joe deBlaquiere wrote: > Hi Alex! > > I'm a little confused here... why are we overrunning? This thing is > running externally at 19200 at best, even if it does all come in as a > packet. Different Merlin -- original Merlin is 19200, "Merlin for Ricochet" is 128Kbps (or faster), and uses Metricom/Ricochet network. -- Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Serial device with very large buffer
On Thu, 1 Feb 2001, Alan Cox wrote: > > I also propose to increase the size of flip buffer to 640 bytes (so the > > flipping won't occur every time in the middle of the full buffer), however > > I understand that it's a rather drastic change for such a simple goal, and > > not everyone will agree that it's worth the trouble: > > Going to a 1K flip buffer would make sense IMHO for high speed devices too 1K flip buffer makes the tty_struct exceed 4096 bytes, and I don't think, it's a good idea to change the allocation mechanism for it. -- Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Serial device with very large buffer
Greg Pomerantz <[EMAIL PROTECTED]> and I have found that Novatel Merlin for Ricochet PCMCIA card, while looking like otherwise ordinary serial PCMCIA device, has the receive buffer 576 bytes long. When regular serial driver reads the arrived data, it often runs out of 512-bytes flip buffer and discards the rest of the data with rather disastrous consequences for whatever is expecting it. We made a fix that changes the behavior of the driver, so when it fills the flip buffer while characters are still being read from uart, it flips the buffer if it's possible or if it's impossible, finishes the loop without reading the remaining characters. The patch is: ---8<--- --- linux-2.4.1-orig/drivers/char/serial.c Wed Dec 6 12:06:18 2000 +++ linux/drivers/char/serial.c Thu Feb 1 13:14:05 2001 @@ -569,9 +569,16 @@ icount = &info->state->icount; do { + /* +* Check if flip buffer is full -- if it is, try to flip, +* and if flipping got queued, return immediately +*/ + if (tty->flip.count >= TTY_FLIPBUF_SIZE) { + tty->flip.tqueue.routine((void *) tty); + if (tty->flip.count >= TTY_FLIPBUF_SIZE) + return; + } ch = serial_inp(info, UART_RX); - if (tty->flip.count >= TTY_FLIPBUF_SIZE) - goto ignore_char; *tty->flip.char_buf_ptr = ch; icount->rx++; --->8--- I also propose to increase the size of flip buffer to 640 bytes (so the flipping won't occur every time in the middle of the full buffer), however I understand that it's a rather drastic change for such a simple goal, and not everyone will agree that it's worth the trouble: ---8<--- --- linux-2.4.1-orig/include/linux/tty.hMon Jan 29 23:24:56 2001 +++ linux/include/linux/tty.h Wed Jan 31 13:06:42 2001 @@ -134,7 +134,7 @@ * located in the tty structure, and is used as a high speed interface * between the tty driver and the tty line discipline. */ -#define TTY_FLIPBUF_SIZE 512 +#define TTY_FLIPBUF_SIZE 640 struct tty_flip_buffer { struct tq_struct tqueue; --->8--- -- Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: PPP broken in Kernel 2.4.1?
On Mon, 29 Jan 2001, Michael B. Trausch wrote: > I'm having a weird problem with 2.4.1, and I am *not* having this problem > with 2.4.0. When I attempt to connect to the Internet using Kernel 2.4.1, > I get errors about PPP something-or-another, invalid argument. I've tried Upgrade ppp to 2.4.0b1 or later -- it's documented in Documentation/Changes. -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Journaling: Surviving or allowing unclean shutdown?
On Thu, 4 Jan 2001, Daniel Phillips wrote: > > A lot of applications always rely on their file i/o being done in some > > manner that has atomic (from the application's point of view) operations > > other than system calls -- heck, even make(1) does that. > > Nobody is forcing you to hit the power switch in the middle of a build. > But now that you mention it, you've provided a good example of a broken > application. Make with its reliance on timestamps for determining build > status is both painfully slow and unreliable. Actually I mean its reliance on files being deleted if the problem or SIGTERM happened in the middle of build ing them. > What happens if you > adjust your system clock? Don't adjust the system clock in the middle of the build. Adjusting clock backward for more than a second is much more rare operation than a shutdown. > That said, Tux2 can preserve the per-write > atomicity quite easily, or better, make could take advantage of the new > journal-oriented transaction api that's being cooked up and specify its > requirement for atomicity in a precise way. I have already said that programs don't use syscalls as the only atomic operations on files -- yes, it may be a good idea to add transactions API on the top of this (and it will have a lot of uses), but then it should be made in a way that its use will be easy to add to existing applications. > Do you have any other examples of programs that would be hurt by sudden > termination? Certainly we'd consider a desktop gui broken if it failed > to come up again just because you bailed out with the power switch > instead of logging out nicely. Any application that writes multiple times over the same files and has any data consistency requirements beyond the piece of data in the chunk sent in one write(). -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Journaling: Surviving or allowing unclean shutdown?
On Wed, 3 Jan 2001, Daniel Phillips wrote: > I don't doubt that if the 'power switch' method of shutdown becomes > popular we will discover some applications that have windows where they > can be hurt by sudden shutdown, even will full filesystem data state > being preserved. Such applications are arguably broken because they > will behave badly in the event of accidental shutdown anyway, and we > should fix them. Well-designed applications are explicitly 'serially > reuseable', in other words, you can interrupt at any point and start > again from the beginning with valid and expected results. I strongly disagree. All valid ways to shut down the system involve sending SIGTERM to running applications -- only broken ones would live long enough after that to be killed by subsequent SIGKILL. A lot of applications always rely on their file i/o being done in some manner that has atomic (from the application's point of view) operations other than system calls -- heck, even make(1) does that. -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: OS & Games Software
On Sat, 23 Dec 2000 [EMAIL PROTECTED] wrote: > Subject: OS & Games Software > > Are you still using an old operating system? Why not upgrade to a > newer and > more reliable version? You'll enjoy greater features and more > stability. > > Microsoft Dos 6.22$15 > Microsoft Windows 3.11$15 > Microsoft Windows 95 $15 > Microsoft Windows 98 SE $20 > Microsoft Windows Millenium $20 > Microsoft Windows 2000 Pro$20 > Microsoft Windows 2000 Server $50 > Microsoft Windows 2000 Advanced Server (25CAL)$65 > Is this a desperate Microsoft's attempt to slow Linux development by insulting developers? ;-)) I mean, what other purpose can this possibly have? Unless, of course, some unintelligent person got linux-kernel address in a list of prepackaged "n millions email addresses for sale" (and then he must be not moron*2, or moron^2, but at least e^moron). -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: The NSA's Security-Enhanced Linux (fwd)
On Fri, 22 Dec 2000, James Lewis Nance wrote: > > benefits from and which may help cut down computer crime beyond government. > > (and which of course actually is part of the NSA's real job) > > I often wonder how many people know that a whole bunch of the Linux > networking code is Copyrighted by the NSA. Not exactly by NSA itself. A bunch of files have in copyright comment: ---8<--- Written 1992-94 by Donald Becker. Copyright 1993 United States Government as represented by the Director, National Security Agency. This software may be used and distributed according to the terms of the GNU Public License, incorporated herein by reference. The author may be reached as [EMAIL PROTECTED], or C/O Center of Excellence in Space Data and Information Sciences Code 930.5, Goddard Space Flight Center, Greenbelt MD 20771 --->8--- ...so this is the result of Becker's employment at NASA and government's legal weirdness (no, I have no idea, why of all possible choices "Director, National Security Agency" must represent US government for copyright purpose). > I'm always waiting to > hear someone come up with a conspiracy theory about it on slashdot, > but I have never heard anyone mention it. Actually I have seen it mentioned there today -- maybe conspiracy theory is being developed right now ;-) -- Alex - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: OS Software?
On Thu, 21 Dec 2000 [EMAIL PROTECTED] wrote: > Are you interested in Office 2000? I am selling perfectly working > copies > of Microsoft Office 2000 SR-1 Premium Edition for a flat price of > $50 USD. > The suite contains 4 discs and includes: > > Word > Excel > Outlook > PowerPoint > Access > FrontPage > Publisher > Small Business Tools > PhotoDraw Is it a new tradition among spammers -- spam linux-kernel ML with offers of software, most hated among the subscribers? Can't they offer something less offensive? -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: uname
On Thu, 23 Nov 2000, J . A . Magallon wrote: > Little question about 'uname'. Does it read data from kernel, /proc or > get its data from other source ? uname(1) utility calls uname(2) syscall. -- Alex -- Excellent.. now give users the option to cut your hair you hippie! -- Anonymous Coward - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] STRIP support for new Metricom modems
I have made changes in STRIP address handling to accomodate new 128Kbps Ricochet GS "modems" that Metricom makes now. There is no official maintainer of STRIP code (maybe I should become one, however folks at Stanford who work on the original project probably will be more appropriate), so I am only sending it here. The patch was tested on 2.2.17 (original tested with Metricom modem with serial link and patched for USB to test the same modem with USB link), and 2.4.0-test9 (unchanged), in all tests I had old modems (Original Ricochet and Ricochet SE) talking to each other and old modems talking with Ricochet GS. Explanations are at http://phobos.illtel.denver.co.us/~abelits/metricom/ diff -u linux-2.2.17-orig/drivers/net/strip.c linux-2.2.17/drivers/net/strip.c --- linux-2.2.17-orig/drivers/net/strip.c Sun Nov 8 13:48:06 1998 +++ linux-2.2.17/drivers/net/strip.cThu Sep 28 11:06:46 2000 @@ -14,7 +14,7 @@ * for kernel-based devices like TTY. It interfaces between a * raw TTY, and the kernel's INET protocol layers (via DDI). * - * Version:@(#)strip.c 1.3 July 1997 + * Version:@(#)strip.c 1.4 September 2000 * * Author: Stuart Cheshire <[EMAIL PROTECTED]> * @@ -66,12 +66,15 @@ * It is no longer necessarily to manually set the radio's * rate permanently to 115200 -- the driver handles setting * the rate automatically. + * + * v1.4 September 2000 (AB) + * Added support for long serial numbers. */ #ifdef MODULE -static const char StripVersion[] = "1.3-STUART.CHESHIRE-MODULAR"; +static const char StripVersion[] = "1.4-STUART.CHESHIRE-MODULAR"; #else -static const char StripVersion[] = "1.3-STUART.CHESHIRE"; +static const char StripVersion[] = "1.4-STUART.CHESHIRE"; #endif #define TICKLE_TIMERS 0 @@ -897,20 +900,37 @@ * Convert a string to a Metricom Address. */ -#define IS_RADIO_ADDRESS(p) ( \ +#define IS_RADIO_ADDRESS_1(p) ( \ isdigit((p)[0]) && isdigit((p)[1]) && isdigit((p)[2]) && isdigit((p)[3]) && \ (p)[4] == '-' &&\ isdigit((p)[5]) && isdigit((p)[6]) && isdigit((p)[7]) && isdigit((p)[8])) +#define IS_RADIO_ADDRESS_2(p) ( \ + isdigit((p)[0]) && isdigit((p)[1]) && \ + (p)[2] == '-' &&\ + isdigit((p)[3]) && isdigit((p)[4]) && isdigit((p)[5]) && isdigit((p)[6]) && \ + (p)[7] == '-' &&\ + isdigit((p)[8]) && isdigit((p)[9]) && isdigit((p)[10]) && isdigit((p)[11]) ) + static int string_to_radio_address(MetricomAddress *addr, __u8 *p) { -if (!IS_RADIO_ADDRESS(p)) return(1); +if (IS_RADIO_ADDRESS_2(p)) +{ +addr->c[0] = 0; +addr->c[1] = (READHEX(p[0]) << 4 | READHEX(p[1])) ^ 0xFF; +addr->c[2] = READHEX(p[3]) << 4 | READHEX(p[4]); +addr->c[3] = READHEX(p[5]) << 4 | READHEX(p[6]); +addr->c[4] = READHEX(p[8]) << 4 | READHEX(p[9]); +addr->c[5] = READHEX(p[10]) << 4 | READHEX(p[11]); +}else{ +if(!IS_RADIO_ADDRESS_1(p)) return(1); addr->c[0] = 0; addr->c[1] = 0; addr->c[2] = READHEX(p[0]) << 4 | READHEX(p[1]); addr->c[3] = READHEX(p[2]) << 4 | READHEX(p[3]); addr->c[4] = READHEX(p[5]) << 4 | READHEX(p[6]); addr->c[5] = READHEX(p[7]) << 4 | READHEX(p[8]); +} return(0); } @@ -920,6 +940,9 @@ static __u8 *radio_address_to_string(const MetricomAddress *addr, MetricomAddressString *p) { +if(addr->c[1]) +sprintf(p->c, "%02X-%02X%02X-%02X%02X", addr->c[1] ^ 0xFF, addr->c[2], +addr->c[3], addr->c[4], addr->c[5]); +else sprintf(p->c, "%02X%02X-%02X%02X", addr->c[2], addr->c[3], addr->c[4], addr->c[5]); return(p->c); } @@ -1481,6 +1504,12 @@ *ptr++ = 0x0D; *ptr++ = '*'; +if(haddr.c[1]) +{ +*ptr++ = hextable[(haddr.c[1] >> 4) ^ 0xF]; +*ptr++ = hextable[(haddr.c[1] & 0xF) ^ 0xF]; +*ptr++ = '-'; +} *ptr++ = hextable[haddr.c[2] >> 4]; *ptr++ = hextable[haddr.c[2] & 0xF]; *ptr++ = hextable[haddr.c[3] >> 4]; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/