Re: [PATCHv3 1/2] arch: Move CONFIG_DEBUG_RODATA and CONFIG_SET_MODULE_RONX to be common

2017-02-06 Thread Ingo Molnar

* Laura Abbott  wrote:

> 
> There are multiple architectures that support CONFIG_DEBUG_RODATA and
> CONFIG_SET_MODULE_RONX. These options also now have the ability to be
> turned off at runtime. Move these to an architecture independent
> location and make these options def_bool y for almost all of those
> arches.
> 
> Signed-off-by: Laura Abbott 
> ---
> v3: Make these configs selectable for arm. Include some documentation about
> how the setup of the optional Kconfigs work as well. Stop spelling 'kenrel'
> in prompt text.
> ---
>  Documentation/security/self-protection.txt |  6 ++
>  arch/Kconfig   | 34 
> ++
>  arch/arm/Kconfig   |  4 
>  arch/arm/Kconfig.debug | 11 --
>  arch/arm/mm/Kconfig| 12 ---
>  arch/arm64/Kconfig |  5 ++---
>  arch/arm64/Kconfig.debug   | 11 --
>  arch/parisc/Kconfig|  1 +
>  arch/parisc/Kconfig.debug  | 11 --
>  arch/s390/Kconfig  |  5 ++---
>  arch/s390/Kconfig.debug|  3 ---
>  arch/x86/Kconfig   |  5 ++---
>  arch/x86/Kconfig.debug | 11 --
>  13 files changed, 51 insertions(+), 68 deletions(-)

Acked-by: Ingo Molnar 

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 2/2] arch: Rename CONFIG_DEBUG_RODATA and CONFIG_DEBUG_MODULE_RONX

2017-02-19 Thread Ingo Molnar

* Kees Cook  wrote:

> On Thu, Feb 16, 2017 at 2:25 PM, Pavel Machek  wrote:
> > Hi!
> >
> >>
> >> -config DEBUG_RODATA
> >> +config STRICT_KERNEL_RWX
> >>   bool "Make kernel text and rodata read-only" if 
> >> ARCH_OPTIONAL_KERNEL_RWX
> >>   depends on ARCH_HAS_STRICT_KERNEL_RWX
> >>   default !ARCH_OPTIONAL_KERNEL_RWX ||
> >
> > Debug features are expected to have runtime cost, so kconfig help is
> > silent about those. But there are runtime costs, right? It would be
> > nice to mention them in the help text...
> 
> It depends on the architecture. The prior help text for arm said:
> 
>  The tradeoff is that each region is padded to section-size (1MiB)
>  boundaries (because their permissions are different and splitting
>  the 1M pages into 4K ones causes TLB performance problems), which
>  can waste memory.
> 
> parisc (somewhat inaccurately) said:
> 
>  This option may have a slight performance impact because a
>  portion of the kernel code won't be covered by a TLB anymore.
> 
> IIUC, arm64 does what parisc is hinting at: mappings at the end are
> broken down to PAGE_SIZE. On x86, IIUC, there's actually no change to
> TLB performance due to how the mappings are already set up.

BTW., a good strategy with RAM sizes above say 4GB would be to just round up to 
the next large-TLB boundary (2MB) and waste 0-2MB of RAM - which is only 0.05% 
of 
4GB of RAM. On most workloads, especially with SSDs it's probably a positive 
RAM 
vs. performance trade-off.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/3] x86/mm: Adapt MODULES_END based on Fixmap section size

2017-03-16 Thread Ingo Molnar

* Thomas Garnier  wrote:

> This patch aligns MODULES_END to the beginning of the Fixmap section.
> It optimizes the space available for both sections. The address is
> pre-computed based on the number of pages required by the Fixmap
> section.
> 
> It will allow GDT remapping in the Fixmap section. The current
> MODULES_END static address does not provide enough space for the kernel
> to support a large number of processors.
> 
> Signed-off-by: Thomas Garnier 
> ---
> Based on next-20170308
> ---
>  Documentation/x86/x86_64/mm.txt | 5 -
>  arch/x86/include/asm/pgtable_64_types.h | 3 ++-
>  arch/x86/kernel/module.c| 1 +
>  arch/x86/mm/dump_pagetables.c   | 1 +
>  arch/x86/mm/kasan_init_64.c | 1 +
>  mm/vmalloc.c| 1 +

> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "internal.h"

Note that asm/fixmap.h is an x86-ism that isn't present in many other 
architectures, so this hunk will break the build.

To make progress with these patches I've fixed it up with an ugly #ifdef 
CONFIG_X86, but it needs a real solution instead before this can be pushed 
upstream.

Thanks,

Ingo

=>
 mm/vmalloc.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index dabea6a29fad..b7d2a23349f4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -35,7 +35,10 @@
 #include 
 #include 
 #include 
-#include 
+
+#ifdef CONFIG_X86
+# include 
+#endif
 
 #include "internal.h"
 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-28 Thread Ingo Molnar

* Masami Hiramatsu  wrote:

> @@ -1824,6 +1823,30 @@ void unregister_jprobes(struct jprobe **jps, int num)
>  EXPORT_SYMBOL_GPL(unregister_jprobes);
>  
>  #ifdef CONFIG_KRETPROBES
> +
> +/* Try to use free instance first, if failed, try to allocate new instance */
> +struct kretprobe_instance *kretprobe_alloc_instance(struct kretprobe *rp)
> +{
> + struct kretprobe_instance *ri = NULL;
> + unsigned long flags = 0;
> +
> + raw_spin_lock_irqsave(&rp->lock, flags);
> + if (!hlist_empty(&rp->free_instances)) {
> + ri = hlist_entry(rp->free_instances.first,
> + struct kretprobe_instance, hlist);
> + hlist_del(&ri->hlist);
> + }
> + raw_spin_unlock_irqrestore(&rp->lock, flags);
> +
> + /* Populate max active instance if possible */
> + if (!ri && rp->maxactive < KRETPROBE_MAXACTIVE_ALLOC) {
> + ri = kmalloc(sizeof(*ri) + rp->data_size, GFP_ATOMIC);
> + if (ri)
> + rp->maxactive++;
> + }
> +
> + return ri;
> +}
>  /*
>   * This kprobe pre_handler is registered with every kretprobe. When probe
>   * hits it will set up the return probe.
> @@ -1846,14 +1869,8 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> struct pt_regs *regs)
>   }
>  
>   /* TODO: consider to only swap the RA after the last pre_handler fired 
> */
> - hash = hash_ptr(current, KPROBE_HASH_BITS);
> - raw_spin_lock_irqsave(&rp->lock, flags);
> - if (!hlist_empty(&rp->free_instances)) {
> - ri = hlist_entry(rp->free_instances.first,
> - struct kretprobe_instance, hlist);
> - hlist_del(&ri->hlist);
> - raw_spin_unlock_irqrestore(&rp->lock, flags);
> -
> + ri = kretprobe_alloc_instance(rp);
> + if (ri) {
>   ri->rp = rp;
>   ri->task = current;
>  
> @@ -1868,13 +1885,13 @@ static int pre_handler_kretprobe(struct kprobe *p, 
> struct pt_regs *regs)
>  
>   /* XXX(hch): why is there no hlist_move_head? */
>   INIT_HLIST_NODE(&ri->hlist);
> + hash = hash_ptr(current, KPROBE_HASH_BITS);
>   kretprobe_table_lock(hash, &flags);
>   hlist_add_head(&ri->hlist, &kretprobe_inst_table[hash]);
>   kretprobe_table_unlock(hash, &flags);
> - } else {
> + } else
>   rp->nmissed++;
> - raw_spin_unlock_irqrestore(&rp->lock, flags);
> - }
> +
>   return 0;
>  }
>  NOKPROBE_SYMBOL(pre_handler_kretprobe);

So this is something I missed while the original code was merged, but the 
concept 
looks a bit weird: why do we do any "allocation" while a handler is executing?

That's fundamentally fragile. What's the maximum number of parallel 
'kretprobe_instance' required per kretprobe - one per CPU?

If so then we should preallocate all of them when they are installed and not do 
any alloc/free dance when executing them.

This will also speed them up, and increase robustness all around.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-03-29 Thread Ingo Molnar

* Masami Hiramatsu  wrote:

> > So this is something I missed while the original code was merged, but the 
> > concept 
> > looks a bit weird: why do we do any "allocation" while a handler is 
> > executing?
> > 
> > That's fundamentally fragile. What's the maximum number of parallel 
> > 'kretprobe_instance' required per kretprobe - one per CPU?
> 
> It depends on the place where we put the probe. If the probed function will be
> blocked (yield to other tasks), then we need a same number of threads on
> the system which can invoke the function. So, ultimately, it is same
> as function_graph tracer, we need it for each thread.

So then put it into task_struct (assuming there's no kretprobe-inside-kretprobe 
nesting allowed). There's just no way in hell we should be calling any complex 
kernel function from kernel probes!

I mean, think about it, a kretprobe can be installed in a lot of places, and 
now 
we want to call get_free_pages() from it?? This would add a massive amount of 
fragility.

Instrumentation must be _simple_, every patch that adds more complexity to the 
most fundamental code path of it should raise a red flag ...

So let's make this more robust, ok?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH tip/master 2/3] kprobes: Allocate kretprobe instance if its free list is empty

2017-04-11 Thread Ingo Molnar

* Masami Hiramatsu  wrote:

> On Thu, 30 Mar 2017 08:53:32 +0200
> Ingo Molnar  wrote:
> 
> > 
> > * Masami Hiramatsu  wrote:
> > 
> > > > So this is something I missed while the original code was merged, but 
> > > > the concept 
> > > > looks a bit weird: why do we do any "allocation" while a handler is 
> > > > executing?
> > > > 
> > > > That's fundamentally fragile. What's the maximum number of parallel 
> > > > 'kretprobe_instance' required per kretprobe - one per CPU?
> > > 
> > > It depends on the place where we put the probe. If the probed function 
> > > will be
> > > blocked (yield to other tasks), then we need a same number of threads on
> > > the system which can invoke the function. So, ultimately, it is same
> > > as function_graph tracer, we need it for each thread.
> > 
> > So then put it into task_struct (assuming there's no 
> > kretprobe-inside-kretprobe 
> > nesting allowed).
> 
> No, that is possible to put several kretprobes on same thread, e.g.
> the func1() is called from func2(), user can put kretprobes for each
> function at same time.
> So the possible solution is to allocate new return-stack for each task_struct,
> and that is what the function-graph tracer did.
> 
> Anyway, I'm considering to integrate kretprobe_instance with the ret_stack.
> It will increase memory usage for kretprobes, but can provide safer way
> to allocate kretprobe_instance.

Ok, that sounds good to me.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] kbuild: Cache exploratory calls to the compiler

2017-10-05 Thread Ingo Molnar

* Douglas Anderson  wrote:

> This two-patch series attempts to speed incremental builds of the
> kernel up by a bit.  How much of a speedup you get depends a lot on
> your environment, specifically the speed of your workstation and how
> fast it takes to invoke the compiler.
> 
> In the Chrome OS build environment you get a really big win.  For an
> incremental build (via emerge) I measured a speedup from ~1 minute to
> ~35 seconds.

Very impressive!

> [...]  ...but Chrome OS calls the compiler through a number of wrapper 
> scripts 
> and also calls the kernel make at least twice for an emerge (during compile 
> stage and install stage), so it's a bit of a worst case.

I don't think that's a worst case: incremental builds are very commonly used 
during kernel development and kernel testing. (I'd even argue that the 
performnace 
of incremental builds is one of the most important features of a build system.)

That it's called twice in the Chrome OS build system does not change the 
proportion of the speedup.

> Perhaps a more realistic measure of the speedup others might see is
> running "time make help > /dev/null" outside of the Chrome OS build
> environment on my system.  When I do this I see that it took more than
> 1.0 seconds before and less than 0.2 seconds after.  So presumably
> this has the ability to shave ~0.8 seconds off an incremental build
> for most folks out there.  While 0.8 seconds savings isn't huge, it
> does make incremental builds feel a lot snappier.

This is a huge deal!

FWIIW I have tested your patches and they work fine here. Here's the 
before/after 
performance testing of various styles of build times of the scheduler.

First the true worst case is a full rebuild:

[ before ]

  triton:~/tip> perf stat --null --repeat 3 --pre "make clean 2>/dev/null 2>&1" 
make kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (3 runs):

   4.693974827 seconds time elapsed 
 ( +-  0.05% )

[ after ]

  triton:~/tip> perf stat --null --repeat 3 --pre "make clean 2>/dev/null 2>&1" 
make kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (3 runs):

   4.391769610 seconds time elapsed 
 ( +-  0.21% )

Still a ~6% speedup which is nice to have.

Then the best case, a fully cached rebuild of a specific subsystem - which I 
personally do all the time when I don't remember whether I already built the 
kernel or not:

[ before ]

triton:~/tip> taskset 1 perf stat --null --pre "sync" --repeat 10 make 
kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (10 runs):

   0.439517157 seconds time elapsed 
 ( +-  0.14% )

[ after ]

  triton:~/tip> taskset 1 perf stat --null --pre "sync" --repeat 10 make 
kernel/sched/ >/dev/null

 Performance counter stats for 'make kernel/sched/' (10 runs):

   0.148483807 seconds time elapsed 
 ( +-  0.57% )

A 300% speedup on my system!

So I wholeheartedly endorse the whole concept of caching build environment 
invariants:

  Tested-by: Ingo Molnar 

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll

2017-11-13 Thread Ingo Molnar

* Quan Xu  wrote:

> From: Quan Xu 
> 
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time when running as a virtual machine with paravirt.
> 
> Signed-off-by: Yang Zhang 
> Signed-off-by: Quan Xu 
> ---
>  Documentation/sysctl/kernel.txt |   35 +++
>  arch/x86/kernel/paravirt.c  |4 
>  include/linux/kernel.h  |6 ++
>  kernel/sysctl.c |   34 ++
>  4 files changed, 79 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 694968c..30c25fb 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one.
>  
>  ==
>  
> +paravirt_poll_grow: (X86 only)
> +
> +Multiplied value to increase the poll time. This is expected to take
> +effect only when running as a virtual machine with CONFIG_PARAVIRT
> +enabled. This can't bring any benifit on bare mental even with
> +CONFIG_PARAVIRT enabled.
> +
> +By default this value is 2. Possible values to set are in range {2..16}.
> +
> +==
> +
> +paravirt_poll_shrink: (X86 only)
> +
> +Divided value to reduce the poll time. This is expected to take effect
> +only when running as a virtual machine with CONFIG_PARAVIRT enabled.
> +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT
> +enabled.
> +
> +By default this value is 2. Possible values to set are in range {2..16}.
> +
> +==
> +
> +paravirt_poll_threshold_ns: (X86 only)
> +
> +Controls the maximum poll time before entering real idle path. This is
> +expected to take effect only when running as a virtual machine with
> +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental
> +even with CONFIG_PARAVIRT enabled.
> +
> +By default, this value is 0 means not to poll. Possible values to set
> +are in range {0..50}. Change the value to non-zero if running
> +latency-bound workloads in a virtual machine.

I absolutely hate it how this hybrid idle loop polling mechanism is not 
self-tuning!

Please make it all work fine by default, and automatically so, instead of 
adding 
three random parameters...

And if it cannot be done automatically then we should rather not do it at all. 
Maybe the next submitter of a similar feature can think of a better approach.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll

2017-11-13 Thread Ingo Molnar

* Quan Xu  wrote:

> 
> 
> On 2017/11/13 23:08, Ingo Molnar wrote:
> > * Quan Xu  wrote:
> > 
> > > From: Quan Xu 
> > > 
> > > To reduce the cost of poll, we introduce three sysctl to control the
> > > poll time when running as a virtual machine with paravirt.
> > > 
> > > Signed-off-by: Yang Zhang 
> > > Signed-off-by: Quan Xu 
> > > ---
> > >   Documentation/sysctl/kernel.txt |   35 
> > > +++
> > >   arch/x86/kernel/paravirt.c  |4 
> > >   include/linux/kernel.h  |6 ++
> > >   kernel/sysctl.c |   34 
> > > ++
> > >   4 files changed, 79 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/Documentation/sysctl/kernel.txt 
> > > b/Documentation/sysctl/kernel.txt
> > > index 694968c..30c25fb 100644
> > > --- a/Documentation/sysctl/kernel.txt
> > > +++ b/Documentation/sysctl/kernel.txt
> > > @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this 
> > > one.
> > >   ==
> > > +paravirt_poll_grow: (X86 only)
> > > +
> > > +Multiplied value to increase the poll time. This is expected to take
> > > +effect only when running as a virtual machine with CONFIG_PARAVIRT
> > > +enabled. This can't bring any benifit on bare mental even with
> > > +CONFIG_PARAVIRT enabled.
> > > +
> > > +By default this value is 2. Possible values to set are in range {2..16}.
> > > +
> > > +==
> > > +
> > > +paravirt_poll_shrink: (X86 only)
> > > +
> > > +Divided value to reduce the poll time. This is expected to take effect
> > > +only when running as a virtual machine with CONFIG_PARAVIRT enabled.
> > > +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT
> > > +enabled.
> > > +
> > > +By default this value is 2. Possible values to set are in range {2..16}.
> > > +
> > > +==
> > > +
> > > +paravirt_poll_threshold_ns: (X86 only)
> > > +
> > > +Controls the maximum poll time before entering real idle path. This is
> > > +expected to take effect only when running as a virtual machine with
> > > +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental
> > > +even with CONFIG_PARAVIRT enabled.
> > > +
> > > +By default, this value is 0 means not to poll. Possible values to set
> > > +are in range {0..50}. Change the value to non-zero if running
> > > +latency-bound workloads in a virtual machine.
> > I absolutely hate it how this hybrid idle loop polling mechanism is not
> > self-tuning!
> 
> Ingo, actually it is self-tuning..

Then why the hell does it touch the syscall ABI?

> could I only leave paravirt_poll_threshold_ns parameter (the maximum poll 
> time), 
> which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user 
> can 
> turn it off, or find an appropriate threshold for some odd scenario..

That way lies utter madness. Maybe add it as a debugfs knob, but exposing it to 
userspace: NAK.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] sched/deadline: fix runtime accounting in documentation

2017-11-16 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Tue, Nov 14, 2017 at 12:19:26PM +0100, Claudio Scordino wrote:
> > Signed-off-by: Claudio Scordino 
> > Signed-off-by: Luca Abeni 
> > Acked-by: Daniel Bristot de Oliveira 
> > CC: Jonathan Corbet 
> > CC: "Peter Zijlstra (Intel)" 
> > CC: Ingo Molnar 
> > CC: linux-doc@vger.kernel.org
> > Cc: Tommaso Cucinotta 
> > Cc: Mathieu Poirier 
> 
> ACK, Ingo can you route this?

Yeah, I've applied it to tip:sched/urgent, should go upstream in a few days.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] acpi, x86: Add SPCR table support

2017-12-07 Thread Ingo Molnar

* Prarit Bhargava  wrote:

> The SPCR (Serial Port Console Redirection) Table provides information
> about the configuration of serial port.  This information can be used
> to configure the early console.

s/about the configuration of serial port
 /about the configuration of the serial port

> SPCR support was added for arm64 and is made available across all arches
> in this patchset.  The first patch adds a weak per-arch configuration function
> and moves the SPCR code into ACPI.  The second patch adds support to x86.
> 
> The existing behaviour on arm64 is maintained.  If the SPCR exists the
> earlycon and console are automatically configured.

s/arm64
 /ARM64

which is easier to read and it's also the prevalent spelling:

 triton:~/tip> for N in $(git grep -ih arm64 arch/arm64/ | sed 's/[[:punct:]]/ 
/g'); do echo $N | grep -iw arm64; done | sort | uniq -c
 412 arm64
   1 Arm64
 854 ARM64

> The existing default behaviour on x86 is also maintained.  If no console or
> earlycon parameter is defined and the SPCR exists , the serial port is not
> configured.  If the earlycon parameter is used both the early console
> and the console are configured using the data from the SPCR.

s/exists , the
 /exists, the

But, the logic to not use the SPCR looks confusing to me.

The SPCR is only present if the user has explicitly configured a serial console 
for that machine, either in the firmware, or remotely via IPMI, correct? I.e. 
SPCR 
will not be spuriously present by default on systems that have a serial console 
but the user never expressed any interest for them, right?

If so then we should pick up that serial console configuration and activate it, 
regardless of any kernel boot options!

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] acpi, x86: Add SPCR table support

2017-12-11 Thread Ingo Molnar

* Prarit Bhargava  wrote:

> If I disable "Serial Port Console Debug" in my BIOS I still see the SPCR 
> configured:
> 
> [root@prarit-lab ~]# dmesg | grep SPCR
> [0.00] ACPI: SPCR 0x69031000 50 (v01
>   )
> 
> AFAICT the SPCR is always enabled on some systems.

Fair enough.

> > If so then we should pick up that serial console configuration and activate 
> > it, 
> > regardless of any kernel boot options!
> 
> I'm worried about someone who doesn't want a console on ttyS0 suddenly ending 
> up
> with one.  The SPCR could contain incorrect data, etc.

Yeah, that's not good.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] acpi, x86: Use SPCR table for earlycon on x86

2017-12-12 Thread Ingo Molnar

* Prarit Bhargava  wrote:

> The ACPI SPCR code has been used to define an earlycon console for ARM64
> and can be used for x86.
> 
> Modify the ACPI SPCR parsing code to account for console behaviour
> differences between ARM64 and x86.  Initialize the SPCR code from
> x86 ACPI initialization code.
> 
> Signed-off-by: Prarit Bhargava 
> Cc: linux-doc@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux...@vger.kernel.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-ser...@vger.kernel.org
> Cc: Bhupesh Sharma 
> Cc: Lv Zheng 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Jonathan Corbet 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: "Rafael J. Wysocki" 
> Cc: Timur Tabi 
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 3 +++
>  arch/arm64/kernel/acpi.c| 2 +-
>  arch/x86/kernel/acpi/boot.c | 4 
>  drivers/acpi/Kconfig| 2 +-
>  drivers/acpi/spcr.c | 7 +--
>  include/linux/acpi.h    | 7 +--
>  6 files changed, 19 insertions(+), 6 deletions(-)

LGTM from an x86 perspective:

  Acked-by: Ingo Molnar 

(assuming it causes no regressions in linux-next.)

Since patch #1 affects ARM64 significantly, once that patch is acceptable to 
the 
arm64 folks feel free to upstream this second patch via the ARM64 tree as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 00/16] perf/core improvements and fixes

2016-02-03 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   This is on top of the previously submitted perf-core-for-mingo tag,
> please consider applying,
> 
> - Arnaldo
> 
> The following changes since commit 5ac76283b32b116c58e362e99542182ddcfc8262:
> 
>   perf cpumap: Auto initialize cpu__max_{node,cpu} (2016-01-26 16:08:36 -0300)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo-2
> 
> for you to fetch changes up to 814568db641f6587c1e98a3a85f214cb6a30fe10:
> 
>   perf build: Align the names of the build tests: (2016-01-29 17:51:04 -0300)
> 
> 
> New features:
> 
> - Port 'perf kvm stat' to PowerPC (Hemant Kumar)
> 
> Infrastructure:
> 
> - Use the 'feature-dump' target to do the feature checks just once and then
>   add code to reuse that in the tests/make makefile, speeding up the
>   'make -C tools/perf build-test' target (Wang Nan)
> 
> - Reduce the number of tests the 'build-test' target do to those that don't
>   pollute the source tree (Arnaldo Carvalho de Melo)
> 
> - Improve the output of the build tests a bit by aligning the name of the
>   tests, more can be done to filter out uninteresting info in the output
>   (Arnaldo Carvalho de Melo)
> 
> - Add perf_evlist pointer to *info_priv_size(), more prep work for
>   supporting the coresight architecture (Mathieu Poirier)
> 
> - Improve the 'perf test bp_signal' test (Wang Nan)
> 
> - Check environment before starting the BPF 'perf test', so that we can just
>   'Skip' older kernels instead of 'FAIL'ing them (Wang Nan)
> 
> - Fix cpumode of synthesized buildid event (Wang Nan)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Arnaldo Carvalho de Melo (2):
>   perf tools: Speed up build-tests by reducing the number of builds tested
>   perf build: Align the names of the build tests:
> 
> Hemant Kumar (4):
>   perf kvm/{x86,s390}: Remove dependency on uapi/kvm_perf.h
>   perf kvm/{x86,s390}: Remove const from kvm_events_tp
>   perf kvm/powerpc: Port perf kvm stat to powerpc
>   perf kvm/powerpc: Add support for HCALL reasons
> 
> Jiri Olsa (1):
>   perf build: Fix feature-dump checks, we need to test all features
> 
> Mathieu Poirier (1):
>   perf auxtrace: Add perf_evlist pointer to *info_priv_size()
> 
> Wang Nan (8):
>   tools build: Check basic headers for test-compile feature checker
>   perf build: Remove all condition feature check {C,LD}FLAGS
>   perf build: Use feature dump file for build-test
>   perf buildid: Fix cpumode of buildid event
>   perf test: Check environment before start real BPF test
>   perf test: Improve bp_signal
>   perf tools: Move timestamp creation to util
>   perf record: Use OPT_BOOLEAN_SET for buildid cache related options
> 
>  tools/build/Makefile.feature   |   8 ++
>  tools/build/feature/test-compile.c |   2 +
>  tools/perf/Makefile|  11 +-
>  tools/perf/arch/powerpc/Makefile   |   2 +
>  tools/perf/arch/powerpc/util/Build |   1 +
>  tools/perf/arch/powerpc/util/book3s_hcalls.h   | 123 ++
>  tools/perf/arch/powerpc/util/book3s_hv_exits.h |  33 +
>  tools/perf/arch/powerpc/util/kvm-stat.c| 170 
> +
>  tools/perf/arch/s390/util/kvm-stat.c   |  10 +-
>  tools/perf/arch/x86/util/intel-bts.c   |   4 +-
>  tools/perf/arch/x86/util/intel-pt.c|   4 +-
>  tools/perf/arch/x86/util/kvm-stat.c|  16 ++-
>  tools/perf/builtin-buildid-cache.c |  14 +-
>  tools/perf/builtin-kvm.c   |  38 --
>  tools/perf/builtin-record.c|  12 +-
>  tools/perf/config/Makefile | 101 +++
>  tools/perf/tests/bp_signal.c   | 140 
>  tools/perf/tests/bpf.c |  37 ++
>  tools/perf/tests/make  |  39 +-
>  tools/perf/util/auxtrace.c |   7 +-
>  tools/perf/util/auxtrace.h |   6 +-
>  tools/perf/util/build-id.c |   6 +-
>  tools/perf/util/kvm-stat.h |   8 +-
>  tools/perf/util/util.c |  17 +++
>  tools/perf/util/util.h |   1 +
>  25 files changed, 688 insertions(+), 122 deletions(-)
>  create mode 100644 tools/perf/arch/powerpc/util/book3s_hcalls.h
>  create mode 100644 tools/perf/arch/powerpc/util/book3s_hv_exits.h
>  create mode 100644 tools/perf/arch/powerpc/util/kvm-stat.c

Pulled, thanks a lot Arnaldo!

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-i

Re: [PATCH] hpet: drop stale link

2016-02-03 Thread Ingo Molnar

* Michael S. Tsirkin  wrote:

> Looks like the HPET spec at intel.com got moved.
> It isn't hard to find so drop the link, just mention
> the revision assumed.
> 
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/char/hpet.c   | 2 +-
>  Documentation/timers/hpet.txt | 2 +-
>  arch/x86/Kconfig  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
> index 240b6cf..86c5f4d 100644
> --- a/drivers/char/hpet.c
> +++ b/drivers/char/hpet.c
> @@ -42,7 +42,7 @@
>  /*
>   * The High Precision Event Timer driver.
>   * This driver is closely modelled after the rtc.c driver.
> - * http://www.intel.com/hardwaredesign/hpetspec_1.pdf
> + * 
> http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf
>   */
>  #define  HPET_USER_FREQ  (64)
>  #define  HPET_DRIFT  (500)
> diff --git a/Documentation/timers/hpet.txt b/Documentation/timers/hpet.txt
> index 767392f..f24e978 100644
> --- a/Documentation/timers/hpet.txt
> +++ b/Documentation/timers/hpet.txt
> @@ -3,7 +3,7 @@
>  The High Precision Event Timer (HPET) hardware follows a specification
>  by Intel and Microsoft which can be found at
>  
> - http://www.intel.com/hardwaredesign/hpetspec_1.pdf
> +http://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/software-developers-hpet-spec-1-0a.pdf
>  
>  Each HPET has one fixed-rate counter (at 10+ MHz, hence "High Precision")
>  and up to 32 comparators.  Normally three or more comparators are provided,
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 9af2e63..f32faba 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -778,7 +778,7 @@ config HPET_TIMER
> The HPET provides a stable time base on SMP
> systems, unlike the TSC, but it is more expensive to access,
> as it is off-chip.  You can find the HPET spec at
> -   .
> +   
> .
>  
> You can safely choose Y here.  However, HPET will only be
> activated if the platform and the BIOS support this feature.

In fact we can also remove the link from this Kconfig entry, as it's highly 
unlikely that a user doing kernel configuration would want to start reading a 
spec.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86/intel/quark: Parameterize the kernel's IMR lock logic

2016-02-17 Thread Ingo Molnar

* Bryan O'Donoghue  wrote:

> Currently when setting up an IMR around the kernel's .text area we lock
> that IMR, preventing further modification. While superficially this appears
> to be the right thing to do, in fact this doesn't account for a legitimate
> change in the memory map such as when executing a new kernel via kexec.
> 
> In such a scenario a second kernel can have a different size and location
> to it's predecessor and can view some of the memory occupied by it's
> predecessor as legitimately usable DMA RAM. If this RAM were then
> subsequently allocated to DMA agents within the system it could conceivably
> trigger an IMR violation.
> 
> This patch fixes the this potential situation by keeping the kernel's .text
> section IMR lock bit false by default, thus ensuring that a user of the
> system will have kexec just work without having to pass special parameters
> on the kernel command line to influence the state of the kernel's IMR. If a
> user so desires then it is possible to explicitly set the lock bit to true.
> 
> The new parameter is kernel_imr and this may be set to kernel_imr=locked or
> kernel_imr=unlocked.
> 
> Signed-off-by: Bryan O'Donoghue 
> Reported-by: Andy Shevchenko 
> ---
>  Documentation/kernel-parameters.txt |  7 +++
>  arch/x86/platform/intel-quark/imr.c | 39 
> +++--
>  2 files changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 9a53c92..1aad1d2 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1359,6 +1359,13 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   hardware thread id mappings.
>   Format: :
>  
> + kernel_imr= [X86, INTEL_IMR] Control the lock bit for the Isolated
> + Memory Region protecting the kernel's .text section on
> + X86 architectures that support IMRs such as Quark X1000.
> + When an IMR lock bit is set it is not possible to unset
> + it without a CPU reset.
> + Format : {locked | unlocked (default) }
> +
>   keep_bootcon[KNL]
>   Do not unregister boot console at start. This is only
>   useful for debugging when something happens in the 
> window
> diff --git a/arch/x86/platform/intel-quark/imr.c 
> b/arch/x86/platform/intel-quark/imr.c
> index c61b6c3..8249f65 100644
> --- a/arch/x86/platform/intel-quark/imr.c
> +++ b/arch/x86/platform/intel-quark/imr.c
> @@ -37,6 +37,7 @@
>  struct imr_device {
>   struct dentry   *file;
>   boolinit;
> + boolkernel_imr_locked;
>   struct mutexlock;
>   int max_imr;
>   int reg_base;
> @@ -568,10 +569,15 @@ static inline int imr_clear(int reg)
>   * BIOS and Grub both setup IMRs around compressed kernel, initrd memory
>   * that need to be removed before the kernel hands out one of the IMR
>   * encased addresses to a downstream DMA agent such as the SD or Ethernet.
> + * Additionally if the current kernel is executing via kexec then we need to
> + * tear down the IMR the previous kernel had setup.
> + *
>   * IMRs on Galileo are setup to immediately reset the system on violation.
>   * As a result if you're running a root filesystem from SD - you'll need
>   * the boot-time IMRs torn down or you'll find seemingly random resets when
> - * using your filesystem.
> + * using your filesystem; similarly if you're doing a kexec boot of Linux 
> then
> + * its required to sanitize the memory map with-respect to the previous IMR
> + * settings.
>   *
>   * @idev:pointer to imr_device structure.
>   * @return:
> @@ -592,14 +598,20 @@ static void __init imr_fixup_memmap(struct imr_device 
> *idev)
>   end = (unsigned long)__end_rodata - 1;
>  
>   /*
> -  * Setup a locked IMR around the physical extent of the kernel
> -  * from the beginning of the .text secton to the end of the
> -  * .rodata section as one physically contiguous block.
> +  * Setup an IMR around the physical extent of the kernel from the
> +  * beginning of the .text section to the end of the .rodata section
> +  * as one physically contiguous block.
>*
>* We don't round up @size since it is already PAGE_SIZE aligned.
>* See vmlinux.lds.S for details.
> +  *
> +  * By default this IMR is unlocked to enable subsequent kernels booted
> +  * by kexec to tear down the IMR we are setting up below. Its possible
> +  * to set this IMR to the locked state by passing kernel_imr=locked on
> +  * the kernel command line.
>*/
> - ret = imr_add_range(base, size, IMR_CPU, IMR_CPU, true);
> + ret = imr_add_range(base, size, IMR_CPU, IMR_CPU,
> + imr_dev.kernel_i

Re: [PATCH] x86/intel/quark: Parameterize the kernel's IMR lock logic

2016-02-18 Thread Ingo Molnar

* Bryan O'Donoghue  wrote:

> On Thu, 2016-02-18 at 08:58 +0100, Ingo Molnar wrote:
> > So why not simply do the patch below? Very few people use boot
> > parameters, and the 
> > complexity does not seem to be worth it.
> > 
> > Furthermore I think an IMR range in itself is safe enough - it's not
> > like such 
> > register state is going to be randomly corrupted, even with the
> > 'lock' bit unset. 
> 
> 
> Hi Ingo.
> 
> I agree - to flip the lock bit you need to be in ring-0 anyway.
> 
> > So it's a perfectly fine protective measure against accidental memory
> > corruption 
> > from the DMA space. It should not try to be more than that.
> > 
> > And once we do this, I suggest we get rid of the 'lock' parameter
> > altogether - 
> > that will further simplify the code.
> > 
> > Thanks,
> > 
> > Ingo
> 
> That was the V1 of this patch
> 
> https://groups.google.com/forum/#!topic/linux.kernel/6ZuVOF3TJow

heh ;-)

> Andriy asked for the boot parameter to control the state of the IMR
> lock bit, I'm just as happy to go back to that version TBH

I really think it's over-engineered - especially considering that with the 
kernel 
lock-down removed there's no other IMR area that is really locked down - so we 
could get rid of the whole 'locked' logic that would simplify the code 
throughout.

Yeah, it's a nice looking hardware feature - but I don't think it's 
particularly 
useful in terms of extra protection.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] x86: PAT: Documentation: rewrite "MTRR effects on PAT / non-PAT systems"

2016-03-05 Thread Ingo Molnar

* Luis R. Rodriguez  wrote:

> The current documentation refers to using set_memory_wc() as a
> possible hole strategy when you have overlapping ioremap() regions,

The whole explanation should talk about virtual aliases over the same physical 
address, not some 'overlapping regions'.

I see where this talk about 'overlap' comes: the memtype rbtree in 
arch/x86/mm/pat_rbtree.c indeed has memtype ranges that may overlap on the 
physical side. But it is highly confusing to call this 'overlapping' on the 
driver 
API documentation level without making it really clear what it's about.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 05/12] task_isolation: support CONFIG_TASK_ISOLATION_ALL

2016-03-05 Thread Ingo Molnar

* Chris Metcalf  wrote:

> On 03/03/2016 01:34 PM, Andi Kleen wrote:
> >Chris Metcalf  writes:
> >>+config TASK_ISOLATION_ALL
> >>+   bool "Provide task isolation on all CPUs by default (except CPU 0)"
> >>+   depends on TASK_ISOLATION
> >>+   help
> >>+If the user doesn't pass the task_isolation boot option to
> >>+define the range of task isolation CPUs, consider that all
> >>+CPUs in the system are task isolation by default.
> >>+Note the boot CPU will still be kept outside the range to
> >>+handle timekeeping duty, etc.
> >That seems like a very dangerous Kconfig option.
> >"CONFIG_BREAK_EVERYTHING"
> >If someone sets that by default they will have a lot of trouble.
> >
> >I wouldn't add that, make it a run time option only.
> 
> So you were thinking, allow a special boot syntax "task_isolation=all",
> which puts all the cores into task isolation mode except the boot core?
> 
> My original argument was that it was so parallel to the existing
> CONFIG_NO_HZ_FULL_ALL option that it just made sense to do it,
> and some testers complained about having to specify the precise
> cpu range, so this seemed like an easy fix.

Yes, it's absolutely legitimate to offer boot options as Kconfig options as 
well - 
in fact that will get things like randconfig bootups stumble upon them and do 
some 
free testing for you. Just ignore Andi's nonsensical objection.

One day we'll have a unified boot parameter/Kconfig/sysctl mechanism, so that 
it 
will be possible to say things like this on the boot command line:

  CONFIG_NO_HZ_FULL_ALL=y

... which will eliminate quite a bit of the current schizm between Kconfig and 
boot time parameters.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-06 Thread Ingo Molnar

* Kees Cook  wrote:

> On Wed, Apr 6, 2016 at 1:56 PM, Linus Torvalds
>  wrote:
> > On Wed, Apr 6, 2016 at 1:17 PM, Pavel Machek  wrote:
> >>
> >> Why is kASLR incompatible with hibernation? We can hibernate have
> >> 4.3 kernel resume hibernation image of 4.2 kernel (on x86-64, and I
> >> have patches for x86). Resuming kernel with different randomization
> >> does not look that much different...
> >
> > Oh, I'd absolutely prefer to just allow kaslr together with
> > hibernation if it actually works.
> >
> > Could the people who piped up to say that they actually use
> > hibernation just try passing in the "kaslr" command line option on
> > their machine, and see if it works for them? We could just remove the
> > "no kaslr with hibername" code - or at least limit it to 32-bit for
> > now..
> >
> > Because that would be lovely.
> 
> This is where our original investigation of having them coexist ended:
> https://lkml.org/lkml/2014/6/15/180
> 
> To quote Rafael Wysocki:
> > We're jumping from the boot kernel into the image kernel.  The virtual 
> > address
> > comes from the image kernel, but the boot kernel has to use it.  The only 
> > way
> > we can ensure that we'll jump to the right place is to pass the physical 
> > address
> > in the header (otherwise we de facto assume that the virtual address of the
> > target page frame will be the same in both the boot and the image kernels).
> >
> > The missing piece is that the code in swsusp_arch_resume() sets up temporary
> > page tables to ensure that they won't be overwritten while copying the last
> > remaining image kernel pages to the right page frames (those page tables
> > have to be stored in page frames that are free from the kernel image 
> > perspective).
> >
> > But if the kernel address space is randomized, set_up_temporary_mappings()
> > really should duplicate the existing layout instead of creating a new one 
> > from
> > scratch.  Otherwise, virtual addresses before set_up_temporary_mappings() 
> > may
> > be different from the ones after it.

So as I suggested it in the previous mail, the right solution would be to pass 
in 
the randomization seed via a new kasl_seed=xyz boot option, and thus have the 
same 
addresses as prior hibernation.

That should make hibernation work as-is, with very little effort.

Two details I can think of:

1) the new option has to be hidden from /proc/cmdline, due to:

  triton:~/tip> ll /proc/cmdline
  -r--r--r-- 1 root root 0 Apr  6 23:45 /proc/cmdline

2)

another detail is that the new boot option has to be checked in 
choose_kernel_location(), to make sure it's done at the right point during 
bootup. 
That's a good place to remove it from the boot options string as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-06 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Kees Cook  wrote:
> 
> > On Wed, Apr 6, 2016 at 1:56 PM, Linus Torvalds
> >  wrote:
> > > On Wed, Apr 6, 2016 at 1:17 PM, Pavel Machek  wrote:
> > >>
> > >> Why is kASLR incompatible with hibernation? We can hibernate have
> > >> 4.3 kernel resume hibernation image of 4.2 kernel (on x86-64, and I
> > >> have patches for x86). Resuming kernel with different randomization
> > >> does not look that much different...
> > >
> > > Oh, I'd absolutely prefer to just allow kaslr together with
> > > hibernation if it actually works.
> > >
> > > Could the people who piped up to say that they actually use
> > > hibernation just try passing in the "kaslr" command line option on
> > > their machine, and see if it works for them? We could just remove the
> > > "no kaslr with hibername" code - or at least limit it to 32-bit for
> > > now..
> > >
> > > Because that would be lovely.
> > 
> > This is where our original investigation of having them coexist ended:
> > https://lkml.org/lkml/2014/6/15/180
> > 
> > To quote Rafael Wysocki:
> > > We're jumping from the boot kernel into the image kernel.  The virtual 
> > > address
> > > comes from the image kernel, but the boot kernel has to use it.  The only 
> > > way
> > > we can ensure that we'll jump to the right place is to pass the physical 
> > > address
> > > in the header (otherwise we de facto assume that the virtual address of 
> > > the
> > > target page frame will be the same in both the boot and the image 
> > > kernels).
> > >
> > > The missing piece is that the code in swsusp_arch_resume() sets up 
> > > temporary
> > > page tables to ensure that they won't be overwritten while copying the 
> > > last
> > > remaining image kernel pages to the right page frames (those page tables
> > > have to be stored in page frames that are free from the kernel image 
> > > perspective).
> > >
> > > But if the kernel address space is randomized, set_up_temporary_mappings()
> > > really should duplicate the existing layout instead of creating a new one 
> > > from
> > > scratch.  Otherwise, virtual addresses before set_up_temporary_mappings() 
> > > may
> > > be different from the ones after it.
> 
> So as I suggested it in the previous mail, the right solution would be to 
> pass in 
> the randomization seed via a new kasl_seed=xyz boot option, and thus have the 
> same 
> addresses as prior hibernation.
> 
> That should make hibernation work as-is, with very little effort.
> 
> Two details I can think of:
> 
> 1) the new option has to be hidden from /proc/cmdline, due to:
> 
>   triton:~/tip> ll /proc/cmdline
>   -r--r--r-- 1 root root 0 Apr  6 23:45 /proc/cmdline
> 
> 2)
> 
> another detail is that the new boot option has to be checked in 
> choose_kernel_location(), to make sure it's done at the right point during 
> bootup. 
> That's a good place to remove it from the boot options string as well.

... and I missed the biggest complication: to solve the chicken and egg problem 
with software_resume() running very late during bootup, software_resume() 
should 
probably kexec() the original kernel image, this time with the kaslr seed set 
in 
the boot parameters.

So it's two bootups...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-06 Thread Ingo Molnar

* Rafael J. Wysocki  wrote:

> On Wed, Apr 6, 2016 at 9:44 PM, Kees Cook  wrote:
> > When building with both CONFIG_HIBERNATION and CONFIG_RANDOMIZE_BASE,
> > one or the other must be chosen at boot-time. Until now, hibernation
> > was selected when no choice was made on the command line.
> >
> > To make the security benefits of kASLR more widely available to end
> > users (since the use of hibernation is becoming more rare and kASLR,
> > already available on x86, will be available on arm64 and MIPS soon),
> > this changes the default to preferring kASLR over hibernation. Users
> > wanting hibernation can turn off kASLR by adding "nokaslr" to the kernel
> > command line.
> >
> > Suggested-by: Linus Torvalds 
> > Signed-off-by: Kees Cook 
> 
> Acked-by: Rafael J. Wysocki 
> 
> Or do you want me to apply it?

I don't think this is a good idea, as it turns off emergency hibernation of 
laptops - many desktop distros support it by default.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-06 Thread Ingo Molnar

* Kees Cook  wrote:

> >> I don't think this is a good idea, as it turns off emergency hibernation 
> >> of 
> >> laptops - many desktop distros support it by default.
> >
> > Right, I forgot about this one.
> 
> When I last checked Ubuntu doesn't enable hibernation by default any more: 
> https://help.ubuntu.com/16.04/ubuntu-help/power-hibernate.html
> 
> And it seems like Fedora either doesn't either, or has a lot of people
> for whom it doesn't work:
> https://bugzilla.redhat.com/show_bug.cgi?id=1206936
> https://bugzilla.redhat.com/show_bug.cgi?id=1224151
> http://blog.kriptonium.com/2015/12/fedora-23-hibernate.html

Ok, that's a relatively recent development, I distinctly remember my laptop 
being 
hibernated in such a fashion fairly recently.

That makes it easier to hack around the kASLR incompatibility by making 
hibernation less useful. Personally I think that conceptually user space 
persistency (CRIU et al) is superior to kernel level hibernation - but
user-space save/restore is nowhere near as complete as kernel hibernation,
so it's still somewhat sad that it doesn't work ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-06 Thread Ingo Molnar

* Rafael J. Wysocki  wrote:

> [...]
> 
> One of the weak points is the final jump, because it has to be done to the 
> physical location of the image kernel's entry point even though the virtual 
> addresses of it may differ between the boot and the image kernels.  The seed 
> is 
> not needed for that, only the physical address of the entry point.  The boot 
> kernel doesn't have it today, though, because the virtual address of that is 
> passed in the image header. That should not be too difficult to change, 
> however.

I didn't realize we jumped to the image kernel as well - I (wrongly) assumed we 
kept the bootup kernel. That should indeed make hibernation mostly 
kASLR-invariant.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kaslr: allow kASLR to be default over Hibernation

2016-04-15 Thread Ingo Molnar

* Kees Cook  wrote:

> 1) The x86 hibernation and KASLR code don't play well together currently.

Please fix it, don't just work it around ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Intel Secure Guard Extensions

2016-04-27 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > What new syscalls would be needed for ssh to get all this support?
> 
> This patchset or similar, plus some user code and an enclave to use.
> 
> Sadly, on current CPUs, you also need Intel to bless the enclave.  It looks 
> like 
> new CPUs might relax that requirement.

That looks like a fundamental technical limitation in my book - to an open 
source 
user this is essentially a very similar capability as tboot: it only allows the 
execution of externally blessed static binary blobs...

I don't think we can merge any of this upstream until it's clear that the 
hardware 
owner running open-source user-space can also freely define/start his own 
secure 
enclaves without having to sign the enclave with any external party. I.e. 
self-signed enclaves should be fundamentally supported as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kbuild: deprecate cc-option-align

2017-06-22 Thread Ingo Molnar

* Masahiro Yamada  wrote:

> Documentation/kbuild/makefiles.txt says the change for align options
> occurred at GCC 3.0, and Documentation/process/changes.rst says the
> minimal supported GCC version is 3.2, so it should be safe to hard-code
> -falign* options.
> 
> Fix the only user arch/x86/Makefile_32.cpu and deprecate the
> cc-option-align.
> 
> Signed-off-by: Masahiro Yamada 
> ---
> 
> x86 maintainers,
> 
> If this patch looks OK, could you give me Acked-by?

LGTM in principle, although I haven't tested it:

   Acked-by: Ingo Molnar 

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/38] x86: Secure Memory Encryption (AMD)

2017-07-08 Thread Ingo Molnar

* Tom Lendacky  wrote:

> This patch series provides support for AMD's new Secure Memory Encryption 
> (SME)
> feature.

I'm wondering, what's the typical performance hit to DRAM access latency when 
SME 
is enabled?

On that same note, if the performance hit is noticeable I'd expect SME to not 
be 
enabled in native kernels typically - but still it looks like a useful hardware 
feature. Since it's controlled at the page table level, have you considered 
allowing SME-activated vmas via mmap(), even on kernels that are otherwise not 
using encrypted DRAM?

One would think that putting encryption keys into such encrypted RAM regions 
would 
generally improve robustness against various physical space attacks that want 
to 
extract keys but don't have full control of the CPU.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 37/38] compiler-gcc.h: Introduce __nostackp function attribute

2017-07-18 Thread Ingo Molnar

* Tom Lendacky  wrote:

> Create a new function attribute, __nostackp, that can used to turn off
> stack protection on a per function basis.
> 
> Signed-off-by: Tom Lendacky 
> ---
>  include/linux/compiler-gcc.h | 2 ++
>  include/linux/compiler.h | 4 
>  2 files changed, 6 insertions(+)
> 
> diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
> index cd4bbe8..682063b 100644
> --- a/include/linux/compiler-gcc.h
> +++ b/include/linux/compiler-gcc.h
> @@ -166,6 +166,8 @@
>  
>  #if GCC_VERSION >= 40100
>  # define __compiletime_object_size(obj) __builtin_object_size(obj, 0)
> +
> +#define __nostackp   __attribute__((__optimize__("no-stack-protector")))
>  #endif
>  
>  #if GCC_VERSION >= 40300
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 219f82f..63cbca1 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -470,6 +470,10 @@ static __always_inline void __write_once_size(volatile 
> void *p, void *res, int s
>  #define __visible
>  #endif
>  
> +#ifndef __nostackp
> +#define __nostackp
> +#endif

So I changed this from the hard to read and ambiguous "__nostackp" abbreviation 
(does it mean 'no stack pointer?') to "__nostackprotector", plus added this 
detail 
to the changelog:

| ( This is needed by the SME in-place kernel memory encryption feature,
|   which activates encryption in its sme_enable() function and thus changes 
the 
|   visible value of the stack protection cookie on function return. )

Agreed?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] acpi, spcr: Make SPCR available to x86

2018-01-20 Thread Ingo Molnar

* Prarit Bhargava  wrote:

> Note on testing:  I've tested this on several ARM64 and x86 boxes (only
> earlycon, console=ttyS0,115200, and with both options), verified that
> functionality on ARM64 has not changed (ie, CONFIG_ACPI_SPCR_TABLE is
> always =y), and verified functionality when !CONFIG_ACPI_SPCR_TABLE on
> x86.
> 
> P.
> 
> 8<
> 
> SPCR is currently only enabled or ARM64 and x86 can use SPCR to setup an
> early console.
> 
> General fixes include updating Documentation & Kconfig (for x86), updating
> comments, and changing parse_spcr() to acpi_parse_spcr(), and
> earlycon_init_is_deferred to earlycon_acpi_spcr_enable to be more
> descriptive.
> 
> On x86, many systems have a valid SPCR table but the table version is not
> 2 so the table version check must be a warning.
> 
> On ARM64 when the kernel parameter earlycon is used both the early console
> and console are enabled.  On x86, only the earlycon should be enabled by
> by default.  Modify acpi_parse_spcr() to allow options for initializing
> the early console and console separately.
> 
> Signed-off-by: Prarit Bhargava 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux...@vger.kernel.org
> Cc: linux-ser...@vger.kernel.org
> Cc: Bhupesh Sharma 
> Cc: Lv Zheng 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: "H. Peter Anvin" 
> Cc: x...@kernel.org
> Cc: Jonathan Corbet 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: "Rafael J. Wysocki" m...@rjwysocki.net>
> Cc: Timur Tabi 
> Cc: graeme.greg...@linaro.org
> Cc: mark.sal...@redhat.com
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  9 +---
>  arch/arm64/kernel/acpi.c|  4 ++--
>  arch/x86/kernel/acpi/boot.c |  3 +++
>  drivers/acpi/Kconfig|  7 +-
>  drivers/acpi/spcr.c | 29 
> +
>  drivers/tty/serial/earlycon.c   | 15 +
>  include/linux/acpi.h    |  7 --
>  include/linux/serial_core.h |  4 ++--
>  8 files changed, 44 insertions(+), 34 deletions(-)

I'm fine with the x86 aspect of this:

Acked-by: Ingo Molnar 

>   earlycon=   [KNL] Output early console device and options.
>  
> - When used with no options, the early console is
> - determined by the stdout-path property in device
> - tree's chosen node.
> + [ARM64] The early console is determined by the
> + stdout-path property in device tree's chosen node,
> + or determined by the ACPI SPCR table.

s/in device tree's chosen node
 /in the device tree's chosen node

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 01/22] selftests/x86: Move protecton key selftest to arch neutral directory

2018-02-21 Thread Ingo Molnar

* Ram Pai  wrote:

> cc: Dave Hansen 
> cc: Florian Weimer 
> Signed-off-by: Ram Pai 
> ---
>  tools/testing/selftests/vm/Makefile   |1 +
>  tools/testing/selftests/vm/pkey-helpers.h |  223 
>  tools/testing/selftests/vm/protection_keys.c  | 1407 
> +
>  tools/testing/selftests/x86/Makefile  |2 +-
>  tools/testing/selftests/x86/pkey-helpers.h|  223 
>  tools/testing/selftests/x86/protection_keys.c | 1407 
> -
>  6 files changed, 1632 insertions(+), 1631 deletions(-)
>  create mode 100644 tools/testing/selftests/vm/pkey-helpers.h
>  create mode 100644 tools/testing/selftests/vm/protection_keys.c
>  delete mode 100644 tools/testing/selftests/x86/pkey-helpers.h
>  delete mode 100644 tools/testing/selftests/x86/protection_keys.c

Acked-by: Ingo Molnar 

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86: Add topology_hw_smt_threads() and remove smp_num_siblings

2018-02-24 Thread Ingo Molnar

* Prarit Bhargava  wrote:

> Commit bbb65d2d365e ("x86: use cpuid vector 0xb when available for
> detecting cpu topology") changed the value of smp_num_siblings from the
> active number of threads in a core to the maximum number threads in a
> core.  e.g.) On Intel Haswell and newer systems smp_num_siblings is
> two even if SMT is disabled.
> 
> topology_max_smt_threads() already returns the active number of threads.
> Introduce topology_hw_smt_threads() which returns the maximum number of
> threads.  These are used to fix and replace references to smp_num_siblings.

It's unclear to the reader of this changelog what the patch does:

 - is it a cleanup?
 - does it introduce some new facility to be used in a later patch?
 - ... or does it fix a real bug?

> diff --git a/arch/x86/include/asm/perf_event_p4.h 
> b/arch/x86/include/asm/perf_event_p4.h
> index 94de1a05aeba..11afdadce9c2 100644
> --- a/arch/x86/include/asm/perf_event_p4.h
> +++ b/arch/x86/include/asm/perf_event_p4.h
> @@ -181,7 +181,7 @@ static inline u64 p4_clear_ht_bit(u64 config)
>  static inline int p4_ht_active(void)
>  {
>  #ifdef CONFIG_SMP
> - return smp_num_siblings > 1;
> + return topology_max_smt_threads() > 1;
>  #endif
>   return 0;
>  }
> @@ -189,7 +189,7 @@ static inline int p4_ht_active(void)
>  static inline int p4_ht_thread(int cpu)
>  {
>  #ifdef CONFIG_SMP
> - if (smp_num_siblings == 2)
> + if (topology_max_smt_threads() == 2)
>   return cpu != 
> cpumask_first(this_cpu_cpumask_var_ptr(cpu_sibling_map));

This appears to change functionality - so I guess it fixes a real bug?

> diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
> index c1d2a9892352..b5ff1c784eef 100644
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -116,16 +116,16 @@ extern unsigned int __max_logical_packages;
>  #define topology_max_packages()  (__max_logical_packages)
>  
>  extern int __max_smt_threads;
> -
> -static inline int topology_max_smt_threads(void)
> -{
> - return __max_smt_threads;
> -}
> +#define topology_max_smt_threads()   (__max_smt_threads)
> +extern int __hw_smt_threads;
> +#define topology_hw_smt_threads()(__hw_smt_threads)

I general it's better to use C inline functions than CPP defines.

> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -332,16 +332,14 @@ static void amd_get_topology(struct cpuinfo_x86 *c)
>   cpuid(0x801e, &eax, &ebx, &ecx, &edx);
>  
>   node_id  = ecx & 0xff;
> - smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
> + __hw_smt_threads = ((ebx >> 8) & 0xff) + 1;
>  
>   if (c->x86 == 0x15)
>   c->cu_id = ebx & 0xff;
>  
>   if (c->x86 >= 0x17) {
>   c->cpu_core_id = ebx & 0xff;
> -
> - if (smp_num_siblings > 1)
> - c->x86_max_cores /= smp_num_siblings;
> + c->x86_max_cores /= topology_hw_smt_threads();
>   }

> @@ -1228,6 +1228,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>   /* Init Machine Check Exception if available. */
>   mcheck_cpu_init(c);
>  
> + /* Must be called before select_idle_routine */
> + if (c != &boot_cpu_data)
> + set_cpu_sibling_map(raw_smp_processor_id());
> +
>   select_idle_routine(c);

This appears to be an unexplained change.

> --- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
> @@ -420,7 +420,8 @@ static u32 get_nbc_for_node(int node_id)
>   struct cpuinfo_x86 *c = &boot_cpu_data;
>   u32 cores_per_node;
>  
> - cores_per_node = (c->x86_max_cores * smp_num_siblings) / 
> amd_get_nodes_per_socket();
> + cores_per_node = (c->x86_max_cores * topology_hw_smt_threads()) /
> +  amd_get_nodes_per_socket();

Please don't break lines that are just slightly over col80.

So all of this looks pretty complex, but is poorly explained. Please split it 
up 
into a series of well-explained patches where each patch does one specific 
thing. 
The cleanup and renaming patches should come first, the bug fixing patch(es) 
should come after them.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Question] Documentation/features: More automation/scripting help?

2018-03-30 Thread Ingo Molnar

* Andrea Parri  wrote:

> Hi all,
> 
> The directory (not yet three years old although, I freely admit, I've
> only recently become aware of it) provides arch. support matrices for
> more than 40 generic kernel features that need per-arch. support:
> 
> This is a superb project! ;-)  and not a simple one given that, to be
> effective, this requires the prompt collaboration between (the intere-
> sted) features maintainers/developers, every architecture maintainers,
> and documentation maintainers.
> 
> There currently appear to be some mismatches between such doc and the
> the actual state of the code (e.g., missing architecture, feature not
> existing anymore, status flags out-of-date). Realized this, I started
> to patch the doc, but this process became soon tedious (consider also
> that I barely know what some of these features are about...).
> 
> Hence this post. I am wondering if it would make sense to script some
> of these matrices.  So, rather than (or together with) using the cur-
> rently hard-coded matrices, try to (automatically) generate them from
> the sources/configs. Consider the sketch below (sorry for the raw sh).
> 
> diff --git a/Documentation/features/list-arch.sh 
> b/Documentation/features/list-arch.sh
> index c16b5b5956889..cdec0c1db9db2 100755
> --- a/Documentation/features/list-arch.sh
> +++ b/Documentation/features/list-arch.sh
> @@ -18,7 +18,13 @@ for F in */*/arch-support.txt; do
>C=$(grep -h "^# Kconfig:" $F | cut -c25-)
>D=$(grep -h "^# description:" $F | cut -c25-)
>S=$(grep -hw $ARCH $F | cut -d\| -f3)
> +  myS=$(grep -h $C ../../arch/$ARCH/Kconfig)
> +  if [ -z "$myS" ]; then
> +   myS=" -- "
> +  else
> +   myS=" ok "
> +  fi
>  
> -  printf "%10s/%-22s:%s| %35s # %s\n" "$SUBSYS" "$N" "$S" "$C" "$D"
> +  printf "%10s/%-22s:%s VS. %s| %35s # %s\n" "$SUBSYS" "$N" "$S" "$myS" "$C" 
> "$D"
>  done
>  
> 
> With this diff.,
> 
> andrea@andrea:~$ ./Documentation/features/list-arch.sh riscv > /tmp/riscv.txt
> grep: asm/rwsem.h: No such file or directory
> andrea@andrea:~$ cat /tmp/riscv.txt
> #
> # Kernel feature support matrix of the 'riscv' architecture:
> #
>   core/ BPF-JIT  : VS.  -- |
> HAVE_BPF_JIT #  arch supports BPF JIT optimizations
>   core/ generic-idle-thread  : VS.  ok | 
> GENERIC_SMP_IDLE_THREAD #  arch makes use of the generic SMP idle thread 
> facility
>   core/ jump-labels  : VS.  -- |
> HAVE_ARCH_JUMP_LABEL #  arch supports live patched, high efficiency branches
>   core/ tracehook: VS.  ok | 
> HAVE_ARCH_TRACEHOOK #  arch supports tracehook (ptrace) register handling APIs
>  debug/ gcov-profile-all : VS.  -- |   
> ARCH_HAS_GCOV_PROFILE_ALL #  arch supports whole-kernel GCOV code coverage 
> profiling
>  debug/ KASAN: VS.  -- | 
> HAVE_ARCH_KASAN #  arch supports the KASAN runtime memory checker
>  debug/ kgdb : VS.  -- |  
> HAVE_ARCH_KGDB #  arch supports the kGDB kernel debugger
>  debug/ kprobes  : VS.  ok |
> HAVE_KPROBES #  arch supports live patched kernel probe
>  debug/ kprobes-on-ftrace: VS.  -- |  
> HAVE_KPROBES_ON_FTRACE #  arch supports combined kprobes and ftrace live 
> patching
>  debug/ kretprobes   : VS.  -- | 
> HAVE_KRETPROBES #  arch supports kernel function-return probes
>  debug/ optprobes: VS.  -- |  
> HAVE_OPTPROBES #  arch supports live patched optprobes
>  debug/ stackprotector   : VS.  -- |  
> HAVE_CC_STACKPROTECTOR #  arch supports compiler driven stack overflow 
> protection
>  debug/ uprobes  : VS.  -- |   
> ARCH_SUPPORTS_UPROBES #  arch supports live patched user probes
>  debug/ user-ret-profiler: VS.  -- |   
> HAVE_USER_RETURN_NOTIFIER #  arch supports user-space return from system call 
> profiler
> io/ dma-api-debug: VS.  ok |  
> HAVE_DMA_API_DEBUG #  arch supports DMA debug facilities
> io/ dma-contiguous   : VS.  ok | 
> HAVE_DMA_CONTIGUOUS #  arch supports the DMA CMA (continuous memory allocator)
> io/ sg-chain : VS.  -- |   
> ARCH_HAS_SG_CHAIN #  arch supports chained scatter-gather lists
>lib/ strncasecmp  : VS.  -- | 
> __HAVE_ARCH_STRNCASECMP #  arch provides an optimized strncasecmp() function
>locking/ cmpxchg-local: VS.  -- |  
> HAVE_CMPXCHG_LOCAL #  arch supports the this_cpu_cmpxchg() API
>locking/ lockdep  : VS.  -- | 
> LOCKDEP_SUPPORT #  arch supports the runtime locking correctness debug 
> facility
>locking/ queued-rwlocks   : VS.  -- | 
> ARCH_USE_QUEUED

Re: [RFC PATCH 0/3] Documentation/features: Provide and apply "features-refresh.sh"

2018-04-03 Thread Ingo Molnar

* Andrea Parri  wrote:

> In Ingo's words [1]:
> 
>   "[...]  what should be done instead is to write a script that refreshes
>all the arch-support.txt files in-place. [...]
> 
>It's OK for the script to have various quirks for weirdly implemented
>features and exceptions: i.e. basically whenever it gets a feature wrong,
>we can just tweak the script with quirks to make it all work out of box.
> 
>[...]  But in the end there should only be a single new script:
> 
>  Documentation/features/scripts/features-refresh.sh
> 
>... which operates on the arch-support.txt files and refreshes them in
>place, and which, after all the refreshes have been committed, should
>produce an empty 'git diff' result."
> 
>   "[...]  New features can then be added by basically just creating a
>header-only arch-support.txt file, such as:
> 
>  triton:~/tip/Documentation/features> cat foo/bar/arch-support.txt
>  #
>  # Feature name:  shiny new fubar kernel feature
>  # Kconfig:   ARCH_USE_FUBAR
>  # description:   arch supports the fubar feature
>  #
> 
>And running Documentation/features/scripts/features-refresh.sh would
>auto-generate the arch support matrix. [...]
>  
>This way we soft- decouple the refreshing of the entries from the
>introduction of the features, while still making it all easy to keep
>sync and to extend."
> 
> This RFC presents a first attempt to implement such a feature/script, and
> applies it script on top of Arnd's:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git 
> arch-removal
> 
> Patch 1/3 provides the "features-refresh.sh" script.  Patch 2/3 removes the
> "BPF-JIT" feature file and it creates header-only files for "cBPF-JIT" and
> "eBPF-JIT".  Patch 3/3 presents the results of running the script; this run
> also printed to standard output the following warnings:
> 
>   WARNING: '__HAVE_ARCH_STRNCASECMP' is not a valid Kconfig
>   WARNING: 'Optimized asm/rwsem.h' is not a valid Kconfig
>   WARNING: '!ARCH_USES_GETTIMEOFFSET' is not a valid Kconfig
>   WARNING: '__HAVE_ARCH_PTE_SPECIAL' is not a valid Kconfig
> 
> (I'm sending these patches with empty commit messagges, for early feedback:
>  I'll fill in these messages in subsequent versions if this makes sense...)
> 
> Cheers,
>   Andrea
> 
> Andrea Parri (3):
>   Documentation/features: Add script that refreshes the arch support status 
> files in place
>   Documentation/features/core: Add arch support status files for 'cBPF-JIT' 
> and 'eBPF-JIT'
>   Documentation/features: Refresh and auto-generate the arch support status 
> files in place

Ok, this series is really impressive at its RFC stage already!

Beyond fixing those warnings, I'd also suggest another change: please make the 
new BPF features patch the last one, so that the 'refresh' patch shows how much 
original bit-rot we gathered recently.

The 'new features' patch should then also include the result of also running 
the 
script, i.e. a single patch adding the base fields and the generated parts as 
well. That will be the usual development flow anyway - people won't do two-part 
patches just to show which bits are by hand and which are auto-generated.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] Documentation/features: Provide and apply "features-refresh.sh"

2018-04-05 Thread Ingo Molnar

* Andrea Parri  wrote:

> On Wed, Apr 04, 2018 at 06:56:32AM +0200, Ingo Molnar wrote:
> > 
> > * Andrea Parri  wrote:
> > 
> > > In Ingo's words [1]:
> > > 
> > >   "[...]  what should be done instead is to write a script that refreshes
> > >all the arch-support.txt files in-place. [...]
> > > 
> > >It's OK for the script to have various quirks for weirdly implemented
> > >features and exceptions: i.e. basically whenever it gets a feature 
> > > wrong,
> > >we can just tweak the script with quirks to make it all work out of 
> > > box.
> > > 
> > >[...]  But in the end there should only be a single new script:
> > > 
> > >  Documentation/features/scripts/features-refresh.sh
> > > 
> > >... which operates on the arch-support.txt files and refreshes them in
> > >place, and which, after all the refreshes have been committed, should
> > >produce an empty 'git diff' result."
> > > 
> > >   "[...]  New features can then be added by basically just creating a
> > >header-only arch-support.txt file, such as:
> > > 
> > >  triton:~/tip/Documentation/features> cat foo/bar/arch-support.txt
> > >  #
> > >  # Feature name:  shiny new fubar kernel feature
> > >  # Kconfig:   ARCH_USE_FUBAR
> > >  # description:   arch supports the fubar feature
> > >  #
> > > 
> > >And running Documentation/features/scripts/features-refresh.sh would
> > >auto-generate the arch support matrix. [...]
> > >  
> > >This way we soft- decouple the refreshing of the entries from the
> > >introduction of the features, while still making it all easy to keep
> > >sync and to extend."
> > > 
> > > This RFC presents a first attempt to implement such a feature/script, and
> > > applies it script on top of Arnd's:
> > > 
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git 
> > > arch-removal
> > > 
> > > Patch 1/3 provides the "features-refresh.sh" script.  Patch 2/3 removes 
> > > the
> > > "BPF-JIT" feature file and it creates header-only files for "cBPF-JIT" and
> > > "eBPF-JIT".  Patch 3/3 presents the results of running the script; this 
> > > run
> > > also printed to standard output the following warnings:
> > > 
> > >   WARNING: '__HAVE_ARCH_STRNCASECMP' is not a valid Kconfig
> > >   WARNING: 'Optimized asm/rwsem.h' is not a valid Kconfig
> > >   WARNING: '!ARCH_USES_GETTIMEOFFSET' is not a valid Kconfig
> > >   WARNING: '__HAVE_ARCH_PTE_SPECIAL' is not a valid Kconfig
> > > 
> > > (I'm sending these patches with empty commit messagges, for early 
> > > feedback:
> > >  I'll fill in these messages in subsequent versions if this makes 
> > > sense...)
> > > 
> > > Cheers,
> > >   Andrea
> > > 
> > > Andrea Parri (3):
> > >   Documentation/features: Add script that refreshes the arch support 
> > > status files in place
> > >   Documentation/features/core: Add arch support status files for 
> > > 'cBPF-JIT' and 'eBPF-JIT'
> > >   Documentation/features: Refresh and auto-generate the arch support 
> > > status files in place
> > 
> > Ok, this series is really impressive at its RFC stage already!
> > 
> > Beyond fixing those warnings, I'd also suggest another change: please make 
> > the 
> > new BPF features patch the last one, so that the 'refresh' patch shows how 
> > much 
> > original bit-rot we gathered recently.
> > 
> > The 'new features' patch should then also include the result of also 
> > running the 
> > script, i.e. a single patch adding the base fields and the generated parts 
> > as 
> > well. That will be the usual development flow anyway - people won't do 
> > two-part 
> > patches just to show which bits are by hand and which are auto-generated.
> 
> Yes, I'll do.
> 
> Let me ask some hints about the warnings, as I'm not sure how to 'fix' them;
> we have:
> 
>   a) __HAVE_ARCH_STRNCASECMP
>  __HAVE_ARCH_PTE_SPECIAL
> 
>   b) Optimized asm/rwsem.h
> 
>   c) !ARCH_USES_GETTIMEOFFSET  
> 
> For (c), I see two options:
&g

Re: [RFC PATCH v3 0/6] Documentation/features: Provide and apply 'features-refresh.sh'

2018-05-14 Thread Ingo Molnar

* Jonathan Corbet  wrote:

> On Mon,  7 May 2018 12:43:37 +0200
> Andrea Parri  wrote:
> 
> > This series provides the script 'features-refresh.sh', which operates on
> > the arch support status files, and it applies this script to refresh the
> > status files in place; previous discussions about this series are at [1].
> 
> Looks good, I've applied the set, thanks.

A belated:

  Reviewed-by: Ingo Molnar 

Thanks guys!

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/4] usb: doc: add document for USB3 debug port usage

2017-01-19 Thread Ingo Molnar

* Lu Baolu  wrote:

> Add Documentation/usb/usb3-debug-port.rst. This document includes
> the user guide for USB3 debug port.
> 
> Cc: linux-doc@vger.kernel.org
> Signed-off-by: Lu Baolu 
> ---
>  Documentation/usb/usb3-debug-port.rst | 95 
> +++
>  1 file changed, 95 insertions(+)
>  create mode 100644 Documentation/usb/usb3-debug-port.rst
> 
> diff --git a/Documentation/usb/usb3-debug-port.rst 
> b/Documentation/usb/usb3-debug-port.rst
> new file mode 100644
> index 000..70eabe4
> --- /dev/null
> +++ b/Documentation/usb/usb3-debug-port.rst
> @@ -0,0 +1,95 @@
> +===
> +USB3 debug port
> +===
> +
> +:Author: Lu Baolu 
> +:Date: October 2016
> +
> +GENERAL
> +===
> +
> +This is a HOWTO for using USB3 debug port on x86 systems.
> +
> +Before using any kernel debugging functionalities based on USB3

s/debugging functionalities
 /debugging functionality

> +debug port, you need to check 1) whether debug port is supported
> +by the xHCI host, 2) which port is used for debugging purpose

s/debugging purpose
 /debugging purposes

> +On debug target system, you need to customize a debugging kernel

s/On debug target system
  On the debug target system

(There are more typos/grammar errors in the document, please re-read it.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH Documentation/memory-barriers.txt] Clarify limited control-dependency scope

2016-06-17 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> Nothing in the control-dependencies section of memory-barriers.txt
> says that control dependencies don't extend beyond the end of the
> if-statement containing the control dependency.  Worse yet, in many
> situations, they do extend beyond that if-statement.  In particular,
> the compiler cannot destroy the control dependency given proper use of
> READ_ONCE() and WRITE_ONCE().  However, a weakly ordered system having
> a conditional-move instruction provides the control-dependency guarantee
> only to code within the scope of the if-statement itself.
> 
> This commit therefore adds words and an example demonstrating this
> limitation of control dependencies.
> 
> Reported-by: Will Deacon 
> Signed-off-by: Paul E. McKenney 
> Acked-by: Peter Zijlstra (Intel) 
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index 147ae8ec836f..a4d0a99de04d 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -806,6 +806,41 @@ out-guess your code.  More generally, although 
> READ_ONCE() does force
>  the compiler to actually emit code for a given load, it does not force
>  the compiler to use the results.
>  
> +In addition, control dependencies apply only to the then-clause and
> +else-clause of the if-statement in question.  In particular, it does
> +not necessarily apply to code following the if-statement:
> +
> + q = READ_ONCE(a);
> + if (q) {
> + WRITE_ONCE(b, p);
> + } else {
> + WRITE_ONCE(b, r);
> + }
> + WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from "a". */
> +
> +It is tempting to argue that there in fact is ordering because the
> +compiler cannot reorder volatile accesses and also cannot reorder
> +the writes to "b" with the condition.  Unfortunately for this line
> +of reasoning, the compiler might compile the two writes to "b" as
> +conditional-move instructions, as in this fanciful pseudo-assembly
> +language:

While CMOV would be the typical situation, even without CMOV the compiler could 
also internally transform it to:

> + if (q)
> + WRITE_ONCE(b, p);
> + if (!q)
> + WRITE_ONCE(b, r);

... and CPU speculation flow could get past the two branches without seeing any 
ordering constraint with the writes to 'b'.

I.e. conditions are not 'atomic', they can be 'torn' by the compiler just as 
much 
as reads or writes can be torn.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v1 3/4] x86, boot: Implement ASLR for kernel memory sections (x86_64)

2016-06-17 Thread Ingo Molnar

* Thomas Garnier  wrote:

>  arch/x86/include/asm/kaslr.h|  12 +++

Hm, what tree is this patch against? asm/kaslr.h does not exist upstream or in 
the 
x86 tree.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v1 3/4] x86, boot: Implement ASLR for kernel memory sections (x86_64)

2016-06-17 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Thomas Garnier  wrote:
> 
> >  arch/x86/include/asm/kaslr.h|  12 +++
> 
> Hm, what tree is this patch against? asm/kaslr.h does not exist upstream or 
> in the 
> x86 tree.

Ah, never mind, introduced by the first patch.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v1 3/4] x86, boot: Implement ASLR for kernel memory sections (x86_64)

2016-06-17 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Thomas Garnier  wrote:
> 
> >  arch/x86/include/asm/kaslr.h|  12 +++
> 
> Hm, what tree is this patch against? asm/kaslr.h does not exist upstream or 
> in the 
> x86 tree.

So the problem is that this file gets introduced by:

  [PATCH v5 1/4] x86, boot: Refactor KASLR entropy functions

... but that patch did not get carried over into this series:

  [PATCH v6 0/3] x86/mm: memory area address KASLR

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [linux-next] Doc: x86: Fix typo in x86

2016-07-01 Thread Ingo Molnar

* Masanari Iida  wrote:

> This patch fix some spelling typo found in
> Documentation/x86.
> 
> Signed-off-by: Masanari Iida 
> ---
>  Documentation/x86/intel_mpx.txt   | 6 +++---
>  Documentation/x86/tlb.txt | 4 ++--
>  Documentation/x86/x86_64/machinecheck | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt
> index 1a5a12184a35..4c40a85ae2b1 100644
> --- a/Documentation/x86/intel_mpx.txt
> +++ b/Documentation/x86/intel_mpx.txt
> @@ -45,7 +45,7 @@ is how we expect the compiler, application and kernel to 
> work together.
> MPX-instrumented.
>  3) The kernel detects that the CPU has MPX, allows the new prctl() to
> succeed, and notes the location of the bounds directory. Userspace is
> -   expected to keep the bounds directory at that locationWe note it
> +   expected to keep the bounds directory at that location We note it

So this documentation fix kept the much more obvious typo in place: that 
sentence 
is missing a period ...

I fixed that in the patch, but next time please read the text you are patching 
...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/3] Doc/memory-barriers: Add Korean translation

2016-07-08 Thread Ingo Molnar

* Byungchul Park  wrote:

> On Fri, Jul 08, 2016 at 07:50:39AM +0900, SeongJae Park wrote:
> > > I will add my opinion in korean.
> > 
> > Thank you for kind and faithful review.  I agree with most of your opinions 
> > and
> > suggestions.  Most of your suggestions looks much better than mine.
> > 
> > However, I also have some different opinion.  I want to emphasize the fact 
> > that
> > (1) CPUs 'issue' memory operations to memory system as they want, (2) memory
> > system 'executes' those operations as they want, and (3) CPUs 'perceive' the
> > 'effects' of the operation executions as they want.  I want to emphasize the
> > fact in document because I think most confusion about memory ordering comes
> > from vague understanding about the relation.  To my perspective, few of your
> 
> Right. I agree with you.

Do these vague formulations exist in the English text as well? If yes then 
please 
try to improve the English text first, then do the Korean translation.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/2] Doc/memory-barriers: Add Korean translation

2016-07-21 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> On Thu, Jul 21, 2016 at 10:10:58AM +0900, SeongJae Park wrote:
> > This commit adds Korean version of memory-barriers.txt document.  The
> > header is refered to HOWTO Korean version.
> > 
> > The translation has started from Feb, 2016 and using a public git
> > repository[1] to maintain the work.  It's commit history says that it is
> > following upstream changes as well.
> > 
> > [1] https://github.com/sjp38/linux.doc_trans_membarrier
> > 
> > Acked-by: David Howells 
> > Signed-off-by: Paul E. McKenney 
> > Acked-by: Minchan Kim 
> > Signed-off-by: SeongJae Park 
> 
> If Minchan is OK with this version, if Ingo and Jon have no objections,
> and given the small change below, I will take it.

Acked-by: Ingo Molnar 

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kernel-hardening] Re: [PATCH 1/2] security, perf: allow further restriction of perf_event_open

2016-08-03 Thread Ingo Molnar

* Kees Cook  wrote:

> > I see 0 up-sides of this approach and, as per the above, a whole bunch of 
> > very 
> > serious downsides.
> >
> > A global (esp. default inhibited) knob is too coarse and limiting.
> 
> I haven't suggested it be default inhibit in the upstream Kconfig. And
> having this knob already with the 0, 1, and 2 settings seems
> incomplete to me without this highest level of restriction that 3
> would provide. That seems rather arbitrary to me. :)

The default has no impact on the "it's too coarse and limiting" negative 
property 
of this patch, which is the show-stopper aspect. Please fix that aspect instead 
of 
trying to argue around it.

This isn't some narrow debugging mechanism we can turn on/off globally and 
forget 
about, this is a wide scope performance measurement and event logging 
infrastructure that is being utilized not just by developers but by apps and 
runtimes as well.

> Let me take this another way instead. What would be a better way to provide a 
> mechanism for system owners to disable perf without an LSM? (Since far fewer 
> folks run with an enforcing "big" LSM: I'm seeking as wide a coverage as 
> possible.)

Because in practice what will happen is that if the only option is to do 
something 
drastic for sekjurity, IT departments will do it - while if there's a more 
flexible mechanism that does not throw out the baby with the bath water that is 
going to be used.

This is as if 20 years ago you had submitted a patch to the early Linux TCP/IP 
networking code to be on/off via a global sysctl switch and told people that 
"in developer mode you can have networking, talk to your admin".

We'd have told you: "this switch is too coarse and limiting, please implement 
something better, like a list of routes which defines which IP ranges are 
accessible, and a privileged range of listen sockets ports and some flexible 
kernel side filtering mechanism to inhibit outgoing/incoming connections".

Global sysctls are way too coarse.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] x86/mm/doc: Clean up the memory region layout descriptions

2018-10-02 Thread Ingo Molnar


* Baoquan He  wrote:

> In Documentation/x86/x86_64/mm.txt, the style of descritions about
> memory region layout is a little confusing:
> 
>  - mix size in TB with 'bits'
>  - sometimes mention a size in the description and sometimes not
>  - sometimes list holes by address, sometimes only as an 'unused hole' line
> 
> So fix them to make them in consistent style.
> 
> Signed-off-by: Baoquan He 
> ---
>  Documentation/x86/x86_64/mm.txt | 76 
> -
>  1 file changed, 38 insertions(+), 38 deletions(-)
> 
> diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
> index 5432a96d31ff..fc1da95e629d 100644
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -1,52 +1,52 @@
>  
>  Virtual memory map with 4 level page tables:
>  
> - - 7fff (=47 bits) user space, different per mm
> -hole caused by [47:63] sign extension
> -8000 - 87ff (=43 bits) guard hole, reserved for 
> hypervisor
> -8800 - c7ff (=64 TB) direct mapping of all phys. 
> memory
> -c800 - c8ff (=40 bits) hole
> -c900 - e8ff (=45 bits) vmalloc/ioremap space
> -e900 - e9ff (=40 bits) hole
> -ea00 - eaff (=40 bits) virtual memory map (1TB)
> -... unused hole ...
> -ec00 - fbff (=44 bits) kasan shadow memory (16TB)
> -... unused hole ...
> + - 7fff (=47 bits, 128 TB) user space, different 
> per mm
> + hole caused by [47:63] sign extension
> +8000 - 87ff (=43 bits, 8 TB) guard hole, reserved 
> for hypervisor
> +8800 - c7ff (=46 bits, 64 TB) direct mapping of all 
> phys. memory (page_offset_base)
> +c800 - c8ff (=40 bits, 1 TB) unused hole
> +c900 - e8ff (=45 bits, 32 TB) vmalloc/ioremap space 
> (vmalloc_base)
> +e900 - e9ff (=40 bits, 1 TB) unused hole
> +ea00 - eaff (=40 bits, 1 TB) virtual memory map 
> (vmemmap_base)
> +eb00 - ebff (=40 bits, 1 TB) unused hole
> +ec00 - fbff (=44 bits, 16 TB) kasan shadow memory
> +fc00 - fdff (=41 bits, 2 TB) unused hole
>   vaddr_end for KASLR
> -fe00 - fe7f (=39 bits) cpu_entry_area mapping
> -fe80 - feff (=39 bits) LDT remap for PTI
> -ff00 - ff7f (=39 bits) %esp fixup stacks
> -... unused hole ...
> -ffef - fffe (=64 GB) EFI region mapping space
> -... unused hole ...
> -8000 - 9fff (=512 MB)  kernel text mapping, from 
> phys 0
> -a000 - feff (1520 MB) module mapping space
> +fe00 - fe7f (=39 bits, 512 GB) cpu_entry_area mapping
> +fe80 - feff (=39 bits, 512 GB) LDT remap for PTI
> +ff00 - ff7f (=39 bits, 512 GB) %esp fixup stacks
> +ff80 - fffeefff (~39 bits, ~507 GB) unused hole
> +ffef - fffe (=36 bits, 64 GB) EFI region mapping 
> space
> + - 7fff (=31 bits, 2 GB) unused hole
> +8000 - 9fff (=29 bits, 512 MB)  kernel text mapping, 
> from phys 0
> +a000 - feff (~31 bits, 1520 MB) module mapping space
>  [fixmap start]   - ff5f kernel-internal fixmap range
>  ff60 - ff600fff (=4 kB) legacy vsyscall ABI
>  ffe0 -  (=2 MB) unused hole

Looks mostly good now, but could you please make the right side align 
vertically as well, like 
I did in the examples I provided previously?

There's zero reason for it to look this disorganized:

> +ff80 - fffeefff (~39 bits, ~507 GB) unused hole
> +ffef - fffe (=36 bits, 64 GB) EFI region mapping 
> space
> + - 7fff (=31 bits, 2 GB) unused hole
> +8000 - 9fff (=29 bits, 512 MB)  kernel text mapping, 
> from phys 0

I.e. can we do something like:

> +ff80 - fffeefff (~39 bits, ~507 GB) unused hole
> +ffef - fffe (=36 bits,   64 GB) EFI region mapping 
> space
> + - 7fff (=31 bits,2 GB) unused hole
> +8000 - 9fff (=29 bits,  512 MB) kernel text mapping, 
> from phys 0

?

That not only makes it more readable, but makes typos like the whitespace noise 
double space in 
the last line more obvious.

Thanks,

Ingo


Re: [PATCH v2 1/3] x86/KASLR: Update document about KERNEL_IMAGE_SIZE

2018-10-03 Thread Ingo Molnar


* Baoquan He  wrote:

> Currently CONFIG_RANDOMIZE_BASE=y is default set, update the relevant
> document about KERNEL_IMAGE_SIZE.

Suggested wording:

  x86/KASLR: Update KERNEL_IMAGE_SIZE description
  
  Currently CONFIG_RANDOMIZE_BASE=y is set by default, which makes some of the
  old comments above the KERNEL_IMAGE_SIZE definition out of date. Update them
  to the current state of affairs.

> Signed-off-by: Baoquan He 
> ---
>  arch/x86/include/asm/page_64_types.h | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/page_64_types.h 
> b/arch/x86/include/asm/page_64_types.h
> index 6afac386a434..2288ceabdb9c 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -61,9 +61,10 @@
>  /*
>   * Kernel image size is limited to 1GiB due to the fixmap living in the
>   * next 1GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S). Use
> - * 512MiB by default, leaving 1.5GiB for modules once the page tables
> - * are fully set up. If kernel ASLR is configured, it can extend the
> - * kernel page table mapping, reducing the size of the modules area.
> + * 1 GiB by default, leaving 1 GiB for modules once the page tables are
> + * fully set up. If kernel ASLR is not configured, it can shrink the
> + * kernel page table mapping to decrease the size of kernel area to 512
> + * MiB, increase the size of the modules area to 1.5 GiB.
>   */

I've prettified that comment some more:

 /*
  * Maximum kernel image size is limited to 1 GiB, due to the fixmap living
  * in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S).
  *
  * On KASLR use 1 GiB by default, leaving 1 GiB for modules once the
  * page tables are fully set up.
  *
  * If KASLR is disabled we can shrink it to 0.5 GiB and increase the size
  * of the modules area to 1.5 GiB.
  */

>  #if defined(CONFIG_RANDOMIZE_BASE)
>  #define KERNEL_IMAGE_SIZE(1024 * 1024 * 1024)

BTW., while at it, shouldn't we make that:

   #ifdef CONFIG_RANDOMIZE_BASE

?

Thanks,

Ingo


Re: [PATCH 0/3] x86/mm/doc: Clean up mm.txt

2018-10-06 Thread Ingo Molnar


* Baoquan He  wrote:

> This clean up is suggested by Ingo.
> 
> It firstly fix the confusions in mm layout tables by unifying
> each memory region description in the consistent style.
> 
> Secondly take the KASLR words out of the mm layout tables to make
> it as a separate section to only list mm layout in non-KASLR case.
> Then add KASLR document at the end of mm.txt.
> 
> Meanwhile update description about KERNEL_IMAGE_SIZE in
> arch/x86/include/asm/page_64_types.h .
> 
> v2->v3:
> Ingo helped to prettify the patch log and code comment, repost them
> after updating accordign to Ingo's suggestions.
> 
> v1->v2:
> 
> Resend v2 since some typo and incorrect descriptions found in v1 post.
> 
> Baoquan He (3):
>   x86/KASLR: Update KERNEL_IMAGE_SIZE description
>   x86/mm/doc: Clean up the memory region layout descriptions
>   x86/doc/kaslr.txt: Create a separate part of document abourt KASLR at
> the end of file

Thanks, patches #1 and #2 are looking good and I'll apply them with some minor 
fixes, and I'll 
comment about patch #3 separately.

I also wrote a larger patch enhancing the descriptions some more, I'll send 
that separately.

Thanks,

Ingo


[PATCH 4/3] x86/mm/doc: Enhance the x86-64 virtual memory layout descriptions

2018-10-06 Thread Ingo Molnar
After the cleanups from Baoquan He, make it even more readable:

 - Remove the 'bits' area size column: it's pretty pointless and was even
   wrong for some of the entries. Given that MB, GB, TB, PT are 10, 20,
   30 and 40 bits, a "8 TB" size description makes it obvious that it's
   43 bits.

 - Introduce an "offset" column:



start addr   | offset | end addr |  size   | VM area 
description

-||--|-|
...
8800 | -120TB | c7ff |   64 TB | direct mapping 
of all physical memory (page_offset_base),
 this is what 
limits max physical memory supported.

   The -120 TB notation makes it obvious where this particular virtual memory
   region starts: 120 TB down from the top of the 64-bit virtual memory space.
   Especially the layout of the kernel mappings is a *lot* more obvious when
   written this way, plus it's much easier to compare it with the size column
   and understand/check/validate and modify the kernel's layout in the future.

 - Mark the part from where the 47-bit and 56-bit kernel layouts are 100% 
identical,
   this starts at the -512 GB offset and the EFI region.

 - Re-shuffle the size desciptions to be continous blocks of sizes, instead of 
the
   often mixed size. I.e. write "0.5 TB" instead of "512 GB" if we are still in
   the TB-granular region of the map.

 - Make the 47-bit and 56-bit descriptions use the *exact* same layout and 
wording,
   and only differ where there's a material difference. This makes it easy to 
compare
   the two tables side by side by switching between two terminal tabs.

 - Plus enhance a lot of other stylistic/typographical details: make the tables
   explicitly tabular, add headers, enhance certain entries, etc. etc.

Note that there are some apparent errors in the tables as well, but I'll fix
them in a separate patch to make it easier to review/validate.

Cc: Andy Lutomirski 
Cc: Baoquan He 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: cor...@lwn.net
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: thgar...@google.com
Signed-off-by: Ingo Molnar 
---
 Documentation/x86/x86_64/mm.txt |  172 
 kernel/sched/core.c |6 +
 2 files changed, 128 insertions(+), 50 deletions(-)

Index: tip/Documentation/x86/x86_64/mm.txt
===
--- tip.orig/Documentation/x86/x86_64/mm.txt
+++ tip/Documentation/x86/x86_64/mm.txt
@@ -1,55 +1,127 @@
 
-Virtual memory map with 4 level page tables:
+
+| Complete virtual memory map with 4-level page tables |
+
 
- - 7fff (=47 bits,   128 TB) user space, different 
per mm
-   hole caused by [47:63] sign extension
-8000 - 87ff (=43 bits, 8 TB) guard hole, reserved 
for hypervisor
-8800 - c7ff (=46 bits,64 TB) direct mapping of all 
phys. memory (page_offset_base)
-c800 - c8ff (=40 bits, 1 TB) unused hole
-c900 - e8ff (=45 bits,32 TB) vmalloc/ioremap space 
(vmalloc_base)
-e900 - e9ff (=40 bits, 1 TB) unused hole
-ea00 - eaff (=40 bits, 1 TB) virtual memory map 
(vmemmap_base)
-eb00 - ebff (=40 bits, 1 TB) unused hole
-ec00 - fbff (=44 bits,16 TB) kasan shadow memory
-fc00 - fdff (=41 bits, 2 TB) unused hole
-   vaddr_end for KASLR
-fe00 - fe7f (=39 bits,   512 GB) cpu_entry_area mapping
-fe80 - feff (=39 bits,   512 GB) LDT remap for PTI
-ff00 - ff7f (=39 bits,   512 GB) %esp fixup stacks
-ff80 - fffeefff (~39 bits,  ~507 GB) unused hole
-ffef - fffe (=36 bits,64 GB) EFI region mapping 
space
- - 7fff (=31 bits, 2 GB) unused hole
-8000 - 9fff (=29 bits,   512 MB) kernel text mapping, 
from phys 0
-a000 - feff (~31 bits,  1520 MB) module mapping space
-[fixmap start]   - ff5f kernel-internal fixmap range
-ff60 - ff600fff ( =4 kB) legacy vsyscall ABI
-ffe0 -  ( =2 MB) unused hole
-
-Virtual memory map w

Re: [PATCH 4/3] x86/mm/doc: Enhance the x86-64 virtual memory layout descriptions

2018-10-06 Thread Ingo Molnar


* Ingo Molnar  wrote:

> +
> +| Complete virtual memory map with 4-level page tables |
> +

> +
> +start addr   | offset | end addr |  size   | VM area 
> description
> +-||--|-|

> +
> +# Identical layout to the 56-bit one from here on:
> +
> +ff80 | -512GB | fffeefff | ~507 GB | ... unused hole
> +ffef |  -68GB | fffe |   64 GB | EFI region 
> mapping space

> +
> +| Complete virtual memory map with 5-level page tables |
> +

> +ff80 |   -0.5  TB | ffee |  444 GB | ... unused hole
> +
> +# Identical layout to the 47-bit one from here on:
> +
> +ffef |  -68GB | fffe |   64 GB | EFI region 
> mapping space

So patch #2 appears to have introduced an error/typo in the 47-bit table. Note 
the weird size 
and discontinuity of the 'unused hole' in the 47-bit table, and compare it with 
56-bit table:

  fffeefff
  ffee

(Note how the incorrect end address was cargo-cult-copied into the 'size' field 
of ~507 GB...)

The correct number is the 56-bit one, and both tables should show the following 
identical 
layout:

  ff80 | -512GB | fffeefff |  444 GB | ... unused hole
  ffef |  -68GB | fffe |   64 GB | EFI region 
mapping space

Agreed?

Thanks,

Ingo


[PATCH 4/3 v2] x86/mm/doc: Enhance the x86-64 virtual memory layout descriptions

2018-10-06 Thread Ingo Molnar


Find a new iteration below, fixed the bug, prettified the table some more.

Thanks,

Ingo

=>
After the cleanups from Baoquan He, make it even more readable:

 - Remove the 'bits' area size column: it's pretty pointless and was even
   wrong for some of the entries. Given that MB, GB, TB, PT are 10, 20,
   30 and 40 bits, a "8 TB" size description makes it obvious that it's
   43 bits.

 - Introduce an "offset" column:



start addr   | offset | end addr |  size   | VM area 
description

-||--|-|
...
8800 | -120TB | c7ff |   64 TB | direct mapping 
of all physical memory (page_offset_base),
 this is what 
limits max physical memory supported.

   The -120 TB notation makes it obvious where this particular virtual memory
   region starts: 120 TB down from the top of the 64-bit virtual memory space.
   Especially the layout of the kernel mappings is a *lot* more obvious when
   written this way, plus it's much easier to compare it with the size column
   and understand/check/validate and modify the kernel's layout in the future.

 - Mark the part from where the 47-bit and 56-bit kernel layouts are 100% 
identical,
   this starts at the -512 GB offset and the EFI region.

 - Re-shuffle the size desciptions to be continous blocks of sizes, instead of 
the
   often mixed size. I.e. write "0.5 TB" instead of "512 GB" if we are still in
   the TB-granular region of the map.

 - Make the 47-bit and 56-bit descriptions use the *exact* same layout and 
wording,
   and only differ where there's a material difference. This makes it easy to 
compare
   the two tables side by side by switching between two terminal tabs.

 - Plus enhance a lot of other stylistic/typographical details: make the tables
   explicitly tabular, add headers, enhance certain entries, etc. etc.

Note that there are some apparent errors in the tables as well, but I'll fix
them in a separate patch to make it easier to review/validate.

Cc: Andy Lutomirski 
Cc: Baoquan He 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: cor...@lwn.net
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: thgar...@google.com
Signed-off-by: Ingo Molnar 
---
 Documentation/x86/x86_64/mm.txt | 171 
 1 file changed, 120 insertions(+), 51 deletions(-)

diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
index b4bc95c9790e..702898633b00 100644
--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -1,55 +1,124 @@
+
+Complete virtual memory map with 4-level page tables
+
 
-Virtual memory map with 4 level page tables:
-
- - 7fff (=47 bits,   128 TB) user space, different 
per mm
-   hole caused by [47:63] sign extension
-8000 - 87ff (=43 bits, 8 TB) guard hole, reserved 
for hypervisor
-8800 - c7ff (=46 bits,64 TB) direct mapping of all 
phys. memory (page_offset_base)
-c800 - c8ff (=40 bits, 1 TB) unused hole
-c900 - e8ff (=45 bits,32 TB) vmalloc/ioremap space 
(vmalloc_base)
-e900 - e9ff (=40 bits, 1 TB) unused hole
-ea00 - eaff (=40 bits, 1 TB) virtual memory map 
(vmemmap_base)
-eb00 - ebff (=40 bits, 1 TB) unused hole
-ec00 - fbff (=44 bits,16 TB) kasan shadow memory
-fc00 - fdff (=41 bits, 2 TB) unused hole
-   vaddr_end for KASLR
-fe00 - fe7f (=39 bits,   512 GB) cpu_entry_area mapping
-fe80 - feff (=39 bits,   512 GB) LDT remap for PTI
-ff00 - ff7f (=39 bits,   512 GB) %esp fixup stacks
-ff80 - fffeefff (~39 bits,  ~507 GB) unused hole
-ffef - fffe (=36 bits,64 GB) EFI region mapping 
space
- - 7fff (=31 bits, 2 GB) unused hole
-8000 - 9fff (=29 bits,   512 MB) kernel text mapping, 
from phys 0
-a000 - feff (~31 bits,  1520 MB) module mapping space
-[fixmap start]   - ff5f kernel-internal fixmap range
-ff60 - ff600fff ( =4 kB) legacy vsyscall ABI
-ffe0 - f

Re: [PATCH 4/3 v2] x86/mm/doc: Enhance the x86-64 virtual memory layout descriptions

2018-10-06 Thread Ingo Molnar


There's one PTI related layout asymmetry I noticed between 4-level and 5-level 
kernels:

  47-bit:
> +|
> +| Kernel-space 
> virtual memory, shared between all processes:
> +|___
> +  ||  | |
> + 8000 | -128TB | 87ff |8 TB | ... guard 
> hole, also reserved for hypervisor
> + 8800 | -120TB | c7ff |   64 TB | direct mapping 
> of all physical memory (page_offset_base)
> + c800 |  -56TB | c8ff |1 TB | ... unused hole
> + c900 |  -55TB | e8ff |   32 TB | 
> vmalloc/ioremap space (vmalloc_base)
> + e900 |  -23TB | e9ff |1 TB | ... unused hole
> + ea00 |  -22TB | eaff |1 TB | virtual memory 
> map (vmemmap_base)
> + eb00 |  -21TB | ebff |1 TB | ... unused hole
> + ec00 |  -20TB | fbff |   16 TB | KASAN shadow 
> memory
> + fc00 |   -4TB | fdff |2 TB | ... unused hole
> +  ||  | | vaddr_end for 
> KASLR
> + fe00 |   -2TB | fe7f |  0.5 TB | cpu_entry_area 
> mapping
> + fe80 |   -1.5  TB | feff |  0.5 TB | LDT remap for 
> PTI
> + ff00 |   -1TB | ff7f |  0.5 TB | %esp fixup 
> stacks
> +__||__|_|
> +|

  56-bit:
> +|
> +| Kernel-space 
> virtual memory, shared between all processes:
> +|___
> +  ||  | |
> + ff00 |  -64PB | ff0f |4 PB | ... guard 
> hole, also reserved for hypervisor
> + ff10 |  -60PB | ff8f |   32 PB | direct mapping 
> of all physical memory (page_offset_base)
> + ff90 |  -28PB | ff9f |4 PB | LDT remap for 
> PTI
> + ffa0 |  -24PB | ffd1 | 12.5 PB | 
> vmalloc/ioremap space (vmalloc_base)
> + ffd2 |  -11.5  PB | ffd3 |  0.5 PB | ... unused hole
> + ffd4 |  -11PB | ffd5 |  0.5 PB | virtual memory 
> map (vmemmap_base)
> + ffd6 |  -10.5  PB | ffde | 2.25 PB | ... unused hole
> + ffdf |   -8.25 PB | fdff |   ~8 PB | KASAN shadow 
> memory
> + fc00 |   -4TB | fdff |2 TB | ... unused hole
> +  ||  | | vaddr_end for 
> KASLR
> + fe00 |   -2TB | fe7f |  0.5 TB | cpu_entry_area 
> mapping
> + fe80 |   -1.5  TB | feff |  0.5 TB | ... unused hole
> + ff00 |   -1TB | ff7f |  0.5 TB | %esp fixup 
> stacks

The two layouts are very similar beyond the shift in the offset and the region 
sizes, except 
one big asymmetry: is the placement of the LDT remap for PTI.

Is there any fundamental reason why the LDT area is mapped into a 4 petabyte 
(!) area on 56-bit 
kernels, instead of being at the -1.5 TB offset like on 47-bit kernels?

The only reason I can see is that this way is that it's currently coded at the 
PGD level only:

static void map_ldt_struct_to_user(struct mm_struct *mm)
{
pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR); 

if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
set_pgd(kernel_to_user_pgdp(pgd), *pgd);
}

( BTW., the 4 petabyte size of the area is misleading: a 5-level PGD entry 
covers 256 TB of 
  virtual memory, i.e 0.25 PB, not 4 PB. So in reality we have a 0.25 PB area 
there, used up
  by the LDT mapping in a single PGD entry, plus a 3.75 PB hole after that. )

... but unless I'm missing something it's not really fundamental for it to be 
at the PGD level 
- it could be two levels lower as well, and it could move back to the same 
place where it's on 
the 47-bit kernel.

The LDT mapping operation is pretty heavy already, and the actual use of the 
LDT is not 
impacted by where it's mapped, as the LDT is per mm so no remapping is required 
on context 
switch.

I.e. could we move the LDT over to the same place? This would make an even 
larger area of the 
address space identical between 47-bit and 56-bit kernels:


Re: [PATCH v5 0/3] x86: make rsdp address accessible via boot params

2018-10-09 Thread Ingo Molnar


* Juergen Gross  wrote:

> In the non-EFI boot path the ACPI RSDP table is currently found via
> either EBDA or by searching through low memory for the RSDP magic.
> This requires the RSDP to be located in the first 1MB of physical
> memory. Xen PVH guests, however, get the RSDP address via the start of
> day information block.
> 
> In order to support an arbitrary RSDP address this patch series adds
> the physical address of the RSDP to the boot params structure filled
> by the boot loader.
> 
> Juergen Gross (3):
>   x86/xen: fix boot loader version reported for pvh guests
>   x86/boot: add acpi rsdp address to setup_header
>   x86/acpi: take rsdp address for boot params if available
> 
>  Documentation/x86/boot.txt| 32 +++-
>  arch/x86/boot/header.S|  6 +-
>  arch/x86/include/asm/acpi.h   |  7 +++
>  arch/x86/include/asm/x86_init.h   |  2 ++
>  arch/x86/include/uapi/asm/bootparam.h |  4 
>  arch/x86/kernel/acpi/boot.c   |  6 ++
>  arch/x86/kernel/head32.c  |  1 +
>  arch/x86/kernel/head64.c  |  2 ++
>  arch/x86/kernel/setup.c   | 17 +
>  arch/x86/kernel/x86_init.c|  3 +--
>  arch/x86/xen/enlighten_pvh.c  |  2 +-
>  11 files changed, 77 insertions(+), 5 deletions(-)

I have some vague memories of an older variant of this feature breaking stuff 
and resulting in 
me involuntarily participating in an overly long bisection session.

If that memory is right and I'm not confusing it with some other patchset, 
could you please 
provide some more context, what that old problem was, how it was resolved, 
whether it is 
expected to trigger on any machines, etc., to create some warm fuzzy feelings 
about this 
patch-set and to reduce my bisectofobia? ;-)

Thanks,

Ingo


Re: [PATCH v5 0/3] x86: make rsdp address accessible via boot params

2018-10-10 Thread Ingo Molnar


* Juergen Gross  wrote:

> You can just dive into the discussion we had back in February:

That was half a year and a thousand commits ago! ;-)

> https://lore.kernel.org/lkml/20180213163244.j2zuxyhs4kbfh...@gmail.com/
> 
> The scheme I have used in V5 of the series is the one you agreed to use
> back then.
> 
> A quick summary of the problem you mentioned:
> 
> There are some downstream variants of grub2 with a patch breaking the
> boot protocol by writing garbage past the end of setup_header. Adding
> a new field at the end of setup_header (here: rsdp_address) resulted in
> those grub2 variants clobbering the preset value of 0.
> 
> The solution is to let grub2 report back the used boot protocol version
> with setting a flag "I am reporting back my version". The kernel now is
> capable to know which fields of setup_header are known to grub2 and can
> act accordingly.
> 
> The related grub2 patch series is under review right now.

Ok, that's reassuring - that's all the context I needed.

Would it help grub2 review+integration if we applied this to -tip and staged
it there until the Grub patches are accepted, or can I consider those changes
as grub2 upstream accepted?

I'd like to help make it happen, let me know what the best route is.

Thanks,

Ingo


Re: [PATCH V5 0/5] KVM: X86: Introducing ROE Protection Kernel Hardening

2018-10-28 Thread Ingo Molnar


* Ahmed Abd El Mawgood  wrote:

> This is the 5th version which is 4th version with minor fixes. ROE is a 
> hypercall that enables host operating system to restrict guest's access to its
> own memory. This will provide a hardening mechanism that can be used to stop
> rootkits from manipulating kernel static data structures and code. Once a 
> memory
> region is protected the guest kernel can't even request undoing the 
> protection.
> 
> Memory protected by ROE should be non-swapable because even if the ROE 
> protected
> page got swapped out, It won't be possible to write anything in its place.
> 
> ROE hypercall should be capable of either protecting a whole memory frame or
> parts of it. With these two, it should be possible for guest kernel to protect
> its memory and all the page table entries for that memory inside the page 
> table.
> I am still not sure whether this should be part of ROE job or the guest's job.
> 
> 
> The reason why it would be better to implement this from inside kvm: instead 
> of
> (host) user space is the need to access SPTEs to modify the permissions, while
> mprotect() from user space can work in theory. It will become a big 
> performance
> hit to vmexit and switch to user space mode on each fault, on the other hand,
> having the permission handled by EPT should make some remarkable performance
> gain.
> 
> Our model assumes that an attacker got full root access to a running guest and
> his goal is to manipulate kernel code/data (hook syscalls, overwrite IDT 
> ..etc).
> 
> There is future work in progress to also put some sort of protection on the 
> page
> table register CR3 and other critical registers that can be intercepted by 
> KVM.
> This way it won't be possible for an attacker to manipulate any part of the
> guests page table.

BTW., transparent detection and trapping of attacks would also be nice: 
if ROE is active and something running on the guest still attempts to 
change the pagetables, the guest should be frozen and a syslog warning on 
the hypervisor side should be printed?

Also, the feature should probably be 'default y' to help spread it on the 
distro side. It's opt-in functionality from the guest side anyway, so 
there's no real cost on the host side other than some minor resident 
memory.

Thanks,

Ingo


[PATCH] Documentation/features: Refresh the features list to v4.20-rc2

2018-11-13 Thread Ingo Molnar


* Palmer Dabbelt  wrote:

> I didn't even know this existed until David submitted a patch set that
> included updates to the documentation as a result of some features he
> added to RISC-V.  It looks like there may be a handful of other people
> who don't know this exists either, so I figured I'd just mail out a
> patch set containing all the updates split out as well as I can.
> 
> This smells like something that sholud be automatic, so if I'm jumping
> the gun here then feel free to drop this.  If nobody says anything then
> I guess I'll submit this as a separate PR to Linus from my personal
> tree, as it's not really a RISC-V thing but it seems like it's worth
> having docs that match the code where it's trivial -- I assume that's
> what this does, I didn't actually read anything but the diff because I
> never trust documentation to be up to date...
> 
> I feel compelled to say something like "maybe this should be part of
> checkpatch?", but I'm definately not looking to learn perl :)

I don't think it should be automated or part of checkpatch, but I 
(obviously) agree with the changes, except that I think it should be a 
single patch (combined patch attached below).

Thanks,

Ingo

Subject: Documentation/features: Refresh the features list to v4.20-rc2
From: Palmer Dabbelt 

Run Documentation/features/scripts/feature-refresh.sh to refresh the 
kernel features support matrix list:

 - The new 'csky' architecture was added
 - s390now supports KASAN
 - powerpc now supports stackprotector
 - xtensa  now supports sg-chain
 - arm64   now supports queued-spinlocks
 - parisc  now supports kprobes-events
 - RISC-V  now supports pte_special

[ mingo: combined the patches and the changelogs. ]

Signed-off-by: Palmer Dabbelt 
Signed-off-by: Ingo Molnar 
---

 Documentation/features/core/cBPF-JIT/arch-support.txt  | 1 +
 Documentation/features/core/eBPF-JIT/arch-support.txt  | 1 +
 Documentation/features/core/generic-idle-thread/arch-support.txt   | 1 +
 Documentation/features/core/jump-labels/arch-support.txt   | 1 +
 Documentation/features/core/tracehook/arch-support.txt | 1 +
 Documentation/features/debug/KASAN/arch-support.txt| 3 ++-
 Documentation/features/debug/gcov-profile-all/arch-support.txt | 1 +
 Documentation/features/debug/kgdb/arch-support.txt | 1 +
 Documentation/features/debug/kprobes-on-ftrace/arch-support.txt| 1 +
 Documentation/features/debug/kprobes/arch-support.txt  | 1 +
 Documentation/features/debug/kretprobes/arch-support.txt   | 1 +
 Documentation/features/debug/optprobes/arch-support.txt| 1 +
 Documentation/features/debug/stackprotector/arch-support.txt   | 3 ++-
 Documentation/features/debug/uprobes/arch-support.txt  | 1 +
 Documentation/features/debug/user-ret-profiler/arch-support.txt| 1 +
 Documentation/features/io/dma-contiguous/arch-support.txt  | 1 +
 Documentation/features/io/sg-chain/arch-support.txt| 3 ++-
 Documentation/features/locking/cmpxchg-local/arch-support.txt  | 1 +
 Documentation/features/locking/lockdep/arch-support.txt| 1 +
 Documentation/features/locking/queued-rwlocks/arch-support.txt | 1 +
 Documentation/features/locking/queued-spinlocks/arch-support.txt   | 3 ++-
 Documentation/features/locking/rwsem-optimized/arch-support.txt| 1 +
 Documentation/features/perf/kprobes-event/arch-support.txt | 3 ++-
 Documentation/features/perf/perf-regs/arch-support.txt | 1 +
 Documentation/features/perf/perf-stackdump/arch-support.txt| 1 +
 Documentation/features/sched/membarrier-sync-core/arch-support.txt | 1 +
 Documentation/features/sched/numa-balancing/arch-support.txt   | 1 +
 Documentation/features/seccomp/seccomp-filter/arch-support.txt | 1 +
 Documentation/features/time/arch-tick-broadcast/arch-support.txt   | 1 +
 Documentation/features/time/clockevents/arch-support.txt   | 1 +
 Documentation/features/time/context-tracking/arch-support.txt  | 1 +
 Documentation/features/time/irq-time-acct/arch-support.txt | 1 +
 Documentation/features/time/modern-timekeeping/arch-support.txt| 1 +
 Documentation/features/time/virt-cpuacct/arch-support.txt  | 1 +
 Documentation/features/vm/ELF-ASLR/arch-support.txt| 1 +
 Documentation/features/vm/PG_uncached/arch-support.txt | 1 +
 Documentation/features/vm/THP/arch-support.txt | 1 +
 Documentation/features/vm/TLB/arch-support.txt | 1 +
 Documentation/features/vm/huge-vmap/arch-support.txt   | 1 +
 Documentation/features/vm/ioremap_prot/arch-support.txt| 1 +
 Documentation/features/vm/numa-memblock/arch-support.txt   | 1 +
 Documentation/features/vm/

Re: [RFC PATCH v6 01/26] Documentation/x86: Add CET description

2018-11-20 Thread Ingo Molnar


* Yu-cheng Yu  wrote:

> +X86 Documentation
> +===
> +
> +Control-flow Enforcement
> +
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   intel_cet
> diff --git a/Documentation/x86/intel_cet.rst b/Documentation/x86/intel_cet.rst
> new file mode 100644
> index ..dac83bbf8a24
> --- /dev/null
> +++ b/Documentation/x86/intel_cet.rst
> @@ -0,0 +1,268 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=
> +Control-flow Enforcement Technology (CET)
> +=
> +
> +[1] Overview
> +
> +
> +Control-flow Enforcement Technology (CET) provides protection against
> +return/jump-oriented programming (ROP) attacks.  It can be setup to
> +protect both the kernel and applications.  In the first phase,
> +only the user-mode protection is implemented in 64-bit mode; 32-bit
> +applications are supported in compatibility mode.
> +
> +CET introduces shadow stack (SHSTK) and indirect branch tracking
> +(IBT).  SHSTK is a secondary stack allocated from memory and cannot
> +be directly modified by applications.  When executing a CALL, the
> +processor pushes a copy of the return address to SHSTK.  Upon
> +function return, the processor pops the SHSTK copy and compares it
> +to the one from the program stack.  If the two copies differ, the
> +processor raises a control-protection exception.  IBT verifies all
> +indirect CALL/JMP targets are intended as marked by the compiler
> +with 'ENDBR' opcodes (see CET instructions below).
> +
> +There are two kernel configuration options:
> +
> +INTEL_X86_SHADOW_STACK_USER, and
> +INTEL_X86_BRANCH_TRACKING_USER.
> +
> +To build a CET-enabled kernel, Binutils v2.31 and GCC v8.1 or later
> +are required.  To build a CET-enabled application, GLIBC v2.28 or
> +later is also required.
> +
> +There are two command-line options for disabling CET features:
> +
> +no_cet_shstk - disables SHSTK, and
> +no_cet_ibt - disables IBT.
> +
> +At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.

What is the rough expected performance impact of CET on average function 
call frequency user applications and the kernel itself?

Thanks,

Ingo


Re: [RFC PATCH v6 01/26] Documentation/x86: Add CET description

2018-11-20 Thread Ingo Molnar


* Yu-cheng Yu  wrote:

> On Tue, 2018-11-20 at 10:52 +0100, Ingo Molnar wrote:
> > * Yu-cheng Yu  wrote:
> > 
> > > +X86 Documentation
> > > [...]
> > > +
> > > +At run time, /proc/cpuinfo shows the availability of SHSTK and IBT.
> > 
> > What is the rough expected performance impact of CET on average function 
> > call frequency user applications and the kernel itself?
> 
> I don't have any conclusive numbers yet; but since currently only user-mode
> protection is implemented, I suspect any impact would be most likely to the
> application.  The kernel would spend some small amount of time on the setup of
> CET.

This is horribly vague.

Thanks,

Ingo


Re: [PATCH v9 01/13] x86/resctrl: Rename and move rdt files to new directory

2018-11-22 Thread Ingo Molnar


* Moger, Babu  wrote:

> New generation of AMD processors start supporting RDT(or QOS)
> features. Together these features will be called as RESCTRL.
> With more than one vendors supporting these features, it seems
> more appropriate to rename these files.
> 
> Create a new directory with the name 'resctrl' and move all the
> intel_rdt files to the new directory. This way all the resctrl
> related code resides inside one directory.
> 
> Suggested-by: Borislav Petkov 
> Signed-off-by: Babu Moger 
> ---
>  .../x86/include/asm/{intel_rdt_sched.h => resctrl_sched.h} | 0
>  arch/x86/kernel/cpu/Makefile   | 5 +
>  arch/x86/kernel/cpu/resctrl/Makefile   | 7 +++
>  arch/x86/kernel/cpu/{intel_rdt.c => resctrl/core.c}| 4 ++--
>  .../cpu/{intel_rdt_ctrlmondata.c => resctrl/ctrlmondata.c} | 2 +-
>  arch/x86/kernel/cpu/{intel_rdt.h => resctrl/internal.h}| 6 +++---
>  .../kernel/cpu/{intel_rdt_monitor.c => resctrl/monitor.c}  | 2 +-
>  .../cpu/{intel_rdt_pseudo_lock.c => resctrl/pseudo_lock.c} | 6 +++---
>  .../pseudo_lock_event.h}   | 2 +-
>  .../cpu/{intel_rdt_rdtgroup.c => resctrl/rdtgroup.c}   | 4 ++--
>  arch/x86/kernel/process_32.c   | 2 +-
>  arch/x86/kernel/process_64.c   | 2 +-
>  12 files changed, 23 insertions(+), 19 deletions(-)
>  rename arch/x86/include/asm/{intel_rdt_sched.h => resctrl_sched.h} (100%)
>  create mode 100644 arch/x86/kernel/cpu/resctrl/Makefile
>  rename arch/x86/kernel/cpu/{intel_rdt.c => resctrl/core.c} (99%)
>  rename arch/x86/kernel/cpu/{intel_rdt_ctrlmondata.c => 
> resctrl/ctrlmondata.c} (99%)
>  rename arch/x86/kernel/cpu/{intel_rdt.h => resctrl/internal.h} (99%)
>  rename arch/x86/kernel/cpu/{intel_rdt_monitor.c => resctrl/monitor.c} (99%)
>  rename arch/x86/kernel/cpu/{intel_rdt_pseudo_lock.c => 
> resctrl/pseudo_lock.c} (99%)
>  rename arch/x86/kernel/cpu/{intel_rdt_pseudo_lock_event.h => 
> resctrl/pseudo_lock_event.h} (95%)
>  rename arch/x86/kernel/cpu/{intel_rdt_rdtgroup.c => resctrl/rdtgroup.c} (99%)

Ugh, violent NAK on this unreadable directory naming: 'resctrl' is an 
ugly double/triple abbreviation that nobody recognizes for what it is to 
begin with, and even the long form 'resource control' is an overly 
generic naming - *everything* the kernel does is in essence 'resource 
control' ...

So please find some better name and standardize the namespace around it. 
A couple of suggestions:

 -'Hardware Quality of Service', i.e. HW_QOS, hw_qos
 - or 'CPU bandwidth control',   i.e. CPU_BW, cpu_bw
 - or 'Hardware Bandwidth Control',  i.e. HW_BW,  hw_bw

etc.

Thanks,

ngo


Re: [PATCH v9 01/13] x86/resctrl: Rename and move rdt files to new directory

2018-11-23 Thread Ingo Molnar


* Borislav Petkov  wrote:

> On Fri, Nov 23, 2018 at 08:28:39AM +0100, Ingo Molnar wrote:
> > Ugh, violent NAK on this unreadable directory naming: 'resctrl' is an 
> > ugly double/triple abbreviation that nobody recognizes for what it is to 
> > begin with, and even the long form 'resource control' is an overly 
> > generic naming - *everything* the kernel does is in essence 'resource 
> > control' ...
> 
> Well, the fs this thing uses is called "resctrl".
> 
> Documentation/x86/resctrl_ui.txt:1075:the resctrl will still mount but cannot 
> create CTRL_MON directories.
> Documentation/x86/resctrl_ui.txt:1082:# mount -t resctrl resctrl 
> /sys/fs/resctrl
> Documentation/x86/resctrl_ui.txt:1083:# cd /sys/fs/resctrl
> 
> Are you saying that the fs should be renamed now too?

Sigh, probably not. I only noticed this naming snafu with the renaming 
commit. The high level name was always RDT-ish - which while an acronym 
is at least is a familiar high level name now with no obvious generic 
namespace collision, while 'resctrl' less so.

> How are those *abbreviations* better? "hw_bw" is especially cryptic and 
> the others are no better.

Those were suggestions - but I'd be fine with 'resource_control':

> "resctrl" to mean "resource control" is much better IMO.

Then at least make the directory name resource_control/, which is only 
marginally longer and a lot more readable.

We really don't have to fit directly names into the 8 character DOS limit 
anymore. ;-)

> [...] And it is different from the "other" resource controlling the 
> kernel does because it is under arch/x86/kernel/cpu/ which tells you it 
> is a *CPU* resource control.

Yeah, so this is not obvious from the filesystem name, nor does it excuse 
the pointless abbreviation.

High level names matter.

Also, the Kconfig space, when it gets extended with the AMD bits, should 
probably follow the same nomenclature: CONFIG_X86_CPU_RESOURCE_CONTROL=y 
or such.

Ingo


Re: [PATCH v9 01/13] x86/resctrl: Rename and move rdt files to new directory

2018-11-23 Thread Ingo Molnar


* Borislav Petkov  wrote:

> On Fri, Nov 23, 2018 at 09:41:17AM +0100, Ingo Molnar wrote:
> > Then at least make the directory name resource_control/, which is only 
> > marginally longer and a lot more readable.
> > 
> > We really don't have to fit directly names into the 8 character DOS limit 
> > anymore. ;-)
> 
> How about
> 
> resource_ctl
> 
> ?

The thing is, thinking about this as a 'CPU resource' is really a 
misnomer on the conceptual level, which is why it's bothering me: RDT is 
not really about 'CPU resources', because registers are CPU resources, 
ioports and iomem are CPU resources, APICs are CPU resources and PMU 
events are resources - none of which is part of RDT.

The key difference in RDT is that they are *shared* resources - caches 
really - where the ad-hoc cache sharing might be causing security and 
scalability problems so there's partitioning and throttling (bandwidth 
control) support in the hardware.

Is there any other resource handled than caches by RDT or by the AMD 
variant?

So how about "cache_control"? It's shorter and a lot closer to what the 
code actually does.

> resource_control/ is kinda long-ish and the other names we have there
> are nice and short, see below.
> 
> BTW, while we're talking renaming, I have a patch which renames the MCE
> pile and am planning to slap it in around -rc6 timeframe since we don't
> have a lot of RAS commits this time around, see also the end of this
> mail. It makes the naming there all nicely regular. :)

That's cool - these IMHO need to be done periodically to keep overall 
namespace complexity low enough. (As long as can be done without breaking 
any ABI that is.)

>  15 files changed, 14 insertions(+), 14 deletions(-)
>  rename arch/x86/kernel/cpu/{mcheck => mce}/Makefile (52%)
>  rename arch/x86/kernel/cpu/{mcheck/mce_amd.c => mce/amd.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck/mce-apei.c => mce/apei.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck/mce.c => mce/core.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck => mce}/dev-mcelog.c (99%)
>  rename arch/x86/kernel/cpu/{mcheck/mce-genpool.c => mce/genpool.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck/mce-inject.c => mce/inject.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck/mce_intel.c => mce/intel.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck/mce-internal.h => mce/internal.h} (100%)
>  rename arch/x86/kernel/cpu/{mcheck => mce}/p5.c (100%)
>  rename arch/x86/kernel/cpu/{mcheck/mce-severity.c => mce/severity.c} (99%)
>  rename arch/x86/kernel/cpu/{mcheck => mce}/therm_throt.c (100%)
>  rename arch/x86/kernel/cpu/{mcheck => mce}/threshold.c (100%)
>  rename arch/x86/kernel/cpu/{mcheck => mce}/winchip.c (100%)

Standardizing around 'MCE' sounds good to me!

Thanks,

Ingo


Re: [PATCH v2] doc: kernel-parameters.txt: fix documentation of nmi_watchdog parameter

2019-05-14 Thread Ingo Molnar


* Zhenzhong Duan  wrote:

> The default behavior of hardlockup depends on the config of
> CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
> 
> Fix the description of nmi_watchdog to make it clear.
> 
> Signed-off-by: Zhenzhong Duan 
> Reviewed-by: Joel Fernandes (Google) 
> Cc: Steven Rostedt 
> ---
>  v2: fix description using words suggested by Steven Rostedt
> 
>  Documentation/admin-guide/kernel-parameters.txt | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 08df588..b9d4358 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2805,8 +2805,9 @@
>   0 - turn hardlockup detector in nmi_watchdog off
>   1 - turn hardlockup detector in nmi_watchdog on
>   When panic is specified, panic when an NMI watchdog
> - timeout occurs (or 'nopanic' to override the opposite
> - default). To disable both hard and soft lockup 
> detectors,
> + timeout occurs (or 'nopanic' to not panic on an NMI
> + watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
> + To disable both hard and soft lockup detectors,
>   please see 'nowatchdog'.
>   This is useful when you use a panic=... timeout and
>   need the box quickly up again.

Acked-by: Ingo Molnar 

Thanks,

Ingo


Re: [PATCH] doc:lock: remove reference to clever use of read-write lock

2019-09-02 Thread Ingo Molnar


* Federico Vaga  wrote:

> On Saturday, August 31, 2019 4:43:44 PM CEST Jonathan Corbet wrote:
> > On Sat, 31 Aug 2019 15:41:16 +0200
> > 
> > Federico Vaga  wrote:
> > >  several CPU's and you want to use spinlocks you can potentially use
> > > 
> > > -cheaper versions of the spinlocks. IFF you know that the spinlocks are
> > > +cheaper versions of the spinlocks. If you know that the spinlocks are
> > > 
> > >  never used in interrupt handlers, you can use the non-irq versions::
> > I suspect that was not actually a typo; "iff" is a way for the
> > mathematically inclined to say "if and only if".
> > 
> > jon
> 
> I learned something new today :)
> 
> I am not used to the mathematical English jargon. It make sense, but then I 
> would replace it with "If and only if": for clarity.

While it's used in a number of places and it's pretty common wording 
overall in the literature, I agree that we should probably change this in 
locking API user facing documentation.

If you change it, please do it in both places it's used.

Thanks,

Ingo


Re: [PATCH 0/6] Address issues with SPDX requirements and PEP-263

2019-09-09 Thread Ingo Molnar


* Thomas Gleixner  wrote:

> On Sun, 8 Sep 2019, Matthew Wilcox wrote:
> > On Sat, Sep 07, 2019 at 11:17:22PM +0200, Thomas Gleixner wrote:
> > > On Sat, 7 Sep 2019, Markus Heiser wrote:
> > > > Am 07.09.19 um 20:04 schrieb Mauro Carvalho Chehab:
> > > > > No idea. I would actually prefer to just remove the restriction, and 
> > > > > let
> > > > > the SPDX header to be anywhere inside the first comment block inside a
> > > > > file [2].
> > > > > [2] I *suspect* that the restriction was added in order to make
> > > > >  ./scripts/spdxcheck.py to run faster and to avoid false 
> > > > > positives.
> > > > >  Right now, if the maximum limit is removed (or set to a very high
> > > > >  value), there will be one false positive:
> > > 
> > > Nope. The intention was to have a well define place and format instead of
> > > everyone and his dog deciding to put it somewhere. SPDX is not intended to
> > > replace the existing licensing mess with some other randomly placed and
> > > formatted licensing mess.
> > 
> > I find the current style quite unaesthetic:
> > 
> > // SPDX-License-Identifier: GPL-2.0-only
> > /*
> >  *  linux/mm/memory.c
> >  *
> >  *  Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
> >  */
> > 
> > I'd much rather see
> > 
> > /*
> >  * SPDX-License-Identifier: GPL-2.0-only
> >  * Copyright (C) 1991, 1992, 1993, 1994  Linus Torvalds
> >  */
> > 
> > but I appreciate the desire to force it to be on the first line if at all
> > possible.
> 
> That style is inflicted upon you by Penguin Emperor Decree. :)

I'd also say that it's a rather tooling-friendly format which mandates a 
single-line representation, which will be less likely to be morphed into 
a zillion variants like the original boilerplates.

So the Penguin Emperor Decree also makes sense, which helps. ;-)

Thanks,

Ingo


Re: [RFC PATCH] x86/doc/boot_protocol: Correct the description of "reloc"

2019-09-25 Thread Ingo Molnar
* Cao jin  wrote:

> The fields marked with (reloc) actually are not dedicated for writing,
> but communicating info for relocatable kernel with boot loaders. For
> example:
> 
> 
> Field name: pref_address
> Type:   read (reloc)
> Offset/size:0x258/8
> Protocol:   2.10+
> 
> 
> 
> Field name: code32_start
> Type:   modify (optional, reloc)
> Offset/size:0x214/4
> Protocol:   2.00+
> 
> 
> Signed-off-by: Cao jin 
> ---
> Unless I have incorrect non-native understanding for "fill in", I think
> this is inaccurate.
> 
>  Documentation/x86/boot.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/x86/boot.rst b/Documentation/x86/boot.rst
> index 08a2f100c0e6..a611bf04492d 100644
> --- a/Documentation/x86/boot.rst
> +++ b/Documentation/x86/boot.rst
> @@ -243,7 +243,7 @@ bootloader ("modify").
>  
>  All general purpose boot loaders should write the fields marked
>  (obligatory).  Boot loaders who want to load the kernel at a
> -nonstandard address should fill in the fields marked (reloc); other
> +nonstandard address should consult with the fields marked (reloc); other
>  boot loaders can ignore those fields.
>  
>  The byte order of all fields is littleendian (this is x86, after all.)

Well, this documentation is written from the point of view of a 
*bootloader*, not the kernel. So the 'fill in' says that the bootloader 
should write those fields - which is correct, right?

Thanks,

Ingo


Re: [PATCH v7 0/8] efi/firmware/platform-x86: Add EFI embedded fw support

2019-10-07 Thread Ingo Molnar


* Hans de Goede  wrote:

> Hi All,
> 
> Here is v7 of my patch-set to add support for EFI embedded fw to the kernel.
> 
> v6 was posted a long time ago, around the 4.18 days. The long wait was for
> a suitable secure-hash for checking the firmware we find embedded in the EFI
> is the one we expect.
> 
> With 5.4-rc1 we finally have a standalone sha256 lib, so that hurdle for
> this patch-set is now gone.
> 
> I've tried to address all review-remarks against v6 in this new version:
> 
> Changes in v7:
> - Split drivers/firmware/efi and drivers/base/firmware_loader changes into
>   2 patches
> - Use new, standalone, lib/crypto/sha256.c code
> - Address kdoc comments from Randy Dunlap
> - Add new FW_OPT_FALLBACK_PLATFORM flag and firmware_request_platform()
>   _request_firmware() wrapper, as requested by Luis R. Rodriguez
> - Stop using "efi-embedded-firmware" device-property, now that drivers need to
>   use the new firmware_request_platform() to enable fallback to a device fw
>   copy embedded in the platform's main firmware, we no longer need a property
>   on the device to trigger this behavior
> - Use security_kernel_load_data instead of calling
>   security_kernel_read_file with a NULL file pointer argument
> - Move the docs to Documentation/driver-api/firmware/fallback-mechanisms.rst
> - Document the new firmware_request_platform() function in
>   Documentation/driver-api/firmware/request_firmware.rst
> - Add 2 new patches for the silead and chipone-icn8505 touchscreen drivers
>   to use the new firmware_request_platform() method
> - Rebased on top of 5.4-rc1
> 
> I guess this will probably need another round (ot two) of review + fixing,
> but eventually this can hopefully be merged. Since this touches a bunch
> of different subsystems the question is how to merge this? Most of the
> touched files outside of the firmware-loader code do not see a lot of
> churn, so my proposal would be to merge patches 1-6 through the tree
> which carries firmware-loader changes; and then provide an immutable
> branch for the platform/x86 maintainers to merge and then they can merge
> the last 2 patches (as the touchscreen_dmi.c file does see quite a bit
> of changes every release).

So I was looking for a high level 0/ boilerplate description of this 
series, to explain what "EFI embedded fw" is, what problems it solves and 
how it helps the kernel in general - and found this in 2/8:

>> Just like with PCI options ROMs, which we save in the setup_efi_pci*
>> functions from arch/x86/boot/compressed/eboot.c, the EFI code / ROM itself
>> sometimes may contain data which is useful/necessary for peripheral drivers
>> to have access to.
>>
>> Specifically the EFI code may contain an embedded copy of firmware which
>> needs to be (re)loaded into the peripheral. Normally such firmware would be
>> part of linux-firmware, but in some cases this is not feasible, for 2
>> reasons:
>>
>> 1) The firmware is customized for a specific use-case of the chipset / use
>> with a specific hardware model, so we cannot have a single firmware file
>> for the chipset. E.g. touchscreen controller firmwares are compiled
>> specifically for the hardware model they are used with, as they are
>> calibrated for a specific model digitizer.
>>
>> 2) Despite repeated attempts we have failed to get permission to
>> redistribute the firmware. This is especially a problem with customized
>> firmwares, these get created by the chip vendor for a specific ODM and the
>> copyright may partially belong with the ODM, so the chip vendor cannot
>> give a blanket permission to distribute these.
>>
>> This commit adds support for finding peripheral firmware embedded in the
>> EFI code and makes the found firmware available through the new
>> efi_get_embedded_fw() function.
>>
>> Support for loading these firmwares through the standard firmware loading
>> mechanism is added in a follow-up commit in this patch-series.
>>
>> Note we check the EFI_BOOT_SERVICES_CODE for embedded firmware near the end
>> of start_kernel(), just before calling rest_init(), this is on purpose
>> because the typical EFI_BOOT_SERVICES_CODE memory-segment is too large for
>> early_memremap(), so the check must be done after mm_init(). This relies
>> on EFI_BOOT_SERVICES_CODE not being free-ed until efi_free_boot_services()
>> is called, which means that this will only work on x86 for now.
>>
>> Reported-by: Dave Olsthoorn 
>> Suggested-by: Peter Jones 
>> Acked-by: Ard Biesheuvel 
>> Signed-off-by: Hans de Goede 

There's also patch #3, which explains how this is used:

>> This commit adds a new platform fallback mechanism to the firmware loader
>> which will try to lookup a device fw copy embedded in the platform's main
>> firmware if direct filesystem lookup fails.
>>
>> Drivers which need such embedded fw copies can enable this fallback
>> mechanism by using the new firmware_request_platform() function.
>>
>> Note that for now this is only supported on EFI platforms and even on
>> these

Re: [PATCH v7 0/8] efi/firmware/platform-x86: Add EFI embedded fw support

2019-10-08 Thread Ingo Molnar


* Hans de Goede  wrote:

> > So I was looking for a high level 0/ boilerplate description of this
> > series, to explain what "EFI embedded fw" is, what problems it solves and
> > how it helps the kernel in general - and found this in 2/8:
> 
> Sorry you had to dig into the individual patch changelogs for that
> I sorta assumed that everyone involved would still vaguely remember
> what this series is about.

Wasn't *that* hard to do and I intended to read the patches anyway. ;-)

Thanks for the explanation and the answers, this all looks good to me in 
principle and in implementation as well.

Thanks,

Ingo


Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

2023-09-18 Thread Ingo Molnar


* Matteo Rizzo  wrote:

> On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher
>  wrote:
> >
> > On Fri, 15 Sep 2023, Dave Hansen wrote:
> >
> > > What's the cost?
> >
> > The only thing that I see is 1-2% on kernel compilations (and "more on
> > machines with lots of cores")?
> 
> I used kernel compilation time (wall clock time) as a benchmark while
> preparing the series. Lower is better.
> 
> Intel Skylake, 112 cores:
> 
>   LABEL| COUNT |   MIN   |   MAX   |   MEAN  |  MEDIAN | STDDEV
> ---+---+-+-+-+-+
> SLAB_VIRTUAL=n | 150   | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959
> SLAB_VIRTUAL=y | 150   | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495
>|   | +0.64%  | +0.66%  | +0.85%  | +0.89%  | +1.79%
> 
> AMD Milan, 256 cores:
> 
> LABEL  | COUNT |   MIN   |   MAX   |   MEAN  |  MEDIAN | STDDEV
> ---+---+-+-+-+-+
> SLAB_VIRTUAL=n | 150   | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495
> SLAB_VIRTUAL=y | 150   | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974
>|   | +1.33%  | +2.00%  | +1.79%  | +1.86%  | +10.55%

That's sadly a rather substantial overhead for a compiler/linker workload 
that is dominantly user-space: a kernel build is about 90% user-time and 
10% system-time:

   $ perf stat --null make -j64 vmlinux
   ...

   Performance counter stats for 'make -j64 vmlinux':

59.840704481 seconds time elapsed

  2000.774537000 seconds user
   219.13828 seconds sys

What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between 
user-space execution and kernel-space execution?

Thanks,

Ingo


Re: [RFC PATCH 00/14] Prevent cross-cache attacks in the SLUB allocator

2023-09-20 Thread Ingo Molnar


* Matteo Rizzo  wrote:

> On Mon, 18 Sept 2023 at 19:39, Ingo Molnar  wrote:
> >
> > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between
> > user-space execution and kernel-space execution?
> >
> 
> Same benchmark as before (compiling a kernel on a system running the patched
> kernel):
> 
> Intel Skylake:
> 
>   LABEL| COUNT |   MIN|   MAX|   MEAN   |  MEDIAN  | STDDEV
> ---+---+--+--+--+--+
> wall clock |   |  |  |  |  |
> SLAB_VIRTUAL=n | 150   | 49.700   | 51.320   | 50.449   | 50.430   | 0.29959
> SLAB_VIRTUAL=y | 150   | 50.020   | 51.660   | 50.880   | 50.880   | 0.30495
>|   | +0.64%   | +0.66%   | +0.85%   | +0.89%   | +1.79%
> system time|   |  |  |  |  |
> SLAB_VIRTUAL=n | 150   | 358.560  | 362.900  | 360.922  | 360.985  | 0.91761
> SLAB_VIRTUAL=y | 150   | 362.970  | 367.970  | 366.062  | 366.115  | 1.015
>|   | +1.23%   | +1.40%   | +1.42%   | +1.42%   | +10.60%
> user time  |   |  |  |  |  |
> SLAB_VIRTUAL=n | 150   | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466
> SLAB_VIRTUAL=y | 150   | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654
>|   | +0.16%   | +0.08%   | +0.08%   | +0.09%   | +7.63%

These Skylake figures are a bit counter-intuitive: how does an increase of 
only +0.08% user-time - which dominates 89.5% of execution, combined with a 
+1.42% increase in system time that consumes only 10.5% of CPU capacity, 
result in a +0.85% increase in wall-clock time?

There might be hidden factors at work in the DMA space, as Linus suggested?

Or perhaps wall-clock time is dominated by the single-threaded final link 
time of the kernel, which phase might be disproportionately hurt by these 
changes?

(Stddev seems low enough for this not to be a measurement artifact.)

The AMD Milan figures are more intuitive:

> AMD Milan:
> 
>   LABEL| COUNT |   MIN|   MAX|   MEAN   |  MEDIAN  | STDDEV
> ---+---+--+--+--+--+
> wall clock |   |  |  |  |  |
> SLAB_VIRTUAL=n | 150   | 25.480   | 26.550   | 26.065   | 26.055   | 0.23495
> SLAB_VIRTUAL=y | 150   | 25.820   | 27.080   | 26.531   | 26.540   | 0.25974
>|   | +1.33%   | +2.00%   | +1.79%   | +1.86%   | +10.55%
> system time|   |  |  |  |  |
> SLAB_VIRTUAL=n | 150   | 478.530  | 540.420  | 520.803  | 521.485  | 9.166
> SLAB_VIRTUAL=y | 150   | 530.520  | 572.460  | 552.825  | 552.985  | 7.161
>|   | +10.86%  | +5.93%   | +6.15%   | +6.04%   | -21.88%
> user time  |   |  |  |  |  |
> SLAB_VIRTUAL=n | 150   | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325
> SLAB_VIRTUAL=y | 150   | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667
>|   | +0.64%   | +0.94%   | +0.92%   | +0.97%   | +25.20%
>
> 
> I'm not exactly sure why user time increases by almost 1% on Milan, it 
> could be TLB contention.

The other worrying aspect is the increase of +6.15% of system time ... 
which is roughly in line with what we'd expect from a +1.79% increase in 
wall-clock time.

Thanks,

Ingo