date:20180814

[PATCH 4.17 87/97] x86/speculation: Simplify sysfs report of VMX L1TF vulnerability

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Paolo Bonzini 

commit ea156d192f5257a5bf393d33910d3b481bf8a401 upstream

Three changes to the content of the sysfs file:

 - If EPT is disabled, L1TF cannot be exploited even across threads on the
   same core, and SMT is irrelevant.

 - If mitigation is completely disabled, and SMT is enabled, print "vulnerable"
   instead of "vulnerable, SMT vulnerable"

 - Reorder the two parts so that the main vulnerability state comes first
   and the detail on SMT is second.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/bugs.c |   12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -738,9 +738,15 @@ static ssize_t l1tf_show_state(char *buf
if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_AUTO)
return sprintf(buf, "%s\n", L1TF_DEFAULT_MSG);
 
-   return sprintf(buf, "%s; VMX: SMT %s, L1D %s\n", L1TF_DEFAULT_MSG,
-  cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : 
"disabled",
-  l1tf_vmx_states[l1tf_vmx_mitigation]);
+   if (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_EPT_DISABLED ||
+   (l1tf_vmx_mitigation == VMENTER_L1D_FLUSH_NEVER &&
+cpu_smt_control == CPU_SMT_ENABLED))
+   return sprintf(buf, "%s; VMX: %s\n", L1TF_DEFAULT_MSG,
+  l1tf_vmx_states[l1tf_vmx_mitigation]);
+
+   return sprintf(buf, "%s; VMX: %s, SMT %s\n", L1TF_DEFAULT_MSG,
+  l1tf_vmx_states[l1tf_vmx_mitigation],
+  cpu_smt_control == CPU_SMT_ENABLED ? "vulnerable" : 
"disabled");
 }
 #else
 static ssize_t l1tf_show_state(char *buf)

[PATCH 4.17 50/97] Revert "x86/apic: Ignore secondary threads if nosmt=force"

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

commit 506a66f374891ff08e064a058c446b336c5ac760 upstream

Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.

The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:

Because the logical processors within a physical package are tightly
coupled with respect to shared hardware resources, both logical
processors are notified of machine check errors that occur within a
given physical processor. If machine-check exceptions are enabled when
a fatal error is reported, all the logical processors within a physical
package are dispatched to the machine-check exception handler. If
machine-check exceptions are disabled, the logical processors enter the
shutdown state and assert the IERR# signal. When enabling machine-check
exceptions, the MCE flag in control register CR4 should be set for each
logical processor.

Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.

This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:

MCE is enabled on the boot CPU:

[0.244017] mce: CPU supports 22 MCE banks

The corresponding sibling #72 boots:

[1.008005]  node  #0, CPUs:#72

That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.

It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.

Adjust the nosmt kernel parameter documentation as well.

Reverts: 2207def700f9 ("x86/apic: Ignore secondary threads if nosmt=force")
Reported-by: Dave Hansen 
Signed-off-by: Thomas Gleixner 
Tested-by: Tony Luck 
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/admin-guide/kernel-parameters.txt |8 ++--
 arch/x86/include/asm/apic.h |2 --
 arch/x86/kernel/acpi/boot.c |3 +--
 arch/x86/kernel/apic/apic.c |   19 ---
 4 files changed, 3 insertions(+), 29 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2676,12 +2676,8 @@
Equivalent to smt=1.
 
[KNL,x86] Disable symmetric multithreading (SMT).
-   nosmt=force: Force disable SMT, similar to disabling
-it in the BIOS except that some of the
-resource partitioning effects which are
-caused by having SMT enabled in the BIOS
-cannot be undone. Depending on the CPU
-type this might have a performance impact.
+   nosmt=force: Force disable SMT, cannot be undone
+via the sysfs control file.
 
nospectre_v2[X86] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -504,10 +504,8 @@ extern int default_check_phys_apicid_pre
 
 #ifdef CONFIG_SMP
 bool apic_id_is_primary_thread(unsigned int id);
-bool apic_id_disabled(unsigned int id);
 #else
 static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }
-static inline bool apic_id_disabled(unsigned int id) { return false; }
 #endif
 
 extern void irq_enter(void);
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -181,8 +181,7 @@ static int acpi_register_lapic(int id, u
}
 
if (!enabled) {
-   if (!apic_id_disabled(id))
-   ++disabled_cpus;
+   ++disabled_cpus;
return -EINVAL;
}
 
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2207,16 +2207,6 @@ bool apic_id_is_primary_thread(unsigned
return !(apicid & mask);
 }
 
-/**
- * apic_id_disabled - Check whether APIC ID is disabled via SMT control
- * @id:APIC ID to check

Re: [PATCH 06/13] coresight: etb10: Handle errors enabling the device

2018-08-14 Thread Mathieu Poirier

Hi Suzuki,

On Mon, Aug 06, 2018 at 02:41:48PM +0100, Suzuki K Poulose wrote:
> Prepare the etb10 driver to return errors in enabling
> the device.
> 
> Cc: Mathieu Poirier 
> Signed-off-by: Suzuki K Poulose 
> ---
>  drivers/hwtracing/coresight/coresight-etb10.c | 18 +-
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/hwtracing/coresight/coresight-etb10.c 
> b/drivers/hwtracing/coresight/coresight-etb10.c
> index 9fd77fd..37d2c88 100644
> --- a/drivers/hwtracing/coresight/coresight-etb10.c
> +++ b/drivers/hwtracing/coresight/coresight-etb10.c
> @@ -107,7 +107,7 @@ static unsigned int etb_get_buffer_depth(struct 
> etb_drvdata *drvdata)
>   return depth;
>  }
>  
> -static void etb_enable_hw(struct etb_drvdata *drvdata)
> +static void __etb_enable_hw(struct etb_drvdata *drvdata)
>  {
>   int i;
>   u32 depth;
> @@ -135,6 +135,12 @@ static void etb_enable_hw(struct etb_drvdata *drvdata)
>   CS_LOCK(drvdata->base);
>  }
>  
> +static int etb_enable_hw(struct etb_drvdata *drvdata)
> +{
> + __etb_enable_hw(drvdata);
> + return 0;
> +}
> +
>  static int etb_enable(struct coresight_device *csdev, u32 mode, void *data)
>  {
>   int ret = 0;
> @@ -150,7 +156,7 @@ static int etb_enable(struct coresight_device *csdev, u32 
> mode, void *data)
>   if (mode == CS_MODE_PERF) {
>   ret = etb_set_buffer(csdev, (struct perf_output_handle *)data);
>   if (ret)
> - goto out;
> + return ret;
>   }
>  
>   val = local_cmpxchg(>mode,
> @@ -172,12 +178,14 @@ static int etb_enable(struct coresight_device *csdev, 
> u32 mode, void *data)
>   goto out;
>  
>   spin_lock_irqsave(>spinlock, flags);
> - etb_enable_hw(drvdata);
> + ret = etb_enable_hw(drvdata);
>   spin_unlock_irqrestore(>spinlock, flags);
>  
> -out:
> - if (!ret)
> + if (ret)
> + local_cmpxchg(>mode, mode, CS_MODE_DISABLED);
> + else

I also have to do hackish things with my work on
CPU-wide trace scenarios because drvdata->mode is of type local_t.  Instead of
living with it I'll send out a patch later today that moves it to a u32 like ETF
and ETR.

Please look at it and if you're happy, add it to this patchset and do your 
modifications on top of it.

Thanks,
Mathieu

>   dev_dbg(drvdata->dev, "ETB enabled\n");
> +
>   return ret;
>  }
>  
> -- 
> 2.7.4
>

[PATCH 4.14 016/104] fix __legitimize_mnt()/mntput() race

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream.

__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.

Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped.  Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.

It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there.  The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...

Reported-by: Oleg Nesterov 
Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/namespace.c |   14 ++
 1 file changed, 14 insertions(+)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -659,12 +659,21 @@ int __legitimize_mnt(struct vfsmount *ba
return 0;
mnt = real_mount(bastard);
mnt_add_count(mnt, 1);
+   smp_mb();   // see mntput_no_expire()
if (likely(!read_seqretry(_lock, seq)))
return 0;
if (bastard->mnt_flags & MNT_SYNC_UMOUNT) {
mnt_add_count(mnt, -1);
return 1;
}
+   lock_mount_hash();
+   if (unlikely(bastard->mnt_flags & MNT_DOOMED)) {
+   mnt_add_count(mnt, -1);
+   unlock_mount_hash();
+   return 1;
+   }
+   unlock_mount_hash();
+   /* caller will mntput() */
return -1;
 }
 
@@ -1210,6 +1219,11 @@ static void mntput_no_expire(struct moun
return;
}
lock_mount_hash();
+   /*
+* make sure that if __legitimize_mnt() has not seen us grab
+* mount_lock, we'll see their refcount increment here.
+*/
+   smp_mb();
mnt_add_count(mnt, -1);
if (mnt_get_count(mnt)) {
rcu_read_unlock();

Re: [PATCH 1/2] perf tools: Make check-headers.sh check based on kernel dir

2018-08-14 Thread Arnaldo Carvalho de Melo

Em Tue, Aug 14, 2018 at 09:27:26AM +0200, Jiri Olsa escreveu:
> On Tue, Aug 14, 2018 at 11:47:39AM +1000, Michael Ellerman wrote:
> > Jiri Olsa  writes:
> > > diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> > > index ea48aa6f8d19..9d466e853aec 100755
> > > --- a/tools/perf/check-headers.sh
> > > +++ b/tools/perf/check-headers.sh
> > > @@ -88,6 +88,8 @@ check () {
> > >  # differences.
> > >  test -d ../../include || exit 0
> > >  
> > > +pushd ../.. > /dev/null
> > > +
> > >  # simple diff check
> > >  for i in $HEADERS; do
> > >check $i -B
> > 
> > This breaks the build when sh is not bash:
> > 
> >   ./check-headers.sh: 91: ./check-headers.sh: pushd: not found
> >   ./check-headers.sh: 107: ./check-headers.sh: popd: not found
> >   Makefile.perf:205: recipe for target 'sub-make' failed
> 
> sry.. Arnaldo, would you change it for simple cd (attached below)
> or should I send the fix?

Nah, I'm folding this in, to keep it bisectable.
 
> thanks,
> jirka
> 
> 
> ---
> diff --git a/tools/perf/check-headers.sh b/tools/perf/check-headers.sh
> index 80bf84803677..466540ee8ea7 100755
> --- a/tools/perf/check-headers.sh
> +++ b/tools/perf/check-headers.sh
> @@ -88,7 +88,7 @@ check () {
>  # differences.
>  test -d ../../include || exit 0
>  
> -pushd ../.. > /dev/null
> +cd ../..
>  
>  # simple diff check
>  for i in $HEADERS; do
> @@ -104,4 +104,4 @@ check include/uapi/linux/mman.h   '-I "^#include 
> <\(uapi/\)*asm/mman.h>"'
>  # diff non-symmetric files
>  check_2 tools/perf/arch/x86/entry/syscalls/syscall_64.tbl 
> arch/x86/entry/syscalls/syscall_64.tbl
>  
> -popd > /dev/null
> +cd tools/perf

Re: [PATCH] PCI: Equalize hotplug memory for non/occupied slots

2018-08-14 Thread Derrick, Jonathan

It's been a few weeks. Thoughts on this one?

On Wed, 2018-07-25 at 17:02 -0600, Jon Derrick wrote:
> Currently, a hotplug bridge will be given hpmemsize additional memory
> if
> available, in order to satisfy any future hotplug allocation
> requirements.
> 
> These calculations don't consider the current memory size of the
> hotplug
> bridge/slot, so hotplug bridges/slots which have downstream devices
> will
> get their current allocation in addition to the hpmemsize value.
> 
> This makes for possibly undesirable results with a mix of unoccupied
> and
> occupied slots (ex, with hpmemsize=2M):
> 
> 02:03.0 PCI bridge: <-- Occupied
>   Memory behind bridge: d620-d64f [size=3M]
> 02:04.0 PCI bridge: <-- Unoccupied
>   Memory behind bridge: d650-d66f [size=2M]
> 
> This change considers the current allocation size when using the
> hpmemsize parameter to make the reservations predictable for the mix
> of
> unoccupied and occupied slots:
> 
> 02:03.0 PCI bridge: <-- Occupied
>   Memory behind bridge: d620-d63f [size=2M]
> 02:04.0 PCI bridge: <-- Unoccupied
>   Memory behind bridge: d640-d65f [size=2M]
> 
> Signed-off-by: Jon Derrick 
> ---
> Original RFC here:
> https://patchwork.ozlabs.org/patch/945374/
> 
> I split this bit out from the RFC while awaiting the pci string
> handling
> enhancements to handle per-device settings
> 
> Changed from RFC is a simpler algo
> 
>  drivers/pci/setup-bus.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 79b1824..5ae39e6 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -831,7 +831,8 @@ static resource_size_t
> calculate_iosize(resource_size_t size,
>  
>  static resource_size_t calculate_memsize(resource_size_t size,
>   resource_size_t min_size,
> - resource_size_t size1,
> + resource_size_t add_size,
> + resource_size_t children_add_size,
>   resource_size_t old_size,
>   resource_size_t align)
>  {
> @@ -841,7 +842,7 @@ static resource_size_t
> calculate_memsize(resource_size_t size,
>   old_size = 0;
>   if (size < old_size)
>   size = old_size;
> - size = ALIGN(size + size1, align);
> + size = ALIGN(max(size, add_size) + children_add_size,
> align);
>   return size;
>  }
>  
> @@ -1079,12 +1080,10 @@ static int pbus_size_mem(struct pci_bus *bus,
> unsigned long mask,
>  
>   min_align = calculate_mem_align(aligns, max_order);
>   min_align = max(min_align, window_alignment(bus, b_res-
> >flags));
> - size0 = calculate_memsize(size, min_size, 0,
> resource_size(b_res), min_align);
> + size0 = calculate_memsize(size, min_size, 0, 0,
> resource_size(b_res), min_align);
>   add_align = max(min_align, add_align);
> - if (children_add_size > add_size)
> - add_size = children_add_size;
> - size1 = (!realloc_head || (realloc_head && !add_size)) ?
> size0 :
> - calculate_memsize(size, min_size, add_size,
> + size1 = (!realloc_head || (realloc_head && !add_size &&
> !children_add_size)) ? size0 :
> + calculate_memsize(size, min_size, add_size,
> children_add_size,
>   resource_size(b_res), add_align);
>   if (!size0 && !size1) {
>   if (b_res->start || b_res->end)

smime.p7s
Description: S/MIME cryptographic signature

[PATCH 4.18 02/79] x86/speculation: Protect against userspace-userspace spectreRSB

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Jiri Kosina 

commit fdf82a7856b32d905c39afc85e34364491e46346 upstream.

The article "Spectre Returns! Speculation Attacks using the Return Stack
Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks,
making use solely of the RSB contents even on CPUs that don't fallback to
BTB on RSB underflow (Skylake+).

Mitigate userspace-userspace attacks by always unconditionally filling RSB on
context switch when the generic spectrev2 mitigation has been enabled.

[1] https://arxiv.org/pdf/1807.07940.pdf

Signed-off-by: Jiri Kosina 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Tim Chen 
Cc: Konrad Rzeszutek Wilk 
Cc: Borislav Petkov 
Cc: David Woodhouse 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.yfh.7.76.1807261308190@cbobk.fhfr.pm
Signed-off-by: Greg Kroah-Hartman 

---
 arch/x86/kernel/cpu/bugs.c |   38 +++---
 1 file changed, 7 insertions(+), 31 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -313,23 +313,6 @@ static enum spectre_v2_mitigation_cmd __
return cmd;
 }
 
-/* Check for Skylake-like CPUs (for RSB handling) */
-static bool __init is_skylake_era(void)
-{
-   if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-   boot_cpu_data.x86 == 6) {
-   switch (boot_cpu_data.x86_model) {
-   case INTEL_FAM6_SKYLAKE_MOBILE:
-   case INTEL_FAM6_SKYLAKE_DESKTOP:
-   case INTEL_FAM6_SKYLAKE_X:
-   case INTEL_FAM6_KABYLAKE_MOBILE:
-   case INTEL_FAM6_KABYLAKE_DESKTOP:
-   return true;
-   }
-   }
-   return false;
-}
-
 static void __init spectre_v2_select_mitigation(void)
 {
enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
@@ -390,22 +373,15 @@ retpoline_auto:
pr_info("%s\n", spectre_v2_strings[mode]);
 
/*
-* If neither SMEP nor PTI are available, there is a risk of
-* hitting userspace addresses in the RSB after a context switch
-* from a shallow call stack to a deeper one. To prevent this fill
-* the entire RSB, even when using IBRS.
+* If spectre v2 protection has been enabled, unconditionally fill
+* RSB during a context switch; this protects against two independent
+* issues:
 *
-* Skylake era CPUs have a separate issue with *underflow* of the
-* RSB, when they will predict 'ret' targets from the generic BTB.
-* The proper mitigation for this is IBRS. If IBRS is not supported
-* or deactivated in favour of retpolines the RSB fill on context
-* switch is required.
+*  - RSB underflow (and switch to BTB) on Skylake+
+*  - SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs
 */
-   if ((!boot_cpu_has(X86_FEATURE_PTI) &&
-!boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {
-   setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
-   pr_info("Spectre v2 mitigation: Filling RSB on context 
switch\n");
-   }
+   setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
+   pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context 
switch\n");
 
/* Initialize Indirect Branch Prediction Barrier if supported */
if (boot_cpu_has(X86_FEATURE_IBPB)) {

[PATCH 4.18 19/79] cpu/hotplug: Provide knobs to control SMT

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

Provide a command line and a sysfs knob to control SMT.

The command line options are:

 'nosmt':   Enumerate secondary threads, but do not online them

 'nosmt=force': Ignore secondary threads completely during enumeration
via MP table and ACPI/MADT.

The sysfs control file has the following states (read/write):

 'on':   SMT is enabled. Secondary threads can be freely onlined
 'off':  SMT is disabled. Secondary threads, even if enumerated
 cannot be onlined
 'forceoff': SMT is permanentely disabled. Writes to the control
 file are rejected.
 'notsupported': SMT is not supported by the CPU

The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.

The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.

When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.

When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.

When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.

When the control status is 'notsupported' then writes to the control file
are rejected.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |   20 ++
 Documentation/admin-guide/kernel-parameters.txt|8 
 arch/Kconfig   |3 
 arch/x86/Kconfig   |1 
 include/linux/cpu.h|   13 +
 kernel/cpu.c   |  170 +
 6 files changed, 215 insertions(+)

--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -487,3 +487,23 @@ Description:   Information about CPU vulne
"Not affected"CPU is not affected by the vulnerability
"Vulnerable"  CPU is affected and no mitigation in effect
"Mitigation: $M"  CPU is affected and mitigation $M is in effect
+
+What:  /sys/devices/system/cpu/smt
+   /sys/devices/system/cpu/smt/active
+   /sys/devices/system/cpu/smt/control
+Date:  June 2018
+Contact:   Linux kernel mailing list 
+Description:   Control Symetric Multi Threading (SMT)
+
+   active:  Tells whether SMT is active (enabled and siblings 
online)
+
+   control: Read/write interface to control SMT. Possible
+values:
+
+"on"   SMT is enabled
+"off"  SMT is disabled
+"forceoff" SMT is force disabled. Cannot be 
changed.
+"notsupported" SMT is not supported by the CPU
+
+If control status is "forceoff" or "notsupported" 
writes
+are rejected.
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2687,6 +2687,14 @@
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
+   [KNL,x86] Disable symmetric multithreading (SMT).
+   nosmt=force: Force disable SMT, similar to disabling
+it in the BIOS except that some of the
+resource partitioning effects which are
+caused by having SMT enabled in the BIOS
+cannot be undone. Depending on the CPU
+type this might have a performance impact.
+
nospectre_v2[X86] Disable all mitigations for the Spectre variant 2
(indirect branch prediction) vulnerability. System may
allow data leaks with this option, which is equivalent
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -13,6 +13,9 @@ config KEXEC_CORE
 config HAVE_IMA_KEXEC
bool
 
+config HOTPLUG_SMT
+   bool
+
 config OPROFILE
tristate "OProfile system profiling"
depends on PROFILING
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -187,6 +187,7 @@ config X86
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_USER_RETURN_NOTIFIER
+   select HOTPLUG_SMT

[PATCH 4.18 18/79] cpu/hotplug: Split do_cpu_down()

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

Split out the inner workings of do_cpu_down() to allow reuse of that
function for the upcoming SMT disabling mechanism.

No functional change.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 kernel/cpu.c |   17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -906,20 +906,19 @@ out:
return ret;
 }
 
+static int cpu_down_maps_locked(unsigned int cpu, enum cpuhp_state target)
+{
+   if (cpu_hotplug_disabled)
+   return -EBUSY;
+   return _cpu_down(cpu, 0, target);
+}
+
 static int do_cpu_down(unsigned int cpu, enum cpuhp_state target)
 {
int err;
 
cpu_maps_update_begin();
-
-   if (cpu_hotplug_disabled) {
-   err = -EBUSY;
-   goto out;
-   }
-
-   err = _cpu_down(cpu, 0, target);
-
-out:
+   err = cpu_down_maps_locked(cpu, target);
cpu_maps_update_done();
return err;
 }

[PATCH 4.18 14/79] sched/smt: Update sched_smt_present at runtime

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Peter Zijlstra 

The static key sched_smt_present is only updated at boot time when SMT
siblings have been detected. Booting with maxcpus=1 and bringing the
siblings online after boot rebuilds the scheduling domains correctly but
does not update the static key, so the SMT code is not enabled.

Let the key be updated in the scheduler CPU hotplug code to fix this.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 kernel/sched/core.c |   30 --
 kernel/sched/fair.c |1 +
 2 files changed, 13 insertions(+), 18 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5774,6 +5774,18 @@ int sched_cpu_activate(unsigned int cpu)
struct rq *rq = cpu_rq(cpu);
struct rq_flags rf;
 
+#ifdef CONFIG_SCHED_SMT
+   /*
+* The sched_smt_present static key needs to be evaluated on every
+* hotplug event because at boot time SMT might be disabled when
+* the number of booted CPUs is limited.
+*
+* If then later a sibling gets hotplugged, then the key would stay
+* off and SMT scheduling would never be functional.
+*/
+   if (cpumask_weight(cpu_smt_mask(cpu)) > 1)
+   static_branch_enable_cpuslocked(_smt_present);
+#endif
set_cpu_active(cpu, true);
 
if (sched_smp_initialized) {
@@ -5871,22 +5883,6 @@ int sched_cpu_dying(unsigned int cpu)
 }
 #endif
 
-#ifdef CONFIG_SCHED_SMT
-DEFINE_STATIC_KEY_FALSE(sched_smt_present);
-
-static void sched_init_smt(void)
-{
-   /*
-* We've enumerated all CPUs and will assume that if any CPU
-* has SMT siblings, CPU0 will too.
-*/
-   if (cpumask_weight(cpu_smt_mask(0)) > 1)
-   static_branch_enable(_smt_present);
-}
-#else
-static inline void sched_init_smt(void) { }
-#endif
-
 void __init sched_init_smp(void)
 {
sched_init_numa();
@@ -5908,8 +5904,6 @@ void __init sched_init_smp(void)
init_sched_rt_class();
init_sched_dl_class();
 
-   sched_init_smt();
-
sched_smp_initialized = true;
 }
 
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6237,6 +6237,7 @@ static inline int find_idlest_cpu(struct
 }
 
 #ifdef CONFIG_SCHED_SMT
+DEFINE_STATIC_KEY_FALSE(sched_smt_present);
 
 static inline void set_idle_cores(int cpu, int val)
 {

[PATCH 4.18 12/79] x86/speculation/l1tf: Limit swap file size to MAX_PA/2

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

For the L1TF workaround its necessary to limit the swap file size to below
MAX_PA/2, so that the higher bits of the swap offset inverted never point
to valid memory.

Add a mechanism for the architecture to override the swap file size check
in swapfile.c and add a x86 specific max swapfile check function that
enforces that limit.

The check is only enabled if the CPU is vulnerable to L1TF.

In VMs with 42bit MAX_PA the typical limit is 2TB now, on a native system
with 46bit PA it is 32TB. The limit is only per individual swap file, so
it's always possible to exceed these limits with multiple swap files or
partitions.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/mm/init.c   |   15 +++
 include/linux/swapfile.h |2 ++
 mm/swapfile.c|   46 ++
 3 files changed, 47 insertions(+), 16 deletions(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -4,6 +4,8 @@
 #include 
 #include 
 #include  /* for max_low_pfn */
+#include 
+#include 
 
 #include 
 #include 
@@ -880,3 +882,16 @@ void update_cache_mode_entry(unsigned en
__cachemode2pte_tbl[cache] = __cm_idx2pte(entry);
__pte2cachemode_tbl[entry] = cache;
 }
+
+unsigned long max_swapfile_size(void)
+{
+   unsigned long pages;
+
+   pages = generic_max_swapfile_size();
+
+   if (boot_cpu_has_bug(X86_BUG_L1TF)) {
+   /* Limit the swap file size to MAX_PA/2 for L1TF workaround */
+   pages = min_t(unsigned long, l1tf_pfn_limit() + 1, pages);
+   }
+   return pages;
+}
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -10,5 +10,7 @@ extern spinlock_t swap_lock;
 extern struct plist_head swap_active_head;
 extern struct swap_info_struct *swap_info[];
 extern int try_to_unuse(unsigned int, bool, unsigned long);
+extern unsigned long generic_max_swapfile_size(void);
+extern unsigned long max_swapfile_size(void);
 
 #endif /* _LINUX_SWAPFILE_H */
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2909,6 +2909,35 @@ static int claim_swapfile(struct swap_in
return 0;
 }
 
+
+/*
+ * Find out how many pages are allowed for a single swap device. There
+ * are two limiting factors:
+ * 1) the number of bits for the swap offset in the swp_entry_t type, and
+ * 2) the number of bits in the swap pte, as defined by the different
+ * architectures.
+ *
+ * In order to find the largest possible bit mask, a swap entry with
+ * swap type 0 and swap offset ~0UL is created, encoded to a swap pte,
+ * decoded to a swp_entry_t again, and finally the swap offset is
+ * extracted.
+ *
+ * This will mask all the bits from the initial ~0UL mask that can't
+ * be encoded in either the swp_entry_t or the architecture definition
+ * of a swap pte.
+ */
+unsigned long generic_max_swapfile_size(void)
+{
+   return swp_offset(pte_to_swp_entry(
+   swp_entry_to_pte(swp_entry(0, ~0UL + 1;
+}
+
+/* Can be overridden by an architecture for additional checks. */
+__weak unsigned long max_swapfile_size(void)
+{
+   return generic_max_swapfile_size();
+}
+
 static unsigned long read_swap_header(struct swap_info_struct *p,
union swap_header *swap_header,
struct inode *inode)
@@ -2944,22 +2973,7 @@ static unsigned long read_swap_header(st
p->cluster_next = 1;
p->cluster_nr = 0;
 
-   /*
-* Find out how many pages are allowed for a single swap
-* device. There are two limiting factors: 1) the number
-* of bits for the swap offset in the swp_entry_t type, and
-* 2) the number of bits in the swap pte as defined by the
-* different architectures. In order to find the
-* largest possible bit mask, a swap entry with swap type 0
-* and swap offset ~0UL is created, encoded to a swap pte,
-* decoded to a swp_entry_t again, and finally the swap
-* offset is extracted. This will mask all the bits from
-* the initial ~0UL mask that can't be encoded in either
-* the swp_entry_t or the architecture definition of a
-* swap pte.
-*/
-   maxpages = swp_offset(pte_to_swp_entry(
-   swp_entry_to_pte(swp_entry(0, ~0UL + 1;
+   maxpages = max_swapfile_size();
last_page = swap_header->info.last_page;
if (!last_page) {
pr_warn("Empty swap-file\n");

[PATCH 4.18 17/79] cpu/hotplug: Make bringup/teardown of smp threads symmetric

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

The asymmetry caused a warning to trigger if the bootup was stopped in state
CPUHP_AP_ONLINE_IDLE. The warning no longer triggers as kthread_park() can
now be invoked on already or still parked threads. But there is still no
reason to have this be asymmetric.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 kernel/cpu.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -754,7 +754,6 @@ static int takedown_cpu(unsigned int cpu
 
/* Park the smpboot threads */
kthread_park(per_cpu_ptr(_state, cpu)->thread);
-   smpboot_park_threads(cpu);
 
/*
 * Prevent irq alloc/free while the dying cpu reorganizes the
@@ -1332,7 +1331,7 @@ static struct cpuhp_step cpuhp_hp_states
[CPUHP_AP_SMPBOOT_THREADS] = {
.name   = "smpboot/threads:online",
.startup.single = smpboot_unpark_threads,
-   .teardown.single= NULL,
+   .teardown.single= smpboot_park_threads,
},
[CPUHP_AP_IRQ_AFFINITY_ONLINE] = {
.name   = "irq/affinity:online",

[PATCH 4.18 13/79] x86/bugs: Move the l1tf function and define pr_fmt properly

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Konrad Rzeszutek Wilk 

The pr_warn in l1tf_select_mitigation would have used the prior pr_fmt
which was defined as "Spectre V2 : ".

Move the function to be past SSBD and also define the pr_fmt.

Fixes: 17dbca119312 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Signed-off-by: Konrad Rzeszutek Wilk 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/bugs.c |   55 +++--
 1 file changed, 29 insertions(+), 26 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -211,32 +211,6 @@ static void x86_amd_ssb_disable(void)
wrmsrl(MSR_AMD64_LS_CFG, msrval);
 }
 
-static void __init l1tf_select_mitigation(void)
-{
-   u64 half_pa;
-
-   if (!boot_cpu_has_bug(X86_BUG_L1TF))
-   return;
-
-#if CONFIG_PGTABLE_LEVELS == 2
-   pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
-   return;
-#endif
-
-   /*
-* This is extremely unlikely to happen because almost all
-* systems have far more MAX_PA/2 than RAM can be fit into
-* DIMM slots.
-*/
-   half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
-   if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
-   pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation 
not effective.\n");
-   return;
-   }
-
-   setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
-}
-
 #ifdef RETPOLINE
 static bool spectre_v2_bad_module;
 
@@ -660,6 +634,35 @@ void x86_spec_ctrl_setup_ap(void)
x86_amd_ssb_disable();
 }
 
+#undef pr_fmt
+#define pr_fmt(fmt)"L1TF: " fmt
+static void __init l1tf_select_mitigation(void)
+{
+   u64 half_pa;
+
+   if (!boot_cpu_has_bug(X86_BUG_L1TF))
+   return;
+
+#if CONFIG_PGTABLE_LEVELS == 2
+   pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
+   return;
+#endif
+
+   /*
+* This is extremely unlikely to happen because almost all
+* systems have far more MAX_PA/2 than RAM can be fit into
+* DIMM slots.
+*/
+   half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
+   if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
+   pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation 
not effective.\n");
+   return;
+   }
+
+   setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
+}
+#undef pr_fmt
+
 #ifdef CONFIG_SYSFS
 
 static ssize_t cpu_show_common(struct device *dev, struct device_attribute 
*attr,

[PATCH 4.18 15/79] x86/smp: Provide topology_is_primary_thread()

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.

This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.

Preparatory patch to support disabling SMT at boot/runtime.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/apic.h |6 ++
 arch/x86/include/asm/topology.h |4 +++-
 arch/x86/kernel/apic/apic.c |   15 +++
 arch/x86/kernel/smpboot.c   |9 +
 4 files changed, 33 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -502,6 +502,12 @@ extern int default_check_phys_apicid_pre
 
 #endif /* CONFIG_X86_LOCAL_APIC */
 
+#ifdef CONFIG_SMP
+bool apic_id_is_primary_thread(unsigned int id);
+#else
+static inline bool apic_id_is_primary_thread(unsigned int id) { return false; }
+#endif
+
 extern void irq_enter(void);
 extern void irq_exit(void);
 
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -123,13 +123,15 @@ static inline int topology_max_smt_threa
 }
 
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
-extern int topology_phys_to_logical_pkg(unsigned int pkg);
+int topology_phys_to_logical_pkg(unsigned int pkg);
+bool topology_is_primary_thread(unsigned int cpu);
 #else
 #define topology_max_packages()(1)
 static inline int
 topology_update_package_map(unsigned int apicid, unsigned int cpu) { return 0; 
}
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_max_smt_threads(void) { return 1; }
+static inline bool topology_is_primary_thread(unsigned int cpu) { return true; 
}
 #endif
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2192,6 +2192,21 @@ static int cpuid_to_apicid[] = {
[0 ... NR_CPUS - 1] = -1,
 };
 
+/**
+ * apic_id_is_primary_thread - Check whether APIC ID belongs to a primary 
thread
+ * @id:APIC ID to check
+ */
+bool apic_id_is_primary_thread(unsigned int apicid)
+{
+   u32 mask;
+
+   if (smp_num_siblings == 1)
+   return true;
+   /* Isolate the SMT bit(s) in the APICID and check for 0 */
+   mask = (1U << (fls(smp_num_siblings) - 1)) - 1;
+   return !(apicid & mask);
+}
+
 /*
  * Should use this API to allocate logical CPU IDs to keep nr_logical_cpuids
  * and cpuid_to_apicid[] synchronized.
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -271,6 +271,15 @@ static void notrace start_secondary(void
 }
 
 /**
+ * topology_is_primary_thread - Check whether CPU is the primary SMT thread
+ * @cpu:   CPU to check
+ */
+bool topology_is_primary_thread(unsigned int cpu)
+{
+   return apic_id_is_primary_thread(per_cpu(x86_cpu_to_apicid, cpu));
+}
+
+/**
  * topology_phys_to_logical_pkg - Map a physical package id to a logical
  *
  * Returns logical package id or -1 if not found

[PATCH 4.18 16/79] x86/topology: Provide topology_smt_supported()

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

Provide information whether SMT is supoorted by the CPUs. Preparatory patch
for SMT control mechanism.

Suggested-by: Dave Hansen 
Signed-off-by: Thomas Gleixner 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/topology.h |2 ++
 arch/x86/kernel/smpboot.c   |8 
 2 files changed, 10 insertions(+)

--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -125,6 +125,7 @@ static inline int topology_max_smt_threa
 int topology_update_package_map(unsigned int apicid, unsigned int cpu);
 int topology_phys_to_logical_pkg(unsigned int pkg);
 bool topology_is_primary_thread(unsigned int cpu);
+bool topology_smt_supported(void);
 #else
 #define topology_max_packages()(1)
 static inline int
@@ -132,6 +133,7 @@ topology_update_package_map(unsigned int
 static inline int topology_phys_to_logical_pkg(unsigned int pkg) { return 0; }
 static inline int topology_max_smt_threads(void) { return 1; }
 static inline bool topology_is_primary_thread(unsigned int cpu) { return true; 
}
+static inline bool topology_smt_supported(void) { return false; }
 #endif
 
 static inline void arch_fix_phys_package_id(int num, u32 slot)
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -280,6 +280,14 @@ bool topology_is_primary_thread(unsigned
 }
 
 /**
+ * topology_smt_supported - Check whether SMT is supported by the CPUs
+ */
+bool topology_smt_supported(void)
+{
+   return smp_num_siblings > 1;
+}
+
+/**
  * topology_phys_to_logical_pkg - Map a physical package id to a logical
  *
  * Returns logical package id or -1 if not found

[PATCH 4.18 20/79] x86/cpu: Remove the pointless CPU printout

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

The value of this printout is dubious at best and there is no point in
having it in two different places along with convoluted ways to reach it.

Remove it completely.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/common.c   |   20 +---
 arch/x86/kernel/cpu/topology.c |   10 --
 2 files changed, 5 insertions(+), 25 deletions(-)

--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -666,13 +666,12 @@ void detect_ht(struct cpuinfo_x86 *c)
 #ifdef CONFIG_SMP
u32 eax, ebx, ecx, edx;
int index_msb, core_bits;
-   static bool printed;
 
if (!cpu_has(c, X86_FEATURE_HT))
return;
 
if (cpu_has(c, X86_FEATURE_CMP_LEGACY))
-   goto out;
+   return;
 
if (cpu_has(c, X86_FEATURE_XTOPOLOGY))
return;
@@ -681,14 +680,14 @@ void detect_ht(struct cpuinfo_x86 *c)
 
smp_num_siblings = (ebx & 0xff) >> 16;
 
+   if (!smp_num_siblings)
+   smp_num_siblings = 1;
+
if (smp_num_siblings == 1) {
pr_info_once("CPU0: Hyper-Threading is disabled\n");
-   goto out;
+   return;
}
 
-   if (smp_num_siblings <= 1)
-   goto out;
-
index_msb = get_count_order(smp_num_siblings);
c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);
 
@@ -700,15 +699,6 @@ void detect_ht(struct cpuinfo_x86 *c)
 
c->cpu_core_id = apic->phys_pkg_id(c->initial_apicid, index_msb) &
   ((1 << core_bits) - 1);
-
-out:
-   if (!printed && (c->x86_max_cores * smp_num_siblings) > 1) {
-   pr_info("CPU: Physical Processor ID: %d\n",
-   c->phys_proc_id);
-   pr_info("CPU: Processor Core ID: %d\n",
-   c->cpu_core_id);
-   printed = 1;
-   }
 #endif
 }
 
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -33,7 +33,6 @@ int detect_extended_topology(struct cpui
unsigned int eax, ebx, ecx, edx, sub_index;
unsigned int ht_mask_width, core_plus_mask_width;
unsigned int core_select_mask, core_level_siblings;
-   static bool printed;
 
if (c->cpuid_level < 0xb)
return -1;
@@ -86,15 +85,6 @@ int detect_extended_topology(struct cpui
c->apicid = apic->phys_pkg_id(c->initial_apicid, 0);
 
c->x86_max_cores = (core_level_siblings / smp_num_siblings);
-
-   if (!printed) {
-   pr_info("CPU: Physical Processor ID: %d\n",
-  c->phys_proc_id);
-   if (c->x86_max_cores > 1)
-   pr_info("CPU: Processor Core ID: %d\n",
-  c->cpu_core_id);
-   printed = 1;
-   }
 #endif
return 0;
 }

[PATCH 4.18 21/79] x86/cpu/AMD: Remove the pointless detect_ht() call

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

Real 32bit AMD CPUs do not have SMT and the only value of the call was to
reach the magic printout which got removed.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/amd.c |4 
 1 file changed, 4 deletions(-)

--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -857,10 +857,6 @@ static void init_amd(struct cpuinfo_x86
srat_detect_node(c);
}
 
-#ifdef CONFIG_X86_32
-   detect_ht(c);
-#endif
-
init_amd_cacheinfo(c);
 
if (c->x86 >= 0xf)

[PATCH 4.18 22/79] x86/cpu/common: Provide detect_ht_early()

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

To support force disabling of SMT it's required to know the number of
thread siblings early. detect_ht() cannot be called before the APIC driver
is selected, so split out the part which initializes smp_num_siblings.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/common.c |   24 ++--
 arch/x86/kernel/cpu/cpu.h|1 +
 2 files changed, 15 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -661,32 +661,36 @@ static void cpu_detect_tlb(struct cpuinf
tlb_lld_4m[ENTRIES], tlb_lld_1g[ENTRIES]);
 }
 
-void detect_ht(struct cpuinfo_x86 *c)
+int detect_ht_early(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_SMP
u32 eax, ebx, ecx, edx;
-   int index_msb, core_bits;
 
if (!cpu_has(c, X86_FEATURE_HT))
-   return;
+   return -1;
 
if (cpu_has(c, X86_FEATURE_CMP_LEGACY))
-   return;
+   return -1;
 
if (cpu_has(c, X86_FEATURE_XTOPOLOGY))
-   return;
+   return -1;
 
cpuid(1, , , , );
 
smp_num_siblings = (ebx & 0xff) >> 16;
+   if (smp_num_siblings == 1)
+   pr_info_once("CPU0: Hyper-Threading is disabled\n");
+#endif
+   return 0;
+}
 
-   if (!smp_num_siblings)
-   smp_num_siblings = 1;
+void detect_ht(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+   int index_msb, core_bits;
 
-   if (smp_num_siblings == 1) {
-   pr_info_once("CPU0: Hyper-Threading is disabled\n");
+   if (detect_ht_early(c) < 0)
return;
-   }
 
index_msb = get_count_order(smp_num_siblings);
c->phys_proc_id = apic->phys_pkg_id(c->initial_apicid, index_msb);
--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -56,6 +56,7 @@ extern void init_amd_cacheinfo(struct cp
 
 extern void detect_num_cpu_cores(struct cpuinfo_x86 *c);
 extern int detect_extended_topology(struct cpuinfo_x86 *c);
+extern int detect_ht_early(struct cpuinfo_x86 *c);
 extern void detect_ht(struct cpuinfo_x86 *c);
 
 unsigned int aperfmperf_get_khz(int cpu);

[PATCH 4.18 24/79] x86/cpu/intel: Evaluate smp_num_siblings early

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

Make use of the new early detection function to initialize smp_num_siblings
on the boot cpu before the MP-Table or ACPI/MADT scan happens. That's
required for force disabling SMT.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/intel.c |7 +++
 1 file changed, 7 insertions(+)

--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -301,6 +301,13 @@ static void early_init_intel(struct cpui
}
 
check_mpx_erratum(c);
+
+   /*
+* Get the number of SMT siblings early from the extended topology
+* leaf, if available. Otherwise try the legacy SMT detection.
+*/
+   if (detect_extended_topology_early(c) < 0)
+   detect_ht_early(c);
 }
 
 #ifdef CONFIG_X86_32

[PATCH 4.18 23/79] x86/cpu/topology: Provide detect_extended_topology_early()

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

To support force disabling of SMT it's required to know the number of
thread siblings early. detect_extended_topology() cannot be called before
the APIC driver is selected, so split out the part which initializes
smp_num_siblings.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/cpu.h  |1 +
 arch/x86/kernel/cpu/topology.c |   31 ++-
 2 files changed, 23 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/cpu/cpu.h
+++ b/arch/x86/kernel/cpu/cpu.h
@@ -55,6 +55,7 @@ extern void init_intel_cacheinfo(struct
 extern void init_amd_cacheinfo(struct cpuinfo_x86 *c);
 
 extern void detect_num_cpu_cores(struct cpuinfo_x86 *c);
+extern int detect_extended_topology_early(struct cpuinfo_x86 *c);
 extern int detect_extended_topology(struct cpuinfo_x86 *c);
 extern int detect_ht_early(struct cpuinfo_x86 *c);
 extern void detect_ht(struct cpuinfo_x86 *c);
--- a/arch/x86/kernel/cpu/topology.c
+++ b/arch/x86/kernel/cpu/topology.c
@@ -22,17 +22,10 @@
 #define BITS_SHIFT_NEXT_LEVEL(eax) ((eax) & 0x1f)
 #define LEVEL_MAX_SIBLINGS(ebx)((ebx) & 0x)
 
-/*
- * Check for extended topology enumeration cpuid leaf 0xb and if it
- * exists, use it for populating initial_apicid and cpu topology
- * detection.
- */
-int detect_extended_topology(struct cpuinfo_x86 *c)
+int detect_extended_topology_early(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_SMP
-   unsigned int eax, ebx, ecx, edx, sub_index;
-   unsigned int ht_mask_width, core_plus_mask_width;
-   unsigned int core_select_mask, core_level_siblings;
+   unsigned int eax, ebx, ecx, edx;
 
if (c->cpuid_level < 0xb)
return -1;
@@ -51,10 +44,30 @@ int detect_extended_topology(struct cpui
 * initial apic id, which also represents 32-bit extended x2apic id.
 */
c->initial_apicid = edx;
+   smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
+#endif
+   return 0;
+}
+
+/*
+ * Check for extended topology enumeration cpuid leaf 0xb and if it
+ * exists, use it for populating initial_apicid and cpu topology
+ * detection.
+ */
+int detect_extended_topology(struct cpuinfo_x86 *c)
+{
+#ifdef CONFIG_SMP
+   unsigned int eax, ebx, ecx, edx, sub_index;
+   unsigned int ht_mask_width, core_plus_mask_width;
+   unsigned int core_select_mask, core_level_siblings;
+
+   if (detect_extended_topology_early(c) < 0)
+   return -1;
 
/*
 * Populate HT related information from sub-leaf level 0.
 */
+   cpuid_count(0xb, SMT_LEVEL, , , , );
core_level_siblings = smp_num_siblings = LEVEL_MAX_SIBLINGS(ebx);
core_plus_mask_width = ht_mask_width = BITS_SHIFT_NEXT_LEVEL(eax);

[PATCH 4.18 25/79] x86/CPU/AMD: Do not check CPUID max ext level before parsing SMP info

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Borislav Petkov 

Old code used to check whether CPUID ext max level is >= 0x8008 because
that last leaf contains the number of cores of the physical CPU.  The three
functions called there now do not depend on that leaf anymore so the check
can go.

Signed-off-by: Borislav Petkov 
Signed-off-by: Thomas Gleixner 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/amd.c |9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -850,12 +850,9 @@ static void init_amd(struct cpuinfo_x86
 
cpu_detect_cache_sizes(c);
 
-   /* Multi core CPU? */
-   if (c->extended_cpuid_level >= 0x8008) {
-   amd_detect_cmp(c);
-   amd_get_topology(c);
-   srat_detect_node(c);
-   }
+   amd_detect_cmp(c);
+   amd_get_topology(c);
+   srat_detect_node(c);
 
init_amd_cacheinfo(c);

[PATCH 4.18 28/79] x86/speculation/l1tf: Extend 64bit swap file size limit

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Vlastimil Babka 

The previous patch has limited swap file size so that large offsets cannot
clear bits above MAX_PA/2 in the pte and interfere with L1TF mitigation.

It assumed that offsets are encoded starting with bit 12, same as pfn. But
on x86_64, offsets are encoded starting with bit 9.

Thus the limit can be raised by 3 bits. That means 16TB with 42bit MAX_PA
and 256TB with 46bit MAX_PA.

Fixes: 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Signed-off-by: Vlastimil Babka 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/mm/init.c |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -891,7 +891,15 @@ unsigned long max_swapfile_size(void)
 
if (boot_cpu_has_bug(X86_BUG_L1TF)) {
/* Limit the swap file size to MAX_PA/2 for L1TF workaround */
-   pages = min_t(unsigned long, l1tf_pfn_limit() + 1, pages);
+   unsigned long l1tf_limit = l1tf_pfn_limit() + 1;
+   /*
+* We encode swap offsets also with 3 bits below those for pfn
+* which makes the usable limit higher.
+*/
+#ifdef CONFIG_X86_64
+   l1tf_limit <<= PAGE_SHIFT - SWP_OFFSET_FIRST_BIT;
+#endif
+   pages = min_t(unsigned long, l1tf_limit, pages);
}
return pages;
 }

[PATCH 4.18 26/79] x86/cpu/AMD: Evaluate smp_num_siblings early

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

To support force disabling of SMT it's required to know the number of
thread siblings early. amd_get_topology() cannot be called before the APIC
driver is selected, so split out the part which initializes
smp_num_siblings and invoke it from amd_early_init().

Signed-off-by: Thomas Gleixner 
Acked-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/amd.c |   13 +
 1 file changed, 13 insertions(+)

--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -315,6 +315,17 @@ static void legacy_fixup_core_id(struct
c->cpu_core_id %= cus_per_node;
 }
 
+
+static void amd_get_topology_early(struct cpuinfo_x86 *c)
+{
+   if (boot_cpu_has(X86_FEATURE_TOPOEXT)) {
+   u32 eax, ebx, ecx, edx;
+
+   cpuid(0x801e, , , , );
+   smp_num_siblings = ((ebx >> 8) & 0xff) + 1;
+   }
+}
+
 /*
  * Fixup core topology information for
  * (1) AMD multi-node processors
@@ -683,6 +694,8 @@ static void early_init_amd(struct cpuinf
set_cpu_bug(c, X86_BUG_AMD_E400);
 
early_detect_mem_encrypt(c);
+
+   amd_get_topology_early(c);
 }
 
 static void init_amd_k8(struct cpuinfo_x86 *c)

Re: [PATCH RFC] Make call_srcu() available during very early boot

2018-08-14 Thread Steven Rostedt

On Tue, 14 Aug 2018 10:06:18 -0700
"Paul E. McKenney"  wrote:


> > >  #define __SRCU_STRUCT_INIT(name, pcpu_name)  
> > > \
> > > - {   \
> > > - .sda = _name,  \
> > > - .lock = __SPIN_LOCK_UNLOCKED(name.lock),\
> > > - .srcu_gp_seq_needed = 0 - 1,\
> > > - __SRCU_DEP_MAP_INIT(name)   \
> > > - }
> > > +{
> > > \
> > > + .sda = _name,  \
> > > + .lock = __SPIN_LOCK_UNLOCKED(name.lock),\
> > > + .srcu_gp_seq_needed = 0 - 1,\  
> > 
> > Interesting initialization of -1. This was there before, but still
> > interesting none the less.  
> 
> If I recall correctly, this subterfuge suppresses compiler complaints
> about initializing an unsigned long with a negative number.  :-/

Did you try:

.srcu_gp_seq_needed = -1UL,

?

-- Steve

[PATCH 4.17 14/97] make sure that __dentry_kill() always invalidates d_seq, unhashed or not

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab upstream.

RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq.  That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all.  Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.

We could try and be careful in the (few) places where that could
happen.  Or we could just make the final dput() invalidate the damn
thing, unhashed or not.  The latter is much simpler and easier to
backport, so let's do it that way.

Reported-by: "Dae R. Jeong" 
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/dcache.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -358,14 +358,11 @@ static void dentry_unlink_inode(struct d
__releases(dentry->d_inode->i_lock)
 {
struct inode *inode = dentry->d_inode;
-   bool hashed = !d_unhashed(dentry);
 
-   if (hashed)
-   raw_write_seqcount_begin(>d_seq);
+   raw_write_seqcount_begin(>d_seq);
__d_clear_type_and_inode(dentry);
hlist_del_init(>d_u.d_alias);
-   if (hashed)
-   raw_write_seqcount_end(>d_seq);
+   raw_write_seqcount_end(>d_seq);
spin_unlock(>d_lock);
spin_unlock(>i_lock);
if (!inode->i_nlink)

[PATCH 4.17 13/97] root dentries need RCU-delayed freeing

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream.

Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.

It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock().  If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.

Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/dcache.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1954,10 +1954,12 @@ struct dentry *d_make_root(struct inode
 
if (root_inode) {
res = d_alloc_anon(root_inode->i_sb);
-   if (res)
+   if (res) {
+   res->d_flags |= DCACHE_RCUACCESS;
d_instantiate(res, root_inode);
-   else
+   } else {
iput(root_inode);
+   }
}
return res;
 }

[PATCH 4.17 04/97] stop_machine: Disable preemption after queueing stopper threads

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Isaac J. Manjarres 

commit 2610e88946632afb78aa58e61f11368ac4c0af7b upstream.

This commit:

  9fb8d5dc4b64 ("stop_machine, Disable preemption when waking two stopper 
threads")

does not fully address the race condition that can occur
as follows:

On one CPU, call it CPU 3, thread 1 invokes
cpu_stop_queue_two_works(2, 3,...), and the execution is such
that thread 1 queues the works for migration/2 and migration/3,
and is preempted after releasing the locks for migration/2 and
migration/3, but before waking the threads.

Then, On CPU 2, a kworker, call it thread 2, is running,
and it invokes cpu_stop_queue_two_works(1, 2,...), such that
thread 2 queues the works for migration/1 and migration/2.
Meanwhile, on CPU 3, thread 1 resumes execution, and wakes
migration/2 and migration/3. This means that when CPU 2
releases the locks for migration/1 and migration/2, but before
it wakes those threads, it can be preempted by migration/2.

If thread 2 is preempted by migration/2, then migration/2 will
execute the first work item successfully, since migration/3
was woken up by CPU 3, but when it goes to execute the second
work item, it disables preemption, calls multi_cpu_stop(),
and thus, CPU 2 will wait forever for migration/1, which should
have been woken up by thread 2. However migration/1 cannot be
woken up by thread 2, since it is a kworker, so it is affine to
CPU 2, but CPU 2 is running migration/2 with preemption
disabled, so thread 2 will never run.

Disable preemption after queueing works for stopper threads
to ensure that the operation of queueing the works and waking
the stopper threads is atomic.

Co-Developed-by: Prasad Sodagudi 
Co-Developed-by: Pavankumar Kondeti 
Signed-off-by: Isaac J. Manjarres 
Signed-off-by: Prasad Sodagudi 
Signed-off-by: Pavankumar Kondeti 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: bige...@linutronix.de
Cc: gre...@linuxfoundation.org
Cc: m...@codeblueprint.co.uk
Fixes: 9fb8d5dc4b64 ("stop_machine, Disable preemption when waking two stopper 
threads")
Link: 
http://lkml.kernel.org/r/1531856129-9871-1-git-send-email-isa...@codeaurora.org
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 

---
 kernel/stop_machine.c |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -260,6 +260,15 @@ retry:
err = 0;
__cpu_stop_queue_work(stopper1, work1, );
__cpu_stop_queue_work(stopper2, work2, );
+   /*
+* The waking up of stopper threads has to happen
+* in the same scheduling context as the queueing.
+* Otherwise, there is a possibility of one of the
+* above stoppers being woken up by another CPU,
+* and preempting us. This will cause us to n ot
+* wake up the other stopper forever.
+*/
+   preempt_disable();
 unlock:
raw_spin_unlock(>lock);
raw_spin_unlock_irq(>lock);
@@ -271,7 +280,6 @@ unlock:
}
 
if (!err) {
-   preempt_disable();
wake_up_q();
preempt_enable();
}

[PATCH 4.17 00/97] 4.17.15-stable review

2018-08-14 Thread Greg Kroah-Hartman

This is the start of the stable review cycle for the 4.17.15 release.
There are 97 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu Aug 16 17:14:15 UTC 2018.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:

https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.17.15-rc1.gz
or in the git tree and branch at:

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
linux-4.17.y
and the diffstat can be found below.

thanks,

greg k-h

-
Pseudo-Shortlog of commits:

Greg Kroah-Hartman 
Linux 4.17.15-rc1

Josh Poimboeuf 
x86/microcode: Allow late microcode loading with SMT disabled

David Woodhouse 
tools headers: Synchronise x86 cpufeatures.h for L1TF additions

Arnaldo Carvalho de Melo 
tools headers: Synchronize prctl.h ABI header

Andi Kleen 
x86/mm/kmmio: Make the tracer robust against L1TF

Andi Kleen 
x86/mm/pat: Make set_memory_np() L1TF safe

Andi Kleen 
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert

Andi Kleen 
x86/speculation/l1tf: Invert all not present mappings

Thomas Gleixner 
cpu/hotplug: Fix SMT supported evaluation

Paolo Bonzini 
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry

Paolo Bonzini 
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry

Paolo Bonzini 
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability

Thomas Gleixner 
Documentation/l1tf: Remove Yonah processors from not vulnerable list

Nicolai Stange 
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()

Nicolai Stange 
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d

Nicolai Stange 
x86: Don't include linux/irq.h from asm/hardirq.h

Nicolai Stange 
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d

Nicolai Stange 
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16

Nicolai Stange 
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()

Nicolai Stange 
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'

Nicolai Stange 
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()

Josh Poimboeuf 
cpu/hotplug: detect SMT disabled by BIOS

Tony Luck 
Documentation/l1tf: Fix typos

Nicolai Stange 
x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content

Jiri Kosina 
x86/speculation/l1tf: Unbreak !__HAVE_ARCH_PFN_MODIFY_ALLOWED architectures

Thomas Gleixner 
Documentation: Add section about CPU vulnerabilities

Jiri Kosina 
x86/bugs, kvm: Introduce boot-time control of L1TF mitigations

Thomas Gleixner 
cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early

Jiri Kosina 
cpu/hotplug: Expose SMT control init function

Thomas Gleixner 
x86/kvm: Allow runtime control of L1D flush

Thomas Gleixner 
x86/kvm: Serialize L1D flush parameter setter

Thomas Gleixner 
x86/kvm: Add static key for flush always

Thomas Gleixner 
x86/kvm: Move l1tf setup function

Thomas Gleixner 
x86/l1tf: Handle EPT disabled state proper

Thomas Gleixner 
x86/kvm: Drop L1TF MSR list approach

Thomas Gleixner 
x86/litf: Introduce vmx status variable

Thomas Gleixner 
cpu/hotplug: Online siblings when SMT control is turned on

Konrad Rzeszutek Wilk 
x86/KVM/VMX: Use MSR save list for IA32_FLUSH_CMD if required

Konrad Rzeszutek Wilk 
x86/KVM/VMX: Extend add_atomic_switch_msr() to allow VMENTER only MSRs

Konrad Rzeszutek Wilk 
x86/KVM/VMX: Separate the VMX AUTOLOAD guest/host number accounting

Konrad Rzeszutek Wilk 
x86/KVM/VMX: Add find_msr() helper function

Konrad Rzeszutek Wilk 
x86/KVM/VMX: Split the VMX MSR LOAD structures to have an host/guest numbers

Paolo Bonzini 
x86/KVM/VMX: Add L1D flush logic

Paolo Bonzini 
x86/KVM/VMX: Add L1D MSR based flush

Paolo Bonzini 
x86/KVM/VMX: Add L1D flush algorithm

Konrad Rzeszutek Wilk 
x86/KVM/VMX: Add module argument for L1TF mitigation

Konrad Rzeszutek Wilk 
x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present

Thomas Gleixner 
cpu/hotplug: Boot HT siblings at least once

Thomas Gleixner 
Revert "x86/apic: Ignore secondary threads if nosmt=force"

Michal Hocko 
x86/speculation/l1tf: Fix up pte->pfn conversion for PAE

Vlastimil Babka 
x86/speculation/l1tf: Protect PAE swap entries against L1TF

Borislav Petkov 
x86/CPU/AMD: Move TOPOEXT reenablement before reading smp_num_siblings

Konrad Rzeszutek Wilk 
x86/cpufeatures: Add detection of L1D cache flush support.

Vlastimil Babka 
x86/speculation/l1tf: Extend 64bit swap file size limit

Thomas Gleixner 
x86/apic: Ignore secondary threads if nosmt=force

Thomas Gleixner 
x86/cpu/AMD: Evaluate smp_num_siblings early

Borislav Petkov 
x86/CPU/AMD: Do not check

Re: [f2fs-dev] [PATCH v3] f2fs: fix performance issue observed with multi-thread sequential read

2018-08-14 Thread Jaegeuk Kim

On 08/14, Chao Yu wrote:
> On 2018/8/14 12:04, Jaegeuk Kim wrote:
> > On 08/14, Chao Yu wrote:
> >> On 2018/8/14 4:11, Jaegeuk Kim wrote:
> >>> On 08/13, Chao Yu wrote:
>  Hi Jaegeuk,
> 
>  On 2018/8/11 2:56, Jaegeuk Kim wrote:
> > This reverts the commit - "b93f771 - f2fs: remove writepages lock"
> > to fix the drop in sequential read throughput.
> >
> > Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
> > device: UFS
> >
> > Before -
> > read throughput: 185 MB/s
> > total read requests: 85177 (of these ~8 are 4KB size requests).
> > total write requests: 2546 (of these ~2208 requests are written in 
> > 512KB).
> >
> > After -
> > read throughput: 758 MB/s
> > total read requests: 2417 (of these ~2042 are 512KB reads).
> > total write requests: 2701 (of these ~2034 requests are written in 
> > 512KB).
> 
>  IMO, it only impact sequential read performance in a large file which 
>  may be
>  fragmented during multi-thread writing.
> 
>  In android environment, mostly, the large file should be cold type, such 
>  as apk,
>  mp3, rmvb, jpeg..., so I think we only need to serialize writepages() 
>  for cold
>  data area writer.
> 
>  So how about adding a mount option to serialize writepage() for 
>  different type
>  of log, e.g. in android, using serialize=4; by default, using serialize=7
>  HOT_DATA 1
>  WARM_DATA2
>  COLD_DATA4
> >>>
> >>> Well, I don't think we need to give too many mount options for this 
> >>> fragmented
> >>> case. How about doing this for the large files only like this?
> >>
> >> Thread A write 512 pages   Thread B write 8 pages
> >>
> >> - writepages()
> >>  - mutex_lock(>writepages);
> >>   - writepage();
> >> ...
> >>- writepages()
> >> - writepage()
> >>  
> >>   - writepage();
> >> ...
> >>  - mutex_unlock(>writepages);
> >>
> >> Above case will also cause fragmentation since we didn't serialize all
> >> concurrent IO with the lock.
> >>
> >> Do we need to consider such case?
> > 
> > We can simply allow 512 and 8 in the same segment, which would not a big 
> > deal,
> > when considering starvation of Thread B.
> 
> Yeah, but in reality, there would be more threads competing in same log 
> header,
> so I worry that the effect of defragmenting will not so good as we expect,
> anyway, for benchmark, it's enough.

Basically, I think this is not a benchmark issue. :) It just reveals the issue
much easily. Let me think about three cases:
1) WB_SYNC_NONE & WB_SYNC_NONE
 -> can simply use mutex_lock

2) WB_SYNC_ALL & WB_SYNC_NONE
 -> can use mutex_lock on WB_SYNC_ALL having >512 blocks, while WB_SYNC_NONE
will skip writing blocks

3) WB_SYNC_ALL & WB_SYNC_ALL
 -> can use mutex_lock on WB_SYNC_ALL having >512 blocks, in order to avoid
starvation.


I've been testing the below.

if (!S_ISDIR(inode->i_mode) && (wbc->sync_mode != WB_SYNC_ALL ||
get_dirty_pages(inode) <= SM_I(sbi)->min_seq_blocks)) {
mutex_lock(>writepages);
locked = true;
}

Thanks,

> 
> Thanks,
> 
> > 
> >>
> >> Thanks,
> >>
> >>>
> >>> >From 4fea0b6e4da8512a72dd52afc7a51beb35966ad9 Mon Sep 17 00:00:00 2001
> >>> From: Jaegeuk Kim 
> >>> Date: Thu, 9 Aug 2018 17:53:34 -0700
> >>> Subject: [PATCH] f2fs: fix performance issue observed with multi-thread
> >>>  sequential read
> >>>
> >>> This reverts the commit - "b93f771 - f2fs: remove writepages lock"
> >>> to fix the drop in sequential read throughput.
> >>>
> >>> Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
> >>> device: UFS
> >>>
> >>> Before -
> >>> read throughput: 185 MB/s
> >>> total read requests: 85177 (of these ~8 are 4KB size requests).
> >>> total write requests: 2546 (of these ~2208 requests are written in 512KB).
> >>>
> >>> After -
> >>> read throughput: 758 MB/s
> >>> total read requests: 2417 (of these ~2042 are 512KB reads).
> >>> total write requests: 2701 (of these ~2034 requests are written in 512KB).
> >>>
> >>> Signed-off-by: Sahitya Tummala 
> >>> Signed-off-by: Jaegeuk Kim 
> >>> ---
> >>>  Documentation/ABI/testing/sysfs-fs-f2fs |  8 
> >>>  fs/f2fs/data.c  | 10 ++
> >>>  fs/f2fs/f2fs.h  |  2 ++
> >>>  fs/f2fs/segment.c   |  1 +
> >>>  fs/f2fs/super.c |  1 +
> >>>  fs/f2fs/sysfs.c |  2 ++
> >>>  6 files changed, 24 insertions(+)
> >>>
> >>> diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs 
> >>> b/Documentation/ABI/testing/sysfs-fs-f2fs
> >>> index 9b0123388f18..94a24aedcdb2 100644
> >>> --- a/Documentation/ABI/testing/sysfs-fs-f2fs
> >>> +++ b/Documentation/ABI/testing/sysfs-fs-f2fs
> >>> @@ -51,6 +51,14 @@

[PATCH 4.17 17/97] ARM: dts: imx6sx: fix irq for pcie bridge

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Oleksij Rempel 

commit 1bcfe0564044be578841744faea1c2f46adc8178 upstream.

Use the correct IRQ line for the MSI controller in the PCIe host
controller. Apparently a different IRQ line is used compared to other
i.MX6 variants. Without this change MSI IRQs aren't properly propagated
to the upstream interrupt controller.

Signed-off-by: Oleksij Rempel 
Reviewed-by: Lucas Stach 
Fixes: b1d17f68e5c5 ("ARM: dts: imx: add initial imx6sx device tree source")
Signed-off-by: Shawn Guo 
Signed-off-by: Amit Pundir 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/arm/boot/dts/imx6sx.dtsi |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/arm/boot/dts/imx6sx.dtsi
+++ b/arch/arm/boot/dts/imx6sx.dtsi
@@ -1351,7 +1351,7 @@
ranges = <0x8100 0 0  0x08f8 0 
0x0001 /* downstream I/O */
  0x8200 0 0x0800 0x0800 0 
0x00f0>; /* non-prefetchable memory */
num-lanes = <1>;
-   interrupts = ;
+   interrupts = ;
interrupt-names = "msi";
#interrupt-cells = <1>;
interrupt-map-mask = <0 0 0 0x7>;

[PATCH 4.17 19/97] x86/speculation: Protect against userspace-userspace spectreRSB

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Jiri Kosina 

commit fdf82a7856b32d905c39afc85e34364491e46346 upstream.

The article "Spectre Returns! Speculation Attacks using the Return Stack
Buffer" [1] describes two new (sub-)variants of spectrev2-like attacks,
making use solely of the RSB contents even on CPUs that don't fallback to
BTB on RSB underflow (Skylake+).

Mitigate userspace-userspace attacks by always unconditionally filling RSB on
context switch when the generic spectrev2 mitigation has been enabled.

[1] https://arxiv.org/pdf/1807.07940.pdf

Signed-off-by: Jiri Kosina 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Tim Chen 
Cc: Konrad Rzeszutek Wilk 
Cc: Borislav Petkov 
Cc: David Woodhouse 
Cc: Peter Zijlstra 
Cc: Linus Torvalds 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.yfh.7.76.1807261308190@cbobk.fhfr.pm
Signed-off-by: Greg Kroah-Hartman 

---
 arch/x86/kernel/cpu/bugs.c |   38 +++---
 1 file changed, 7 insertions(+), 31 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -311,23 +311,6 @@ static enum spectre_v2_mitigation_cmd __
return cmd;
 }
 
-/* Check for Skylake-like CPUs (for RSB handling) */
-static bool __init is_skylake_era(void)
-{
-   if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
-   boot_cpu_data.x86 == 6) {
-   switch (boot_cpu_data.x86_model) {
-   case INTEL_FAM6_SKYLAKE_MOBILE:
-   case INTEL_FAM6_SKYLAKE_DESKTOP:
-   case INTEL_FAM6_SKYLAKE_X:
-   case INTEL_FAM6_KABYLAKE_MOBILE:
-   case INTEL_FAM6_KABYLAKE_DESKTOP:
-   return true;
-   }
-   }
-   return false;
-}
-
 static void __init spectre_v2_select_mitigation(void)
 {
enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
@@ -388,22 +371,15 @@ retpoline_auto:
pr_info("%s\n", spectre_v2_strings[mode]);
 
/*
-* If neither SMEP nor PTI are available, there is a risk of
-* hitting userspace addresses in the RSB after a context switch
-* from a shallow call stack to a deeper one. To prevent this fill
-* the entire RSB, even when using IBRS.
+* If spectre v2 protection has been enabled, unconditionally fill
+* RSB during a context switch; this protects against two independent
+* issues:
 *
-* Skylake era CPUs have a separate issue with *underflow* of the
-* RSB, when they will predict 'ret' targets from the generic BTB.
-* The proper mitigation for this is IBRS. If IBRS is not supported
-* or deactivated in favour of retpolines the RSB fill on context
-* switch is required.
+*  - RSB underflow (and switch to BTB) on Skylake+
+*  - SpectreRSB variant of spectre v2 on X86_BUG_SPECTRE_V2 CPUs
 */
-   if ((!boot_cpu_has(X86_FEATURE_PTI) &&
-!boot_cpu_has(X86_FEATURE_SMEP)) || is_skylake_era()) {
-   setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
-   pr_info("Spectre v2 mitigation: Filling RSB on context 
switch\n");
-   }
+   setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
+   pr_info("Spectre v2 / SpectreRSB mitigation: Filling RSB on context 
switch\n");
 
/* Initialize Indirect Branch Prediction Barrier if supported */
if (boot_cpu_has(X86_FEATURE_IBPB)) {

[PATCH 4.17 20/97] kprobes/x86: Fix %p uses in error messages

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Masami Hiramatsu 

commit 0ea063306eecf300fcf06d2f5917474b580f666f upstream.

Remove all %p uses in error messages in kprobes/x86.

Signed-off-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Anil S Keshavamurthy 
Cc: Arnd Bergmann 
Cc: David Howells 
Cc: David S . Miller 
Cc: Heiko Carstens 
Cc: Jon Medhurst 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Thomas Richter 
Cc: Tobin C . Harding 
Cc: Will Deacon 
Cc: a...@kernel.org
Cc: a...@linux-foundation.org
Cc: brueck...@linux.vnet.ibm.com
Cc: linux-a...@vger.kernel.org
Cc: rost...@goodmis.org
Cc: schwidef...@de.ibm.com
Cc: sta...@vger.kernel.org
Link: 
https://lkml.kernel.org/lkml/152491902310.9916.13355297638917767319.stgit@devbox
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/x86/kernel/kprobes/core.c |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -395,8 +395,6 @@ int __copy_instruction(u8 *dest, u8 *src
  - (u8 *) real;
if ((s64) (s32) newdisp != newdisp) {
pr_err("Kprobes error: new displacement does not fit 
into s32 (%llx)\n", newdisp);
-   pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",
-   src, real, insn->displacement.value);
return 0;
}
disp = (u8 *) dest + insn_offset_displacement(insn);
@@ -640,8 +638,7 @@ static int reenter_kprobe(struct kprobe
 * Raise a BUG or we'll continue in an endless reentering loop
 * and eventually a stack overflow.
 */
-   printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",
-  p->addr);
+   pr_err("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
default:

[PATCH 4.17 15/97] fix mntput/mntput race

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream.

mntput_no_expire() does the calculation of total refcount under mount_lock;
unfortunately, the decrement (as well as all increments) are done outside
of it, leading to false positives in the "are we dropping the last reference"
test.  Consider the following situation:
* mnt is a lazy-umounted mount, kept alive by two opened files.  One
of those files gets closed.  Total refcount of mnt is 2.  On CPU 42
mntput(mnt) (called from __fput()) drops one reference, decrementing component
* After it has looked at component #0, the process on CPU 0 does
mntget(), incrementing component #0, gets preempted and gets to run again -
on CPU 69.  There it does mntput(), which drops the reference (component #69)
and proceeds to spin on mount_lock.
* On CPU 42 our first mntput() finishes counting.  It observes the
decrement of component #69, but not the increment of component #0.  As the
result, the total it gets is not 1 as it should've been - it's 0.  At which
point we decide that vfsmount needs to be killed and proceed to free it and
shut the filesystem down.  However, there's still another opened file
on that filesystem, with reference to (now freed) vfsmount, etc. and we are
screwed.

It's not a wide race, but it can be reproduced with artificial slowdown of
the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.

Fix consists of moving the refcount decrement under mount_lock; the tricky
part is that we want (and can) keep the fast case (i.e. mount that still
has non-NULL ->mnt_ns) entirely out of mount_lock.  All places that zero
mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
before that mntput().  IOW, if mntput() observes (under rcu_read_lock())
a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
be dropped.

Reported-by: Jann Horn 
Tested-by: Jann Horn 
Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/namespace.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -1195,12 +1195,22 @@ static DECLARE_DELAYED_WORK(delayed_mntp
 static void mntput_no_expire(struct mount *mnt)
 {
rcu_read_lock();
-   mnt_add_count(mnt, -1);
-   if (likely(mnt->mnt_ns)) { /* shouldn't be the last one */
+   if (likely(READ_ONCE(mnt->mnt_ns))) {
+   /*
+* Since we don't do lock_mount_hash() here,
+* ->mnt_ns can change under us.  However, if it's
+* non-NULL, then there's a reference that won't
+* be dropped until after an RCU delay done after
+* turning ->mnt_ns NULL.  So if we observe it
+* non-NULL under rcu_read_lock(), the reference
+* we are dropping is not the final one.
+*/
+   mnt_add_count(mnt, -1);
rcu_read_unlock();
return;
}
lock_mount_hash();
+   mnt_add_count(mnt, -1);
if (mnt_get_count(mnt)) {
rcu_read_unlock();
unlock_mount_hash();

[PATCH 4.17 23/97] x86/speculation/l1tf: Change order of offset/type in swap entry

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Linus Torvalds 

commit bcd11afa7adad8d720e7ba5ef58bdcd9775cf45f upstream

If pages are swapped out, the swap entry is stored in the corresponding
PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
on PTE entries which have the present bit set and would treat the swap
entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
must be set so the PTE points to non existent memory.

The swap entry stores the type and the offset of a swapped out page in the
PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
ignores the bits beyond the phsyical address space limit, so to make the
mitigation effective its required to start 'offset' at the lowest possible
bit so that even large swap offsets do not reach into the physical address
space limit bits.

Move offset to bit 9-58 and type to bit 59-63 which are the bits that
hardware generally doesn't care about.

That, in turn, means that if you on desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, there needs to be
30 bits of offset actually *in use* until bit 39 ends up being set, which
means when inverted it will again point into existing memory.

So that's 4 terabyte of swap space (because the offset is counted in pages,
so 30 bits of offset is 42 bits of actual coverage). With bigger physical
addressing, that obviously grows further, until the limit of the offset is
hit (at 50 bits of offset - 62 bits of actual swap file coverage).

This is a preparatory change for the actual swap entry inversion to protect
against L1TF.

[ AK: Updated description and minor tweaks. Split into two parts ]
[ tglx: Massaged changelog ]

Signed-off-by: Linus Torvalds 
Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Tested-by: Andi Kleen 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable_64.h |   31 ---
 1 file changed, 20 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) {
  *
  * | ...| 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * | ...|SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | OFFSET (14->63) | TYPE (9-13)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) |  OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -287,19 +287,28 @@ static inline int pgd_large(pgd_t pgd) {
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  */
-#define SWP_TYPE_FIRST_BIT (_PAGE_BIT_PROTNONE + 1)
-#define SWP_TYPE_BITS 5
-/* Place the offset above the type: */
-#define SWP_OFFSET_FIRST_BIT (SWP_TYPE_FIRST_BIT + SWP_TYPE_BITS)
+#define SWP_TYPE_BITS  5
+
+#define SWP_OFFSET_FIRST_BIT   (_PAGE_BIT_PROTNONE + 1)
+
+/* We always extract/encode the offset by shifting it all the way up, and then 
down again */
+#define SWP_OFFSET_SHIFT   (SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
 
 #define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
 
-#define __swp_type(x)  (((x).val >> (SWP_TYPE_FIRST_BIT)) \
-& ((1U << SWP_TYPE_BITS) - 1))
-#define __swp_offset(x)((x).val >> 
SWP_OFFSET_FIRST_BIT)
-#define __swp_entry(type, offset)  ((swp_entry_t) { \
-((type) << (SWP_TYPE_FIRST_BIT)) \
-| ((offset) << SWP_OFFSET_FIRST_BIT) })
+/* Extract the high bits for type */
+#define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
+
+/* Shift up (to get rid of type), then down to get value */
+#define __swp_offset(x) ((x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
+
+/*
+ * Shift the offset up "too far" by TYPE bits, then down again
+ */
+#define __swp_entry(type, offset) ((swp_entry_t) { \
+   ((unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+   | ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
+
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val((pte)) 
})
 #define __pmd_to_swp_entry(pmd)((swp_entry_t) { pmd_val((pmd)) 
})
 #define __swp_entry_to_pte(x)  ((pte_t) { .pte = (x).val })

[PATCH 4.17 05/97] sched/deadline: Update rq_clock of later_rq when pushing a task

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Daniel Bristot de Oliveira 

commit 840d719604b0925ca23dde95f1767e4528668369 upstream.

Daniel Casini got this warn while running a DL task here at RetisLab:

  [  461.137582] [ cut here ]
  [  461.137583] rq->clock_update_flags < RQCF_ACT_SKIP
  [  461.137599] WARNING: CPU: 4 PID: 2354 at kernel/sched/sched.h:967 
assert_clock_updated.isra.32.part.33+0x17/0x20
  [a ton of modules]
  [  461.137646] CPU: 4 PID: 2354 Comm: label_image Not tainted 4.18.0-rc4+ #3
  [  461.137647] Hardware name: ASUS All Series/Z87-K, BIOS 0801 09/02/2013
  [  461.137649] RIP: 0010:assert_clock_updated.isra.32.part.33+0x17/0x20
  [  461.137649] Code: ff 48 89 83 08 09 00 00 eb c6 66 0f 1f 84 00 00 00 00 00 
55 48 c7 c7 98 7a 6c a5 c6 05 bc 0d 54 01 01 48 89 e5 e8 a9 84 fb ff <0f> 0b 5d 
c3 0f 1f 44 00 00 0f 1f 44 00 00 83 7e 60 01 74 0a 48 3b
  [  461.137673] RSP: 0018:a77e08cafc68 EFLAGS: 00010082
  [  461.137674] RAX:  RBX: 8b3fc1702d80 RCX: 
0006
  [  461.137674] RDX: 0007 RSI: 0096 RDI: 
8b3fded164b0
  [  461.137675] RBP: a77e08cafc68 R08: 0026 R09: 
0339
  [  461.137676] R10: 8b3fd060d410 R11: 0026 R12: 
a4e14e20
  [  461.137677] R13: 8b3fdec22940 R14: 8b3fc1702da0 R15: 
8b3fdec22940
  [  461.137678] FS:  7efe43ee5700() GS:8b3fded0() 
knlGS:
  [  461.137679] CS:  0010 DS:  ES:  CR0: 80050033
  [  461.137680] CR2: 7efe3010 CR3: 000301744003 CR4: 
001606e0
  [  461.137680] Call Trace:
  [  461.137684]  push_dl_task.part.46+0x3bc/0x460
  [  461.137686]  task_woken_dl+0x60/0x80
  [  461.137689]  ttwu_do_wakeup+0x4f/0x150
  [  461.137690]  ttwu_do_activate+0x77/0x80
  [  461.137692]  try_to_wake_up+0x1d6/0x4c0
  [  461.137693]  wake_up_q+0x32/0x70
  [  461.137696]  do_futex+0x7e7/0xb50
  [  461.137698]  __x64_sys_futex+0x8b/0x180
  [  461.137701]  do_syscall_64+0x5a/0x110
  [  461.137703]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [  461.137705] RIP: 0033:0x7efe4918ca26
  [  461.137705] Code: 00 00 00 74 17 49 8b 48 20 44 8b 59 10 41 83 e3 30 41 83 
fb 20 74 1e be 85 00 00 00 41 ba 01 00 00 00 41 b9 01 00 00 04 0f 05 <48> 3d 01 
f0 ff ff 73 1f 31 c0 c3 be 8c 00 00 00 49 89 c8 4d 31 d2
  [  461.137738] RSP: 002b:7efe43ee4928 EFLAGS: 0283 ORIG_RAX: 
00ca
  [  461.137739] RAX: ffda RBX: 05094df0 RCX: 
7efe4918ca26
  [  461.137740] RDX: 0001 RSI: 0085 RDI: 
05094e24
  [  461.137741] RBP: 7efe43ee49c0 R08: 05094e20 R09: 
0401
  [  461.137741] R10: 0001 R11: 0283 R12: 

  [  461.137742] R13: 05094df8 R14: 0001 R15: 
00448a10
  [  461.137743] ---[ end trace 187df4cad2bf7649 ]---

This warning happened in the push_dl_task(), because
__add_running_bw()->cpufreq_update_util() is getting the rq_clock of
the later_rq before its update, which takes place at activate_task().
The fix then is to update the rq_clock before calling add_running_bw().

To avoid double rq_clock_update() call, we set ENQUEUE_NOCLOCK flag to
activate_task().

Reported-by: Daniel Casini 
Signed-off-by: Daniel Bristot de Oliveira 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Juri Lelli 
Cc: Clark Williams 
Cc: Linus Torvalds 
Cc: Luca Abeni 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Cc: Tommaso Cucinotta 
Fixes: e0367b12674b sched/deadline: Move CPU frequency selection triggering 
points
Link: 
http://lkml.kernel.org/r/ca31d073a4788acf0684a8b255f14fea775ccf20.1532077269.git.bris...@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman 

---
 kernel/sched/deadline.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -2090,8 +2090,14 @@ retry:
sub_rq_bw(_task->dl, >dl);
set_task_cpu(next_task, later_rq->cpu);
add_rq_bw(_task->dl, _rq->dl);
+
+   /*
+* Update the later_rq clock here, because the clock is used
+* by the cpufreq_update_util() inside __add_running_bw().
+*/
+   update_rq_clock(later_rq);
add_running_bw(_task->dl, _rq->dl);
-   activate_task(later_rq, next_task, 0);
+   activate_task(later_rq, next_task, ENQUEUE_NOCLOCK);
ret = 1;
 
resched_curr(later_rq);

[PATCH 4.17 18/97] x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Peter Zijlstra 

commit 5800dc5c19f34e6e03b5adab1282535cb102fafd upstream.

Nadav reported that on guests we're failing to rewrite the indirect
calls to CALLEE_SAVE paravirt functions. In particular the
pv_queued_spin_unlock() call is left unpatched and that is all over the
place. This obviously wrecks Spectre-v2 mitigation (for paravirt
guests) which relies on not actually having indirect calls around.

The reason is an incorrect clobber test in paravirt_patch_call(); this
function rewrites an indirect call with a direct call to the _SAME_
function, there is no possible way the clobbers can be different
because of this.

Therefore remove this clobber check. Also put WARNs on the other patch
failure case (not enough room for the instruction) which I've not seen
trigger in my (limited) testing.

Three live kernel image disassemblies for lock_sock_nested (as a small
function that illustrates the problem nicely). PRE is the current
situation for guests, POST is with this patch applied and NATIVE is with
or without the patch for !guests.

PRE:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0x817be970 <+0>: push   %rbp
   0x817be971 <+1>: mov%rdi,%rbp
   0x817be974 <+4>: push   %rbx
   0x817be975 <+5>: lea0x88(%rbp),%rbx
   0x817be97c <+12>:callq  0x819f7160 <_cond_resched>
   0x817be981 <+17>:mov%rbx,%rdi
   0x817be984 <+20>:callq  0x819fbb00 <_raw_spin_lock_bh>
   0x817be989 <+25>:mov0x8c(%rbp),%eax
   0x817be98f <+31>:test   %eax,%eax
   0x817be991 <+33>:jne0x817be9ba 
   0x817be993 <+35>:movl   $0x1,0x8c(%rbp)
   0x817be99d <+45>:mov%rbx,%rdi
   0x817be9a0 <+48>:callq  *0x822299e8
   0x817be9a7 <+55>:pop%rbx
   0x817be9a8 <+56>:pop%rbp
   0x817be9a9 <+57>:mov$0x200,%esi
   0x817be9ae <+62>:mov$0x817be993,%rdi
   0x817be9b5 <+69>:jmpq   0x81063ae0 <__local_bh_enable_ip>
   0x817be9ba <+74>:mov%rbp,%rdi
   0x817be9bd <+77>:callq  0x817be8c0 <__lock_sock>
   0x817be9c2 <+82>:jmp0x817be993 
End of assembler dump.

POST:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0x817be970 <+0>: push   %rbp
   0x817be971 <+1>: mov%rdi,%rbp
   0x817be974 <+4>: push   %rbx
   0x817be975 <+5>: lea0x88(%rbp),%rbx
   0x817be97c <+12>:callq  0x819f7160 <_cond_resched>
   0x817be981 <+17>:mov%rbx,%rdi
   0x817be984 <+20>:callq  0x819fbb00 <_raw_spin_lock_bh>
   0x817be989 <+25>:mov0x8c(%rbp),%eax
   0x817be98f <+31>:test   %eax,%eax
   0x817be991 <+33>:jne0x817be9ba 
   0x817be993 <+35>:movl   $0x1,0x8c(%rbp)
   0x817be99d <+45>:mov%rbx,%rdi
   0x817be9a0 <+48>:callq  0x810a0c20 
<__raw_callee_save___pv_queued_spin_unlock>
   0x817be9a5 <+53>:xchg   %ax,%ax
   0x817be9a7 <+55>:pop%rbx
   0x817be9a8 <+56>:pop%rbp
   0x817be9a9 <+57>:mov$0x200,%esi
   0x817be9ae <+62>:mov$0x817be993,%rdi
   0x817be9b5 <+69>:jmpq   0x81063aa0 <__local_bh_enable_ip>
   0x817be9ba <+74>:mov%rbp,%rdi
   0x817be9bd <+77>:callq  0x817be8c0 <__lock_sock>
   0x817be9c2 <+82>:jmp0x817be993 
End of assembler dump.

NATIVE:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0x817be970 <+0>: push   %rbp
   0x817be971 <+1>: mov%rdi,%rbp
   0x817be974 <+4>: push   %rbx
   0x817be975 <+5>: lea0x88(%rbp),%rbx
   0x817be97c <+12>:callq  0x819f7160 <_cond_resched>
   0x817be981 <+17>:mov%rbx,%rdi
   0x817be984 <+20>:callq  0x819fbb00 <_raw_spin_lock_bh>
   0x817be989 <+25>:mov0x8c(%rbp),%eax
   0x817be98f <+31>:test   %eax,%eax
   0x817be991 <+33>:jne0x817be9ba 
   0x817be993 <+35>:movl   $0x1,0x8c(%rbp)
   0x817be99d <+45>:mov%rbx,%rdi
   0x817be9a0 <+48>:movb   $0x0,(%rdi)
   0x817be9a3 <+51>:nopl   0x0(%rax)
   0x817be9a7 <+55>:pop%rbx
   0x817be9a8 <+56>:pop%rbp
   0x817be9a9 <+57>:mov$0x200,%esi
   0x817be9ae <+62>:mov$0x817be993,%rdi
   0x817be9b5 <+69>:jmpq   0x81063ae0 <__local_bh_enable_ip>

[PATCH 4.17 26/97] x86/speculation/l1tf: Make sure the first page is always reserved

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 10a70416e1f067f6c4efda6ffd8ea96002ac4223 upstream

The L1TF workaround doesn't make any attempt to mitigate speculate accesses
to the first physical page for zeroed PTEs. Normally it only contains some
data from the early real mode BIOS.

It's not entirely clear that the first page is reserved in all
configurations, so add an extra reservation call to make sure it is really
reserved. In most configurations (e.g.  with the standard reservations)
it's likely a nop.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/setup.c |6 ++
 1 file changed, 6 insertions(+)

--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -823,6 +823,12 @@ void __init setup_arch(char **cmdline_p)
memblock_reserve(__pa_symbol(_text),
 (unsigned long)__bss_stop - (unsigned long)_text);
 
+   /*
+* Make sure page 0 is always reserved because on systems with
+* L1TF its contents can be leaked to user processes.
+*/
+   memblock_reserve(0, PAGE_SIZE);
+
early_reserve_initrd();
 
/*

[PATCH 4.17 21/97] x86/irqflags: Provide a declaration for native_save_fl

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Nick Desaulniers 

commit 208cbb32558907f68b3b2a081ca2337ac3744794 upstream.

It was reported that the commit d0a8d9378d16 is causing users of gcc < 4.9
to observe -Werror=missing-prototypes errors.

Indeed, it seems that:
extern inline unsigned long native_save_fl(void) { return 0; }

compiled with -Werror=missing-prototypes produces this warning in gcc <
4.9, but not gcc >= 4.9.

Fixes: d0a8d9378d16 ("x86/paravirt: Make native_save_fl() extern inline").
Reported-by: David Laight 
Reported-by: Jean Delvare 
Signed-off-by: Nick Desaulniers 
Signed-off-by: Thomas Gleixner 
Cc: h...@zytor.com
Cc: jgr...@suse.com
Cc: kstew...@linuxfoundation.org
Cc: gre...@linuxfoundation.org
Cc: boris.ostrov...@oracle.com
Cc: astrac...@google.com
Cc: m...@chromium.org
Cc: a...@arndb.de
Cc: tstel...@redhat.com
Cc: sedat.di...@gmail.com
Cc: david.lai...@aculab.com
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulni...@google.com
Signed-off-by: Greg Kroah-Hartman 

---
 arch/x86/include/asm/irqflags.h |2 ++
 1 file changed, 2 insertions(+)

--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -13,6 +13,8 @@
  * Interrupt control:
  */
 
+/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
+extern inline unsigned long native_save_fl(void);
 extern inline unsigned long native_save_fl(void)
 {
unsigned long flags;

[PATCH 4.17 24/97] x86/speculation/l1tf: Protect swap entries against L1TF

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Linus Torvalds 

commit 2f22b4cd45b67b3496f4aa4c7180a1271c6452f6 upstream

With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
side effects allow to read the memory the PTE is pointing too, if its
values are still in the L1 cache.

For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
them.

To protect against L1TF it must be ensured that the swap entry is not
pointing to valid memory, which requires setting higher bits (between bit
36 and bit 45) that are inside the CPUs physical address space, but outside
any real memory.

To do this invert the offset to make sure the higher bits are always set,
as long as the swap file is not too big.

Note there is no workaround for 32bit !PAE, or on systems which have more
than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
real systems.

[AK: updated description and minor tweaks by. Split out from the original
 patch ]

Signed-off-by: Linus Torvalds 
Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Tested-by: Andi Kleen 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable_64.h |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -273,7 +273,7 @@ static inline int pgd_large(pgd_t pgd) {
  *
  * | ...| 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * | ...|SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) |  OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -286,6 +286,9 @@ static inline int pgd_large(pgd_t pgd) {
  *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
+ *
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
 #define SWP_TYPE_BITS  5
 
@@ -300,13 +303,15 @@ static inline int pgd_large(pgd_t pgd) {
 #define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
 
 /* Shift up (to get rid of type), then down to get value */
-#define __swp_offset(x) ((x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
+#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
 
 /*
  * Shift the offset up "too far" by TYPE bits, then down again
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
 #define __swp_entry(type, offset) ((swp_entry_t) { \
-   ((unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+   (~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val((pte)) 
})

[PATCH 4.17 25/97] x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 6b28baca9b1f0d4a42b865da7a05b1c81424bd5c upstream

When PTEs are set to PROT_NONE the kernel just clears the Present bit and
preserves the PFN, which creates attack surface for L1TF speculation
speculation attacks.

This is important inside guests, because L1TF speculation bypasses physical
page remapping. While the host has its own migitations preventing leaking
data from other VMs into the guest, this would still risk leaking the wrong
page inside the current guest.

This uses the same technique as Linus' swap entry patch: while an entry is
is in PROTNONE state invert the complete PFN part part of it. This ensures
that the the highest bit will point to non existing memory.

The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.

This assume that no code path touches the PFN part of a PTE directly
without using these primitives.

This doesn't handle the case that MMIO is on the top of the CPU physical
memory. If such an MMIO region was exposed by an unpriviledged driver for
mmap it would be possible to attack some real memory.  However this
situation is all rather unlikely.

For 32bit non PAE the inversion is not done because there are really not
enough bits to protect anything.

Q: Why does the guest need to be protected when the HyperVisor already has
   L1TF mitigations?

A: Here's an example:

   Physical pages 1 2 get mapped into a guest as
   GPA 1 -> PA 2
   GPA 2 -> PA 1
   through EPT.

   The L1TF speculation ignores the EPT remapping.

   Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
   they belong to different users and should be isolated.

   A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
   gets read access to the underlying physical page. Which in this case
   points to PA 2, so it can read process B's data, if it happened to be in
   L1, so isolation inside the guest is broken.

   There's nothing the hypervisor can do about this. This mitigation has to
   be done in the guest itself.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable-2level.h |   17 +
 arch/x86/include/asm/pgtable-3level.h |2 +
 arch/x86/include/asm/pgtable-invert.h |   32 
 arch/x86/include/asm/pgtable.h|   44 +++---
 arch/x86/include/asm/pgtable_64.h |2 +
 5 files changed, 84 insertions(+), 13 deletions(-)
 create mode 100644 arch/x86/include/asm/pgtable-invert.h

--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -95,4 +95,21 @@ static inline unsigned long pte_bitop(un
 #define __pte_to_swp_entry(pte)((swp_entry_t) { (pte).pte_low 
})
 #define __swp_entry_to_pte(x)  ((pte_t) { .pte = (x).val })
 
+/* No inverted PFNs on 2 level page tables */
+
+static inline u64 protnone_mask(u64 val)
+{
+   return 0;
+}
+
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
+{
+   return val;
+}
+
+static inline bool __pte_needs_invert(u64 val)
+{
+   return false;
+}
+
 #endif /* _ASM_X86_PGTABLE_2LEVEL_H */
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -295,4 +295,6 @@ static inline pte_t gup_get_pte(pte_t *p
return pte;
 }
 
+#include 
+
 #endif /* _ASM_X86_PGTABLE_3LEVEL_H */
--- /dev/null
+++ b/arch/x86/include/asm/pgtable-invert.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_PGTABLE_INVERT_H
+#define _ASM_PGTABLE_INVERT_H 1
+
+#ifndef __ASSEMBLY__
+
+static inline bool __pte_needs_invert(u64 val)
+{
+   return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) == _PAGE_PROTNONE;
+}
+
+/* Get a mask to xor with the page table entry to get the correct pfn. */
+static inline u64 protnone_mask(u64 val)
+{
+   return __pte_needs_invert(val) ?  ~0ull : 0;
+}
+
+static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask)
+{
+   /*
+* When a PTE transitions from NONE to !NONE or vice-versa
+* invert the PFN part to stop speculation.
+* pte_pfn undoes this when needed.
+*/
+   if (__pte_needs_invert(oldval) != __pte_needs_invert(val))
+   val = (val & ~mask) | (~val & mask);
+   return val;
+}
+
+#endif /* __ASSEMBLY__ */
+
+#endif
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -185,19 +185,29 @@ static inline int pte_special(pte_t pte)
return pte_flags(pte) & _PAGE_SPECIAL;
 }
 
+/* Entries that were set to PROT_NONE are inverted */
+
+static inline u64 protnone_mask(u64 val);
+
 static inline unsigned long pte_pfn(pte_t pte)
 {
-   return (pte_val(pte) &

[PATCH 4.17 16/97] fix __legitimize_mnt()/mntput() race

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream.

__legitimize_mnt() has two problems - one is that in case of success
the check of mount_lock is not ordered wrt preceding increment of
refcount, making it possible to have successful __legitimize_mnt()
on one CPU just before the otherwise final mntpu() on another,
with __legitimize_mnt() not seeing mntput() taking the lock and
mntput() not seeing the increment done by __legitimize_mnt().
Solved by a pair of barriers.

Another is that failure of __legitimize_mnt() on the second
read_seqretry() leaves us with reference that'll need to be
dropped by caller; however, if that races with final mntput()
we can end up with caller dropping rcu_read_lock() and doing
mntput() to release that reference - with the first mntput()
having freed the damn thing just as rcu_read_lock() had been
dropped.  Solution: in "do mntput() yourself" failure case
grab mount_lock, check if MNT_DOOMED has been set by racing
final mntput() that has missed our increment and if it has -
undo the increment and treat that as "failure, caller doesn't
need to drop anything" case.

It's not easy to hit - the final mntput() has to come right
after the first read_seqretry() in __legitimize_mnt() *and*
manage to miss the increment done by __legitimize_mnt() before
the second read_seqretry() in there.  The things that are almost
impossible to hit on bare hardware are not impossible on SMP
KVM, though...

Reported-by: Oleg Nesterov 
Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/namespace.c |   14 ++
 1 file changed, 14 insertions(+)

--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -659,12 +659,21 @@ int __legitimize_mnt(struct vfsmount *ba
return 0;
mnt = real_mount(bastard);
mnt_add_count(mnt, 1);
+   smp_mb();   // see mntput_no_expire()
if (likely(!read_seqretry(_lock, seq)))
return 0;
if (bastard->mnt_flags & MNT_SYNC_UMOUNT) {
mnt_add_count(mnt, -1);
return 1;
}
+   lock_mount_hash();
+   if (unlikely(bastard->mnt_flags & MNT_DOOMED)) {
+   mnt_add_count(mnt, -1);
+   unlock_mount_hash();
+   return 1;
+   }
+   unlock_mount_hash();
+   /* caller will mntput() */
return -1;
 }
 
@@ -1210,6 +1219,11 @@ static void mntput_no_expire(struct moun
return;
}
lock_mount_hash();
+   /*
+* make sure that if __legitimize_mnt() has not seen us grab
+* mount_lock, we'll see their refcount increment here.
+*/
+   smp_mb();
mnt_add_count(mnt, -1);
if (mnt_get_count(mnt)) {
rcu_read_unlock();

[PATCH 4.17 22/97] x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 50896e180c6aa3a9c61a26ced99e15d602666a4c upstream

L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
speculates on PTE entries which do not have the PRESENT bit set, if the
content of the resulting physical address is available in the L1D cache.

The OS side mitigation makes sure that a !PRESENT PTE entry points to a
physical address outside the actually existing and cachable memory
space. This is achieved by inverting the upper bits of the PTE. Due to the
address space limitations this only works for 64bit and 32bit PAE kernels,
but not for 32bit non PAE.

This mitigation applies to both host and guest kernels, but in case of a
64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
the PAE address space (44bit) is not enough if the host has more than 43
bits of populated memory address space, because the speculation treats the
PTE content as a physical host address bypassing EPT.

The host (hypervisor) protects itself against the guest by flushing L1D as
needed, but pages inside the guest are not protected against attacks from
other processes inside the same guest.

For the guest the inverted PTE mask has to match the host to provide the
full protection for all pages the host could possibly map into the
guest. The hosts populated address space is not known to the guest, so the
mask must cover the possible maximal host address space, i.e. 52 bit.

On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
the mask to be below what the host could possible use for physical pages.

The L1TF PROT_NONE protection code uses the PTE masks to determine which
bits to invert to make sure the higher bits are set for unmapped entries to
prevent L1TF speculation attacks against EPT inside guests.

In order to invert all bits that could be used by the host, increase
__PHYSICAL_PAGE_SHIFT to 52 to match 64bit.

The real limit for a 32bit PAE kernel is still 44 bits because all Linux
PTEs are created from unsigned long PFNs, so they cannot be higher than 44
bits on a 32bit kernel. So these extra PFN bits should be never set. The
only users of this macro are using it to look at PTEs, so it's safe.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/page_32_types.h |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -29,8 +29,13 @@
 #define N_EXCEPTION_STACKS 1
 
 #ifdef CONFIG_X86_PAE
-/* 44=32+12, the limit we can fit into an unsigned long pfn */
-#define __PHYSICAL_MASK_SHIFT  44
+/*
+ * This is beyond the 44 bit limit imposed by the 32bit long pfns,
+ * but we need the full mask to make sure inverted PROT_NONE
+ * entries have all the host bits set in a guest.
+ * The real limit is still 44 bits.
+ */
+#define __PHYSICAL_MASK_SHIFT  52
 #define __VIRTUAL_MASK_SHIFT   32
 
 #else  /* !CONFIG_X86_PAE */

[PATCH 4.17 27/97] x86/speculation/l1tf: Add sysfs reporting for l1tf

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 17dbca119312b4e8173d4e25ff64262119fcef38 upstream

L1TF core kernel workarounds are cheap and normally always enabled, However
they still should be reported in sysfs if the system is vulnerable or
mitigated. Add the necessary CPU feature/bug bits.

- Extend the existing checks for Meltdowns to determine if the system is
  vulnerable. All CPUs which are not vulnerable to Meltdown are also not
  vulnerable to L1TF

- Check for 32bit non PAE and emit a warning as there is no practical way
  for mitigation due to the limited physical address bits

- If the system has more than MAX_PA/2 physical memory the invert page
  workarounds don't protect the system against the L1TF attack anymore,
  because an inverted physical address will also point to valid
  memory. Print a warning in this case and report that the system is
  vulnerable.

Add a function which returns the PFN limit for the L1TF mitigation, which
will be used in follow up patches for sanity and range checks.

[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/cpufeatures.h |2 +
 arch/x86/include/asm/processor.h   |5 
 arch/x86/kernel/cpu/bugs.c |   40 +
 arch/x86/kernel/cpu/common.c   |   20 ++
 drivers/base/cpu.c |8 +++
 include/linux/cpu.h|2 +
 6 files changed, 77 insertions(+)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -219,6 +219,7 @@
 #define X86_FEATURE_IBPB   ( 7*32+26) /* Indirect Branch 
Prediction Barrier */
 #define X86_FEATURE_STIBP  ( 7*32+27) /* Single Thread Indirect 
Branch Predictors */
 #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD 
family 0x17 (Zen) */
+#define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF 
workaround PTE inversion */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
@@ -371,5 +372,6 @@
 #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by 
Spectre variant 1 attack with conditional branches */
 #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by 
Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS  X86_BUG(17) /* CPU is affected by 
speculative store bypass attack */
+#define X86_BUG_L1TF   X86_BUG(18) /* CPU is affected by L1 
Terminal Fault */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -181,6 +181,11 @@ extern const struct seq_operations cpuin
 
 extern void cpu_detect(struct cpuinfo_x86 *c);
 
+static inline unsigned long l1tf_pfn_limit(void)
+{
+   return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;
+}
+
 extern void early_cpu_init(void);
 extern void identify_boot_cpu(void);
 extern void identify_secondary_cpu(struct cpuinfo_x86 *);
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -27,9 +27,11 @@
 #include 
 #include 
 #include 
+#include 
 
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
+static void __init l1tf_select_mitigation(void);
 
 /*
  * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any
@@ -81,6 +83,8 @@ void __init check_bugs(void)
 */
ssb_select_mitigation();
 
+   l1tf_select_mitigation();
+
 #ifdef CONFIG_X86_32
/*
 * Check whether we are able to run this kernel safely on SMP.
@@ -205,6 +209,32 @@ static void x86_amd_ssb_disable(void)
wrmsrl(MSR_AMD64_LS_CFG, msrval);
 }
 
+static void __init l1tf_select_mitigation(void)
+{
+   u64 half_pa;
+
+   if (!boot_cpu_has_bug(X86_BUG_L1TF))
+   return;
+
+#if CONFIG_PGTABLE_LEVELS == 2
+   pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
+   return;
+#endif
+
+   /*
+* This is extremely unlikely to happen because almost all
+* systems have far more MAX_PA/2 than RAM can be fit into
+* DIMM slots.
+*/
+   half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
+   if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
+   pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation 
not effective.\n");
+   return;
+   }
+
+   setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
+}
+
 #ifdef RETPOLINE
 static bool spectre_v2_bad_module;
 
@@ -657,6 +687,11 @@ static ssize_t cpu_show_common(struct de
case X86_BUG_SPEC_STORE_BYPASS:
return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
 
+

[PATCH 4.14 080/104] cpu/hotplug: detect SMT disabled by BIOS

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Josh Poimboeuf 

commit 73d5e2b472640b1fcdb61ae8be389912ef211bda upstream

If SMT is disabled in BIOS, the CPU code doesn't properly detect it.
The /sys/devices/system/cpu/smt/control file shows 'on', and the 'l1tf'
vulnerabilities file shows SMT as vulnerable.

Fix it by forcing 'cpu_smt_control' to CPU_SMT_NOT_SUPPORTED in such a
case.  Unfortunately the detection can only be done after bringing all
the CPUs online, so we have to overwrite any previous writes to the
variable.

Reported-by: Joe Mario 
Tested-by: Jiri Kosina 
Fixes: f048c399e0f7 ("x86/topology: Provide topology_smt_supported()")
Signed-off-by: Josh Poimboeuf 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 kernel/cpu.c |9 +
 1 file changed, 9 insertions(+)

--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2137,6 +2137,15 @@ static const struct attribute_group cpuh
 
 static int __init cpu_smt_state_init(void)
 {
+   /*
+* If SMT was disabled by BIOS, detect it here, after the CPUs have
+* been brought online.  This ensures the smt/l1tf sysfs entries are
+* consistent with reality.  Note this may overwrite cpu_smt_control's
+* previous setting.
+*/
+   if (topology_max_smt_threads() == 1)
+   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
+
return sysfs_create_group(_subsys.dev_root->kobj,
  _smt_attr_group);
 }

[PATCH 4.14 078/104] x86/KVM/VMX: Initialize the vmx_l1d_flush_pages content

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 288d152c23dcf3c09da46c5c481903ca10ebfef7 upstream

The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
to evict the L1d cache.

However, these pages are never cleared and, in theory, their data could be
leaked.

More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.

Fix this by initializing the individual vmx_l1d_flush_pages with a
different pattern each.

Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
"flush_pages" to reflect this change.

Fixes: a47dd5f06714 ("x86/KVM/VMX: Add L1D flush algorithm")
Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kvm/vmx.c |   17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -214,6 +214,7 @@ static void *vmx_l1d_flush_pages;
 static int vmx_setup_l1d_flush(enum vmx_l1d_flush_state l1tf)
 {
struct page *page;
+   unsigned int i;
 
if (!enable_ept) {
l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_EPT_DISABLED;
@@ -246,6 +247,16 @@ static int vmx_setup_l1d_flush(enum vmx_
if (!page)
return -ENOMEM;
vmx_l1d_flush_pages = page_address(page);
+
+   /*
+* Initialize each page with a different pattern in
+* order to protect against KSM in the nested
+* virtualization case.
+*/
+   for (i = 0; i < 1u << L1D_CACHE_ORDER; ++i) {
+   memset(vmx_l1d_flush_pages + i * PAGE_SIZE, i + 1,
+  PAGE_SIZE);
+   }
}
 
l1tf_vmx_mitigation = l1tf;
@@ -9176,7 +9187,7 @@ static void vmx_l1d_flush(struct kvm_vcp
/* First ensure the pages are in the TLB */
"xorl   %%eax, %%eax\n"
".Lpopulate_tlb:\n\t"
-   "movzbl (%[empty_zp], %%" _ASM_AX "), %%ecx\n\t"
+   "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
"addl   $4096, %%eax\n\t"
"cmpl   %%eax, %[size]\n\t"
"jne.Lpopulate_tlb\n\t"
@@ -9185,12 +9196,12 @@ static void vmx_l1d_flush(struct kvm_vcp
/* Now fill the cache */
"xorl   %%eax, %%eax\n"
".Lfill_cache:\n"
-   "movzbl (%[empty_zp], %%" _ASM_AX "), %%ecx\n\t"
+   "movzbl (%[flush_pages], %%" _ASM_AX "), %%ecx\n\t"
"addl   $64, %%eax\n\t"
"cmpl   %%eax, %[size]\n\t"
"jne.Lfill_cache\n\t"
"lfence\n"
-   :: [empty_zp] "r" (vmx_l1d_flush_pages),
+   :: [flush_pages] "r" (vmx_l1d_flush_pages),
[size] "r" (size)
: "eax", "ebx", "ecx", "edx");
 }

[PATCH 4.14 081/104] x86/KVM/VMX: Dont set l1tf_flush_l1d to true from vmx_l1d_flush()

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 379fd0c7e6a391e5565336a646f19f218fb98c6c upstream

vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
point in setting l1tf_flush_l1d to true from there again.

Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kvm/vmx.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9166,15 +9166,15 @@ static void vmx_l1d_flush(struct kvm_vcp
/*
 * This code is only executed when the the flush mode is 'cond' or
 * 'always'
-*
-* If 'flush always', keep the flush bit set, otherwise clear
-* it. The flush bit gets set again either from vcpu_run() or from
-* one of the unsafe VMEXIT handlers.
 */
-   if (static_branch_unlikely(_l1d_flush_always))
-   vcpu->arch.l1tf_flush_l1d = true;
-   else
+   if (!static_branch_unlikely(_l1d_flush_always)) {
+   /*
+* Clear the flush bit, it gets set again either from
+* vcpu_run() or from one of the unsafe VMEXIT
+* handlers.
+*/
vcpu->arch.l1tf_flush_l1d = false;
+   }
 
vcpu->stat.l1d_flush++;

[PATCH 4.14 103/104] tools headers: Synchronise x86 cpufeatures.h for L1TF additions

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: David Woodhouse 

commit e24f14b0ff985f3e09e573ba1134bfdf42987e05 upstream

Signed-off-by: David Woodhouse 
Signed-off-by: Greg Kroah-Hartman 
---
 tools/arch/x86/include/asm/cpufeatures.h |3 +++
 1 file changed, 3 insertions(+)

--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -219,6 +219,7 @@
 #define X86_FEATURE_IBPB   ( 7*32+26) /* Indirect Branch 
Prediction Barrier */
 #define X86_FEATURE_STIBP  ( 7*32+27) /* Single Thread Indirect 
Branch Predictors */
 #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD 
family 0x17 (Zen) */
+#define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF 
workaround PTE inversion */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
@@ -338,6 +339,7 @@
 #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL  (18*32+26) /* "" Speculation Control 
(IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP(18*32+27) /* "" Single Thread 
Indirect Branch Predictors */
+#define X86_FEATURE_FLUSH_L1D  (18*32+28) /* Flush L1D cache */
 #define X86_FEATURE_ARCH_CAPABILITIES  (18*32+29) /* IA32_ARCH_CAPABILITIES 
MSR (Intel) */
 #define X86_FEATURE_SPEC_CTRL_SSBD (18*32+31) /* "" Speculative Store 
Bypass Disable */
 
@@ -370,5 +372,6 @@
 #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by 
Spectre variant 1 attack with conditional branches */
 #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by 
Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS  X86_BUG(17) /* CPU is affected by 
speculative store bypass attack */
+#define X86_BUG_L1TF   X86_BUG(18) /* CPU is affected by L1 
Terminal Fault */
 
 #endif /* _ASM_X86_CPUFEATURES_H */

[PATCH 4.14 104/104] x86/microcode: Allow late microcode loading with SMT disabled

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Josh Poimboeuf 

commit 07d981ad4cf1e78361c6db1c28ee5ba105f96cc1 upstream

The kernel unnecessarily prevents late microcode loading when SMT is
disabled.  It should be safe to allow it if all the primary threads are
online.

Signed-off-by: Josh Poimboeuf 
Acked-by: Borislav Petkov 
Signed-off-by: David Woodhouse 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/microcode/core.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -509,12 +509,20 @@ static struct platform_device *microcode
 
 static int check_online_cpus(void)
 {
-   if (num_online_cpus() == num_present_cpus())
-   return 0;
+   unsigned int cpu;
 
-   pr_err("Not all CPUs online, aborting microcode update.\n");
+   /*
+* Make sure all CPUs are online.  It's fine for SMT to be disabled if
+* all the primary threads are still online.
+*/
+   for_each_present_cpu(cpu) {
+   if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {
+   pr_err("Not all CPUs online, aborting microcode 
update.\n");
+   return -EINVAL;
+   }
+   }
 
-   return -EINVAL;
+   return 0;
 }
 
 static atomic_t late_cpus_in;

[PATCH 4.14 100/104] x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 0768f91530ff46683e0b372df14fd79fe8d156e5 upstream

Some cases in THP like:
  - MADV_FREE
  - mprotect
  - split

mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.

Use the proper low level functions for pmd/pud_mknotpresent() to address
this.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable.h |   22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -410,11 +410,6 @@ static inline pmd_t pmd_mkwrite(pmd_t pm
return pmd_set_flags(pmd, _PAGE_RW);
 }
 
-static inline pmd_t pmd_mknotpresent(pmd_t pmd)
-{
-   return pmd_clear_flags(pmd, _PAGE_PRESENT | _PAGE_PROTNONE);
-}
-
 static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
 {
pudval_t v = native_pud_val(pud);
@@ -469,11 +464,6 @@ static inline pud_t pud_mkwrite(pud_t pu
return pud_set_flags(pud, _PAGE_RW);
 }
 
-static inline pud_t pud_mknotpresent(pud_t pud)
-{
-   return pud_clear_flags(pud, _PAGE_PRESENT | _PAGE_PROTNONE);
-}
-
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline int pte_soft_dirty(pte_t pte)
 {
@@ -560,6 +550,18 @@ static inline pud_t pfn_pud(unsigned lon
return __pud(pfn | massage_pgprot(pgprot));
 }
 
+static inline pmd_t pmd_mknotpresent(pmd_t pmd)
+{
+   return pfn_pmd(pmd_pfn(pmd),
+ __pgprot(pmd_flags(pmd) & 
~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
+static inline pud_t pud_mknotpresent(pud_t pud)
+{
+   return pfn_pud(pud_pfn(pud),
+ __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

[PATCH 4.14 077/104] Documentation: Add section about CPU vulnerabilities

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

commit 3ec8ce5d866ec6a08a9cfab82b62acf4a830b35f upstream

Add documentation for the L1TF vulnerability and the mitigation mechanisms:

  - Explain the problem and risks
  - Document the mitigation mechanisms
  - Document the command line controls
  - Document the sysfs files

Signed-off-by: Thomas Gleixner 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Josh Poimboeuf 
Acked-by: Linus Torvalds 
Link: https://lkml.kernel.org/r/20180713142323.287429...@linutronix.de
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/admin-guide/index.rst |9 
 Documentation/admin-guide/l1tf.rst  |  591 
 2 files changed, 600 insertions(+)
 create mode 100644 Documentation/admin-guide/l1tf.rst

--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -17,6 +17,15 @@ etc.
kernel-parameters
devices
 
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+   :maxdepth: 1
+
+   l1tf
+
 Here is a set of documents aimed at users who are trying to track down
 problems and bugs in particular.
 
--- /dev/null
+++ b/Documentation/admin-guide/l1tf.rst
@@ -0,0 +1,591 @@
+L1TF - L1 Terminal Fault
+
+
+L1 Terminal Fault is a hardware vulnerability which allows unprivileged
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual address, which is used
+for the access, has the Present bit cleared or other reserved bits set.
+
+Affected processors
+---
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Processors from AMD, Centaur and other non Intel vendors
+
+   - Older processor models, where the CPU family is < 6
+
+   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
+ Penwell, Pineview, Slivermont, Airmont, Merrifield)
+
+   - The Intel Core Duo Yonah variants (2006 - 2008)
+
+   - The Intel XEON PHI family
+
+   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
+ by the Meltdown vulnerability either. These CPUs should become
+ available by end of 2018.
+
+Whether a processor is affected or not can be read out from the L1TF
+vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
+
+Related CVEs
+
+
+The following CVE entries are related to the L1TF vulnerability:
+
+   =  =  ==
+   CVE-2018-3615  L1 Terminal Fault  SGX related aspects
+   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
+   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
+   =  =  ==
+
+Problem
+---
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the Present bit cleared or other reserved bits set,
+then speculative execution ignores the invalid PTE and loads the referenced
+data if it is present in the Level 1 Data Cache, as if the page referenced
+by the address bits in the PTE was still present and accessible.
+
+While this is a purely speculative mechanism and the instruction will raise
+a page fault when it is retired eventually, the pure act of loading the
+data and making it available to other speculative instructions opens up the
+opportunity for side channel attacks to unprivileged malicious code,
+similar to the Meltdown attack.
+
+While Meltdown breaks the user space to kernel space protection, L1TF
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows an attack of SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+
+Attack scenarios
+
+
+1. Malicious user space
+^^^
+
+   Operating Systems store arbitrary information in the address bits of a
+   PTE which is marked non present. This allows a malicious user space
+   application to attack the physical memory to which these PTEs resolve.
+   In some cases user-space can maliciously influence the information
+   encoded in the address bits of the PTE, thus making attacks more
+   deterministic and more practical.
+
+   The Linux kernel contains a mitigation for this attack vector, PTE
+   inversion, which is permanently enabled and has no performance
+   impact. The kernel ensures that the address bits of PTEs, which are not
+   marked present, never point to cacheable physical memory space.
+
+   A system with an up to date kernel is protected against attacks from
+   malicious

[PATCH 4.14 098/104] cpu/hotplug: Fix SMT supported evaluation

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

commit bc2d8d262cba5736332cbc866acb11b1c5748aa9 upstream

Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.

Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.

SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.

Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: Josh Poimboeuf 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/bugs.c |2 +-
 include/linux/cpu.h|2 ++
 kernel/cpu.c   |   41 -
 kernel/smp.c   |2 ++
 4 files changed, 33 insertions(+), 14 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -62,7 +62,7 @@ void __init check_bugs(void)
 * identify_boot_cpu() initialized SMT support information, let the
 * core code know.
 */
-   cpu_smt_check_topology();
+   cpu_smt_check_topology_early();
 
if (!IS_ENABLED(CONFIG_SMP)) {
pr_info("CPU: ");
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -188,10 +188,12 @@ enum cpuhp_smt_control {
 #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
 extern enum cpuhp_smt_control cpu_smt_control;
 extern void cpu_smt_disable(bool force);
+extern void cpu_smt_check_topology_early(void);
 extern void cpu_smt_check_topology(void);
 #else
 # define cpu_smt_control   (CPU_SMT_ENABLED)
 static inline void cpu_smt_disable(bool force) { }
+static inline void cpu_smt_check_topology_early(void) { }
 static inline void cpu_smt_check_topology(void) { }
 #endif
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -351,6 +351,8 @@ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
 EXPORT_SYMBOL_GPL(cpu_smt_control);
 
+static bool cpu_smt_available __read_mostly;
+
 void __init cpu_smt_disable(bool force)
 {
if (cpu_smt_control == CPU_SMT_FORCE_DISABLED ||
@@ -367,14 +369,28 @@ void __init cpu_smt_disable(bool force)
 
 /*
  * The decision whether SMT is supported can only be done after the full
- * CPU identification. Called from architecture code.
+ * CPU identification. Called from architecture code before non boot CPUs
+ * are brought up.
  */
-void __init cpu_smt_check_topology(void)
+void __init cpu_smt_check_topology_early(void)
 {
if (!topology_smt_supported())
cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
 }
 
+/*
+ * If SMT was disabled by BIOS, detect it here, after the CPUs have been
+ * brought online. This ensures the smt/l1tf sysfs entries are consistent
+ * with reality. cpu_smt_available is set to true during the bringup of non
+ * boot CPUs when a SMT sibling is detected. Note, this may overwrite
+ * cpu_smt_control's previous setting.
+ */
+void __init cpu_smt_check_topology(void)
+{
+   if (!cpu_smt_available)
+   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
+}
+
 static int __init smt_cmdline_disable(char *str)
 {
cpu_smt_disable(str && !strcmp(str, "force"));
@@ -384,10 +400,18 @@ early_param("nosmt", smt_cmdline_disable
 
 static inline bool cpu_smt_allowed(unsigned int cpu)
 {
-   if (cpu_smt_control == CPU_SMT_ENABLED)
+   if (topology_is_primary_thread(cpu))
return true;
 
-   if (topology_is_primary_thread(cpu))
+   /*
+* If the CPU is not a 'primary' thread and the booted_once bit is
+* set then the processor has SMT support. Store this information
+* for the late check of SMT support in cpu_smt_check_topology().
+*/
+   if (per_cpu(cpuhp_state, cpu).booted_once)
+   cpu_smt_available = true;
+
+   if (cpu_smt_control == CPU_SMT_ENABLED)
return true;
 
/*
@@ -2137,15 +2161,6 @@ static const struct attribute_group cpuh
 
 static int __init cpu_smt_state_init(void)
 {
-   /*
-* If SMT was disabled by BIOS, detect it here, after the CPUs have
-* been brought online.  This ensures the smt/l1tf sysfs entries are
-* consistent with reality.  Note this may overwrite cpu_smt_control's
-* previous setting.
-*/
-   if (topology_max_smt_threads() == 1)
-   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
-
return

[PATCH 4.14 084/104] x86/irq: Demote irq_cpustat_t::__softirq_pending to u16

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 9aee5f8a7e30330d0a8f4c626dc924ca5590aba5 upstream

An upcoming patch will extend KVM's L1TF mitigation in conditional mode
to also cover interrupts after VMEXITs. For tracking those, stores to a
new per-cpu flag from interrupt handlers will become necessary.

In order to improve cache locality, this new flag will be added to x86's
irq_cpustat_t.

Make some space available there by shrinking the ->softirq_pending bitfield
from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
i.e. 10.

Suggested-by: Paolo Bonzini 
Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/hardirq.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -6,7 +6,7 @@
 #include 
 
 typedef struct {
-   unsigned int __softirq_pending;
+   u16  __softirq_pending;
unsigned int __nmi_count;   /* arch dependent */
 #ifdef CONFIG_X86_LOCAL_APIC
unsigned int apic_timer_irqs;   /* arch dependent */

[PATCH 4.14 101/104] x86/mm/pat: Make set_memory_np() L1TF safe

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 958f79b9ee55dfaf00c8106ed1c22a2919e0028b upstream

set_memory_np() is used to mark kernel mappings not present, but it has
it's own open coded mechanism which does not have the L1TF protection of
inverting the address bits.

Replace the open coded PTE manipulation with the L1TF protecting low level
PTE routines.

Passes the CPA self test.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/mm/pageattr.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -1006,8 +1006,8 @@ static long populate_pmd(struct cpa_data
 
pmd = pmd_offset(pud, start);
 
-   set_pmd(pmd, __pmd(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
-  massage_pgprot(pmd_pgprot)));
+   set_pmd(pmd, pmd_mkhuge(pfn_pmd(cpa->pfn,
+   canon_pgprot(pmd_pgprot;
 
start += PMD_SIZE;
cpa->pfn  += PMD_SIZE >> PAGE_SHIFT;
@@ -1079,8 +1079,8 @@ static int populate_pud(struct cpa_data
 * Map everything starting from the Gb boundary, possibly with 1G pages
 */
while (boot_cpu_has(X86_FEATURE_GBPAGES) && end - start >= PUD_SIZE) {
-   set_pud(pud, __pud(cpa->pfn << PAGE_SHIFT | _PAGE_PSE |
-  massage_pgprot(pud_pgprot)));
+   set_pud(pud, pud_mkhuge(pfn_pud(cpa->pfn,
+  canon_pgprot(pud_pgprot;
 
start += PUD_SIZE;
cpa->pfn  += PUD_SIZE >> PAGE_SHIFT;

[PATCH 4.14 097/104] KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Paolo Bonzini 

commit 5b76a3cff011df2dcb6186c965a2e4d809a05ad4 upstream

When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry.  Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.

Add the relevant Documentation.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/admin-guide/l1tf.rst |   21 +
 arch/x86/include/asm/kvm_host.h|1 +
 arch/x86/kvm/vmx.c |3 +--
 arch/x86/kvm/x86.c |   26 +-
 4 files changed, 48 insertions(+), 3 deletions(-)

--- a/Documentation/admin-guide/l1tf.rst
+++ b/Documentation/admin-guide/l1tf.rst
@@ -546,6 +546,27 @@ available:
 EPT can be disabled in the hypervisor via the 'kvm-intel.ept'
 parameter.
 
+3.4. Nested virtual machines
+
+
+When nested virtualization is in use, three operating systems are involved:
+the bare metal hypervisor, the nested hypervisor and the nested virtual
+machine.  VMENTER operations from the nested hypervisor into the nested
+guest will always be processed by the bare metal hypervisor. If KVM is the
+bare metal hypervisor it wiil:
+
+ - Flush the L1D cache on every switch from the nested hypervisor to the
+   nested virtual machine, so that the nested hypervisor's secrets are not
+   exposed to the nested virtual machine;
+
+ - Flush the L1D cache on every switch from the nested virtual machine to
+   the nested hypervisor; this is a complex operation, and flushing the L1D
+   cache avoids that the bare metal hypervisor's secrets are exposed to the
+   nested virtual machine;
+
+ - Instruct the nested hypervisor to not perform any L1D cache flush. This
+   is an optimization to avoid double L1D flushing.
+
 
 .. _default_mitigations:
 
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1374,6 +1374,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcp
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu);
 
+u64 kvm_get_arch_capabilities(void);
 void kvm_define_shared_msr(unsigned index, u32 msr);
 int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5910,8 +5910,7 @@ static int vmx_vcpu_setup(struct vcpu_vm
++vmx->nmsrs;
}
 
-   if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
-   rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
+   vmx->arch_capabilities = kvm_get_arch_capabilities();
 
vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
 
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1054,11 +1054,35 @@ static u32 msr_based_features[] = {
 
 static unsigned int num_msr_based_features;
 
+u64 kvm_get_arch_capabilities(void)
+{
+   u64 data;
+
+   rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, );
+
+   /*
+* If we're doing cache flushes (either "always" or "cond")
+* we will do one whenever the guest does a vmlaunch/vmresume.
+* If an outer hypervisor is doing the cache flush for us
+* (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that
+* capability to the guest too, and if EPT is disabled we're not
+* vulnerable.  Overall, only VMENTER_L1D_FLUSH_NEVER will
+* require a nested hypervisor to do a flush of its own.
+*/
+   if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
+   data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
+
+   return data;
+}
+EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);
+
 static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
 {
switch (msr->index) {
-   case MSR_IA32_UCODE_REV:
case MSR_IA32_ARCH_CAPABILITIES:
+   msr->data = kvm_get_arch_capabilities();
+   break;
+   case MSR_IA32_UCODE_REV:
rdmsrl_safe(msr->index, >data);
break;
default:

[PATCH 4.14 102/104] x86/mm/kmmio: Make the tracer robust against L1TF

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 1063711b57393c1999248cccb57bebfaf16739e7 upstream

The mmio tracer sets io mapping PTEs and PMDs to non present when enabled
without inverting the address bits, which makes the PTE entry vulnerable
for L1TF.

Make it use the right low level macros to actually invert the address bits
to protect against L1TF.

In principle this could be avoided because MMIO tracing is not likely to be
enabled on production machines, but the fix is straigt forward and for
consistency sake it's better to get rid of the open coded PTE manipulation.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/mm/kmmio.c |   25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/arch/x86/mm/kmmio.c
+++ b/arch/x86/mm/kmmio.c
@@ -126,24 +126,29 @@ static struct kmmio_fault_page *get_kmmi
 
 static void clear_pmd_presence(pmd_t *pmd, bool clear, pmdval_t *old)
 {
+   pmd_t new_pmd;
pmdval_t v = pmd_val(*pmd);
if (clear) {
-   *old = v & _PAGE_PRESENT;
-   v &= ~_PAGE_PRESENT;
-   } else  /* presume this has been called with clear==true previously */
-   v |= *old;
-   set_pmd(pmd, __pmd(v));
+   *old = v;
+   new_pmd = pmd_mknotpresent(*pmd);
+   } else {
+   /* Presume this has been called with clear==true previously */
+   new_pmd = __pmd(*old);
+   }
+   set_pmd(pmd, new_pmd);
 }
 
 static void clear_pte_presence(pte_t *pte, bool clear, pteval_t *old)
 {
pteval_t v = pte_val(*pte);
if (clear) {
-   *old = v & _PAGE_PRESENT;
-   v &= ~_PAGE_PRESENT;
-   } else  /* presume this has been called with clear==true previously */
-   v |= *old;
-   set_pte_atomic(pte, __pte(v));
+   *old = v;
+   /* Nothing should care about address */
+   pte_clear(_mm, 0, pte);
+   } else {
+   /* Presume this has been called with clear==true previously */
+   set_pte_atomic(pte, __pte(*old));
+   }
 }
 
 static int clear_page_presence(struct kmmio_fault_page *f, bool clear)

[PATCH 4.14 075/104] cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

commit fee0aede6f4739c87179eca76136f83210953b86 upstream

The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support
SMT) when the sysfs SMT control file is initialized.

That was fine so far as this was only required to make the output of the
control file correct and to prevent writes in that case.

With the upcoming l1tf command line parameter, this needs to be set up
before the L1TF mitigation selection and command line parsing happens.

Signed-off-by: Thomas Gleixner 
Tested-by: Jiri Kosina 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Josh Poimboeuf 
Link: https://lkml.kernel.org/r/20180713142323.121795...@linutronix.de
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/bugs.c |6 ++
 include/linux/cpu.h|2 ++
 kernel/cpu.c   |   13 ++---
 3 files changed, 18 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -58,6 +58,12 @@ void __init check_bugs(void)
 {
identify_boot_cpu();
 
+   /*
+* identify_boot_cpu() initialized SMT support information, let the
+* core code know.
+*/
+   cpu_smt_check_topology();
+
if (!IS_ENABLED(CONFIG_SMP)) {
pr_info("CPU: ");
print_cpu_info(_cpu_data);
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -188,9 +188,11 @@ enum cpuhp_smt_control {
 #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
 extern enum cpuhp_smt_control cpu_smt_control;
 extern void cpu_smt_disable(bool force);
+extern void cpu_smt_check_topology(void);
 #else
 # define cpu_smt_control   (CPU_SMT_ENABLED)
 static inline void cpu_smt_disable(bool force) { }
+static inline void cpu_smt_check_topology(void) { }
 #endif
 
 #endif /* _LINUX_CPU_H_ */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -365,6 +365,16 @@ void __init cpu_smt_disable(bool force)
}
 }
 
+/*
+ * The decision whether SMT is supported can only be done after the full
+ * CPU identification. Called from architecture code.
+ */
+void __init cpu_smt_check_topology(void)
+{
+   if (!topology_smt_supported())
+   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
+}
+
 static int __init smt_cmdline_disable(char *str)
 {
cpu_smt_disable(str && !strcmp(str, "force"));
@@ -2127,9 +2137,6 @@ static const struct attribute_group cpuh
 
 static int __init cpu_smt_state_init(void)
 {
-   if (!topology_smt_supported())
-   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
-
return sysfs_create_group(_subsys.dev_root->kobj,
  _smt_attr_group);
 }

[PATCH 4.14 094/104] KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Paolo Bonzini 

commit cd28325249a1ca0d771557ce823e0308ad629f98 upstream

This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all
requested features are available on the host.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kvm/x86.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1049,6 +1049,7 @@ static unsigned num_emulated_msrs;
 static u32 msr_based_features[] = {
MSR_F10H_DECFG,
MSR_IA32_UCODE_REV,
+   MSR_IA32_ARCH_CAPABILITIES,
 };
 
 static unsigned int num_msr_based_features;
@@ -1057,7 +1058,8 @@ static int kvm_get_msr_feature(struct kv
 {
switch (msr->index) {
case MSR_IA32_UCODE_REV:
-   rdmsrl(msr->index, msr->data);
+   case MSR_IA32_ARCH_CAPABILITIES:
+   rdmsrl_safe(msr->index, >data);
break;
default:
if (kvm_x86_ops->get_msr_feature(msr))

[PATCH 4.14 076/104] x86/bugs, kvm: Introduce boot-time control of L1TF mitigations

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Jiri Kosina 

commit d90a7a0ec83fb86622cd7dae23255d3c50a99ec8 upstream

Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.

  full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.

  flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.

  flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.

  flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.

  off
Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'   : Performe L1D flushes. No runtime control
  possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt'  : Perform L1D flushes and warn on VM start if
  SMT has been runtime enabled or L1D flushing
  has been run-time enabled

  - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.

  - 'l1tf=off'  : L1D flushes are not performed and no warnings
  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.

Signed-off-by: Jiri Kosina 
Signed-off-by: Thomas Gleixner 
Tested-by: Jiri Kosina 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Josh Poimboeuf 
Link: https://lkml.kernel.org/r/20180713142323.202758...@linutronix.de
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |4 +
 Documentation/admin-guide/kernel-parameters.txt|   68 +++--
 arch/x86/include/asm/processor.h   |   12 +++
 arch/x86/kernel/cpu/bugs.c |   44 +
 arch/x86/kvm/vmx.c |   56 +
 5 files changed, 165 insertions(+), 19 deletions(-)

--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -379,6 +379,7 @@ What:   /sys/devices/system/cpu/vulnerabi
/sys/devices/system/cpu/vulnerabilities/spectre_v1
/sys/devices/system/cpu/vulnerabilities/spectre_v2
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
+   /sys/devices/system/cpu/vulnerabilities/l1tf
 Date:  January 2018
 Contact:   Linux kernel mailing list 
 Description:   Information about CPU vulnerabilities
@@ -391,6 +392,9 @@ Description:Information about CPU vulne
"Vulnerable"  CPU is affected and no mitigation in effect
"Mitigation: $M"  CPU is affected and mitigation $M is in effect
 
+   Details about the l1tf file can be found in
+   Documentation/admin-guide/l1tf.rst
+
 What:  /sys/devices/system/cpu/smt
/sys/devices/system/cpu/smt/active
/sys/devices/system/cpu/smt/control
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1867,12 +1867,6 @@
[KVM,ARM] Trap guest accesses to GICv3 common
system registers
 
-   kvm-intel.nosmt=[KVM,Intel] If the L1TF CPU bug is present 
(CVE-2018-3620)
-   and the system has SMT (aka Hyper-Threading) enabled 
then
-   don't allow guests to be created.
-
-   Default is 0 (allow guests to be created).
-
kvm-intel.ept=  [KVM,Intel]

[PATCH 4.14 082/104] x86/KVM/VMX: Replace vmx_l1d_flush_always with vmx_l1d_flush_cond

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 427362a142441f08051369db6fbe7f61c73b3dca upstream

The vmx_l1d_flush_always static key is only ever evaluated if
vmx_l1d_should_flush is enabled. In that case however, there are only two
L1d flushing modes possible: "always" and "conditional".

The "conditional" mode's implementation tends to require more sophisticated
logic than the "always" mode.

Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
with a 'vmx_l1d_flush_cond' one.

There is no change in functionality.

Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kvm/vmx.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -192,7 +192,7 @@ module_param(ple_window_max, int, S_IRUG
 extern const ulong vmx_return;
 
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_always);
+static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
 
 /* Storage for pre module init parameter parsing */
@@ -266,10 +266,10 @@ static int vmx_setup_l1d_flush(enum vmx_
else
static_branch_disable(_l1d_should_flush);
 
-   if (l1tf == VMENTER_L1D_FLUSH_ALWAYS)
-   static_branch_enable(_l1d_flush_always);
+   if (l1tf == VMENTER_L1D_FLUSH_COND)
+   static_branch_enable(_l1d_flush_cond);
else
-   static_branch_disable(_l1d_flush_always);
+   static_branch_disable(_l1d_flush_cond);
return 0;
 }
 
@@ -9167,7 +9167,7 @@ static void vmx_l1d_flush(struct kvm_vcp
 * This code is only executed when the the flush mode is 'cond' or
 * 'always'
 */
-   if (!static_branch_unlikely(_l1d_flush_always)) {
+   if (static_branch_likely(_l1d_flush_cond)) {
/*
 * Clear the flush bit, it gets set again either from
 * vcpu_run() or from one of the unsafe VMEXIT

[PATCH 4.14 099/104] x86/speculation/l1tf: Invert all not present mappings

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit f22cc87f6c1f771b57c407555cfefd811cdd9507 upstream

For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.

Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable-invert.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/include/asm/pgtable-invert.h
+++ b/arch/x86/include/asm/pgtable-invert.h
@@ -6,7 +6,7 @@
 
 static inline bool __pte_needs_invert(u64 val)
 {
-   return (val & (_PAGE_PRESENT|_PAGE_PROTNONE)) == _PAGE_PROTNONE;
+   return !(val & _PAGE_PRESENT);
 }
 
 /* Get a mask to xor with the page table entry to get the correct pfn. */

[PATCH 4.14 085/104] x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 45b575c00d8e72d69d75dd8c112f044b7b01b069 upstream

Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
VMENTRY.

L1D flushes are costly and two modes of operations are provided to users:
"always" and the more selective "conditional" mode.

If operating in the latter, the cache would get flushed only if a host side
code path considered unconfined had been traversed. "Unconfined" in this
context means that it might have pulled in sensitive data like user data
or kernel crypto keys.

The need for L1D flushes is tracked by means of the per-vcpu flag
l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
L1d flush based on its value and reset that flag again.

Currently, interrupts delivered "normally" while in root operation between
VMEXIT and VMENTER are not taken into account. Part of the reason is that
these don't leave any traces and thus, the vmx code is unable to tell if
any such has happened.

As proposed by Paolo Bonzini, prepare for tracking all interrupts by
introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
strong analogy to the per-vcpu ->l1tf_flush_l1d.

A later patch will make interrupt handlers set it.

For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
per-cpu irq_cpustat_t as suggested by Peter Zijlstra.

Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.

Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
l1tf_flush_l1d.

Suggested-by: Paolo Bonzini 
Suggested-by: Peter Zijlstra 
Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/hardirq.h |   23 +++
 arch/x86/kvm/vmx.c |   17 +
 2 files changed, 36 insertions(+), 4 deletions(-)

--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -7,6 +7,9 @@
 
 typedef struct {
u16  __softirq_pending;
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+   u8   kvm_cpu_l1tf_flush_l1d;
+#endif
unsigned int __nmi_count;   /* arch dependent */
 #ifdef CONFIG_X86_LOCAL_APIC
unsigned int apic_timer_irqs;   /* arch dependent */
@@ -62,4 +65,24 @@ extern u64 arch_irq_stat_cpu(unsigned in
 extern u64 arch_irq_stat(void);
 #define arch_irq_stat  arch_irq_stat
 
+
+#if IS_ENABLED(CONFIG_KVM_INTEL)
+static inline void kvm_set_cpu_l1tf_flush_l1d(void)
+{
+   __this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 1);
+}
+
+static inline void kvm_clear_cpu_l1tf_flush_l1d(void)
+{
+   __this_cpu_write(irq_stat.kvm_cpu_l1tf_flush_l1d, 0);
+}
+
+static inline bool kvm_get_cpu_l1tf_flush_l1d(void)
+{
+   return __this_cpu_read(irq_stat.kvm_cpu_l1tf_flush_l1d);
+}
+#else /* !IS_ENABLED(CONFIG_KVM_INTEL) */
+static inline void kvm_set_cpu_l1tf_flush_l1d(void) { }
+#endif /* IS_ENABLED(CONFIG_KVM_INTEL) */
+
 #endif /* _ASM_X86_HARDIRQ_H */
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -9168,14 +9168,23 @@ static void vmx_l1d_flush(struct kvm_vcp
 * 'always'
 */
if (static_branch_likely(_l1d_flush_cond)) {
-   bool flush_l1d = vcpu->arch.l1tf_flush_l1d;
+   bool flush_l1d;
 
/*
-* Clear the flush bit, it gets set again either from
-* vcpu_run() or from one of the unsafe VMEXIT
-* handlers.
+* Clear the per-vcpu flush bit, it gets set again
+* either from vcpu_run() or from one of the unsafe
+* VMEXIT handlers.
 */
+   flush_l1d = vcpu->arch.l1tf_flush_l1d;
vcpu->arch.l1tf_flush_l1d = false;
+
+   /*
+* Clear the per-cpu flush bit, it gets set again from
+* the interrupt handlers.
+*/
+   flush_l1d |= kvm_get_cpu_l1tf_flush_l1d();
+   kvm_clear_cpu_l1tf_flush_l1d();
+
if (!flush_l1d)
return;
}

Re: [PATCH RFC] Make call_srcu() available during very early boot

2018-08-14 Thread Paul E. McKenney

On Tue, Aug 14, 2018 at 01:24:53PM -0400, Steven Rostedt wrote:
> On Tue, 14 Aug 2018 10:06:18 -0700
> "Paul E. McKenney"  wrote:
> 
> 
> > > >  #define __SRCU_STRUCT_INIT(name, pcpu_name)
> > > > \
> > > > -   {   
> > > > \
> > > > -   .sda = _name,  
> > > > \
> > > > -   .lock = __SPIN_LOCK_UNLOCKED(name.lock),
> > > > \
> > > > -   .srcu_gp_seq_needed = 0 - 1,
> > > > \
> > > > -   __SRCU_DEP_MAP_INIT(name)   
> > > > \
> > > > -   }
> > > > +{  
> > > > \
> > > > +   .sda = _name,  
> > > > \
> > > > +   .lock = __SPIN_LOCK_UNLOCKED(name.lock),
> > > > \
> > > > +   .srcu_gp_seq_needed = 0 - 1,
> > > > \  
> > > 
> > > Interesting initialization of -1. This was there before, but still
> > > interesting none the less.  
> > 
> > If I recall correctly, this subterfuge suppresses compiler complaints
> > about initializing an unsigned long with a negative number.  :-/
> 
> Did you try:
> 
>   .srcu_gp_seq_needed = -1UL,
> 
> ?

Works for my compiler, not sure what set of complaints pushed me in that
direction.

Thanx, Paul

[PATCH 4.17 53/97] x86/KVM/VMX: Add module argument for L1TF mitigation

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Konrad Rzeszutek Wilk 

commit a399477e52c17e148746d3ce9a483f681c2aa9a0 upstream

Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
L1 terminal fault. The valid arguments are:

 - "always" L1D cache flush on every VMENTER.
 - "cond"   Conditional L1D cache flush, explained below
 - "never"  Disable the L1D cache flush mitigation

"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.

[ tglx: Split out from a larger patch ]

Signed-off-by: Konrad Rzeszutek Wilk 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/admin-guide/kernel-parameters.txt |   12 
 arch/x86/kvm/vmx.c  |   59 
 2 files changed, 71 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1964,6 +1964,18 @@
(virtualized real and unpaged mode) on capable
Intel chips. Default is 1 (enabled)
 
+   kvm-intel.vmentry_l1d_flush=[KVM,Intel] Mitigation for L1 Terminal Fault
+   CVE-2018-3620.
+
+   Valid arguments: never, cond, always
+
+   always: L1D cache flush on every VMENTER.
+   cond:   Flush L1D on VMENTER only when the code between
+   VMEXIT and VMENTER can leak host memory.
+   never:  Disables the mitigation
+
+   Default is cond (do L1 cache flush in specific 
instances)
+
kvm-intel.vpid= [KVM,Intel] Disable Virtual Processor Identification
feature (tagged TLBs) on capable Intel chips.
Default is 1 (enabled)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -191,6 +191,54 @@ module_param(ple_window_max, uint, 0444)
 
 extern const ulong vmx_return;
 
+static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
+
+/* These MUST be in sync with vmentry_l1d_param order. */
+enum vmx_l1d_flush_state {
+   VMENTER_L1D_FLUSH_NEVER,
+   VMENTER_L1D_FLUSH_COND,
+   VMENTER_L1D_FLUSH_ALWAYS,
+};
+
+static enum vmx_l1d_flush_state __read_mostly vmentry_l1d_flush = 
VMENTER_L1D_FLUSH_COND;
+
+static const struct {
+   const char *option;
+   enum vmx_l1d_flush_state cmd;
+} vmentry_l1d_param[] = {
+   {"never",   VMENTER_L1D_FLUSH_NEVER},
+   {"cond",VMENTER_L1D_FLUSH_COND},
+   {"always",  VMENTER_L1D_FLUSH_ALWAYS},
+};
+
+static int vmentry_l1d_flush_set(const char *s, const struct kernel_param *kp)
+{
+   unsigned int i;
+
+   if (!s)
+   return -EINVAL;
+
+   for (i = 0; i < ARRAY_SIZE(vmentry_l1d_param); i++) {
+   if (!strcmp(s, vmentry_l1d_param[i].option)) {
+   vmentry_l1d_flush = vmentry_l1d_param[i].cmd;
+   return 0;
+   }
+   }
+
+   return -EINVAL;
+}
+
+static int vmentry_l1d_flush_get(char *s, const struct kernel_param *kp)
+{
+   return sprintf(s, "%s\n", vmentry_l1d_param[vmentry_l1d_flush].option);
+}
+
+static const struct kernel_param_ops vmentry_l1d_flush_ops = {
+   .set = vmentry_l1d_flush_set,
+   .get = vmentry_l1d_flush_get,
+};
+module_param_cb(vmentry_l1d_flush, _l1d_flush_ops, _l1d_flush, 
S_IRUGO);
+
 struct kvm_vmx {
struct kvm kvm;
 
@@ -12881,6 +12929,15 @@ static struct kvm_x86_ops vmx_x86_ops __
.enable_smi_window = enable_smi_window,
 };
 
+static void __init vmx_setup_l1d_flush(void)
+{
+   if (vmentry_l1d_flush == VMENTER_L1D_FLUSH_NEVER ||
+   !boot_cpu_has_bug(X86_BUG_L1TF))
+   return;
+
+   static_branch_enable(_l1d_should_flush);
+}
+
 static int __init vmx_init(void)
 {
int r;
@@ -12914,6 +12971,8 @@ static int __init vmx_init(void)
}
 #endif
 
+   vmx_setup_l1d_flush();
+
r = kvm_init(_x86_ops, sizeof(struct vcpu_vmx),
  __alignof__(struct vcpu_vmx), THIS_MODULE);
if (r)

[PATCH 4.17 73/97] Documentation: Add section about CPU vulnerabilities

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

commit 3ec8ce5d866ec6a08a9cfab82b62acf4a830b35f upstream

Add documentation for the L1TF vulnerability and the mitigation mechanisms:

  - Explain the problem and risks
  - Document the mitigation mechanisms
  - Document the command line controls
  - Document the sysfs files

Signed-off-by: Thomas Gleixner 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Josh Poimboeuf 
Acked-by: Linus Torvalds 
Link: https://lkml.kernel.org/r/20180713142323.287429...@linutronix.de
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/admin-guide/index.rst |9 
 Documentation/admin-guide/l1tf.rst  |  591 
 2 files changed, 600 insertions(+)
 create mode 100644 Documentation/admin-guide/l1tf.rst

--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -17,6 +17,15 @@ etc.
kernel-parameters
devices
 
+This section describes CPU vulnerabilities and provides an overview of the
+possible mitigations along with guidance for selecting mitigations if they
+are configurable at compile, boot or run time.
+
+.. toctree::
+   :maxdepth: 1
+
+   l1tf
+
 Here is a set of documents aimed at users who are trying to track down
 problems and bugs in particular.
 
--- /dev/null
+++ b/Documentation/admin-guide/l1tf.rst
@@ -0,0 +1,591 @@
+L1TF - L1 Terminal Fault
+
+
+L1 Terminal Fault is a hardware vulnerability which allows unprivileged
+speculative access to data which is available in the Level 1 Data Cache
+when the page table entry controlling the virtual address, which is used
+for the access, has the Present bit cleared or other reserved bits set.
+
+Affected processors
+---
+
+This vulnerability affects a wide range of Intel processors. The
+vulnerability is not present on:
+
+   - Processors from AMD, Centaur and other non Intel vendors
+
+   - Older processor models, where the CPU family is < 6
+
+   - A range of Intel ATOM processors (Cedarview, Cloverview, Lincroft,
+ Penwell, Pineview, Slivermont, Airmont, Merrifield)
+
+   - The Intel Core Duo Yonah variants (2006 - 2008)
+
+   - The Intel XEON PHI family
+
+   - Intel processors which have the ARCH_CAP_RDCL_NO bit set in the
+ IA32_ARCH_CAPABILITIES MSR. If the bit is set the CPU is not affected
+ by the Meltdown vulnerability either. These CPUs should become
+ available by end of 2018.
+
+Whether a processor is affected or not can be read out from the L1TF
+vulnerability file in sysfs. See :ref:`l1tf_sys_info`.
+
+Related CVEs
+
+
+The following CVE entries are related to the L1TF vulnerability:
+
+   =  =  ==
+   CVE-2018-3615  L1 Terminal Fault  SGX related aspects
+   CVE-2018-3620  L1 Terminal Fault  OS, SMM related aspects
+   CVE-2018-3646  L1 Terminal Fault  Virtualization related aspects
+   =  =  ==
+
+Problem
+---
+
+If an instruction accesses a virtual address for which the relevant page
+table entry (PTE) has the Present bit cleared or other reserved bits set,
+then speculative execution ignores the invalid PTE and loads the referenced
+data if it is present in the Level 1 Data Cache, as if the page referenced
+by the address bits in the PTE was still present and accessible.
+
+While this is a purely speculative mechanism and the instruction will raise
+a page fault when it is retired eventually, the pure act of loading the
+data and making it available to other speculative instructions opens up the
+opportunity for side channel attacks to unprivileged malicious code,
+similar to the Meltdown attack.
+
+While Meltdown breaks the user space to kernel space protection, L1TF
+allows to attack any physical memory address in the system and the attack
+works across all protection domains. It allows an attack of SGX and also
+works from inside virtual machines because the speculation bypasses the
+extended page table (EPT) protection mechanism.
+
+
+Attack scenarios
+
+
+1. Malicious user space
+^^^
+
+   Operating Systems store arbitrary information in the address bits of a
+   PTE which is marked non present. This allows a malicious user space
+   application to attack the physical memory to which these PTEs resolve.
+   In some cases user-space can maliciously influence the information
+   encoded in the address bits of the PTE, thus making attacks more
+   deterministic and more practical.
+
+   The Linux kernel contains a mitigation for this attack vector, PTE
+   inversion, which is permanently enabled and has no performance
+   impact. The kernel ensures that the address bits of PTEs, which are not
+   marked present, never point to cacheable physical memory space.
+
+   A system with an up to date kernel is protected against attacks from
+   malicious

[PATCH 4.17 97/97] x86/microcode: Allow late microcode loading with SMT disabled

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Josh Poimboeuf 

commit 07d981ad4cf1e78361c6db1c28ee5ba105f96cc1 upstream

The kernel unnecessarily prevents late microcode loading when SMT is
disabled.  It should be safe to allow it if all the primary threads are
online.

Signed-off-by: Josh Poimboeuf 
Acked-by: Borislav Petkov 
Signed-off-by: David Woodhouse 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/microcode/core.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/cpu/microcode/core.c
+++ b/arch/x86/kernel/cpu/microcode/core.c
@@ -509,12 +509,20 @@ static struct platform_device *microcode
 
 static int check_online_cpus(void)
 {
-   if (num_online_cpus() == num_present_cpus())
-   return 0;
+   unsigned int cpu;
 
-   pr_err("Not all CPUs online, aborting microcode update.\n");
+   /*
+* Make sure all CPUs are online.  It's fine for SMT to be disabled if
+* all the primary threads are still online.
+*/
+   for_each_present_cpu(cpu) {
+   if (topology_is_primary_thread(cpu) && !cpu_online(cpu)) {
+   pr_err("Not all CPUs online, aborting microcode 
update.\n");
+   return -EINVAL;
+   }
+   }
 
-   return -EINVAL;
+   return 0;
 }
 
 static atomic_t late_cpus_in;

[PATCH 4.14 001/104] parisc: Enable CONFIG_MLONGCALLS by default

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Helge Deller 

commit 66509a276c8c1d19ee3f661a41b418d101c57d29 upstream.

Enable the -mlong-calls compiler option by default, because otherwise in most
cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F
relocations. This fixes building the 64-bit defconfig.

Cc: sta...@vger.kernel.org # 4.0+
Signed-off-by: Helge Deller 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/parisc/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -201,7 +201,7 @@ config PREFETCH
 
 config MLONGCALLS
bool "Enable the -mlong-calls compiler option for big kernels"
-   def_bool y if (!MODULES)
+   default y
depends on PA8X00
help
  If you configure the kernel to include many drivers built-in instead

[PATCH 4.17 88/97] x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Paolo Bonzini 

commit 8e0b2b916662e09dd4d09e5271cdf214c6b80e62 upstream

Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
not needed.  Add a new value to enum vmx_l1d_flush_state, which is used
either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/msr-index.h |1 +
 arch/x86/include/asm/vmx.h   |1 +
 arch/x86/kernel/cpu/bugs.c   |1 +
 arch/x86/kvm/vmx.c   |   10 ++
 4 files changed, 13 insertions(+)

--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -70,6 +70,7 @@
 #define MSR_IA32_ARCH_CAPABILITIES 0x010a
 #define ARCH_CAP_RDCL_NO   (1 << 0)   /* Not susceptible to 
Meltdown */
 #define ARCH_CAP_IBRS_ALL  (1 << 1)   /* Enhanced IBRS support */
+#define ARCH_CAP_SKIP_VMENTRY_L1DFLUSH (1 << 3)   /* Skip L1D flush on vmentry 
*/
 #define ARCH_CAP_SSB_NO(1 << 4)   /*
* Not susceptible to 
Speculative Store Bypass
* attack, so no Speculative 
Store Bypass
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -580,6 +580,7 @@ enum vmx_l1d_flush_state {
VMENTER_L1D_FLUSH_COND,
VMENTER_L1D_FLUSH_ALWAYS,
VMENTER_L1D_FLUSH_EPT_DISABLED,
+   VMENTER_L1D_FLUSH_NOT_REQUIRED,
 };
 
 extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -731,6 +731,7 @@ static const char *l1tf_vmx_states[] = {
[VMENTER_L1D_FLUSH_COND]= "conditional cache flushes",
[VMENTER_L1D_FLUSH_ALWAYS]  = "cache flushes",
[VMENTER_L1D_FLUSH_EPT_DISABLED]= "EPT disabled",
+   [VMENTER_L1D_FLUSH_NOT_REQUIRED]= "flush not necessary"
 };
 
 static ssize_t l1tf_show_state(char *buf)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -218,6 +218,16 @@ static int vmx_setup_l1d_flush(enum vmx_
return 0;
}
 
+   if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES)) {
+  u64 msr;
+
+  rdmsrl(MSR_IA32_ARCH_CAPABILITIES, msr);
+  if (msr & ARCH_CAP_SKIP_VMENTRY_L1DFLUSH) {
+  l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_NOT_REQUIRED;
+  return 0;
+  }
+   }
+
/* If set to auto use the default l1tf mitigation method */
if (l1tf == VMENTER_L1D_FLUSH_AUTO) {
switch (l1tf_mitigation) {

[PATCH 4.17 79/97] x86/KVM/VMX: Replace vmx_l1d_flush_always with vmx_l1d_flush_cond

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 427362a142441f08051369db6fbe7f61c73b3dca upstream

The vmx_l1d_flush_always static key is only ever evaluated if
vmx_l1d_should_flush is enabled. In that case however, there are only two
L1d flushing modes possible: "always" and "conditional".

The "conditional" mode's implementation tends to require more sophisticated
logic than the "always" mode.

Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
with a 'vmx_l1d_flush_cond' one.

There is no change in functionality.

Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kvm/vmx.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -189,7 +189,7 @@ module_param(ple_window_max, uint, 0444)
 extern const ulong vmx_return;
 
 static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
-static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_always);
+static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
 static DEFINE_MUTEX(vmx_l1d_flush_mutex);
 
 /* Storage for pre module init parameter parsing */
@@ -263,10 +263,10 @@ static int vmx_setup_l1d_flush(enum vmx_
else
static_branch_disable(_l1d_should_flush);
 
-   if (l1tf == VMENTER_L1D_FLUSH_ALWAYS)
-   static_branch_enable(_l1d_flush_always);
+   if (l1tf == VMENTER_L1D_FLUSH_COND)
+   static_branch_enable(_l1d_flush_cond);
else
-   static_branch_disable(_l1d_flush_always);
+   static_branch_disable(_l1d_flush_cond);
return 0;
 }
 
@@ -9462,7 +9462,7 @@ static void vmx_l1d_flush(struct kvm_vcp
 * This code is only executed when the the flush mode is 'cond' or
 * 'always'
 */
-   if (!static_branch_unlikely(_l1d_flush_always)) {
+   if (static_branch_likely(_l1d_flush_cond)) {
/*
 * Clear the flush bit, it gets set again either from
 * vcpu_run() or from one of the unsafe VMEXIT

[PATCH 4.17 96/97] tools headers: Synchronise x86 cpufeatures.h for L1TF additions

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: David Woodhouse 

commit e24f14b0ff985f3e09e573ba1134bfdf42987e05 upstream

[ ... and some older changes in the 4.17.y backport too ...]
Signed-off-by: David Woodhouse 
Signed-off-by: Greg Kroah-Hartman 
---
 tools/arch/x86/include/asm/cpufeatures.h |   23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -198,7 +198,6 @@
 #define X86_FEATURE_CAT_L2 ( 7*32+ 5) /* Cache Allocation 
Technology L2 */
 #define X86_FEATURE_CDP_L3 ( 7*32+ 6) /* Code and Data 
Prioritization L3 */
 #define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && 
CR4.PCIDE=1 */
-
 #define X86_FEATURE_HW_PSTATE  ( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK  ( 7*32+ 9) /* AMD ProcFeedbackInterface 
*/
 #define X86_FEATURE_SME( 7*32+10) /* AMD Secure Memory 
Encryption */
@@ -207,13 +206,20 @@
 #define X86_FEATURE_RETPOLINE_AMD  ( 7*32+13) /* "" AMD Retpoline 
mitigation for Spectre variant 2 */
 #define X86_FEATURE_INTEL_PPIN ( 7*32+14) /* Intel Processor Inventory 
Number */
 #define X86_FEATURE_CDP_L2 ( 7*32+15) /* Code and Data 
Prioritization L2 */
-
+#define X86_FEATURE_MSR_SPEC_CTRL  ( 7*32+16) /* "" MSR SPEC_CTRL is 
implemented */
+#define X86_FEATURE_SSBD   ( 7*32+17) /* Speculative Store Bypass 
Disable */
 #define X86_FEATURE_MBA( 7*32+18) /* Memory Bandwidth 
Allocation */
 #define X86_FEATURE_RSB_CTXSW  ( 7*32+19) /* "" Fill RSB on context 
switches */
 #define X86_FEATURE_SEV( 7*32+20) /* AMD Secure 
Encrypted Virtualization */
-
 #define X86_FEATURE_USE_IBPB   ( 7*32+21) /* "" Indirect Branch 
Prediction Barrier enabled */
 #define X86_FEATURE_USE_IBRS_FW( 7*32+22) /* "" Use IBRS 
during runtime firmware calls */
+#define X86_FEATURE_SPEC_STORE_BYPASS_DISABLE  ( 7*32+23) /* "" Disable 
Speculative Store Bypass. */
+#define X86_FEATURE_LS_CFG_SSBD( 7*32+24)  /* "" AMD SSBD 
implementation via LS_CFG MSR */
+#define X86_FEATURE_IBRS   ( 7*32+25) /* Indirect Branch 
Restricted Speculation */
+#define X86_FEATURE_IBPB   ( 7*32+26) /* Indirect Branch 
Prediction Barrier */
+#define X86_FEATURE_STIBP  ( 7*32+27) /* Single Thread Indirect 
Branch Predictors */
+#define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD 
family 0x17 (Zen) */
+#define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF 
workaround PTE inversion */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
@@ -274,9 +280,10 @@
 #define X86_FEATURE_CLZERO (13*32+ 0) /* CLZERO instruction */
 #define X86_FEATURE_IRPERF (13*32+ 1) /* Instructions Retired 
Count */
 #define X86_FEATURE_XSAVEERPTR (13*32+ 2) /* Always save/restore FP 
error pointers */
-#define X86_FEATURE_IBPB   (13*32+12) /* Indirect Branch 
Prediction Barrier */
-#define X86_FEATURE_IBRS   (13*32+14) /* Indirect Branch 
Restricted Speculation */
-#define X86_FEATURE_STIBP  (13*32+15) /* Single Thread Indirect 
Branch Predictors */
+#define X86_FEATURE_AMD_IBPB   (13*32+12) /* "" Indirect Branch 
Prediction Barrier */
+#define X86_FEATURE_AMD_IBRS   (13*32+14) /* "" Indirect Branch 
Restricted Speculation */
+#define X86_FEATURE_AMD_STIBP  (13*32+15) /* "" Single Thread Indirect 
Branch Predictors */
+#define X86_FEATURE_VIRT_SSBD  (13*32+25) /* Virtualized Speculative 
Store Bypass Disable */
 
 /* Thermal and Power Management Leaf, CPUID level 0x0006 (EAX), word 14 */
 #define X86_FEATURE_DTHERM (14*32+ 0) /* Digital Thermal Sensor */
@@ -333,7 +340,9 @@
 #define X86_FEATURE_PCONFIG(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_SPEC_CTRL  (18*32+26) /* "" Speculation Control 
(IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP(18*32+27) /* "" Single Thread 
Indirect Branch Predictors */
+#define X86_FEATURE_FLUSH_L1D  (18*32+28) /* Flush L1D cache */
 #define X86_FEATURE_ARCH_CAPABILITIES  (18*32+29) /* IA32_ARCH_CAPABILITIES 
MSR (Intel) */
+#define X86_FEATURE_SPEC_CTRL_SSBD (18*32+31) /* "" Speculative Store 
Bypass Disable */
 
 /*
  * BUG word(s)
@@ -363,5 +372,7 @@
 #define X86_BUG_CPU_MELTDOWN   X86_BUG(14) /* CPU is affected by 
meltdown attack and needs kernel page table isolation */
 #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by 
Spectre variant 1 attack with conditional branches */
 #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by 
Spectre variant 2 attack with indirect

[PATCH 4.9 015/107] proc/sysctl: prune stale dentries during unregistering

2018-08-14 Thread Greg Kroah-Hartman

4.9-stable review patch.  If anyone has any objections, please let me know.

--

From: Konstantin Khlebnikov 

commit d6cffbbe9a7e51eb705182965a189457c17ba8a3 upstream.

Currently unregistering sysctl table does not prune its dentries.
Stale dentries could slowdown sysctl operations significantly.

For example, command:

 # for i in {1..10} ; do unshare -n -- sysctl -a &> /dev/null ; done
 creates a millions of stale denties around sysctls of loopback interface:

 # sysctl fs.dentry-state
 fs.dentry-state = 25812579  2472413545  0   0   0

 All of them have matching names thus lookup have to scan though whole
 hash chain and call d_compare (proc_sys_compare) which checks them
 under system-wide spinlock (sysctl_lock).

 # time sysctl -a > /dev/null
 real1m12.806s
 user0m0.016s
 sys 1m12.400s

Currently only memory reclaimer could remove this garbage.
But without significant memory pressure this never happens.

This patch collects sysctl inodes into list on sysctl table header and
prunes all their dentries once that table unregisters.

Konstantin Khlebnikov  writes:
> On 10.02.2017 10:47, Al Viro wrote:
>> how about >> the matching stats *after* that patch?
>
> dcache size doesn't grow endlessly, so stats are fine
>
> # sysctl fs.dentry-state
> fs.dentry-state = 92712   58376   45  0   0   0
>
> # time sysctl -a &>/dev/null
>
> real  0m0.013s
> user  0m0.004s
> sys   0m0.008s

Signed-off-by: Konstantin Khlebnikov 
Suggested-by: Al Viro 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/proc/inode.c|3 +-
 fs/proc/internal.h |7 -
 fs/proc/proc_sysctl.c  |   59 +++--
 include/linux/sysctl.h |1 
 4 files changed, 51 insertions(+), 19 deletions(-)

--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -43,10 +43,11 @@ static void proc_evict_inode(struct inod
de = PDE(inode);
if (de)
pde_put(de);
+
head = PROC_I(inode)->sysctl;
if (head) {
RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
-   sysctl_head_put(head);
+   proc_sys_evict_inode(inode, head);
}
 }
 
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -65,6 +65,7 @@ struct proc_inode {
struct proc_dir_entry *pde;
struct ctl_table_header *sysctl;
struct ctl_table *sysctl_entry;
+   struct list_head sysctl_inodes;
const struct proc_ns_operations *ns_ops;
struct inode vfs_inode;
 };
@@ -249,10 +250,12 @@ extern void proc_thread_self_init(void);
  */
 #ifdef CONFIG_PROC_SYSCTL
 extern int proc_sys_init(void);
-extern void sysctl_head_put(struct ctl_table_header *);
+extern void proc_sys_evict_inode(struct inode *inode,
+struct ctl_table_header *head);
 #else
 static inline void proc_sys_init(void) { }
-static inline void sysctl_head_put(struct ctl_table_header *head) { }
+static inline void proc_sys_evict_inode(struct  inode *inode,
+   struct ctl_table_header *head) { }
 #endif
 
 /*
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -190,6 +190,7 @@ static void init_header(struct ctl_table
head->set = set;
head->parent = NULL;
head->node = node;
+   INIT_LIST_HEAD(>inodes);
if (node) {
struct ctl_table *entry;
for (entry = table; entry->procname; entry++, node++)
@@ -259,6 +260,29 @@ static void unuse_table(struct ctl_table
complete(p->unregistering);
 }
 
+/* called under sysctl_lock */
+static void proc_sys_prune_dcache(struct ctl_table_header *head)
+{
+   struct inode *inode, *prev = NULL;
+   struct proc_inode *ei;
+
+   list_for_each_entry(ei, >inodes, sysctl_inodes) {
+   inode = igrab(>vfs_inode);
+   if (inode) {
+   spin_unlock(_lock);
+   iput(prev);
+   prev = inode;
+   d_prune_aliases(inode);
+   spin_lock(_lock);
+   }
+   }
+   if (prev) {
+   spin_unlock(_lock);
+   iput(prev);
+   spin_lock(_lock);
+   }
+}
+
 /* called under sysctl_lock, will reacquire if has to wait */
 static void start_unregistering(struct ctl_table_header *p)
 {
@@ -278,27 +302,17 @@ static void start_unregistering(struct c
p->unregistering = ERR_PTR(-EINVAL);
}
/*
+* Prune dentries for unregistered sysctls: namespaced sysctls
+* can have duplicate names and contaminate dcache very badly.
+*/
+   proc_sys_prune_dcache(p);
+   /*
 * do not remove from the list until nobody holds it; walking the
 * list in do_sysctl() relies on that.
 */
erase_header(p);
 }
 
-static void sysctl_head_get(struct ctl_table_header *head)
-{
-

[PATCH 4.17 52/97] x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Konrad Rzeszutek Wilk 

commit 26acfb666a473d960f0fd971fe68f3e3ad16c70b upstream

If the L1TF CPU bug is present we allow the KVM module to be loaded as the
major of users that use Linux and KVM have trusted guests and do not want a
broken setup.

Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
such they are the ones that should set nosmt to one.

Setting 'nosmt' means that the system administrator also needs to disable
SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit
05736e4ac13c ("cpu/hotplug: Provide knobs to control SMT").

Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.

Signed-off-by: Konrad Rzeszutek Wilk 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/admin-guide/kernel-parameters.txt |6 ++
 arch/x86/kvm/vmx.c  |   13 +
 kernel/cpu.c|1 +
 3 files changed, 20 insertions(+)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1937,6 +1937,12 @@
[KVM,ARM] Allow use of GICv4 for direct injection of
LPIs.
 
+   kvm-intel.nosmt=[KVM,Intel] If the L1TF CPU bug is present 
(CVE-2018-3620)
+   and the system has SMT (aka Hyper-Threading) enabled 
then
+   don't allow guests to be created.
+
+   Default is 0 (allow guests to be created).
+
kvm-intel.ept=  [KVM,Intel] Disable extended page tables
(virtualized MMU) support on capable Intel chips.
Default is 1 (enabled)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -71,6 +71,9 @@ static const struct x86_cpu_id vmx_cpu_i
 };
 MODULE_DEVICE_TABLE(x86cpu, vmx_cpu_id);
 
+static bool __read_mostly nosmt;
+module_param(nosmt, bool, S_IRUGO);
+
 static bool __read_mostly enable_vpid = 1;
 module_param_named(vpid, enable_vpid, bool, 0444);
 
@@ -10142,10 +10145,20 @@ free_vcpu:
return ERR_PTR(err);
 }
 
+#define L1TF_MSG "SMT enabled with L1TF CPU bug present. Refer to 
CVE-2018-3620 for details.\n"
+
 static int vmx_vm_init(struct kvm *kvm)
 {
if (!ple_gap)
kvm->arch.pause_in_guest = true;
+
+   if (boot_cpu_has(X86_BUG_L1TF) && cpu_smt_control == CPU_SMT_ENABLED) {
+   if (nosmt) {
+   pr_err(L1TF_MSG);
+   return -EOPNOTSUPP;
+   }
+   pr_warn(L1TF_MSG);
+   }
return 0;
 }
 
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -345,6 +345,7 @@ EXPORT_SYMBOL_GPL(cpu_hotplug_enable);
 
 #ifdef CONFIG_HOTPLUG_SMT
 enum cpuhp_smt_control cpu_smt_control __read_mostly = CPU_SMT_ENABLED;
+EXPORT_SYMBOL_GPL(cpu_smt_control);
 
 static int __init smt_cmdline_disable(char *str)
 {

[PATCH 4.14 010/104] scsi: sr: Avoid that opening a CD-ROM hangs with runtime power management enabled

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Bart Van Assche 

commit 1214fd7b497400d200e3f4e64e2338b303a20949 upstream.

Surround scsi_execute() calls with scsi_autopm_get_device() and
scsi_autopm_put_device(). Note: removing sr_mutex protection from the
scsi_cd_get() and scsi_cd_put() calls is safe because the purpose of
sr_mutex is to serialize cdrom_*() calls.

This patch avoids that complaints similar to the following appear in the
kernel log if runtime power management is enabled:

INFO: task systemd-udevd:650 blocked for more than 120 seconds.
 Not tainted 4.18.0-rc7-dbg+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
systemd-udevd   D28176   650513 0x0104
Call Trace:
__schedule+0x444/0xfe0
schedule+0x4e/0xe0
schedule_preempt_disabled+0x18/0x30
__mutex_lock+0x41c/0xc70
mutex_lock_nested+0x1b/0x20
__blkdev_get+0x106/0x970
blkdev_get+0x22c/0x5a0
blkdev_open+0xe9/0x100
do_dentry_open.isra.19+0x33e/0x570
vfs_open+0x7c/0xd0
path_openat+0x6e3/0x1120
do_filp_open+0x11c/0x1c0
do_sys_open+0x208/0x2d0
__x64_sys_openat+0x59/0x70
do_syscall_64+0x77/0x230
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Bart Van Assche 
Cc: Maurizio Lombardi 
Cc: Johannes Thumshirn 
Cc: Alan Stern 
Cc: 
Tested-by: Johannes Thumshirn 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/scsi/sr.c |   29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -523,18 +523,26 @@ static int sr_init_command(struct scsi_c
 static int sr_block_open(struct block_device *bdev, fmode_t mode)
 {
struct scsi_cd *cd;
+   struct scsi_device *sdev;
int ret = -ENXIO;
 
+   cd = scsi_cd_get(bdev->bd_disk);
+   if (!cd)
+   goto out;
+
+   sdev = cd->device;
+   scsi_autopm_get_device(sdev);
check_disk_change(bdev);
 
mutex_lock(_mutex);
-   cd = scsi_cd_get(bdev->bd_disk);
-   if (cd) {
-   ret = cdrom_open(>cdi, bdev, mode);
-   if (ret)
-   scsi_cd_put(cd);
-   }
+   ret = cdrom_open(>cdi, bdev, mode);
mutex_unlock(_mutex);
+
+   scsi_autopm_put_device(sdev);
+   if (ret)
+   scsi_cd_put(cd);
+
+out:
return ret;
 }
 
@@ -562,6 +570,8 @@ static int sr_block_ioctl(struct block_d
if (ret)
goto out;
 
+   scsi_autopm_get_device(sdev);
+
/*
 * Send SCSI addressing ioctls directly to mid level, send other
 * ioctls to cdrom/block level.
@@ -570,15 +580,18 @@ static int sr_block_ioctl(struct block_d
case SCSI_IOCTL_GET_IDLUN:
case SCSI_IOCTL_GET_BUS_NUMBER:
ret = scsi_ioctl(sdev, cmd, argp);
-   goto out;
+   goto put;
}
 
ret = cdrom_ioctl(>cdi, bdev, mode, cmd, arg);
if (ret != -ENOSYS)
-   goto out;
+   goto put;
 
ret = scsi_ioctl(sdev, cmd, argp);
 
+put:
+   scsi_autopm_put_device(sdev);
+
 out:
mutex_unlock(_mutex);
return ret;

[PATCH 4.17 83/97] x86: Dont include linux/irq.h from asm/hardirq.h

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Nicolai Stange 

commit 447ae316670230d7d29430e2cbf1f5db4f49d14c upstream

The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().

Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like

  asm/smp.h
asm/apic.h
  asm/hardirq.h
linux/irq.h
  linux/topology.h
linux/smp.h
  asm/smp.h

or

  linux/gfp.h
linux/mmzone.h
  asm/mmzone.h
asm/mmzone_64.h
  asm/smp.h
asm/apic.h
  asm/hardirq.h
linux/irq.h
  linux/irqdesc.h
linux/kobject.h
  linux/sysfs.h
linux/kernfs.h
  linux/idr.h
linux/gfp.h

and others.

This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.

A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.

However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.

Remove the linux/irq.h #include from x86' asm/hardirq.h.

Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.

Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.

Signed-off-by: Nicolai Stange 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/dmi.h   |2 +-
 arch/x86/include/asm/hardirq.h   |1 -
 arch/x86/include/asm/kvm_host.h  |1 +
 arch/x86/kernel/apic/apic.c  |1 +
 arch/x86/kernel/apic/io_apic.c   |1 +
 arch/x86/kernel/apic/msi.c   |1 +
 arch/x86/kernel/apic/vector.c|1 +
 arch/x86/kernel/fpu/core.c   |1 +
 arch/x86/kernel/hpet.c   |1 +
 arch/x86/kernel/i8259.c  |1 +
 arch/x86/kernel/idt.c|1 +
 arch/x86/kernel/irq.c|1 +
 arch/x86/kernel/irq_32.c |1 +
 arch/x86/kernel/irq_64.c |1 +
 arch/x86/kernel/irqinit.c|1 +
 arch/x86/kernel/smpboot.c|1 +
 arch/x86/kernel/time.c   |1 +
 arch/x86/mm/pti.c|1 +
 arch/x86/platform/intel-mid/device_libs/platform_mrfld_wdt.c |1 +
 arch/x86/xen/enlighten.c |1 +
 drivers/gpu/drm/i915/i915_pmu.c  |1 +
 drivers/gpu/drm/i915/intel_lpe_audio.c   |1 +
 drivers/pci/host/pci-hyperv.c|2 ++
 23 files changed, 23 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/dmi.h
+++ b/arch/x86/include/asm/dmi.h
@@ -4,8 +4,8 @@
 
 #include 
 #include 
+#include 
 
-#include 
 #include 
 
 static __always_inline __init void *dmi_alloc(unsigned len)
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -3,7 +3,6 @@
 #define _ASM_X86_HARDIRQ_H
 
 #include 
-#include 
 
 typedef struct {
u16  __softirq_pending;
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 unsigned int num_processors;
 
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -33,6 +33,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -12,6 +12,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -11,6 +11,7 @@
  * published by the Free Software Foundation.
  */
 #include 
+#include 
 #include 
 #include 
 #include 
--- a/arch/x86/kernel/fpu/core.c
+++

[PATCH 3/7] evmtest: test kernel module loading

2018-08-14 Thread David Jacobson

The Linux kernel supports two methods of loading kernel modules -
init_module and finit_module syscalls. This test verifies loading kernel
modules with both syscalls, first without an IMA policy, and
subsequently with an IMA policy (that restricts module loading to signed
modules).

This test requires the kernel to be configured with the
"CONFIG_MODULE_SIG" option, but not with "CONFIG_MODULE_SIG_FORCE".  For
this reason, the test requires that  "module.sig_enforce=1" is supplied
as a boot option to the kernel.

Signed-off-by: David Jacobson 

Changelog:
---
 evmtest/Makefile.am |  11 +-
 evmtest/files/policies/kernel_module_policy |   2 +
 evmtest/functions/r_kmod_sig.sh | 225 
 evmtest/src/Makefile|   5 +
 evmtest/src/basic_mod.c |  36 
 evmtest/src/kern_mod_loader.c   | 131 
 6 files changed, 407 insertions(+), 3 deletions(-)
 create mode 100644 evmtest/files/policies/kernel_module_policy
 create mode 100755 evmtest/functions/r_kmod_sig.sh
 create mode 100644 evmtest/src/Makefile
 create mode 100644 evmtest/src/basic_mod.c
 create mode 100644 evmtest/src/kern_mod_loader.c

diff --git a/evmtest/Makefile.am b/evmtest/Makefile.am
index b537e78..6be0596 100644
--- a/evmtest/Makefile.am
+++ b/evmtest/Makefile.am
@@ -3,7 +3,7 @@ datarootdir=@datarootdir@
 exec_prefix=@exec_prefix@
 bindir=@bindir@
 
-all: evmtest.1
+all: src evmtest.1
 
 evmtest.1:
asciidoc -d manpage -b docbook -o evmtest.1.xsl README
@@ -11,7 +11,10 @@ evmtest.1:
xsltproc --nonet -o $@ $(MANPAGE_DOCBOOK_XSL) evmtest.1.xsl
asciidoc -o evmtest.html README
rm -f evmtest.1.xsl
-install:
+src:
+   cd src && make
+
+install: src
install -m 755 evmtest $(bindir)
install -d $(datarootdir)/evmtest/files/
install -d $(datarootdir)/evmtest/files/policies
@@ -19,7 +22,9 @@ install:
install -D $$(find ./files/ -not -type d -not -path 
"./files/policies/*")  $(datarootdir)/evmtest/files/
install -D ./functions/* $(datarootdir)/evmtest/functions/
install -D ./files/policies/* $(datarootdir)/evmtest/files/policies/
+   cp ./src/basic_mod.ko $(datarootdir)/evmtest/files/
+   cp ./src/kern_mod_loader $(datarootdir)/evmtest/files
cp evmtest.1 $(datarootdir)/man/man1
mandb -q
 
-.PHONY: install evmtest.1
+.PHONY: src install evmtest.1
diff --git a/evmtest/files/policies/kernel_module_policy 
b/evmtest/files/policies/kernel_module_policy
new file mode 100644
index 000..8096e18
--- /dev/null
+++ b/evmtest/files/policies/kernel_module_policy
@@ -0,0 +1,2 @@
+measure func=MODULE_CHECK
+appraise func=MODULE_CHECK appraise_type=imasig
diff --git a/evmtest/functions/r_kmod_sig.sh b/evmtest/functions/r_kmod_sig.sh
new file mode 100755
index 000..43ab9df
--- /dev/null
+++ b/evmtest/functions/r_kmod_sig.sh
@@ -0,0 +1,225 @@
+#!/bin/bash
+# Author: David Jacobson 
+TEST="r_kmod_sig"
+BUILD_DIR=""
+ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )/.."
+source $ROOT/files/common.sh
+
+VERBOSE=0
+# This test validates that IMA prevents the loading of unsigned
+# kernel modules
+# There is no way to tell how the Kernel was compiled, so the boot command:
+# module.sig_enable=1 is required for this test. This is equivalent to
+# compiling with CONFIG_MODULE_SIG_FORCE
+
+SIG_ENFORCE_CMD="module.sig_enforce=1"
+POLICY_LOAD=$ROOT/files/load_policy.sh
+
+usage(){
+   echo ""
+   echo "kmod_sig [-b kernel_build_directory] -k  [-v]"
+   echo "  This test verifies that IMA prevents the loading of an"
+   echo "  unsigned kernel module with a policy appraising MODULE_CHECK"
+   echo ""
+   echo "  This test must be run as root"
+   echo ""
+   echo "  -b,--kernel_build_directory The path to a kernel build dir"
+   echo "  -k,--keyIMA key"
+   echo "  -v,--verboseVerbose logging"
+   echo "  -h,--help   Display this help message"
+}
+
+TEMP=`getopt -o 'b:k:hv' -l 'kernel_build_directory:,key:,help,verbose'\
+   -n 'r_kmod_sig' -- "$@"`
+eval set -- "$TEMP"
+
+while true ; do
+   case "$1" in
+   -h|--help) usage; exit 0 ;;
+   -b|--kernel_build_directory) BUILD_DIR=$2; shift 2;;
+   -k|--key) IMA_KEY=$2; shift 2;;
+   -v|--verbose)   VERBOSE=1; shift;;
+   --) shift; break;;
+   *) echo "[*] Unrecognized option $1"; exit 1 ;;
+   esac
+done
+
+if [[ -z $IMA_KEY ]]; then
+   echo "[!] Please provide an IMA key."
+   usage
+   exit 1
+fi
+
+if [[ -z $BUILD_DIR ]]; then
+   v_out "No build directory provided, searching..."
+   BUILD_DIR="/lib/modules/`uname -r`/build"
+   if [[ ! -e $BUILD_DIR ]]; then
+   echo "[!] Could not find build tree. Please specify with -b"
+   exit 1
+   else
+   v_out "Found - using: `readlink -f

[PATCH 7/7] emvtest: Add ability to run all tests

2018-08-14 Thread David Jacobson

evmtest tests functionality of different IMA-Appraisal policies.

To simplify testing, this patch defines an evmtest config file.  This
allows for running all tests at once, rather than invoking each test
individually. Variables can be set once rather than specifying
parameters at runtime on the command line.

Signed-off-by: David Jacobson 
---
 evmtest/README   | 19 +++--
 evmtest/evmtest  | 51 +++-
 evmtest/example.conf | 14 
 3 files changed, 81 insertions(+), 3 deletions(-)
 create mode 100644 evmtest/example.conf

diff --git a/evmtest/README b/evmtest/README
index ac0c175..6f1c5c8 100644
--- a/evmtest/README
+++ b/evmtest/README
@@ -20,8 +20,8 @@ used to check a kernel's configuration and validate 
compatibility with IMA.
 COMMANDS
 
 
- runtest  - Run a specific test
- runall  - Run all tests
+ runtest - Run a specific test
+ runall   - Run all tests
 
 OPTIONS
 ---
@@ -34,7 +34,21 @@ OPTIONS
  --vm  Validate compatibility with a virtual machine
 
 
+CONFIGURATION FILE
+--
+
+The `example.conf` provides a skeleton configuration file, where the only
+variable that *must* be defined is `IMA_KEY`.
+
+* `IMA_KEY` - The private key for the certificate on the IMA Trusted Keyring
 
+* `KBUILD_DIR` - Should point to a kernel build tree. If not provided, the test
+will use `/lib/modules/$(uname -r)/build`.
+
+* `KERN_IMAGE` - Should point towards an unsigned kernel image. If not 
provided,
+the test will attempt to use the running kernel.
+
+* `VERBOSE` - If set to 1, will add -v to all tests run
 
 BACKGROUND
 --
@@ -42,6 +56,7 @@ The Linux kernel needs to be configured properly with a key 
embedded into the
 kernel and loaded onto the `.builtin_trusted_keys` keyring at boot in order to
 run evmtest.
 
+
 === 1. Confirming the kernel is properly configured with IMA enabled.
 
 A number of Kconfig options need to be configured to enable IMA and permit
diff --git a/evmtest/evmtest b/evmtest/evmtest
index dfe39a9..74c829c 100755
--- a/evmtest/evmtest
+++ b/evmtest/evmtest
@@ -17,7 +17,7 @@ fi
 
 source $EVMDIR/files/common.sh
 usage (){
-   echo "Usage: evmtest [[runtest] ] [options]"
+   echo "Usage: evmtest [[runtest|runall] ] [options]"
echo ""
echo "Tests may be called directly by cd'ing to the evmtest directory:"
echo "$ cd $EVMDIR/functions"
@@ -69,6 +69,55 @@ elif [[ "$1" == "runtest" ]]; then
runtest $@
exit $?
fi
+elif [[ "$1" == "runall" ]]; then
+   if [[ -z $2 || ! -e $2 ]]; then
+   echo "evmtest runall "
+   echo "[!] Please provide a config file"
+   exit 1
+   fi
+   source $2 # Load in config
+   if [[ $VERBOSE -eq 1 ]]; then
+   V="-v"
+   fi
+
+   # Key is not optional
+   if [[ -z $IMA_KEY ]]; then
+   echo "[*] Please correct your config file"
+   exit 1
+   fi
+
+   EVMTEST_require_root
+   FAIL=0
+   echo "[*] Running tests..."
+   # 1
+   $EVMDIR/functions/r_env_validate.sh -r $V
+
+   # 2
+   if [[ -z $KERN_IMAGE ]]; then
+   $EVMDIR/functions/r_kexec_sig.sh -k $IMA_KEY $V
+   else
+   $EVMDIR/functions/r_kexec_sig.sh -k $IMA_KEY -i $KERN_IMAGE $V
+   fi
+   FAIL=$((FAIL+$?))
+   # 3
+   if [[ -z $KBUILD_DIR ]]; then
+   $EVMDIR/functions/r_kmod_sig.sh -k $IMA_KEY $V
+   else
+   $EVMDIR/functions/r_kmod_sig.sh -b $KBUILD_DIR -k $IMA_KEY $V
+   fi
+   FAIL=$((FAIL+$?))
+   # 4
+   $EVMDIR/functions/r_policy_sig.sh -k $IMA_KEY $V
+   FAIL=$((FAIL+$?))
+   # 5
+   $EVMDIR/functions/r_validate_boot_record.sh $V
+   FAIL=$((FAIL+$?))
+   # 6
+   $EVMDIR/functions/r_xattr_preserve.sh $V
+   FAIL=$((FAIL+$?))
+   echo "..."
+   echo "[*] TESTS PASSED: $((6-FAIL))"
+   echo "[*] TESTS FAILED: $FAIL"
 else
usage
 fi
diff --git a/evmtest/example.conf b/evmtest/example.conf
new file mode 100644
index 000..fd1c8fe
--- /dev/null
+++ b/evmtest/example.conf
@@ -0,0 +1,14 @@
+# This is an example config file
+# There are three variables that can be set when using evmtest runall
+
+#Set this to 1 for verbose output
+VERBOSE=0
+# Path to the private key for the IMA Trusted Keyring
+# This is required
+IMA_KEY=/path/to/your/ima_key
+
+# If this is not provided, tests will run but attempt to copy the running 
kernel
+KERN_IMAGE=/path/to/unsigned/kernel_image
+
+# If this is not defined, tests will try to find build tree
+KBUILD_DIR=/path/to/kernel/build/tree
-- 
2.17.1

[PATCH 6/7] evmtest: test the preservation of extended attributes

2018-08-14 Thread David Jacobson

IMA supports file signatures by storing information in a security.ima
extended file attribute. This test ensures that the attribute is
preserved when a file is copied.  This test requires root because only
root can write "security." xattrs to files.

Signed-off-by: David Jacobson 
---
 evmtest/functions/r_xattr_preserve.sh | 74 +++
 1 file changed, 74 insertions(+)
 create mode 100755 evmtest/functions/r_xattr_preserve.sh

diff --git a/evmtest/functions/r_xattr_preserve.sh 
b/evmtest/functions/r_xattr_preserve.sh
new file mode 100755
index 000..e7f0e2a
--- /dev/null
+++ b/evmtest/functions/r_xattr_preserve.sh
@@ -0,0 +1,74 @@
+#!/bin/bash
+# Author: David Jacobson 
+TEST="r_xattr_preserve"
+ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )/.."
+source $ROOT/files/common.sh
+
+VERBOSE=0
+# This test ensures that extended file attributes are preserved when a file is
+# moved with the correct flag
+
+usage (){
+   echo ""
+   echo "xattr_preserve [-hv]"
+   echo ""
+   echo "This test must be run as root"
+   echo ""
+   echo "  This test ensures that extended file attributes (specifically"
+   echo "  security.ima labels) are preserved when copying"
+   echo "Options"
+   echo "  -h,--help   Display this help message"
+   echo "  -v,--verboseVerbose logging"
+}
+
+TEMP=`getopt -o 'hv' -l 'help,verbose' -n 'r_xattr_preserve' -- "$@"`
+eval set -- "$TEMP"
+
+while true ; do
+   case "$1" in
+   -h|--help) usage; exit; shift;;
+   -v|--verbose) VERBOSE=1; shift;;
+   --) shift; break;;
+   *) echo "[*] Unrecognized option $1"; exit 1;;
+   esac
+done
+
+EVMTEST_require_root
+
+begin
+
+LOCATION_1=`mktemp`
+LOCATION_2=`mktemp -u` # Doesn't create the file
+v_out "Labeling file..."
+
+evmctl ima_hash $LOCATION_1
+initial_ima_label=`getfattr -m ^security.ima -e hex \
+   --dump $LOCATION_1 2> /dev/null`
+
+initial_hash=`echo $initial_ima_label | awk -F '=' '{print $2}'`
+
+if [[ $initial_ima_label = *"security.ima"* ]]; then
+   v_out "Found hash on initial file... "
+else
+   fail "Hash not found on initial file"
+fi
+
+initial_hash=`echo $initial_ima_label | awk -F '=' '{print $2}'`
+
+v_out "Copying file..."
+cp --preserve=xattr $LOCATION_1 $LOCATION_2
+v_out "Checking if extended attribute has been preserved..."
+
+
+second_ima_label=`getfattr -m ^security.ima -e hex \
+   --dump $LOCATION_2 2> /dev/null`
+second_hash=`echo $second_ima_label | awk -F '=' '{print $2}'`
+if [[ "$initial_hash" != "$second_hash" ]]; then
+   fail "security.ima xattr was not preserved!"
+else
+   v_out "Extended attribute was preserved during copy"
+fi
+v_out "Cleaning up..."
+rm $LOCATION_1 $LOCATION_2
+
+passed
-- 
2.17.1

[PATCH 2/7] evmtest: test appraisal on policy loading with signature

2018-08-14 Thread David Jacobson

IMA can be configured to require signatures on policies before loading
them. This test verifies that IMA correctly validates signatures, and
rejects policies that lack signatures or have been signed by an
unauthorized party (i.e. certificate is not on the appropriate keyring).

This test requires root privileges in order to write to securityfs
files.

Signed-off-by: David Jacobson 
---
 evmtest/Makefile.am  |  4 +-
 evmtest/files/Notes  | 25 ++
 evmtest/files/bad_privkey_ima.pem| 16 
 evmtest/files/policies/signed_policy |  2 +
 evmtest/files/policies/unknown_signed_policy |  1 +
 evmtest/files/policies/unsigned_policy   |  1 +
 evmtest/functions/r_policy_sig.sh| 93 
 7 files changed, 141 insertions(+), 1 deletion(-)
 create mode 100644 evmtest/files/Notes
 create mode 100644 evmtest/files/bad_privkey_ima.pem
 create mode 100644 evmtest/files/policies/signed_policy
 create mode 100644 evmtest/files/policies/unknown_signed_policy
 create mode 100644 evmtest/files/policies/unsigned_policy
 create mode 100755 evmtest/functions/r_policy_sig.sh

diff --git a/evmtest/Makefile.am b/evmtest/Makefile.am
index 388ead1..b537e78 100644
--- a/evmtest/Makefile.am
+++ b/evmtest/Makefile.am
@@ -14,9 +14,11 @@ evmtest.1:
 install:
install -m 755 evmtest $(bindir)
install -d $(datarootdir)/evmtest/files/
+   install -d $(datarootdir)/evmtest/files/policies
install -d $(datarootdir)/evmtest/functions/
-   install -D $$(find ./files/ -not -type d)  $(datarootdir)/evmtest/files/
+   install -D $$(find ./files/ -not -type d -not -path 
"./files/policies/*")  $(datarootdir)/evmtest/files/
install -D ./functions/* $(datarootdir)/evmtest/functions/
+   install -D ./files/policies/* $(datarootdir)/evmtest/files/policies/
cp evmtest.1 $(datarootdir)/man/man1
mandb -q
 
diff --git a/evmtest/files/Notes b/evmtest/files/Notes
new file mode 100644
index 000..6b75263
--- /dev/null
+++ b/evmtest/files/Notes
@@ -0,0 +1,25 @@
+This file contains a description of the contents of this directory.
+
+1. bad_privkey_ima.pem
+
+This file was generated such that its corresponding public key could be placed
+on the IMA Trusted Keyring, however, it has not. Therefore, any policy (or 
file)
+signed by this key cannot be verified, and is untrusted.
+
+2. basic_mod.ko
+
+This is a kernel module that logs (to dmesg) the syscall that was used to load
+it.
+
+3. common.sh
+
+This file contains useful functions and variables for evmtest scripts.
+
+4. load_policy.sh
+
+This is a script to load policies. The first time this is called, it will
+replace the existing policy. Subsequent calls will append the running policy.
+
+5. policies/
+
+This is a directory that contains IMA policies with self explanatory names.
diff --git a/evmtest/files/bad_privkey_ima.pem 
b/evmtest/files/bad_privkey_ima.pem
new file mode 100644
index 000..dcc0e24
--- /dev/null
+++ b/evmtest/files/bad_privkey_ima.pem
@@ -0,0 +1,16 @@
+-BEGIN PRIVATE KEY-
+MIICdgIBADANBgkqhkiG9w0BAQEFAASCAmAwggJcAgEAAoGBAMOnki6OKMHExpH1
+IWgUlPWWSbsDpW1lpqXMj0/ZWo9xU5W2xZC53TVArUGOImQ5PcMNkw1VcHhKbFKO
+jYT0gEE0Sv+VbePiEnhUheFOWUxNNFE3DVQaOpBN0OzsUCSGX9RKIIwkIAwJkvWA
+MHzR4ZPQGGM9hMJKhEvlTG4PP96LAgMBAAECgYBKVKVCrptpUhqmZNx2MCuPSbNl
+KzNz5kRzhM2FZmvzRvicTj2siBA0JQgteZQzQ1PlgIi3bhg2ev/ANYwqUMFQWZv9
+zm5d4P7Zsdyle15MDTSrQIaroeb1nbfNvaB0L4D4Inv0p6ksyIFp7TR5MLVenC5k
+bxfESVWVPDseiAFKUQJBAPQ/x3LmnT0RiMeX6quCGAON7DGpV5KFwL97luWO6vH+
+qZ2W1/J0UxTbruv7rA+tj3ZXpdNOxfmq+JStY0jrJV0CQQDNEUqomnA183rX0dv8
+MWyOPmX0Z9SMSTRvflNRW85Bzbosq68uLTq3qOBj+td9zUlopsLpJlfF0Vc+moff
+uq0HAkEAi/Sz47oTZXfTqZL6TBZ6jibXrck8PeBYhyBZYebX55ymMn/J88sGBFCx
+VdVbTYyFRSmKAqADv0FhuUf1OUZMnQJAOayjUsgcxw+zfP+I32UHIvppslOBc/Mi
+zDi7Niab2+YAdo/StSoDWaQld/kUok0aWFSOfQRLq1c1MmZD0KiwAQJANY0LopqG
+pxACc4/QawxtBoV1a8j5Zui8LZPRtKwjkA30Nq8fOufzMuBeJIlLap45uD1xC7St
+bsPWG5+uz18e5w==
+-END PRIVATE KEY-
diff --git a/evmtest/files/policies/signed_policy 
b/evmtest/files/policies/signed_policy
new file mode 100644
index 000..87828f0
--- /dev/null
+++ b/evmtest/files/policies/signed_policy
@@ -0,0 +1,2 @@
+measure func=POLICY_CHECK
+appraise func=POLICY_CHECK appraise_type=imasig
diff --git a/evmtest/files/policies/unknown_signed_policy 
b/evmtest/files/policies/unknown_signed_policy
new file mode 100644
index 000..1f8f8f4
--- /dev/null
+++ b/evmtest/files/policies/unknown_signed_policy
@@ -0,0 +1 @@
+audit func=POLICY_CHECK
diff --git a/evmtest/files/policies/unsigned_policy 
b/evmtest/files/policies/unsigned_policy
new file mode 100644
index 000..1f8f8f4
--- /dev/null
+++ b/evmtest/files/policies/unsigned_policy
@@ -0,0 +1 @@
+audit func=POLICY_CHECK
diff --git a/evmtest/functions/r_policy_sig.sh 
b/evmtest/functions/r_policy_sig.sh
new file mode 100755
index 000..7462c0a
--- /dev/null
+++ b/evmtest/functions/r_policy_sig.sh
@@ -0,0 +1,93 @@

[PATCH 1/7] evmtest: Regression testing Integrity Subsystem

2018-08-14 Thread David Jacobson

As the existing IMA/EVM features of the kernel mature, and new features are
being added, the number of kernel configuration options (Kconfig) and
methods for loading policies have been increasing. Rigorous testing of
the various IMA/EVM features is needed to ensure correct behavior and to
help avoid regressions.

Currently, only IMA-measurement can be tested (via LTP), a feature
introduced in Linux 2.6.30. Since then, IMA has grown to support
IMA-appraisal and IMA-audit. There are no LTP modules to test either of
these features.

This patchset introduces evmtest — a stand alone tool for regression
testing IMA. evmtest can be used to validate individual behaviors by
exercising execve, kexec, module load, and other LSM hooks.  evmtest
uses a combination of invalid signatures, invalid hashes, and unsigned
files to check that IMA-Appraisal catches all cases a running policy has
set. evmtest can also be used to validate that the kernel is properly
configured. For example, there are a number of Kconfig options that need
to be set, in addition to a local CA certificate being built-in or
memory reserved for embedding the certificate post-build.  evmtest
output is consistent. Consistent output allows evmtest to be plugged
into a testing framework/harness. Testing frameworks (such as xfstests)
require deterministic output. XFStests runs a test and compares its
output to a predefined value, created by running the test script under
conditions where it passes. evmtest provides output that can easily be
integrated with xfstests. All tests have a verbose mode (-v) that
outputs more information for debugging purposes.

New tests can be defined by placing them in the functions/ directory.
An "example_test.sh" script is provided for reference.

Example 1:
Successful example test output
$ evmtest runtest example_test -e /bin/bash
[*] Starting test: example_test
[*] TEST: PASSED

Example 1a: successful verbose example test output
$ evmtest runtest example_test -e /bin/bash -v
[*] Starting test: example_test
[*] Example file exists
[*] TEST: PASSED

Example 1b: failed verbose example test output
$ evmtest runtest example_test -e /bin/foo -v
[*] Starting test: example_test
[!] Example file does not exist
[!] TEST: FAILED

SYNOPSIS:
evmtest [runtest|help] test_name [test options]
options:
-h,--help   Displays a help message
-v,--verboseVerbose logging for debugging

Signed-off-by: David Jacobson 
---
 Makefile.am |   5 +-
 configure.ac|   1 +
 evmtest/INSTALL |  11 ++
 evmtest/Makefile.am |  23 
 evmtest/README  | 190 ++
 evmtest/evmtest |  74 ++
 evmtest/files/common.sh |  49 +++
 evmtest/files/load_policy.sh|  38 ++
 evmtest/functions/example_test.sh   |  75 ++
 evmtest/functions/r_env_validate.sh | 205 
 10 files changed, 670 insertions(+), 1 deletion(-)
 create mode 100644 evmtest/INSTALL
 create mode 100644 evmtest/Makefile.am
 create mode 100644 evmtest/README
 create mode 100755 evmtest/evmtest
 create mode 100755 evmtest/files/common.sh
 create mode 100755 evmtest/files/load_policy.sh
 create mode 100755 evmtest/functions/example_test.sh
 create mode 100755 evmtest/functions/r_env_validate.sh

diff --git a/Makefile.am b/Makefile.am
index dba408d..0cb4111 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -36,4 +36,7 @@ rmman:
 
 doc: evmctl.1.html rmman evmctl.1
 
-.PHONY: $(tarname)
+evmtest:
+   $(MAKE) -C evmtest
+
+.PHONY: $(tarname) evmtest
diff --git a/configure.ac b/configure.ac
index a5b4288..59ec1d1 100644
--- a/configure.ac
+++ b/configure.ac
@@ -52,6 +52,7 @@ EVMCTL_MANPAGE_DOCBOOK_XSL
 AC_CONFIG_FILES([Makefile
src/Makefile
packaging/ima-evm-utils.spec
+   evmtest/Makefile
])
 AC_OUTPUT
 
diff --git a/evmtest/INSTALL b/evmtest/INSTALL
new file mode 100644
index 000..699853e
--- /dev/null
+++ b/evmtest/INSTALL
@@ -0,0 +1,11 @@
+Installation Instructions
+-
+
+Basic Installation
+--
+
+From the root directory of ima-evm-utils, run the commands: `./autogen.sh`
+followed by `./configure`. `cd` to evmtest directory, execute `make`, and
+`sudo make install`.
+For details on how to use `evmtest` See the installed manpage or the README.
+There is an evmtest.html provided as well.
diff --git a/evmtest/Makefile.am b/evmtest/Makefile.am
new file mode 100644
index 000..388ead1
--- /dev/null
+++ b/evmtest/Makefile.am
@@ -0,0 +1,23 @@
+prefix=@prefix@
+datarootdir=@datarootdir@
+exec_prefix=@exec_prefix@
+bindir=@bindir@
+
+all: evmtest.1
+
+evmtest.1:
+   asciidoc -d manpage -b docbook -o evmtest.1.xsl README
+   asciidoc INSTALL
+   xsltproc --nonet -o $@ $(MANPAGE_DOCBOOK_XSL) evmtest.1.xsl
+   asciidoc -o evmtest.html README
+   rm -f evmtest.1.xsl

[PATCH 4/7] evmtest: test kexec signature policy

2018-08-14 Thread David Jacobson

With secure boot enabled, the bootloader verifies the kernel image's
signature before transferring control to it. With Linux as the
bootloader running with secure boot enabled, kexec needs to verify the
kernel image's signature.

This patch defined a new test named "kexec_sig", which first attempts to
kexec an unsigned kernel image with an IMA policy that requires
signatures on any kernel image. Then, the test attempts to kexec the
signed kernel image, which should succeed.

Signed-off-by: David Jacobson 
---
 evmtest/files/policies/kexec_policy |   3 +
 evmtest/functions/r_kexec_sig.sh| 156 
 2 files changed, 159 insertions(+)
 create mode 100644 evmtest/files/policies/kexec_policy
 create mode 100755 evmtest/functions/r_kexec_sig.sh

diff --git a/evmtest/files/policies/kexec_policy 
b/evmtest/files/policies/kexec_policy
new file mode 100644
index 000..dc00fa7
--- /dev/null
+++ b/evmtest/files/policies/kexec_policy
@@ -0,0 +1,3 @@
+appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig
+measure func=KEXEC_KERNEL_CHECK
+audit func=KEXEC_KERNEL_CHECK
diff --git a/evmtest/functions/r_kexec_sig.sh b/evmtest/functions/r_kexec_sig.sh
new file mode 100755
index 000..e1295b9
--- /dev/null
+++ b/evmtest/functions/r_kexec_sig.sh
@@ -0,0 +1,156 @@
+#!/bin/bash
+# Author: David Jacobson 
+TEST="r_kexec_sig"
+ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )/.."
+source $ROOT/files/common.sh
+VERBOSE=0
+POLICY_LOAD=$ROOT/files/load_policy.sh
+
+# This test validates that IMA measures and appraises signatures on kernel
+# images when trying to kexec, if the current policy requires that.
+usage() {
+   echo ""
+   echo "kexec_sig -k  [-i > /dev/null
+
+if [[ $? != 0 ]]; then
+   fail "Could not update policy - verify keys"
+fi
+
+v_out "Testing kexec (using kexec_file_load) on unsigned image..."
+# -s uses the kexec_file_load syscall
+kexec -s -l $KERNEL_IMAGE &>> /dev/null
+loaded_unsigned=$?
+if [[ $loaded_unsigned != 0 ]]; then # Permission denied (IMA)
+   v_out "Correctly prevented kexec of an unsigned image"
+else
+   kexec -s -u
+   fail "kexec loaded instead of rejecting. Unloading and exiting."
+fi
+
+v_out "Testing kexec (using kexec_load) on unsigned image..."
+kexec -l $KERNEL_IMAGE &>> /dev/null
+if [[ $? == 0 ]]; then
+   kexec -u
+   fail "Kexec loaded unsigned image - unloading"
+else
+   v_out "Correctly prevented kexec of an unsigned image"
+fi
+
+# On some systems this prevents resigning the kernel image
+
+#v_out "Signing image with invalid key..."
+#evmctl ima_sign -f $KERNEL_IMAGE -k $ROOT/files/bad_privkey_ima.pem
+#kexec -s -l $KERNEL_IMAGE &>> /dev/null
+#loaded_bad_signature=$?
+
+#if [[ $loaded_bad_signature == 0 ]]; then
+#  kexec -u
+#  fail "Kernel image signed by invalid party was allowed to load.\
+#  Unloaded"
+#fi
+
+#v_out "Correctly prevented loading of kernel signed by unknown key"
+
+v_out "Signing kernel image with provided key..."
+evmctl ima_sign -f $KERNEL_IMAGE -k $IMA_KEY
+
+v_out "Attempting to kexec signed image using kexec_file_load..."
+kexec -s -l $KERNEL_IMAGE &>> /dev/null
+
+loaded_signed=$?
+if [[ $loaded_signed != 0 ]]; then
+   fail "kexec rejected a signed image - possibly due to PECOFF signature"
+else
+   v_out "kexec correctly loaded signed image...unloading"
+fi
+
+kexec -s -u
+
+v_out "Attempting kexec_load on signed kernel... [should fail]"
+kexec -l $KERNEL_IMAGE &>> /dev/null
+
+if [[ $? == 0 ]]; then
+   kexec -u
+   fail "Signed image was allowed to load without file descriptor for\
+   appraisal. Unloading."
+fi
+
+v_out "Correctly prevented loading"
+
+v_out "Cleaning up..."
+if [[ ! -z $TEMP_LOCATION ]]; then
+   rm $TEMP_LOCATION
+fi
+
+passed
-- 
2.17.1

[PATCH 5/7] evmtest: validate boot record

2018-08-14 Thread David Jacobson

The first record in the IMA runtime measurement list is the boot
aggregate - a hash of PCRs 0-7. This test calculates the boot aggregate
based off the PCRs and compares it to IMA's boot aggregate.

Dependencies: a TPM, IBMTSS2.

Signed-off-by: David Jacobson 
---
 evmtest/functions/r_validate_boot_record.sh | 140 
 1 file changed, 140 insertions(+)
 create mode 100755 evmtest/functions/r_validate_boot_record.sh

diff --git a/evmtest/functions/r_validate_boot_record.sh 
b/evmtest/functions/r_validate_boot_record.sh
new file mode 100755
index 000..421cbf1
--- /dev/null
+++ b/evmtest/functions/r_validate_boot_record.sh
@@ -0,0 +1,140 @@
+#!/bin/bash
+# Author: David Jacobson 
+TEST="r_validate_boot_record"
+
+ROOT="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )/.."
+source $ROOT/files/common.sh
+
+TPM_VERSION="2.0" # DEFAULT
+VERBOSE=0
+TSS_DIR=`locate ibmtpm20tss | head -1`
+EVENT_EXTEND=$TSS_DIR/utils12/eventextend
+LD_LIBRARY_PATH=$TSS_DIR/utils:$TSS_DIR/utils12
+MEASUREMENT_FILE=$EVMTEST_SECFS/tpm0/binary_bios_measurements
+# This test validates the eventlog against the hardware PCRs in the TPM, and
+# the boot aggregate against IMA.
+
+usage (){
+   echo "r_validate_boot_record [-hv]"
+   echo ""
+   echo "  This test must be run as root"
+   echo ""
+   echo "  This test will attempt to validate PCRs 0-7 in the TPM"
+   echo "  It will also validate the boot_aggregate based those PCRs"
+   echo "  against what IMA has recorded"
+   echo ""
+   echo "  -h,--help   Display this help message"
+   echo "  -v,--verboseVerbose logging"
+}
+
+
+TEMP=`getopt -o 'hv' -l 'help,verbose' -n 'r_validate_boot_record' -- "$@"`
+eval set -- "$TEMP"
+
+while true ; do
+   case "$1" in
+   -h|--help) usage; exit; shift;;
+   -v|--verbose) VERBOSE=1; shift;;
+   --) shift; break;;
+   *) echo "[*] Unrecognized option $1"; exit 1 ;;
+   esac
+done
+
+EVMTEST_require_root
+
+echo "[*] Starting test: $TEST"
+
+v_out "Checking if securityfs is mounted..."
+if [[ -z $EVMTEST_SECFS_EXISTS ]]; then
+   fail "securityfs not found..."
+fi
+
+v_out "Verifying TPM is present..."
+if [[ ! -d $EVMTEST_SECFS/tpm0 ]]; then
+   fail "Could not locate TPM in $EVMTEST_SECFS"
+fi
+
+v_out "TPM found..."
+
+v_out "Checking if system supports reading event log..."
+
+if [[ ! -f $EVMTEST_SECFS/tpm0/binary_bios_measurements ]]; then
+   fail "Kernel does not support reading BIOS measurements,
+   please update to at least 4.16.0"
+fi
+
+
+
+v_out "Verifying TPM Version"
+if [[ -e /sys/class/tpm/tpm0/device/caps ]]; then
+   contains_12=`grep 'TCG version: 1.2' /sys/class/tpm/tpm0/device/caps`
+   if [[ -z $contains12 ]]; then
+   v_out "TPM 1.2"
+   TPM_VERSION="1.2"
+   fi
+else
+   v_out "TPM 2.0"
+fi
+
+v_out "Checking if system supports reading PCRs..."
+
+if [[ ! -d $TSS_DIR ]]; then
+   fail "Could not find TSS2, please install using the package and
+try again"
+fi
+
+v_out "Grabbing PCR values..."
+pcrs=() # array to store the Hardware PCR values
+sim_pcrs=() # What PCRs should be according to the event log
+halg=$(grep boot_aggregate $EVMTEST_SECFS/ima/ascii_runtime_measurements|\
+   sed -n 's/.*\(sha[^:]*\):.*/\1/p')
+
+for ((i=0; i<=7; i++)); do
+   if [[ $TPM_VERSION == "1.2" ]]; then
+   pcrs[i]=`TPM_INTERFACE_TYPE=dev $TSS_DIR/utils12/pcrread \
+   -ha $i -ns`
+   else
+   pcrs[i]=`TPM_INTERFACE_TYPE=dev $TSS_DIR/utils/pcrread \
+   -ha $i -halg $halg -ns`
+   fi
+done
+
+tss_out=`LD_LIBRARY_PATH=$LD_LIBRARY_PATH $EVENT_EXTEND -if \
+   $MEASUREMENT_FILE -sim -ns`
+for ((y=2; y<=9; y++)); do
+   # Parse TSS output - first strip away PCR, then split on :, then
+   # remove leading whitespace
+   x=`echo $tss_out | awk -v y=$y -F 'PCR' '{print $y}'`
+   x=`echo "$x" | awk -F ":" '{print $2}' | sed -e 's/^[ \t]*//'`
+   index=$((y-2))
+   sim_pcrs[$index]=$x
+done
+
+v_out "Validating PCRs.."
+for ((i=0; i<=7; i++)); do
+   v_out "SIM PCR [$i]: ${sim_pcrs[$i]}"
+   v_out "TPM PCR [$i]: ${pcrs[$i]}"
+   if [[  "${pcrs[$i]}" = "${sim_pcrs[$i]}" ]]; then
+   v_out "PCRs are incorrect..."
+   fail "Mismatch at PCR "$i" "
+   else
+   v_out "PCR $i validated..."
+   fi
+done
+
+
+v_out "Validating Boot Aggregate..."
+tss_boot_agg=`echo $tss_out | awk -F "boot aggregate:" '{print $2}'| tr -d " "`
+ima_boot_agg=`grep boot_aggregate \
+$EVMTEST_SECFS/ima/ascii_runtime_measurements|cut -d ":" -f2|cut -d " " -f1`
+v_out "TSS BOOT AGG: $tss_boot_agg"
+v_out "IMA BOOT AGG: $ima_boot_agg"
+
+if [ "$tss_boot_agg" != "$ima_boot_agg" ]; then
+   fail "Boot Aggregate is inconsistent"
+else
+   v_out "Boot Aggregate validated"
+fi
+
+echo "[*] TEST: PASSED"

[PATCH 4.14 012/104] init: rename and re-order boot_cpu_state_init()

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Linus Torvalds 

commit b5b1404d0815894de0690de8a1ab58269e56eae6 upstream.

This is purely a preparatory patch for upcoming changes during the 4.19
merge window.

We have a function called "boot_cpu_state_init()" that isn't really
about the bootup cpu state: that is done much earlier by the similarly
named "boot_cpu_init()" (note lack of "state" in name).

This function initializes some hotplug CPU state, and needs to run after
the percpu data has been properly initialized.  It even has a comment to
that effect.

Except it _doesn't_ actually run after the percpu data has been properly
initialized.  On x86 it happens to do that, but on at least arm and
arm64, the percpu base pointers are initialized by the arch-specific
'smp_prepare_boot_cpu()' hook, which ran _after_ boot_cpu_state_init().

This had some unexpected results, and in particular we have a patch
pending for the merge window that did the obvious cleanup of using
'this_cpu_write()' in the cpu hotplug init code:

  -   per_cpu_ptr(_state, smp_processor_id())->state = CPUHP_ONLINE;
  +   this_cpu_write(cpuhp_state.state, CPUHP_ONLINE);

which is obviously the right thing to do.  Except because of the
ordering issue, it actually failed miserably and unexpectedly on arm64.

So this just fixes the ordering, and changes the name of the function to
be 'boot_cpu_hotplug_init()' to make it obvious that it's about cpu
hotplug state, because the core CPU state was supposed to have already
been done earlier.

Marked for stable, since the (not yet merged) patch that will show this
problem is marked for stable.

Reported-by: Vlastimil Babka 
Reported-by: Mian Yousaf Kaukab 
Suggested-by: Catalin Marinas 
Acked-by: Thomas Gleixner 
Cc: Will Deacon 
Cc: sta...@kernel.org
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman 

---
 include/linux/cpu.h |2 +-
 init/main.c |2 +-
 kernel/cpu.c|2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -30,7 +30,7 @@ struct cpu {
 };
 
 extern void boot_cpu_init(void);
-extern void boot_cpu_state_init(void);
+extern void boot_cpu_hotplug_init(void);
 extern void cpu_init(void);
 extern void trap_init(void);
 
--- a/init/main.c
+++ b/init/main.c
@@ -543,8 +543,8 @@ asmlinkage __visible void __init start_k
setup_command_line(command_line);
setup_nr_cpu_ids();
setup_per_cpu_areas();
-   boot_cpu_state_init();
smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */
+   boot_cpu_hotplug_init();
 
build_all_zonelists(NULL);
page_alloc_init();
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -2022,7 +2022,7 @@ void __init boot_cpu_init(void)
 /*
  * Must be called _AFTER_ setting up the per_cpu areas
  */
-void __init boot_cpu_state_init(void)
+void __init boot_cpu_hotplug_init(void)
 {
per_cpu_ptr(_state, smp_processor_id())->state = CPUHP_ONLINE;
 }

[PATCH 4.14 014/104] make sure that __dentry_kill() always invalidates d_seq, unhashed or not

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab upstream.

RCU pathwalk relies upon the assumption that anything that changes
->d_inode of a dentry will invalidate its ->d_seq.  That's almost
true - the one exception is that the final dput() of already unhashed
dentry does *not* touch ->d_seq at all.  Unhashing does, though,
so for anything we'd found by RCU dcache lookup we are fine.
Unfortunately, we can *start* with an unhashed dentry or jump into
it.

We could try and be careful in the (few) places where that could
happen.  Or we could just make the final dput() invalidate the damn
thing, unhashed or not.  The latter is much simpler and easier to
backport, so let's do it that way.

Reported-by: "Dae R. Jeong" 
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/dcache.c |7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -357,14 +357,11 @@ static void dentry_unlink_inode(struct d
__releases(dentry->d_inode->i_lock)
 {
struct inode *inode = dentry->d_inode;
-   bool hashed = !d_unhashed(dentry);
 
-   if (hashed)
-   raw_write_seqcount_begin(>d_seq);
+   raw_write_seqcount_begin(>d_seq);
__d_clear_type_and_inode(dentry);
hlist_del_init(>d_u.d_alias);
-   if (hashed)
-   raw_write_seqcount_end(>d_seq);
+   raw_write_seqcount_end(>d_seq);
spin_unlock(>d_lock);
spin_unlock(>i_lock);
if (!inode->i_nlink)

[PATCH 4.17 72/97] x86/bugs, kvm: Introduce boot-time control of L1TF mitigations

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Jiri Kosina 

commit d90a7a0ec83fb86622cd7dae23255d3c50a99ec8 upstream

Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.

  full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.

  flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.

  flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.

  flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.

  off
Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'   : Performe L1D flushes. No runtime control
  possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt'  : Perform L1D flushes and warn on VM start if
  SMT has been runtime enabled or L1D flushing
  has been run-time enabled

  - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.

  - 'l1tf=off'  : L1D flushes are not performed and no warnings
  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.

Signed-off-by: Jiri Kosina 
Signed-off-by: Thomas Gleixner 
Tested-by: Jiri Kosina 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Josh Poimboeuf 
Link: https://lkml.kernel.org/r/20180713142323.202758...@linutronix.de
Signed-off-by: Greg Kroah-Hartman 
---
 Documentation/ABI/testing/sysfs-devices-system-cpu |4 +
 Documentation/admin-guide/kernel-parameters.txt|   68 +++--
 arch/x86/include/asm/processor.h   |   12 +++
 arch/x86/kernel/cpu/bugs.c |   44 +
 arch/x86/kvm/vmx.c |   56 +
 5 files changed, 165 insertions(+), 19 deletions(-)

--- a/Documentation/ABI/testing/sysfs-devices-system-cpu
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -479,6 +479,7 @@ What:   /sys/devices/system/cpu/vulnerabi
/sys/devices/system/cpu/vulnerabilities/spectre_v1
/sys/devices/system/cpu/vulnerabilities/spectre_v2
/sys/devices/system/cpu/vulnerabilities/spec_store_bypass
+   /sys/devices/system/cpu/vulnerabilities/l1tf
 Date:  January 2018
 Contact:   Linux kernel mailing list 
 Description:   Information about CPU vulnerabilities
@@ -491,6 +492,9 @@ Description:Information about CPU vulne
"Vulnerable"  CPU is affected and no mitigation in effect
"Mitigation: $M"  CPU is affected and mitigation $M is in effect
 
+   Details about the l1tf file can be found in
+   Documentation/admin-guide/l1tf.rst
+
 What:  /sys/devices/system/cpu/smt
/sys/devices/system/cpu/smt/active
/sys/devices/system/cpu/smt/control
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1937,12 +1937,6 @@
[KVM,ARM] Allow use of GICv4 for direct injection of
LPIs.
 
-   kvm-intel.nosmt=[KVM,Intel] If the L1TF CPU bug is present 
(CVE-2018-3620)
-   and the system has SMT (aka Hyper-Threading) enabled 
then
-   don't allow guests to be created.
-
-   Default is 0 (allow guests to be created).
-
kvm-intel.ept=  [KVM,Intel]

[PATCH 4.17 71/97] cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early

2018-08-14 Thread Greg Kroah-Hartman

4.17-stable review patch.  If anyone has any objections, please let me know.

--

From: Thomas Gleixner 

commit fee0aede6f4739c87179eca76136f83210953b86 upstream

The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support
SMT) when the sysfs SMT control file is initialized.

That was fine so far as this was only required to make the output of the
control file correct and to prevent writes in that case.

With the upcoming l1tf command line parameter, this needs to be set up
before the L1TF mitigation selection and command line parsing happens.

Signed-off-by: Thomas Gleixner 
Tested-by: Jiri Kosina 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Josh Poimboeuf 
Link: https://lkml.kernel.org/r/20180713142323.121795...@linutronix.de
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/kernel/cpu/bugs.c |6 ++
 include/linux/cpu.h|2 ++
 kernel/cpu.c   |   13 ++---
 3 files changed, 18 insertions(+), 3 deletions(-)

--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -58,6 +58,12 @@ void __init check_bugs(void)
 {
identify_boot_cpu();
 
+   /*
+* identify_boot_cpu() initialized SMT support information, let the
+* core code know.
+*/
+   cpu_smt_check_topology();
+
if (!IS_ENABLED(CONFIG_SMP)) {
pr_info("CPU: ");
print_cpu_info(_cpu_data);
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -178,9 +178,11 @@ enum cpuhp_smt_control {
 #if defined(CONFIG_SMP) && defined(CONFIG_HOTPLUG_SMT)
 extern enum cpuhp_smt_control cpu_smt_control;
 extern void cpu_smt_disable(bool force);
+extern void cpu_smt_check_topology(void);
 #else
 # define cpu_smt_control   (CPU_SMT_ENABLED)
 static inline void cpu_smt_disable(bool force) { }
+static inline void cpu_smt_check_topology(void) { }
 #endif
 
 #endif /* _LINUX_CPU_H_ */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -361,6 +361,16 @@ void __init cpu_smt_disable(bool force)
}
 }
 
+/*
+ * The decision whether SMT is supported can only be done after the full
+ * CPU identification. Called from architecture code.
+ */
+void __init cpu_smt_check_topology(void)
+{
+   if (!topology_smt_supported())
+   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
+}
+
 static int __init smt_cmdline_disable(char *str)
 {
cpu_smt_disable(str && !strcmp(str, "force"));
@@ -2115,9 +2125,6 @@ static const struct attribute_group cpuh
 
 static int __init cpu_smt_state_init(void)
 {
-   if (!topology_smt_supported())
-   cpu_smt_control = CPU_SMT_NOT_SUPPORTED;
-
return sysfs_create_group(_subsys.dev_root->kobj,
  _smt_attr_group);
 }

[PATCH 4.14 013/104] root dentries need RCU-delayed freeing

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Al Viro 

commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream.

Since mountpoint crossing can happen without leaving lazy mode,
root dentries do need the same protection against having their
memory freed without RCU delay as everything else in the tree.

It's partially hidden by RCU delay between detaching from the
mount tree and dropping the vfsmount reference, but the starting
point of pathwalk can be on an already detached mount, in which
case umount-caused RCU delay has already passed by the time the
lazy pathwalk grabs rcu_read_lock().  If the starting point
happens to be at the root of that vfsmount *and* that vfsmount
covers the entire filesystem, we get trouble.

Fixes: 48a066e72d97 ("RCU'd vsfmounts")
Cc: sta...@vger.kernel.org
Signed-off-by: Al Viro 
Signed-off-by: Greg Kroah-Hartman 

---
 fs/dcache.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1922,10 +1922,12 @@ struct dentry *d_make_root(struct inode
 
if (root_inode) {
res = __d_alloc(root_inode->i_sb, NULL);
-   if (res)
+   if (res) {
+   res->d_flags |= DCACHE_RCUACCESS;
d_instantiate(res, root_inode);
-   else
+   } else {
iput(root_inode);
+   }
}
return res;
 }

[PATCH 4.14 017/104] mtd: nand: qcom: Add a NULL check for devm_kasprintf()

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Fabio Estevam 

commit 069f05346d01e7298939f16533953cdf52370be3 upstream.

devm_kasprintf() may fail, so we should better add a NULL check
and propagate an error on failure.

Signed-off-by: Fabio Estevam 
Signed-off-by: Boris Brezillon 
Signed-off-by: Amit Pundir 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/mtd/nand/qcom_nandc.c |3 +++
 1 file changed, 3 insertions(+)

--- a/drivers/mtd/nand/qcom_nandc.c
+++ b/drivers/mtd/nand/qcom_nandc.c
@@ -2544,6 +2544,9 @@ static int qcom_nand_host_init(struct qc
 
nand_set_flash_node(chip, dn);
mtd->name = devm_kasprintf(dev, GFP_KERNEL, "qcom_nand.%d", host->cs);
+   if (!mtd->name)
+   return -ENOMEM;
+
mtd->owner = THIS_MODULE;
mtd->dev.parent = dev;

Re: [PATCH] PCI: Equalize hotplug memory for non/occupied slots

2018-08-14 Thread Bjorn Helgaas

On Wed, Jul 25, 2018 at 05:02:59PM -0600, Jon Derrick wrote:
> Currently, a hotplug bridge will be given hpmemsize additional memory if
> available, in order to satisfy any future hotplug allocation
> requirements.
> 
> These calculations don't consider the current memory size of the hotplug
> bridge/slot, so hotplug bridges/slots which have downstream devices will
> get their current allocation in addition to the hpmemsize value.
> 
> This makes for possibly undesirable results with a mix of unoccupied and
> occupied slots (ex, with hpmemsize=2M):
> 
> 02:03.0 PCI bridge: <-- Occupied
>   Memory behind bridge: d620-d64f [size=3M]
> 02:04.0 PCI bridge: <-- Unoccupied
>   Memory behind bridge: d650-d66f [size=2M]
> 
> This change considers the current allocation size when using the
> hpmemsize parameter to make the reservations predictable for the mix of
> unoccupied and occupied slots:
> 
> 02:03.0 PCI bridge: <-- Occupied
>   Memory behind bridge: d620-d63f [size=2M]
> 02:04.0 PCI bridge: <-- Unoccupied
>   Memory behind bridge: d640-d65f [size=2M]

The I/O sizing code (pbus_size_io() and calculate_iosize()) is essentially
identical to the mem sizing code you're updating.  I assume the same
considerations would apply there?  If not, please include a note in the
changelog about why you changed the mem code but not the I/O code.

> Signed-off-by: Jon Derrick 
> ---
> Original RFC here:
> https://patchwork.ozlabs.org/patch/945374/
> 
> I split this bit out from the RFC while awaiting the pci string handling
> enhancements to handle per-device settings
> 
> Changed from RFC is a simpler algo
> 
>  drivers/pci/setup-bus.c | 13 ++---
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 79b1824..5ae39e6 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -831,7 +831,8 @@ static resource_size_t calculate_iosize(resource_size_t 
> size,
>  
>  static resource_size_t calculate_memsize(resource_size_t size,
>   resource_size_t min_size,
> - resource_size_t size1,
> + resource_size_t add_size,
> + resource_size_t children_add_size,
>   resource_size_t old_size,
>   resource_size_t align)
>  {
> @@ -841,7 +842,7 @@ static resource_size_t calculate_memsize(resource_size_t 
> size,
>   old_size = 0;
>   if (size < old_size)
>   size = old_size;
> - size = ALIGN(size + size1, align);
> + size = ALIGN(max(size, add_size) + children_add_size, align);
>   return size;
>  }
>  
> @@ -1079,12 +1080,10 @@ static int pbus_size_mem(struct pci_bus *bus, 
> unsigned long mask,
>  
>   min_align = calculate_mem_align(aligns, max_order);
>   min_align = max(min_align, window_alignment(bus, b_res->flags));
> - size0 = calculate_memsize(size, min_size, 0, resource_size(b_res), 
> min_align);
> + size0 = calculate_memsize(size, min_size, 0, 0, resource_size(b_res), 
> min_align);
>   add_align = max(min_align, add_align);
> - if (children_add_size > add_size)
> - add_size = children_add_size;
> - size1 = (!realloc_head || (realloc_head && !add_size)) ? size0 :
> - calculate_memsize(size, min_size, add_size,
> + size1 = (!realloc_head || (realloc_head && !add_size && 
> !children_add_size)) ? size0 :
> + calculate_memsize(size, min_size, add_size, children_add_size,
>   resource_size(b_res), add_align);
>   if (!size0 && !size1) {
>   if (b_res->start || b_res->end)
> -- 
> 1.8.3.1
>

Re: [PATCH 2/2] bcache: add undef for macro in function

2018-08-14 Thread Coly Li

On 2018/8/14 10:59 PM, cdb...@163.com wrote:
> Hi Coly，
> The three macros is only locally used in func "__cached_dev"，I think
> they should be undefined before leaving the func.
> 

Hi Dongbo,

It is worthy to do this if there is a potential conflict. But they are
defined in sysfs.c and not exported to anywhere else, I guess there
won't be any symbol conflict. Then we don't need to worry about this.

Coly Li


> ---Original---
> *From:* "Coly Li"
> *Date:* Tue, Aug 14, 2018 20:16 PM
> *To:* "Dongbo Cao";
> *Cc:*
> "linux-kernel";"linux-bcache";"kent.overstreet";
> *Subject:* Re: [PATCH 2/2] bcache: add undef for macro in function
> 
> On 2018/8/14 4:16 PM, Dongbo Cao wrote:
>> add undef for macro d_strtoul,d_strtoul_nonzero and d_strtoi_h
>> 
>> Signed-off-by: Dongbo Cao 
>> ---
>>  drivers/md/bcache/sysfs.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/md/bcache/sysfs.c b/drivers/md/bcache/sysfs.c
>> index 225b15aa..ed67a290 100644
>> --- a/drivers/md/bcache/sysfs.c
>> +++ b/drivers/md/bcache/sysfs.c
>> @@ -349,6 +349,9 @@ STORE(__cached_dev)
>>   if (attr == _stop)
>>   bcache_device_stop(>disk);
>>  
>> +#undef d_strtoul
>> +#undef d_strtoul_nonzero
>> +#undef d_strtoi_h
>>   return size;
>>  }
>>  
>> 
> 
> Hi Dongbo,
> 
> Could you please to provide the motivation why you make this change ?
> 
> Thanks.
> 
> Coly Li

Re: [PATCH] Handle clock_gettime(CLOCK_TAI) in VDSO

2018-08-14 Thread David Woodhouse

On Tue, 2018-08-14 at 07:20 -0700, Andy Lutomirski wrote:
> > +   /* Doubled switch statement to work around kernel Makefile error */
> > +   /* See: 
> > https://www.mail-archive.com/gcc-bugs@gcc.gnu.org/msg567499.html */
> 
> NAK.
> 
> The issue here (after reading that thread) is that, with our current
> compile options, gcc generates a jump table once the switch statement
> hits five entries.  And it uses retpolines for it, and somehow it
> generates the relocations in such a way that the vDSO build fails. 
> We
> need to address this so that the vDSO build is reliable, but there's
> an important question here:
> 
> Should the vDSO be built with retpolines, or should it be built with
> indirect branches?  Or should we go out of our way to make sure that
> the vDSO contains neither retpolines nor indirect branches?
> 
> We could accomplish the latter (sort of) by manually converting the
> switch into the appropriate if statements, but that's rather ugly.
> 
> (Hmm.  We should add exports to directly read each clock source.
> They'll be noticeably faster, especially when
> cache-and-predictor-code.)

Surely it's kind of expected that the vDSO can't find an externally
provided __x86_indirect_thunk_rax symbol, since we only provide one as
part of the kernel image.

Building the vDSO with -mindirect-branch=thunk(|-inline) should fix
that, if we want retpolines in the vDSO.

There's also -fno-jump-tables.


smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)

2018-08-14 Thread Song Liu

On Mon, Aug 13, 2018 at 9:37 PM, Ravi Bangoria
 wrote:
> Hi Song,
>
> On 08/13/2018 10:42 PM, Song Liu wrote:
>> On Mon, Aug 13, 2018 at 6:17 AM, Oleg Nesterov  wrote:
>>> On 08/13, Ravi Bangoria wrote:

> But damn, process creation (exec) is trivial. We could add a new 
> uprobe_exec()
> hook and avoid delayed_uprobe_install() in uprobe_mmap().

 I'm sorry. I didn't get this.
>>>
>>> Sorry for confusion...
>>>
>>> I meant, if only exec*( could race with _register(), we could add another 
>>> uprobe
>>> hook which updates all (delayed) counters before return to user-mode.
>>>
> Afaics, the really problematic case is dlopen() which can race with 
> _register()
> too, right?

 dlopen() should internally use mmap() right? So what is the problem here? 
 Can
 you please elaborate.
>>>
>>> What I tried to say is that we can't avoid 
>>> uprobe_mmap()->delayed_uprobe_install()
>>> because dlopen() can race with _register() too, just like exec.
>>>
>>> Oleg.
>>>
>>
>> How about we do delayed_uprobe_install() per file? Say we keep a list
>> of delayed_uprobe
>> in load_elf_binary(). Then we can install delayed_uprobe after loading
>> all sections of the
>> file.
>
> I'm not sure if I totally understood the idea. But how this approach can
> solve dlopen() race with _register()?
>
> Rather, making delayed_uprobe_list an mm field seems simple and effective
> idea to me. The only overhead will be list_empty(mm->delayed_list) check.
>
> Please let me know if I misunderstood you.
>
> Thanks,
> Ravi

I misunderstood the problem here. I guess mm->delayed_list is the
easiest solution of the race condition.

Thanks,
Song

RE: [PATCH net-next 5/9] net: hns3: Fix for vf vlan delete failed problem

2018-08-14 Thread Salil Mehta

Hi Dave,

> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Monday, August 13, 2018 4:57 PM
> To: Salil Mehta 
> Cc: Zhuangyuzeng (Yisen) ; lipeng (Y)
> ; mehta.salil@gmail.com;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; Linuxarm
> ; linyunsheng 
> Subject: Re: [PATCH net-next 5/9] net: hns3: Fix for vf vlan delete
> failed problem
> 
> From: Salil Mehta 
> Date: Sun, 12 Aug 2018 10:47:34 +0100
> 
> > Fixes: 9dba194574e3 ("{topost} net: hns3: fix for vlan table
> problem")
> 
> This commit ID doesn't exist.

Thanks for catching this. This commit ID was from our internal branch
and ideally should have been from net-next - I should have caught this
earlier, sorry for this!

I have for now dropped this patch from the series as there is another
related patch(being referred in the Fixes string) that would need to
be merged with this patch before sending to net-next. Therefore, will
refloat this patch along with other related patch later in next cycle.

> 
> Also, I really don't think the string "{topost}" would be in the commit
> header line.

Yes, this is stray and will be removed when this patch is sent next.

Thank you
Salil

[PATCH] x86, asm: Use CC_SET()/CC_OUT() in arch/x86/include/asm/signal.h

2018-08-14 Thread Uros Bizjak

Remove open-coded uses of set instructions to use CC_SET()/CC_OUT() in 
arch/x86/include/asm/signal.h.

Signed-off-by: Uros Bizjak 
---
 arch/x86/include/asm/signal.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/signal.h b/arch/x86/include/asm/signal.h
index 5f9012ff52ed..33d3c88a7225 100644
--- a/arch/x86/include/asm/signal.h
+++ b/arch/x86/include/asm/signal.h
@@ -39,6 +39,7 @@ extern void do_signal(struct pt_regs *regs);
 
 #define __ARCH_HAS_SA_RESTORER
 
+#include 
 #include 
 
 #ifdef __i386__
@@ -86,9 +87,9 @@ static inline int __const_sigismember(sigset_t *set, int _sig)
 
 static inline int __gen_sigismember(sigset_t *set, int _sig)
 {
-   unsigned char ret;
-   asm("btl %2,%1\n\tsetc %0"
-   : "=qm"(ret) : "m"(*set), "Ir"(_sig-1) : "cc");
+   bool ret;
+   asm("btl %2,%1" CC_SET(c)
+   : CC_OUT(c) (ret) : "m"(*set), "Ir"(_sig-1));
return ret;
 }
 
-- 
2.17.1

Re: [PATCH v2 1/3] mfd: cros: add charger port count command definition

2018-08-14 Thread Enric Balletbo Serra

Hi,
Missatge de Fabien Parent  del dia dv., 10 d’ag.
2018 a les 15:17:
>
> A new more command has been added to the ChromeOS embedded controller
> that allows to get the number of charger port count. Unlike
> EC_CMD_USB_PD_PORTS, this new command also includes the dedicated
> port if present.
>
> This command will be used to expose the dedicated charger port
> in the ChromeOS charger driver.
>
> Signed-off-by: Fabien Parent 

Seems you missed my Reviewed-by tag, not a problem :), in any case,
for if it helps to land adding again

Reviewed-by: Enric Balletbo i Serra 

> Acked-for-MFD-by: Lee Jones 
> ---
> V1 -> V2:
>   * No change
> ---
>  include/linux/mfd/cros_ec_commands.h | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/mfd/cros_ec_commands.h 
> b/include/linux/mfd/cros_ec_commands.h
> index 0d926492ac3a..e3187f8bdb7e 100644
> --- a/include/linux/mfd/cros_ec_commands.h
> +++ b/include/linux/mfd/cros_ec_commands.h
> @@ -3005,6 +3005,16 @@ struct ec_params_usb_pd_info_request {
> uint8_t port;
>  } __packed;
>
> +/*
> + * This command will return the number of USB PD charge port + the number
> + * of dedicated port present.
> + * EC_CMD_USB_PD_PORTS does NOT include the dedicated ports
> + */
> +#define EC_CMD_CHARGE_PORT_COUNT 0x0105
> +struct ec_response_charge_port_count {
> +   uint8_t port_count;
> +} __packed;
> +
>  /* Read USB-PD Device discovery info */
>  #define EC_CMD_USB_PD_DISCOVERY 0x0113
>  struct ec_params_usb_pd_discovery_entry {
> --
> 2.18.0
>

[PATCH 4.18 01/79] x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Peter Zijlstra 

commit 5800dc5c19f34e6e03b5adab1282535cb102fafd upstream.

Nadav reported that on guests we're failing to rewrite the indirect
calls to CALLEE_SAVE paravirt functions. In particular the
pv_queued_spin_unlock() call is left unpatched and that is all over the
place. This obviously wrecks Spectre-v2 mitigation (for paravirt
guests) which relies on not actually having indirect calls around.

The reason is an incorrect clobber test in paravirt_patch_call(); this
function rewrites an indirect call with a direct call to the _SAME_
function, there is no possible way the clobbers can be different
because of this.

Therefore remove this clobber check. Also put WARNs on the other patch
failure case (not enough room for the instruction) which I've not seen
trigger in my (limited) testing.

Three live kernel image disassemblies for lock_sock_nested (as a small
function that illustrates the problem nicely). PRE is the current
situation for guests, POST is with this patch applied and NATIVE is with
or without the patch for !guests.

PRE:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0x817be970 <+0>: push   %rbp
   0x817be971 <+1>: mov%rdi,%rbp
   0x817be974 <+4>: push   %rbx
   0x817be975 <+5>: lea0x88(%rbp),%rbx
   0x817be97c <+12>:callq  0x819f7160 <_cond_resched>
   0x817be981 <+17>:mov%rbx,%rdi
   0x817be984 <+20>:callq  0x819fbb00 <_raw_spin_lock_bh>
   0x817be989 <+25>:mov0x8c(%rbp),%eax
   0x817be98f <+31>:test   %eax,%eax
   0x817be991 <+33>:jne0x817be9ba 
   0x817be993 <+35>:movl   $0x1,0x8c(%rbp)
   0x817be99d <+45>:mov%rbx,%rdi
   0x817be9a0 <+48>:callq  *0x822299e8
   0x817be9a7 <+55>:pop%rbx
   0x817be9a8 <+56>:pop%rbp
   0x817be9a9 <+57>:mov$0x200,%esi
   0x817be9ae <+62>:mov$0x817be993,%rdi
   0x817be9b5 <+69>:jmpq   0x81063ae0 <__local_bh_enable_ip>
   0x817be9ba <+74>:mov%rbp,%rdi
   0x817be9bd <+77>:callq  0x817be8c0 <__lock_sock>
   0x817be9c2 <+82>:jmp0x817be993 
End of assembler dump.

POST:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0x817be970 <+0>: push   %rbp
   0x817be971 <+1>: mov%rdi,%rbp
   0x817be974 <+4>: push   %rbx
   0x817be975 <+5>: lea0x88(%rbp),%rbx
   0x817be97c <+12>:callq  0x819f7160 <_cond_resched>
   0x817be981 <+17>:mov%rbx,%rdi
   0x817be984 <+20>:callq  0x819fbb00 <_raw_spin_lock_bh>
   0x817be989 <+25>:mov0x8c(%rbp),%eax
   0x817be98f <+31>:test   %eax,%eax
   0x817be991 <+33>:jne0x817be9ba 
   0x817be993 <+35>:movl   $0x1,0x8c(%rbp)
   0x817be99d <+45>:mov%rbx,%rdi
   0x817be9a0 <+48>:callq  0x810a0c20 
<__raw_callee_save___pv_queued_spin_unlock>
   0x817be9a5 <+53>:xchg   %ax,%ax
   0x817be9a7 <+55>:pop%rbx
   0x817be9a8 <+56>:pop%rbp
   0x817be9a9 <+57>:mov$0x200,%esi
   0x817be9ae <+62>:mov$0x817be993,%rdi
   0x817be9b5 <+69>:jmpq   0x81063aa0 <__local_bh_enable_ip>
   0x817be9ba <+74>:mov%rbp,%rdi
   0x817be9bd <+77>:callq  0x817be8c0 <__lock_sock>
   0x817be9c2 <+82>:jmp0x817be993 
End of assembler dump.

NATIVE:

(gdb) disassemble lock_sock_nested
Dump of assembler code for function lock_sock_nested:
   0x817be970 <+0>: push   %rbp
   0x817be971 <+1>: mov%rdi,%rbp
   0x817be974 <+4>: push   %rbx
   0x817be975 <+5>: lea0x88(%rbp),%rbx
   0x817be97c <+12>:callq  0x819f7160 <_cond_resched>
   0x817be981 <+17>:mov%rbx,%rdi
   0x817be984 <+20>:callq  0x819fbb00 <_raw_spin_lock_bh>
   0x817be989 <+25>:mov0x8c(%rbp),%eax
   0x817be98f <+31>:test   %eax,%eax
   0x817be991 <+33>:jne0x817be9ba 
   0x817be993 <+35>:movl   $0x1,0x8c(%rbp)
   0x817be99d <+45>:mov%rbx,%rdi
   0x817be9a0 <+48>:movb   $0x0,(%rdi)
   0x817be9a3 <+51>:nopl   0x0(%rax)
   0x817be9a7 <+55>:pop%rbx
   0x817be9a8 <+56>:pop%rbp
   0x817be9a9 <+57>:mov$0x200,%esi
   0x817be9ae <+62>:mov$0x817be993,%rdi
   0x817be9b5 <+69>:jmpq   0x81063ae0 <__local_bh_enable_ip>

[PATCH 4.18 10/79] x86/speculation/l1tf: Add sysfs reporting for l1tf

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

L1TF core kernel workarounds are cheap and normally always enabled, However
they still should be reported in sysfs if the system is vulnerable or
mitigated. Add the necessary CPU feature/bug bits.

- Extend the existing checks for Meltdowns to determine if the system is
  vulnerable. All CPUs which are not vulnerable to Meltdown are also not
  vulnerable to L1TF

- Check for 32bit non PAE and emit a warning as there is no practical way
  for mitigation due to the limited physical address bits

- If the system has more than MAX_PA/2 physical memory the invert page
  workarounds don't protect the system against the L1TF attack anymore,
  because an inverted physical address will also point to valid
  memory. Print a warning in this case and report that the system is
  vulnerable.

Add a function which returns the PFN limit for the L1TF mitigation, which
will be used in follow up patches for sanity and range checks.

[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/cpufeatures.h |2 +
 arch/x86/include/asm/processor.h   |5 
 arch/x86/kernel/cpu/bugs.c |   40 +
 arch/x86/kernel/cpu/common.c   |   20 ++
 drivers/base/cpu.c |8 +++
 include/linux/cpu.h|2 +
 6 files changed, 77 insertions(+)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -219,6 +219,7 @@
 #define X86_FEATURE_IBPB   ( 7*32+26) /* Indirect Branch 
Prediction Barrier */
 #define X86_FEATURE_STIBP  ( 7*32+27) /* Single Thread Indirect 
Branch Predictors */
 #define X86_FEATURE_ZEN( 7*32+28) /* "" CPU is AMD 
family 0x17 (Zen) */
+#define X86_FEATURE_L1TF_PTEINV( 7*32+29) /* "" L1TF 
workaround PTE inversion */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW ( 8*32+ 0) /* Intel TPR Shadow */
@@ -373,5 +374,6 @@
 #define X86_BUG_SPECTRE_V1 X86_BUG(15) /* CPU is affected by 
Spectre variant 1 attack with conditional branches */
 #define X86_BUG_SPECTRE_V2 X86_BUG(16) /* CPU is affected by 
Spectre variant 2 attack with indirect branches */
 #define X86_BUG_SPEC_STORE_BYPASS  X86_BUG(17) /* CPU is affected by 
speculative store bypass attack */
+#define X86_BUG_L1TF   X86_BUG(18) /* CPU is affected by L1 
Terminal Fault */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -181,6 +181,11 @@ extern const struct seq_operations cpuin
 
 extern void cpu_detect(struct cpuinfo_x86 *c);
 
+static inline unsigned long l1tf_pfn_limit(void)
+{
+   return BIT(boot_cpu_data.x86_phys_bits - 1 - PAGE_SHIFT) - 1;
+}
+
 extern void early_cpu_init(void);
 extern void identify_boot_cpu(void);
 extern void identify_secondary_cpu(struct cpuinfo_x86 *);
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -28,9 +28,11 @@
 #include 
 #include 
 #include 
+#include 
 
 static void __init spectre_v2_select_mitigation(void);
 static void __init ssb_select_mitigation(void);
+static void __init l1tf_select_mitigation(void);
 
 /*
  * Our boot-time value of the SPEC_CTRL MSR. We read it once so that any
@@ -82,6 +84,8 @@ void __init check_bugs(void)
 */
ssb_select_mitigation();
 
+   l1tf_select_mitigation();
+
 #ifdef CONFIG_X86_32
/*
 * Check whether we are able to run this kernel safely on SMP.
@@ -207,6 +211,32 @@ static void x86_amd_ssb_disable(void)
wrmsrl(MSR_AMD64_LS_CFG, msrval);
 }
 
+static void __init l1tf_select_mitigation(void)
+{
+   u64 half_pa;
+
+   if (!boot_cpu_has_bug(X86_BUG_L1TF))
+   return;
+
+#if CONFIG_PGTABLE_LEVELS == 2
+   pr_warn("Kernel not compiled for PAE. No mitigation for L1TF\n");
+   return;
+#endif
+
+   /*
+* This is extremely unlikely to happen because almost all
+* systems have far more MAX_PA/2 than RAM can be fit into
+* DIMM slots.
+*/
+   half_pa = (u64)l1tf_pfn_limit() << PAGE_SHIFT;
+   if (e820__mapped_any(half_pa, ULLONG_MAX - half_pa, E820_TYPE_RAM)) {
+   pr_warn("System has more than MAX_PA/2 memory. L1TF mitigation 
not effective.\n");
+   return;
+   }
+
+   setup_force_cpu_cap(X86_FEATURE_L1TF_PTEINV);
+}
+
 #ifdef RETPOLINE
 static bool spectre_v2_bad_module;
 
@@ -660,6 +690,11 @@ static ssize_t cpu_show_common(struct de
case X86_BUG_SPEC_STORE_BYPASS:
return sprintf(buf, "%s\n", ssb_strings[ssb_mode]);
 
+   case X86_BUG_L1TF:
+   if

[PATCH 4.18 11/79] x86/speculation/l1tf: Disallow non privileged high MMIO PROT_NONE mappings

2018-08-14 Thread Greg Kroah-Hartman

4.18-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.

Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.

To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.

Valid page mappings are allowed because the system is then unsafe anyways.

It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.

For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().

For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable.h |8 ++
 arch/x86/mm/mmap.c |   21 +
 include/asm-generic/pgtable.h  |   12 ++
 mm/memory.c|   37 ++
 mm/mprotect.c  |   49 +
 5 files changed, 117 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1338,6 +1338,14 @@ static inline bool pud_access_permitted(
return __pte_access_permitted(pud_val(pud), write);
 }
 
+#define __HAVE_ARCH_PFN_MODIFY_ALLOWED 1
+extern bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot);
+
+static inline bool arch_has_pfn_modify_check(void)
+{
+   return boot_cpu_has_bug(X86_BUG_L1TF);
+}
+
 #include 
 #endif /* __ASSEMBLY__ */
 
--- a/arch/x86/mm/mmap.c
+++ b/arch/x86/mm/mmap.c
@@ -240,3 +240,24 @@ int valid_mmap_phys_addr_range(unsigned
 
return phys_addr_valid(addr + count - 1);
 }
+
+/*
+ * Only allow root to set high MMIO mappings to PROT_NONE.
+ * This prevents an unpriv. user to set them to PROT_NONE and invert
+ * them, then pointing to valid memory for L1TF speculation.
+ *
+ * Note: for locked down kernels may want to disable the root override.
+ */
+bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
+{
+   if (!boot_cpu_has_bug(X86_BUG_L1TF))
+   return true;
+   if (!__pte_needs_invert(pgprot_val(prot)))
+   return true;
+   /* If it's real memory always allow */
+   if (pfn_valid(pfn))
+   return true;
+   if (pfn > l1tf_pfn_limit() && !capable(CAP_SYS_ADMIN))
+   return false;
+   return true;
+}
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1097,4 +1097,16 @@ static inline void init_espfix_bsp(void)
 #endif
 #endif
 
+#ifndef __HAVE_ARCH_PFN_MODIFY_ALLOWED
+static inline bool pfn_modify_allowed(unsigned long pfn, pgprot_t prot)
+{
+   return true;
+}
+
+static inline bool arch_has_pfn_modify_check(void)
+{
+   return false;
+}
+#endif
+
 #endif /* _ASM_GENERIC_PGTABLE_H */
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1884,6 +1884,9 @@ int vm_insert_pfn_prot(struct vm_area_st
if (addr < vma->vm_start || addr >= vma->vm_end)
return -EFAULT;
 
+   if (!pfn_modify_allowed(pfn, pgprot))
+   return -EACCES;
+
track_pfn_insert(vma, , __pfn_to_pfn_t(pfn, PFN_DEV));
 
ret = insert_pfn(vma, addr, __pfn_to_pfn_t(pfn, PFN_DEV), pgprot,
@@ -1919,6 +1922,9 @@ static int __vm_insert_mixed(struct vm_a
 
track_pfn_insert(vma, , pfn);
 
+   if (!pfn_modify_allowed(pfn_t_to_pfn(pfn), pgprot))
+   return -EACCES;
+
/*
 * If we don't have pte special, then we have to use the pfn_valid()
 * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
@@ -1980,6 +1986,7 @@ static int remap_pte_range(struct mm_str
 {
pte_t *pte;
spinlock_t *ptl;
+   int err = 0;
 
pte = pte_alloc_map_lock(mm, pmd, addr, );
if (!pte)
@@ -1987,12 +1994,16 @@ static int remap_pte_range(struct mm_str
arch_enter_lazy_mmu_mode();
do {
BUG_ON(!pte_none(*pte));
+   if (!pfn_modify_allowed(pfn, prot)) {
+

[PATCH 4.14 025/104] x86/irqflags: Provide a declaration for native_save_fl

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Nick Desaulniers 

commit 208cbb32558907f68b3b2a081ca2337ac3744794 upstream.

It was reported that the commit d0a8d9378d16 is causing users of gcc < 4.9
to observe -Werror=missing-prototypes errors.

Indeed, it seems that:
extern inline unsigned long native_save_fl(void) { return 0; }

compiled with -Werror=missing-prototypes produces this warning in gcc <
4.9, but not gcc >= 4.9.

Fixes: d0a8d9378d16 ("x86/paravirt: Make native_save_fl() extern inline").
Reported-by: David Laight 
Reported-by: Jean Delvare 
Signed-off-by: Nick Desaulniers 
Signed-off-by: Thomas Gleixner 
Cc: h...@zytor.com
Cc: jgr...@suse.com
Cc: kstew...@linuxfoundation.org
Cc: gre...@linuxfoundation.org
Cc: boris.ostrov...@oracle.com
Cc: astrac...@google.com
Cc: m...@chromium.org
Cc: a...@arndb.de
Cc: tstel...@redhat.com
Cc: sedat.di...@gmail.com
Cc: david.lai...@aculab.com
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulni...@google.com
Signed-off-by: Greg Kroah-Hartman 

---
 arch/x86/include/asm/irqflags.h |2 ++
 1 file changed, 2 insertions(+)

--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -13,6 +13,8 @@
  * Interrupt control:
  */
 
+/* Declaration required for gcc < 4.9 to prevent -Werror=missing-prototypes */
+extern inline unsigned long native_save_fl(void);
 extern inline unsigned long native_save_fl(void)
 {
unsigned long flags;

[PATCH 4.14 028/104] x86/speculation/l1tf: Protect swap entries against L1TF

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Linus Torvalds 

commit 2f22b4cd45b67b3496f4aa4c7180a1271c6452f6 upstream

With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
side effects allow to read the memory the PTE is pointing too, if its
values are still in the L1 cache.

For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
them.

To protect against L1TF it must be ensured that the swap entry is not
pointing to valid memory, which requires setting higher bits (between bit
36 and bit 45) that are inside the CPUs physical address space, but outside
any real memory.

To do this invert the offset to make sure the higher bits are always set,
as long as the swap file is not too big.

Note there is no workaround for 32bit !PAE, or on systems which have more
than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
real systems.

[AK: updated description and minor tweaks by. Split out from the original
 patch ]

Signed-off-by: Linus Torvalds 
Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Tested-by: Andi Kleen 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/pgtable_64.h |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -276,7 +276,7 @@ static inline int pgd_large(pgd_t pgd) {
  *
  * | ...| 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * | ...|SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) |  OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|X|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -289,6 +289,9 @@ static inline int pgd_large(pgd_t pgd) {
  *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
+ *
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
 #define SWP_TYPE_BITS  5
 
@@ -303,13 +306,15 @@ static inline int pgd_large(pgd_t pgd) {
 #define __swp_type(x) ((x).val >> (64 - SWP_TYPE_BITS))
 
 /* Shift up (to get rid of type), then down to get value */
-#define __swp_offset(x) ((x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
+#define __swp_offset(x) (~(x).val << SWP_TYPE_BITS >> SWP_OFFSET_SHIFT)
 
 /*
  * Shift the offset up "too far" by TYPE bits, then down again
+ * The offset is inverted by a binary not operation to make the high
+ * physical bits set.
  */
 #define __swp_entry(type, offset) ((swp_entry_t) { \
-   ((unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
+   (~(unsigned long)(offset) << SWP_OFFSET_SHIFT >> SWP_TYPE_BITS) \
| ((unsigned long)(type) << (64-SWP_TYPE_BITS)) })
 
 #define __pte_to_swp_entry(pte)((swp_entry_t) { pte_val((pte)) 
})

[PATCH 4.14 003/104] scsi: hpsa: fix selection of reply queue

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Ming Lei 

commit 8b834bff1b73dce46f4e9f5e84af6f73fed8b0ef upstream.

Since commit 84676c1f21e8 ("genirq/affinity: assign vectors to all
possible CPUs") we could end up with an MSI-X vector that did not have
any online CPUs mapped. This would lead to I/O hangs since there was no
CPU to receive the completion.

Retrieve IRQ affinity information using pci_irq_get_affinity() and use
this mapping to choose a reply queue.

[mkp: tweaked commit desc]

Cc: Hannes Reinecke 
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Laurence Oberman 
Cc: Meelis Roos 
Cc: Artem Bityutskiy 
Cc: Mike Snitzer 
Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
Signed-off-by: Ming Lei 
Tested-by: Laurence Oberman 
Tested-by: Don Brace 
Tested-by: Artem Bityutskiy 
Acked-by: Don Brace 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Martin K. Petersen 
Signed-off-by: Greg Kroah-Hartman 

---
 drivers/scsi/hpsa.c |   73 ++--
 drivers/scsi/hpsa.h |1 
 2 files changed, 55 insertions(+), 19 deletions(-)

--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -1040,11 +1040,7 @@ static void set_performant_mode(struct c
c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
if (unlikely(!h->msix_vectors))
return;
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   c->Header.ReplyQueue =
-   raw_smp_processor_id() % h->nreply_queues;
-   else
-   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
+   c->Header.ReplyQueue = reply_queue;
}
 }
 
@@ -1058,10 +1054,7 @@ static void set_ioaccel1_performant_mode
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->ReplyQueue = reply_queue % h->nreply_queues;
+   cp->ReplyQueue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit (bit 0)
@@ -1082,10 +1075,7 @@ static void set_ioaccel2_tmf_performant_
/* Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/* Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
 *  - pull count (bits 0-3)
@@ -1104,10 +1094,7 @@ static void set_ioaccel2_performant_mode
 * Tell the controller to post the reply to the queue for this
 * processor.  This seems to give the best I/O throughput.
 */
-   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
-   cp->reply_queue = smp_processor_id() % h->nreply_queues;
-   else
-   cp->reply_queue = reply_queue % h->nreply_queues;
+   cp->reply_queue = reply_queue;
/*
 * Set the bits in the address sent down to include:
 *  - performant mode bit not used in ioaccel mode 2
@@ -1152,6 +1139,8 @@ static void __enqueue_cmd_and_start_io(s
 {
dial_down_lockup_detection_during_fw_flash(h, c);
atomic_inc(>commands_outstanding);
+
+   reply_queue = h->reply_map[raw_smp_processor_id()];
switch (c->cmd_type) {
case CMD_IOACCEL1:
set_ioaccel1_performant_mode(h, c, reply_queue);
@@ -7244,6 +7233,26 @@ static void hpsa_disable_interrupt_mode(
h->msix_vectors = 0;
 }
 
+static void hpsa_setup_reply_map(struct ctlr_info *h)
+{
+   const struct cpumask *mask;
+   unsigned int queue, cpu;
+
+   for (queue = 0; queue < h->msix_vectors; queue++) {
+   mask = pci_irq_get_affinity(h->pdev, queue);
+   if (!mask)
+   goto fallback;
+
+   for_each_cpu(cpu, mask)
+   h->reply_map[cpu] = queue;
+   }
+   return;
+
+fallback:
+   for_each_possible_cpu(cpu)
+   h->reply_map[cpu] = 0;
+}
+
 /* If MSI/MSI-X is supported by the kernel we will try to enable it on
  * controllers that are capable. If not, we use legacy INTx mode.
  */
@@ -7639,6 +7648,10 @@ static int hpsa_pci_init(struct ctlr_inf
err = hpsa_interrupt_mode(h);
if (err)
goto clean1;
+
+   /* setup mapping between CPU and reply queue */
+

[PATCH 4.14 026/104] x86/speculation/l1tf: Increase 32bit PAE __PHYSICAL_PAGE_SHIFT

2018-08-14 Thread Greg Kroah-Hartman

4.14-stable review patch.  If anyone has any objections, please let me know.

--

From: Andi Kleen 

commit 50896e180c6aa3a9c61a26ced99e15d602666a4c upstream

L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
speculates on PTE entries which do not have the PRESENT bit set, if the
content of the resulting physical address is available in the L1D cache.

The OS side mitigation makes sure that a !PRESENT PTE entry points to a
physical address outside the actually existing and cachable memory
space. This is achieved by inverting the upper bits of the PTE. Due to the
address space limitations this only works for 64bit and 32bit PAE kernels,
but not for 32bit non PAE.

This mitigation applies to both host and guest kernels, but in case of a
64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
the PAE address space (44bit) is not enough if the host has more than 43
bits of populated memory address space, because the speculation treats the
PTE content as a physical host address bypassing EPT.

The host (hypervisor) protects itself against the guest by flushing L1D as
needed, but pages inside the guest are not protected against attacks from
other processes inside the same guest.

For the guest the inverted PTE mask has to match the host to provide the
full protection for all pages the host could possibly map into the
guest. The hosts populated address space is not known to the guest, so the
mask must cover the possible maximal host address space, i.e. 52 bit.

On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
the mask to be below what the host could possible use for physical pages.

The L1TF PROT_NONE protection code uses the PTE masks to determine which
bits to invert to make sure the higher bits are set for unmapped entries to
prevent L1TF speculation attacks against EPT inside guests.

In order to invert all bits that could be used by the host, increase
__PHYSICAL_PAGE_SHIFT to 52 to match 64bit.

The real limit for a 32bit PAE kernel is still 44 bits because all Linux
PTEs are created from unsigned long PFNs, so they cannot be higher than 44
bits on a 32bit kernel. So these extra PFN bits should be never set. The
only users of this macro are using it to look at PTEs, so it's safe.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Josh Poimboeuf 
Acked-by: Michal Hocko 
Acked-by: Dave Hansen 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/include/asm/page_32_types.h |9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -29,8 +29,13 @@
 #define N_EXCEPTION_STACKS 1
 
 #ifdef CONFIG_X86_PAE
-/* 44=32+12, the limit we can fit into an unsigned long pfn */
-#define __PHYSICAL_MASK_SHIFT  44
+/*
+ * This is beyond the 44 bit limit imposed by the 32bit long pfns,
+ * but we need the full mask to make sure inverted PROT_NONE
+ * entries have all the host bits set in a guest.
+ * The real limit is still 44 bits.
+ */
+#define __PHYSICAL_MASK_SHIFT  52
 #define __VIRTUAL_MASK_SHIFT   32
 
 #else  /* !CONFIG_X86_PAE */

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1640 matches

Mail list logo