Re: [PATCH] crypto: vmx/xts - use fallback for ciphertext stealing

2019-08-21 Thread Herbert Xu
On Fri, Aug 16, 2019 at 05:06:24PM +0300, Ard Biesheuvel wrote:
> For correctness and compliance with the XTS-AES specification, we are
> adding support for ciphertext stealing to XTS implementations, even
> though no use cases are known that will be enabled by this.
> 
> Since the Power8 implementation already has a fallback skcipher standby
> for other purposes, let's use it for this purpose as well. If ciphertext
> stealing use cases ever become a bottleneck, we can always revisit this.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  drivers/crypto/vmx/aes_xts.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[linux-next][PPC][bisected c7d8b7][gcc 6.4.1] build error at drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1471

2019-08-21 Thread Abdul Haleem
Greeting's

Today's linux-next kernel 5.3.0-rc5-next-20190820 failed to build on my
powerpc machine

Build errors:
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c: In function amdgpu_exit:
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1471:2: error: implicit
declaration of function mmu_notifier_synchronize
[-Werror=implicit-function-declaration]
  mmu_notifier_synchronize();
  ^~~~ 
cc1: some warnings being treated as errors
make[4]: *** [drivers/gpu/drm/amd/amdgpu/amdgpu_drv.o] Error 1
make[3]: *** [drivers/gpu/drm/amd/amdgpu] Error 2

It was introduced with commit c7d8b7 (hmm: use mmu_notifier_get/put for
'struct hmm')

error disappears when commit is reverted.

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH v3 1/2] powerpc/powernv: Enhance opal message read interface

2019-08-21 Thread Oliver O'Halloran
On Wed, 2019-08-21 at 13:43 +0530, Vasant Hegde wrote:
> Use "opal-msg-size" device tree property to allocate memory for "opal_msg".
> 
> Cc: Mahesh Salgaonkar 
> Cc: Jeremy Kerr 
> Signed-off-by: Vasant Hegde 
> ---
> Changes in v3:
>   - Call BUG_ON, if we fail to allocate memory during init.
> 
> -Vasant
> 
>  arch/powerpc/platforms/powernv/opal.c | 29 ++-
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index aba443be7daa..4f1f68f568bf 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -58,6 +58,8 @@ static DEFINE_SPINLOCK(opal_write_lock);
>  static struct atomic_notifier_head opal_msg_notifier_head[OPAL_MSG_TYPE_MAX];
>  static uint32_t opal_heartbeat;
>  static struct task_struct *kopald_tsk;
> +static struct opal_msg *opal_msg;
> +static uint64_t opal_msg_size;
>  
>  void opal_configure_cores(void)
>  {
> @@ -271,14 +273,9 @@ static void opal_message_do_notify(uint32_t msg_type, 
> void *msg)
>  static void opal_handle_message(void)
>  {
>   s64 ret;
> - /*
> -  * TODO: pre-allocate a message buffer depending on opal-msg-size
> -  * value in /proc/device-tree.
> -  */
> - static struct opal_msg msg;
>   u32 type;
>  
> - ret = opal_get_msg(__pa(), sizeof(msg));
> + ret = opal_get_msg(__pa(opal_msg), opal_msg_size);
>   /* No opal message pending. */
>   if (ret == OPAL_RESOURCE)
>   return;
> @@ -290,14 +287,14 @@ static void opal_handle_message(void)
>   return;
>   }
>  
> - type = be32_to_cpu(msg.msg_type);
> + type = be32_to_cpu(opal_msg->msg_type);
>  
>   /* Sanity check */
>   if (type >= OPAL_MSG_TYPE_MAX) {
>   pr_warn_once("%s: Unknown message type: %u\n", __func__, type);
>   return;
>   }
> - opal_message_do_notify(type, (void *));
> + opal_message_do_notify(type, (void *)opal_msg);
>  }
>  
>  static irqreturn_t opal_message_notify(int irq, void *data)
> @@ -306,9 +303,21 @@ static irqreturn_t opal_message_notify(int irq, void 
> *data)
>   return IRQ_HANDLED;
>  }
>  
> -static int __init opal_message_init(void)
> +static int __init opal_message_init(struct device_node *opal_node)
>  {
>   int ret, i, irq;

> + const __be32 *val;
> +
> + val = of_get_property(opal_node, "opal-msg-size", NULL);
> + if (val)
> + opal_msg_size = be32_to_cpup(val);

Use of_property_read_u32()

> +
> + /* If opal-msg-size property is not available then use default size */
> + if (!opal_msg_size)
> + opal_msg_size = sizeof(struct opal_msg);
> +
> + opal_msg = kmalloc(opal_msg_size, GFP_KERNEL);

> + BUG_ON(opal_msg == NULL);

Seems questionable. Why not fall back to using a staticly allocated
struct opal_msg? Or re-try the allocation with the size limited to
sizeof(struct opal_msg)?

>   for (i = 0; i < OPAL_MSG_TYPE_MAX; i++)
>   ATOMIC_INIT_NOTIFIER_HEAD(_msg_notifier_head[i]);
> @@ -910,7 +919,7 @@ static int __init opal_init(void)
>   }
>  
>   /* Initialise OPAL messaging system */
> - opal_message_init();
> + opal_message_init(opal_node);
>  
>   /* Initialise OPAL asynchronous completion interface */
>   opal_async_comp_init();



Re: [PATCH 1/3] powerpc/sriov: Remove VF eeh_dev state when disabling SR-IOV

2019-08-21 Thread Sam Bobroff
On Wed, Aug 21, 2019 at 04:26:53PM +1000, Oliver O'Halloran wrote:
> When disabling virtual functions on an SR-IOV adapter we currently do not
> correctly remove the EEH state for the now-dead virtual functions. When
> removing the pci_dn that was created for the VF when SR-IOV was enabled
> we free the corresponding eeh_dev without removing it from the child device
> list of the eeh_pe that contained it. This can result in crashes due to the
> use-after-free.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
> No Fixes: here since I'm not sure if the commit that added this actually
> introduced the bug. EEH is amazing.

Yep.

> I suspect backporting this would cause more problems than it solves since
> reliably replicating the crash required enabling memory poisoning and
> hacking a device driver to remove the PCI error handling callbacks so
> the EEH fallback path (which removes and re-probes PCI devices)
> would be used.

I gave this a quick test with some added instrumentation, and I can see
that the new code is used during VF removal and it doesn't cause any new
problems. I agree that even if it's difficult to trigger, it was
definitely a bug.

Reviewed-by: Sam Bobroff 
Tested-by: Sam Bobroff 
> ---
>  arch/powerpc/kernel/pci_dn.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index 6556b57..795c4e3 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -244,9 +244,22 @@ void remove_dev_pci_data(struct pci_dev *pdev)
>   continue;
>  
>  #ifdef CONFIG_EEH
> - /* Release EEH device for the VF */
> + /*
> +  * Release EEH state for this VF. The PCI core
> +  * has already torn down the pci_dev for this VF, but
> +  * we're responsible to removing the eeh_dev since it
> +  * has the same lifetime as the pci_dn that spawned it.
> +  */
>   edev = pdn_to_eeh_dev(pdn);
>   if (edev) {
> + /*
> +  * We allocate pci_dn's for the totalvfs count,
> +  * but only only the vfs that were activated
> +  * have a configured PE.
> +  */
> + if (edev->pe)
> + eeh_rmv_from_parent_pe(edev);
> +
>   pdn->edev = NULL;
>   kfree(edev);
>   }
> -- 
> 2.9.5
> 


signature.asc
Description: PGP signature


RE: [RFC PATCH] powerpc: Convert ____flush_dcache_icache_phys() to C

2019-08-21 Thread Alastair D'Silva
On Thu, 2019-08-22 at 07:06 +0200, Christophe Leroy wrote:
> 
> Le 22/08/2019 à 02:27, Alastair D'Silva a écrit :
> > On Wed, 2019-08-21 at 22:27 +0200, Christophe Leroy wrote:
> > > Le 20/08/2019 à 06:36, Alastair D'Silva a écrit :
> > > > On Fri, 2019-08-16 at 15:52 +, Christophe Leroy wrote:
> > > 
> > > [...]
> > > 
> > > > Thanks Christophe,
> > > > 
> > > > I'm trying a somewhat different approach that requires less
> > > > knowledge
> > > > of assembler. Handling of CPU_FTR_COHERENT_ICACHE is outside
> > > > this
> > > > function. The code below is not a patch as my tree is a bit
> > > > messy,
> > > > sorry:
> > > 
> > > Can we be 100% sure that GCC won't add any code accessing some
> > > global
> > > data or stack while the Data MMU is OFF ?
> > > 
> > > Christophe
> > > 
> > 
> > +mpe
> > 
> > I'm not sure how we would go about making such a guarantee, but
> > I've
> > tied every variable used to a register and addr is passed in a
> > register, so there is no stack usage, and every call in there only
> > operates on it's operands.
> > 
> > The calls to the inline cache helpers (for the PPC32 case) are all
> > constants, so I can't see a reasonable scenario where there would
> > be a
> > function call and reordered to after the DR bit is turned off, but
> > I
> > guess if we want to be paranoid, we could always add an mb() call
> > before the DR bit is manipulated to prevent the compiler from
> > reordering across the section where the data MMU is disabled.
> > 
> > 
> 
> Anyway, I think the benefit of converting that function to C is
> pretty 
> small. flush_dcache_range() and friends were converted to C mainly
> in 
> order to inline them. But this __flush_dcache_icache_phys() is too
> big 
> to be worth inlining, yet small and stable enough to remain in
> assembly 
> for the time being.
> 
I disagree on this point, after converting it to C, using
44x/currituck.defconfig, the compiler definitely will inline it (noting
that there is only 1 caller of it):

0134 :
 134:   94 21 ff f0 stwur1,-16(r1)
 138:   3d 20 00 00 lis r9,0
 13c:   81 29 00 00 lwz r9,0(r9)
 140:   7c 08 02 a6 mflrr0
 144:   38 81 00 0c addir4,r1,12
 148:   90 01 00 14 stw r0,20(r1)
 14c:   91 21 00 0c stw r9,12(r1)
 150:   48 00 00 01 bl  150 
 154:   39 00 00 20 li  r8,32
 158:   39 43 10 00 addir10,r3,4096
 15c:   7c 69 1b 78 mr  r9,r3
 160:   7d 09 03 a6 mtctr   r8
 164:   7c 00 48 6c dcbst   0,r9
 168:   39 29 00 80 addir9,r9,128
 16c:   42 00 ff f8 bdnz164 
 170:   7c 00 04 ac hwsync
 174:   7c 69 1b 78 mr  r9,r3
 178:   7c 00 4f ac icbi0,r9
 17c:   39 29 00 80 addir9,r9,128
 180:   7f 8a 48 40 cmplw   cr7,r10,r9
 184:   40 9e ff f4 bne cr7,178 
 188:   7c 00 04 ac hwsync
 18c:   4c 00 01 2c isync
 190:   80 01 00 14 lwz r0,20(r1)
 194:   38 21 00 10 addir1,r1,16
 198:   7c 08 03 a6 mtlrr0
 19c:   48 00 00 00 b   19c 


> So I suggest you keep it aside your series for now, just move 
> PURGE_PREFETCHED_INS inside it directly as it will be the only
> remaining 
> user of it.
> 
> Christophe

-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819



[linux-next][PPC][bisected c7d8b7][gcc 6.4.1] build error at drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1471

2019-08-21 Thread Abdul Haleem
Greeting's

Today's linux-next kernel 5.3.0-rc5-next-20190820 failed to build on my
powerpc machine

Build errors:
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c: In function amdgpu_exit:
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1471:2: error: implicit
declaration of function mmu_notifier_synchronize
[-Werror=implicit-function-declaration]
  mmu_notifier_synchronize();
  ^~~~ 
cc1: some warnings being treated as errors
make[4]: *** [drivers/gpu/drm/amd/amdgpu/amdgpu_drv.o] Error 1
make[3]: *** [drivers/gpu/drm/amd/amdgpu] Error 2

It was introduced with commit c7d8b7 (hmm: use mmu_notifier_get/put for
'struct hmm')

Reverting the commit fixes the issue.

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH v2 1/4] powerpc/powernv: Add OPAL API interface to access secure variable

2019-08-21 Thread Oliver O'Halloran
On Thu, Aug 22, 2019 at 3:02 PM Oliver O'Halloran  wrote:
>
> On Wed, 2019-08-21 at 11:08 -0400, Nayna Jain wrote:
> > diff --git a/arch/powerpc/platforms/powernv/opal.c 
> > b/arch/powerpc/platforms/powernv/opal.c
> > index aba443be7daa..ffe6f1cf0830 100644
> > --- a/arch/powerpc/platforms/powernv/opal.c
> > +++ b/arch/powerpc/platforms/powernv/opal.c
> > @@ -32,6 +32,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >
> >  #include "powernv.h"
> >
> > @@ -988,6 +990,9 @@ static int __init opal_init(void)
> >   /* Initialise OPAL Power control interface */
> >   opal_power_control_init();
> >
> > + if (is_powerpc_secvar_supported())
> > + secvar_init();
> > +
>
> The usual pattern here is to have the init function check for support
> internally.
>
> Also, is_powerpc_secvar_supported() doesn't appear to be defined
> anywhere. Is that supposed to be is_opal_secvar_supported()? Or is this
> series supposed to be applied on top of another series?

To answer my own question, yes it depends on the series at [1] which
adds IMA support. Turns out actually reading the cover letter helps,
who knew.

That said, I'm still not entirely sure about this. The implementation
of is_powerpc_secvar_supported() in [2] parses the DT and seems to
assume the DT bindings that OPAL produces. Are those common with the
DT bindings produced by OF when running on pseries?

[1] http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=125961
[2] http://patchwork.ozlabs.org/patch/1149257/

>
> >   return 0;
> >  }
> >  machine_subsys_initcall(powernv, opal_init);
>


Re: [PATCH v2 2/4] powerpc: expose secure variables to userspace via sysfs

2019-08-21 Thread Oliver O'Halloran
On Wed, 2019-08-21 at 11:08 -0400, Nayna Jain wrote:
> PowerNV secure variables, which store the keys used for OS kernel
> verification, are managed by the firmware. These secure variables need to
> be accessed by the userspace for addition/deletion of the certificates.
> 
> This patch adds the sysfs interface to expose secure variables for PowerNV
> secureboot. The users shall use this interface for manipulating
> the keys stored in the secure variables.
> 
> Signed-off-by: Nayna Jain 
> ---
>  Documentation/ABI/testing/sysfs-secvar |  27 
>  arch/powerpc/Kconfig   |   9 ++
>  arch/powerpc/kernel/Makefile   |   1 +
>  arch/powerpc/kernel/secvar-sysfs.c | 210 +
>  4 files changed, 247 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-secvar
>  create mode 100644 arch/powerpc/kernel/secvar-sysfs.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-secvar 
> b/Documentation/ABI/testing/sysfs-secvar
> new file mode 100644
> index ..68f0e03d873d
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-secvar
> @@ -0,0 +1,27 @@
> +What:/sys/firmware/secvar
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description:
> + This directory exposes interfaces for interacting with
> + the secure variables managed by OPAL firmware.
> +
> + This is only for the powerpc/powernv platform.
> +
> + Directory:
> + vars:   This directory lists all the variables that
> + are supported by the OPAL. The variables are
> + represented in the form of directories with
> + their variable names. The variable name is
> + unique and is in ASCII representation. The data
> + and size can be determined by reading their
> + respective attribute files.
> +
> + Each variable directory has the following files:
> + name:   An ASCII representation of the variable name
> + data:   A read-only file containing the value of the
> + variable
> + size:   An integer representation of the size of the
> + content of the variable. In other works, it
> + represents the size of the data
> + update: A write-only file that is used to submit the new
> + value for the variable.
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 42109682b727..b4bdf77837b2 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -925,6 +925,15 @@ config PPC_SECURE_BOOT
> allows user to enable OS Secure Boot on PowerPC systems that
> have firmware secure boot support.
>  
> +config SECVAR_SYSFS
> +tristate "Enable sysfs interface for POWER secure variables"
> +depends on PPC_SECURE_BOOT
> +help
> +  POWER secure variables are managed and controlled by firmware.
> +  These variables are exposed to userspace via sysfs to enable
> +  read/write operations on these variables. Say Y if you have
> +   secure boot enabled and want to expose variables to userspace.
> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 9041563f1c74..4ea7b738c3a3 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -158,6 +158,7 @@ obj-$(CONFIG_EPAPR_PARAVIRT)  += epapr_paravirt.o 
> epapr_hcalls.o
>  obj-$(CONFIG_KVM_GUEST)  += kvm.o kvm_emul.o
>  
>  obj-$(CONFIG_PPC_SECURE_BOOT)+= secboot.o ima_arch.o secvar-ops.o
> +obj-$(CONFIG_SECVAR_SYSFS) += secvar-sysfs.o
>  
>  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
>  GCOV_PROFILE_prom_init.o := n
> diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
> b/arch/powerpc/kernel/secvar-sysfs.c
> new file mode 100644
> index ..e46986bb29a0
> --- /dev/null
> +++ b/arch/powerpc/kernel/secvar-sysfs.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2019 IBM Corporation 
> + *
> + * This code exposes secure variables to user via sysfs
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +//Approximating it for now, it is bound to change.
> +#define VARIABLE_MAX_SIZE  32000

this needs to be communicated from the secvar backend, maybe via a
field in the ops structure?

> +
> +static struct kobject *powerpc_kobj;
Call it secvar_kobj or something.

> +static struct secvar_operations *secvarops;
Ah, the old I-can't-believe-it's-not-global trick.

> +struct kset *secvar_kset;
shouldn't that be static too?

> +
> +static ssize_t name_show(struct kobject *kobj, struct kobj_attribute *attr,
> + 

Re: [RFC PATCH] powerpc: Convert ____flush_dcache_icache_phys() to C

2019-08-21 Thread Christophe Leroy




Le 22/08/2019 à 02:27, Alastair D'Silva a écrit :

On Wed, 2019-08-21 at 22:27 +0200, Christophe Leroy wrote:


Le 20/08/2019 à 06:36, Alastair D'Silva a écrit :

On Fri, 2019-08-16 at 15:52 +, Christophe Leroy wrote:


[...]



Thanks Christophe,

I'm trying a somewhat different approach that requires less
knowledge
of assembler. Handling of CPU_FTR_COHERENT_ICACHE is outside this
function. The code below is not a patch as my tree is a bit messy,
sorry:


Can we be 100% sure that GCC won't add any code accessing some
global
data or stack while the Data MMU is OFF ?

Christophe



+mpe

I'm not sure how we would go about making such a guarantee, but I've
tied every variable used to a register and addr is passed in a
register, so there is no stack usage, and every call in there only
operates on it's operands.

The calls to the inline cache helpers (for the PPC32 case) are all
constants, so I can't see a reasonable scenario where there would be a
function call and reordered to after the DR bit is turned off, but I
guess if we want to be paranoid, we could always add an mb() call
before the DR bit is manipulated to prevent the compiler from
reordering across the section where the data MMU is disabled.




Anyway, I think the benefit of converting that function to C is pretty 
small. flush_dcache_range() and friends were converted to C mainly in 
order to inline them. But this __flush_dcache_icache_phys() is too big 
to be worth inlining, yet small and stable enough to remain in assembly 
for the time being.


So I suggest you keep it aside your series for now, just move 
PURGE_PREFETCHED_INS inside it directly as it will be the only remaining 
user of it.


Christophe


Re: [PATCH v2 1/4] powerpc/powernv: Add OPAL API interface to access secure variable

2019-08-21 Thread Oliver O'Halloran
On Wed, 2019-08-21 at 11:08 -0400, Nayna Jain wrote:
> The X.509 certificates trusted by the platform and required to secure boot
> the OS kernel are wrapped in secure variables, which are controlled by
> OPAL.
> 
> This patch adds firmware/kernel interface to read and write OPAL secure
> variables based on the unique key.
> 
> This support can be enabled using CONFIG_OPAL_SECVAR.
> 
> Signed-off-by: Claudio Carvalho 
> Signed-off-by: Nayna Jain 
> ---
>  arch/powerpc/include/asm/opal-api.h  |   5 +-
>  arch/powerpc/include/asm/opal.h  |   6 ++
>  arch/powerpc/include/asm/secvar.h|  55 ++
>  arch/powerpc/kernel/Makefile |   2 +-
>  arch/powerpc/kernel/secvar-ops.c |  25 +
>  arch/powerpc/platforms/powernv/Kconfig   |   6 ++
>  arch/powerpc/platforms/powernv/Makefile  |   1 +
>  arch/powerpc/platforms/powernv/opal-call.c   |   3 +
>  arch/powerpc/platforms/powernv/opal-secvar.c | 102 +++
>  arch/powerpc/platforms/powernv/opal.c|   5 +
>  10 files changed, 208 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/secvar.h
>  create mode 100644 arch/powerpc/kernel/secvar-ops.c
>  create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 383242eb0dea..b238b4f26c5b 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -208,7 +208,10 @@
>  #define OPAL_HANDLE_HMI2 166
>  #define  OPAL_NX_COPROC_INIT 167
>  #define OPAL_XIVE_GET_VP_STATE   170
> -#define OPAL_LAST170
> +#define OPAL_SECVAR_GET 173
> +#define OPAL_SECVAR_GET_NEXT174
> +#define OPAL_SECVAR_ENQUEUE_UPDATE  175
> +#define OPAL_LAST   175
>  
>  #define QUIESCE_HOLD 1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT   2 /* Fail all calls with 
> OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 57bd029c715e..247adec2375f 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -388,6 +388,12 @@ void opal_powercap_init(void);
>  void opal_psr_init(void);
>  void opal_sensor_groups_init(void);

Put these with the rest of the OPAL API wrapper prototypes
(i.e. just after opal_nx_coproc_init()) rather than with the
internal functions.
 
> > +extern int opal_secvar_get(uint64_t k_key, uint64_t k_key_len,
> +uint64_t k_data, uint64_t k_data_size);
> +extern int opal_secvar_get_next(uint64_t k_key, uint64_t k_key_len,
> + uint64_t k_key_size);
> +extern int opal_secvar_enqueue_update(uint64_t k_key, uint64_t k_key_len,
> +   uint64_t k_data, uint64_t k_data_size);

Everything in asm/opal.h is intended for consumption by the kernel, so
use a useful kernel type (or annotation) rather than blank uint64_t for
the parameters that are actually pointers. You should also ditch the k_
prefix since it doesn't make much sense having it inside the kernel.

As a general comment, don't use extern on function prototypes. They're
extern by default and, more importantly, it's contrary to the normal
kernel style.

>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* _ASM_POWERPC_OPAL_H */
> diff --git a/arch/powerpc/include/asm/secvar.h 
> b/arch/powerpc/include/asm/secvar.h
> new file mode 100644
> index ..645654456265
> --- /dev/null
> +++ b/arch/powerpc/include/asm/secvar.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PowerPC secure variable operations.
> + *
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain 
> + *
> + */
> +#ifndef SECVAR_OPS_H
> +#define SECVAR_OPS_H
> +
> +#include
> +#include
missing space

> +
> +struct secvar_operations {
> + int (*get_variable)(const char *key, unsigned long key_len, u8 *data,
> + unsigned long *data_size);
> + int (*get_next_variable)(const char *key, unsigned long *key_len,
> +  unsigned long keysize);
> + int (*set_variable)(const char *key, unsigned long key_len, u8 *data,
> + unsigned long data_size);
> +};


Calling them requires writing code like:

secvar_ops->get_variable(blah);

Why not shorten it to:
secvar_ops->get(blah);


> +#ifdef CONFIG_PPC_SECURE_BOOT
> +
> +extern void set_secvar_ops(struct secvar_operations *ops);
> +extern struct secvar_operations *get_secvar_ops(void);
> +
> +#else
> +
> +static inline void set_secvar_ops(struct secvar_operations *ops)
> +{
> +}
> +
> +static inline struct secvar_operations *get_secvar_ops(void)
> +{
> + return NULL;
> +}
> +
> +#endif
> +#ifdef CONFIG_OPAL_SECVAR
> 

Re: [PATCH 1/2] net/ibmvnic: unlock rtnl_lock in reset so linkwatch_event can run

2019-08-21 Thread David Miller
From: Juliet Kim 
Date: Tue, 20 Aug 2019 17:31:19 -0400

> Commit a5681e20b541 ("net/ibmnvic: Fix deadlock problem in reset") 
> made the change to hold the RTNL lock during a reset to avoid deadlock 
> but linkwatch_event is fired during the reset and needs the RTNL lock.  
> That keeps linkwatch_event process from proceeding until the reset 
> is complete. The reset process cannot tolerate the linkwatch_event 
> processing after reset completes, so release the RTNL lock during the 
> process to allow a chance for linkwatch_event to run during reset. 
> This does not guarantee that the linkwatch_event will be processed as 
> soon as link state changes, but is an improvement over the current code
> where linkwatch_event processing is always delayed, which prevents 
> transmissions on the device from being deactivated leading transmit 
> watchdog timer to time-out. 
> 
> Release the RTNL lock before link state change and re-acquire after 
> the link state change to allow linkwatch_event to grab the RTNL lock 
> and run during the reset.
> 
> Fixes: a5681e20b541 ("net/ibmnvic: Fix deadlock problem in reset")
> Signed-off-by: Juliet Kim 

Conditional locking, especialy such extensive use of conditional
locking as is being done here, is strongly discouraged and is always
indicative of bad design.

Please try to rework this change such that the code paths that want
to lock things a certain way are %100 segregated functionally into
different code paths and functions.

Or feel free to find a cleaner way to fix this.


[PATCH v6 7/7] powerpc/kvm: Use UV_RETURN ucall to return to ultravisor

2019-08-21 Thread Claudio Carvalho
From: Sukadev Bhattiprolu 

When an SVM makes an hypercall or incurs some other exception, the
Ultravisor usually forwards (a.k.a. reflects) the exceptions to the
Hypervisor. After processing the exception, Hypervisor uses the
UV_RETURN ultracall to return control back to the SVM.

The expected register state on entry to this ultracall is:

* Non-volatile registers are restored to their original values.
* If returning from an hypercall, register R0 contains the return value
  (unlike other ultracalls) and, registers R4 through R12 contain any
  output values of the hypercall.
* R3 contains the ultracall number, i.e UV_RETURN.
* If returning with a synthesized interrupt, R2 contains the
  synthesized interrupt number.

Thanks to input from Paul Mackerras, Ram Pai and Mike Anderson.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Claudio Carvalho 
---
 arch/powerpc/include/asm/kvm_host.h   |  1 +
 arch/powerpc/include/asm/ultravisor-api.h |  1 +
 arch/powerpc/kernel/asm-offsets.c |  1 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 39 +++
 4 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e6e5f59aaa97..4bb552d639b8 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -283,6 +283,7 @@ struct kvm_arch {
cpumask_t cpu_in_guest;
u8 radix;
u8 fwnmi_enabled;
+   u8 secure_guest;
bool threads_indep;
bool nested_enable;
pgd_t *pgtable;
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 8cd49abff4f3..6a0f9c74f959 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -24,5 +24,6 @@
 
 /* opcodes */
 #define UV_WRITE_PATE  0xF104
+#define UV_RETURN  0xF11C
 
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 4ccb6b3a7fbd..484f54dab247 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -506,6 +506,7 @@ int main(void)
OFFSET(KVM_VRMA_SLB_V, kvm, arch.vrma_slb_v);
OFFSET(KVM_RADIX, kvm, arch.radix);
OFFSET(KVM_FWNMI, kvm, arch.fwnmi_enabled);
+   OFFSET(KVM_SECURE_GUEST, kvm, arch.secure_guest);
OFFSET(VCPU_DSISR, kvm_vcpu, arch.shregs.dsisr);
OFFSET(VCPU_DAR, kvm_vcpu, arch.shregs.dar);
OFFSET(VCPU_VPA, kvm_vcpu, arch.vpa.pinned_addr);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 07181d0dfcb7..9a05b0d932ef 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Sign-extend HDEC if not on POWER9 */
 #define EXTEND_HDEC(reg)   \
@@ -1085,16 +1086,10 @@ BEGIN_FTR_SECTION
 END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 
ld  r5, VCPU_LR(r4)
-   ld  r6, VCPU_CR(r4)
mtlrr5
-   mtcrr6
 
ld  r1, VCPU_GPR(R1)(r4)
-   ld  r2, VCPU_GPR(R2)(r4)
-   ld  r3, VCPU_GPR(R3)(r4)
ld  r5, VCPU_GPR(R5)(r4)
-   ld  r6, VCPU_GPR(R6)(r4)
-   ld  r7, VCPU_GPR(R7)(r4)
ld  r8, VCPU_GPR(R8)(r4)
ld  r9, VCPU_GPR(R9)(r4)
ld  r10, VCPU_GPR(R10)(r4)
@@ -1112,10 +1107,42 @@ BEGIN_FTR_SECTION
mtspr   SPRN_HDSISR, r0
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 
+   ld  r6, VCPU_KVM(r4)
+   lbz r7, KVM_SECURE_GUEST(r6)
+   cmpdi   r7, 0
+   ld  r6, VCPU_GPR(R6)(r4)
+   ld  r7, VCPU_GPR(R7)(r4)
+   bne ret_to_ultra
+
+   lwz r0, VCPU_CR(r4)
+   mtcrr0
+
ld  r0, VCPU_GPR(R0)(r4)
+   ld  r2, VCPU_GPR(R2)(r4)
+   ld  r3, VCPU_GPR(R3)(r4)
ld  r4, VCPU_GPR(R4)(r4)
HRFI_TO_GUEST
b   .
+/*
+ * Use UV_RETURN ultracall to return control back to the Ultravisor after
+ * processing an hypercall or interrupt that was forwarded (a.k.a. reflected)
+ * to the Hypervisor.
+ *
+ * All registers have already been loaded, except:
+ *   R0 = hcall result
+ *   R2 = SRR1, so UV can detect a synthesized interrupt (if any)
+ *   R3 = UV_RETURN
+ */
+ret_to_ultra:
+   lwz r0, VCPU_CR(r4)
+   mtcrr0
+
+   ld  r0, VCPU_GPR(R3)(r4)
+   mfspr   r2, SPRN_SRR1
+   li  r3, 0
+   ori r3, r3, UV_RETURN
+   ld  r4, VCPU_GPR(R4)(r4)
+   sc  2
 
 /*
  * Enter the guest on a P9 or later system where we have exactly
-- 
2.20.1



[PATCH v6 6/7] powerpc/powernv: Access LDBAR only if ultravisor disabled

2019-08-21 Thread Claudio Carvalho
LDBAR is a per-thread SPR populated and used by the thread-imc pmu
driver to dump the data counter into memory. It contains memory along
with few other configuration bits. LDBAR is populated and enabled only
when any of the thread imc pmu events are monitored.

In ultravisor enabled systems, LDBAR becomes ultravisor privileged and
an attempt to write to it will cause a Hypervisor Emulation Assistance
interrupt.

In ultravisor enabled systems, the ultravisor is responsible to maintain
the LDBAR (e.g. save and restore it).

This restricts LDBAR access to only when ultravisor is disabled.

Signed-off-by: Claudio Carvalho 
Reviewed-by: Ram Pai 
Reviewed-by: Ryan Grimm 
---
 arch/powerpc/platforms/powernv/idle.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 09f49eed7fb8..78599bca66c2 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -675,7 +675,8 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
sprs.ptcr   = mfspr(SPRN_PTCR);
sprs.rpr= mfspr(SPRN_RPR);
sprs.tscr   = mfspr(SPRN_TSCR);
-   sprs.ldbar  = mfspr(SPRN_LDBAR);
+   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
+   sprs.ldbar = mfspr(SPRN_LDBAR);
 
sprs_saved = true;
 
@@ -789,7 +790,8 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
mtspr(SPRN_MMCR0,   sprs.mmcr0);
mtspr(SPRN_MMCR1,   sprs.mmcr1);
mtspr(SPRN_MMCR2,   sprs.mmcr2);
-   mtspr(SPRN_LDBAR,   sprs.ldbar);
+   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
+   mtspr(SPRN_LDBAR, sprs.ldbar);
 
mtspr(SPRN_SPRG3,   local_paca->sprg_vdso);
 
-- 
2.20.1



[PATCH v6 5/7] powerpc/mm: Write to PTCR only if ultravisor disabled

2019-08-21 Thread Claudio Carvalho
In ultravisor enabled systems, PTCR becomes ultravisor privileged only
for writing and an attempt to write to it will cause a Hypervisor
Emulation Assitance interrupt.

This patch uses the set_ptcr_when_no_uv() function to restrict PTCR
writing to only when ultravisor is disabled.

Signed-off-by: Claudio Carvalho 
---
 arch/powerpc/include/asm/ultravisor.h| 12 
 arch/powerpc/mm/book3s64/hash_utils.c|  5 +++--
 arch/powerpc/mm/book3s64/pgtable.c   |  2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |  8 +---
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index 6fe1f365dec8..d7aa97aa7834 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -10,10 +10,22 @@
 
 #include 
 #include 
+#include 
 
 int early_init_dt_scan_ultravisor(unsigned long node, const char *uname,
  int depth, void *data);
 
+/*
+ * In ultravisor enabled systems, PTCR becomes ultravisor privileged only for
+ * writing and an attempt to write to it will cause a Hypervisor Emulation
+ * Assistance interrupt.
+ */
+static inline void set_ptcr_when_no_uv(u64 val)
+{
+   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
+   mtspr(SPRN_PTCR, val);
+}
+
 static inline int uv_register_pate(u64 lpid, u64 dw0, u64 dw1)
 {
return ucall_norets(UV_WRITE_PATE, lpid, dw0, dw1);
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index c3bfef08dcf8..4d0bb194ed70 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -62,6 +62,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1076,8 +1077,8 @@ void hash__early_init_mmu_secondary(void)
if (!cpu_has_feature(CPU_FTR_ARCH_300))
mtspr(SPRN_SDR1, _SDR1);
else
-   mtspr(SPRN_PTCR,
- __pa(partition_tb) | (PATB_SIZE_SHIFT - 12));
+   set_ptcr_when_no_uv(__pa(partition_tb) |
+   (PATB_SIZE_SHIFT - 12));
}
/* Initialize SLB */
slb_initialize();
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 4173f6931009..01a7570c10e0 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -207,7 +207,7 @@ void __init mmu_partition_table_init(void)
 * 64 K size.
 */
ptcr = __pa(partition_tb) | (PATB_SIZE_SHIFT - 12);
-   mtspr(SPRN_PTCR, ptcr);
+   set_ptcr_when_no_uv(ptcr);
powernv_set_nmmu_ptcr(ptcr);
 }
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 2204d8eeb784..ec3560b96177 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -650,8 +651,9 @@ void radix__early_init_mmu_secondary(void)
lpcr = mfspr(SPRN_LPCR);
mtspr(SPRN_LPCR, lpcr | LPCR_UPRT | LPCR_HR);
 
-   mtspr(SPRN_PTCR,
- __pa(partition_tb) | (PATB_SIZE_SHIFT - 12));
+   set_ptcr_when_no_uv(__pa(partition_tb) |
+   (PATB_SIZE_SHIFT - 12));
+
radix_init_amor();
}
 
@@ -667,7 +669,7 @@ void radix__mmu_cleanup_all(void)
if (!firmware_has_feature(FW_FEATURE_LPAR)) {
lpcr = mfspr(SPRN_LPCR);
mtspr(SPRN_LPCR, lpcr & ~LPCR_UPRT);
-   mtspr(SPRN_PTCR, 0);
+   set_ptcr_when_no_uv(0);
powernv_set_nmmu_ptcr(0);
radix__flush_tlb_all();
}
-- 
2.20.1



[PATCH v6 4/7] powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE

2019-08-21 Thread Claudio Carvalho
From: Michael Anderson 

When Ultravisor (UV) is enabled, the partition table is stored in secure
memory and can only be accessed via the UV. The Hypervisor (HV) however
maintains a copy of the partition table in normal memory to allow Nest MMU
translations to occur (for normal VMs). The HV copy includes partition
table entries (PATE)s for secure VMs which would currently be unused
(Nest MMU translations cannot access secure memory) but they would be
needed as we add functionality.

This patch adds the UV_WRITE_PATE ucall which is used to update the PATE
for a VM (both normal and secure) when Ultravisor is enabled.

Signed-off-by: Michael Anderson 
Signed-off-by: Madhavan Srinivasan 
Signed-off-by: Ram Pai 
[ cclaudio: Write the PATE in HV's table before doing that in UV's ]
Signed-off-by: Claudio Carvalho 
Reviewed-by: Ryan Grimm 
---
 arch/powerpc/include/asm/ultravisor-api.h |  5 +++
 arch/powerpc/include/asm/ultravisor.h |  8 
 arch/powerpc/mm/book3s64/pgtable.c| 50 +--
 3 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
index 88ffa78f9d61..8cd49abff4f3 100644
--- a/arch/powerpc/include/asm/ultravisor-api.h
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -11,6 +11,7 @@
 #include 
 
 /* Return codes */
+#define U_BUSY H_BUSY
 #define U_FUNCTION H_FUNCTION
 #define U_NOT_AVAILABLEH_NOT_AVAILABLE
 #define U_P2   H_P2
@@ -18,6 +19,10 @@
 #define U_P4   H_P4
 #define U_P5   H_P5
 #define U_PARAMETERH_PARAMETER
+#define U_PERMISSION   H_PERMISSION
 #define U_SUCCESS  H_SUCCESS
 
+/* opcodes */
+#define UV_WRITE_PATE  0xF104
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
index dc6e1ea198f2..6fe1f365dec8 100644
--- a/arch/powerpc/include/asm/ultravisor.h
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -8,7 +8,15 @@
 #ifndef _ASM_POWERPC_ULTRAVISOR_H
 #define _ASM_POWERPC_ULTRAVISOR_H
 
+#include 
+#include 
+
 int early_init_dt_scan_ultravisor(unsigned long node, const char *uname,
  int depth, void *data);
 
+static inline int uv_register_pate(u64 lpid, u64 dw0, u64 dw1)
+{
+   return ucall_norets(UV_WRITE_PATE, lpid, dw0, dw1);
+}
+
 #endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 7d0e0d0d22c4..4173f6931009 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -12,6 +12,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -209,21 +211,10 @@ void __init mmu_partition_table_init(void)
powernv_set_nmmu_ptcr(ptcr);
 }
 
-void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
-  unsigned long dw1)
+static void flush_partition(unsigned int lpid, bool radix)
 {
-   unsigned long old = be64_to_cpu(partition_tb[lpid].patb0);
-
-   partition_tb[lpid].patb0 = cpu_to_be64(dw0);
-   partition_tb[lpid].patb1 = cpu_to_be64(dw1);
-
-   /*
-* Global flush of TLBs and partition table caches for this lpid.
-* The type of flush (hash or radix) depends on what the previous
-* use of this partition ID was, not the new use.
-*/
asm volatile("ptesync" : : : "memory");
-   if (old & PATB_HR) {
+   if (radix) {
asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : :
 "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
@@ -237,6 +228,39 @@ void mmu_partition_table_set_entry(unsigned int lpid, 
unsigned long dw0,
/* do we need fixup here ?*/
asm volatile("eieio; tlbsync; ptesync" : : : "memory");
 }
+
+void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
+ unsigned long dw1)
+{
+   unsigned long old = be64_to_cpu(partition_tb[lpid].patb0);
+
+   /*
+* When ultravisor is enabled, the partition table is stored in secure
+* memory and can only be accessed doing an ultravisor call. However, we
+* maintain a copy of the partition table in normal memory to allow Nest
+* MMU translations to occur (for normal VMs).
+*
+* Therefore, here we always update partition_tb, regardless of whether
+* we are running under an ultravisor or not.
+*/
+   partition_tb[lpid].patb0 = cpu_to_be64(dw0);
+   partition_tb[lpid].patb1 = cpu_to_be64(dw1);
+
+   /*
+* If ultravisor is enabled, we do an ultravisor call to register the
+* partition table entry (PATE), which also do a global flush of TLBs
+* and partition table caches for the lpid. 

[PATCH v6 3/7] powerpc/powernv: Introduce FW_FEATURE_ULTRAVISOR

2019-08-21 Thread Claudio Carvalho
In PEF enabled systems, some of the resources which were previously
hypervisor privileged are now ultravisor privileged and controlled by
the ultravisor firmware.

This adds FW_FEATURE_ULTRAVISOR to indicate if PEF is enabled.

The host kernel can use FW_FEATURE_ULTRAVISOR, for instance, to skip
accessing resources (e.g. PTCR and LDBAR) in case PEF is enabled.

Signed-off-by: Claudio Carvalho 
[ andmike: Device node name to "ibm,ultravisor" ]
Signed-off-by: Michael Anderson 
---
 arch/powerpc/include/asm/firmware.h |  5 +++--
 arch/powerpc/include/asm/ultravisor.h   | 14 
 arch/powerpc/kernel/prom.c  |  4 
 arch/powerpc/platforms/powernv/Makefile |  1 +
 arch/powerpc/platforms/powernv/ultravisor.c | 24 +
 5 files changed, 46 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/include/asm/ultravisor.h
 create mode 100644 arch/powerpc/platforms/powernv/ultravisor.c

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index faeca8b76c8c..b3e214a97f3a 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -50,6 +50,7 @@
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
 #define FW_FEATURE_BLOCK_REMOVE ASM_CONST(0x0010)
 #define FW_FEATURE_PAPR_SCMASM_CONST(0x0020)
+#define FW_FEATURE_ULTRAVISOR  ASM_CONST(0x0040)
 
 #ifndef __ASSEMBLY__
 
@@ -68,9 +69,9 @@ enum {
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
FW_FEATURE_DRC_INFO | FW_FEATURE_BLOCK_REMOVE |
-   FW_FEATURE_PAPR_SCM,
+   FW_FEATURE_PAPR_SCM | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_PSERIES_ALWAYS = 0,
-   FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
+   FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_ULTRAVISOR,
FW_FEATURE_POWERNV_ALWAYS = 0,
FW_FEATURE_PS3_POSSIBLE = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
FW_FEATURE_PS3_ALWAYS = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
diff --git a/arch/powerpc/include/asm/ultravisor.h 
b/arch/powerpc/include/asm/ultravisor.h
new file mode 100644
index ..dc6e1ea198f2
--- /dev/null
+++ b/arch/powerpc/include/asm/ultravisor.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Ultravisor definitions
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#ifndef _ASM_POWERPC_ULTRAVISOR_H
+#define _ASM_POWERPC_ULTRAVISOR_H
+
+int early_init_dt_scan_ultravisor(unsigned long node, const char *uname,
+ int depth, void *data);
+
+#endif /* _ASM_POWERPC_ULTRAVISOR_H */
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 7159e791a70d..5828f1c81dc9 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -702,6 +703,9 @@ void __init early_init_devtree(void *params)
 #ifdef CONFIG_PPC_POWERNV
/* Some machines might need OPAL info for debugging, grab it now. */
of_scan_flat_dt(early_init_dt_scan_opal, NULL);
+
+   /* Scan tree for ultravisor feature */
+   of_scan_flat_dt(early_init_dt_scan_ultravisor, NULL);
 #endif
 
 #ifdef CONFIG_FA_DUMP
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index 69a3aefa905b..49186f0985a4 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -4,6 +4,7 @@ obj-y   += idle.o opal-rtc.o opal-nvram.o 
opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
 obj-y  += opal-kmsg.o opal-powercap.o opal-psr.o 
opal-sensor-groups.o
+obj-y  += ultravisor.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
diff --git a/arch/powerpc/platforms/powernv/ultravisor.c 
b/arch/powerpc/platforms/powernv/ultravisor.c
new file mode 100644
index ..02ac57b4bded
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/ultravisor.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Ultravisor high level interfaces
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+int __init early_init_dt_scan_ultravisor(unsigned long node, const char *uname,
+int depth, void *data)
+{
+   if (!of_flat_dt_is_compatible(node, "ibm,ultravisor"))
+   return 0;
+
+   powerpc_firmware_features |= FW_FEATURE_ULTRAVISOR;
+   pr_debug("Ultravisor detected!\n");
+   return 1;
+}
-- 
2.20.1



[PATCH v6 2/7] powerpc/kernel: Add ucall_norets() ultravisor call handler

2019-08-21 Thread Claudio Carvalho
The ultracalls (ucalls for short) allow the Secure Virtual Machines
(SVM)s and hypervisor to request services from the ultravisor such as
accessing a register or memory region that can only be accessed when
running in ultravisor-privileged mode.

This patch adds the ucall_norets() ultravisor call handler.

The specific service needed from an ucall is specified in register
R3 (the first parameter to the ucall). Other parameters to the
ucall, if any, are specified in registers R4 through R12.

Return value of all ucalls is in register R3. Other output values
from the ucall, if any, are returned in registers R4 through R12.

Each ucall returns specific error codes, applicable in the context
of the ucall. However, like with the PowerPC Architecture Platform
Reference (PAPR), if no specific error code is defined for a particular
situation, then the ucall will fallback to an erroneous
parameter-position based code. i.e U_PARAMETER, U_P2, U_P3 etc depending
on the ucall parameter that may have caused the error.

Every host kernel (powernv) needs to be able to do ucalls in case it
ends up being run in a machine with ultravisor enabled. Otherwise, the
kernel may crash early in boot trying to access ultravisor resources,
for instance, trying to set the partition table entry 0. Secure guests
also need to be able to do ucalls and its kernel may not have
CONFIG_PPC_POWERNV=y. For that reason, the ucall.S file is placed under
arch/powerpc/kernel.

If ultravisor is not enabled, the ucalls will be redirected to the
hypervisor which must handle/fail the call.

Thanks to inputs from Ram Pai and Michael Anderson.

Signed-off-by: Claudio Carvalho 

---
Ultravisor call support for secure guests is being proposed as part of
the patchset "Secure Virtual Machine Enablement" posted by Thiago
Bauermann.
---
 arch/powerpc/include/asm/asm-prototypes.h | 11 +++
 arch/powerpc/include/asm/ultravisor-api.h | 23 +++
 arch/powerpc/kernel/Makefile  |  1 +
 arch/powerpc/kernel/ucall.S   | 14 ++
 4 files changed, 49 insertions(+)
 create mode 100644 arch/powerpc/include/asm/ultravisor-api.h
 create mode 100644 arch/powerpc/kernel/ucall.S

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ec1c97a8e8cb..e698f48cbc6d 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -34,6 +35,16 @@ extern struct static_key hcall_tracepoint_key;
 void __trace_hcall_entry(unsigned long opcode, unsigned long *args);
 void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf);
 
+/* Ultravisor */
+#ifdef CONFIG_PPC_POWERNV
+long ucall_norets(unsigned long opcode, ...);
+#else
+static inline long ucall_norets(unsigned long opcode, ...)
+{
+   return U_NOT_AVAILABLE;
+}
+#endif
+
 /* OPAL */
 int64_t __opal_call(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
int64_t a4, int64_t a5, int64_t a6, int64_t a7,
diff --git a/arch/powerpc/include/asm/ultravisor-api.h 
b/arch/powerpc/include/asm/ultravisor-api.h
new file mode 100644
index ..88ffa78f9d61
--- /dev/null
+++ b/arch/powerpc/include/asm/ultravisor-api.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Ultravisor API.
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#ifndef _ASM_POWERPC_ULTRAVISOR_API_H
+#define _ASM_POWERPC_ULTRAVISOR_API_H
+
+#include 
+
+/* Return codes */
+#define U_FUNCTION H_FUNCTION
+#define U_NOT_AVAILABLEH_NOT_AVAILABLE
+#define U_P2   H_P2
+#define U_P3   H_P3
+#define U_P4   H_P4
+#define U_P5   H_P5
+#define U_PARAMETERH_PARAMETER
+#define U_SUCCESS  H_SUCCESS
+
+#endif /* _ASM_POWERPC_ULTRAVISOR_API_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index ea0c69236789..c6c4ea240b2a 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -156,6 +156,7 @@ endif
 
 obj-$(CONFIG_EPAPR_PARAVIRT)   += epapr_paravirt.o epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
+obj-$(CONFIG_PPC_POWERNV)  += ucall.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/ucall.S b/arch/powerpc/kernel/ucall.S
new file mode 100644
index ..07296bc39166
--- /dev/null
+++ b/arch/powerpc/kernel/ucall.S
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Generic code to perform an ultravisor call.
+ *
+ * Copyright 2019, IBM Corporation.
+ *
+ */
+#include 
+#include 
+
+_GLOBAL(ucall_norets)
+EXPORT_SYMBOL_GPL(ucall_norets)
+   sc  2   /* Invoke the ultravisor */
+   blr /* Return r3 = status */
-- 
2.20.1



[PATCH v6 1/7] Documentation/powerpc: Ultravisor API

2019-08-21 Thread Claudio Carvalho
From: Sukadev Bhattiprolu 

Protected Execution Facility (PEF) is an architectural change for
POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
PEF adds a new higher privileged mode, called Ultravisor mode, to POWER
architecture. Along with the new mode there is new firmware called the
Protected Execution Ultravisor (or Ultravisor for short).

POWER 9 DD2.3 chips (PVR=0x004e1203) or greater will be PEF-capable.

Attached documentation provides an overview of PEF and defines the API
for various interfaces that must be implemented in the Ultravisor
firmware as well as in the KVM Hypervisor.

Based on input from Mike Anderson, Thiago Bauermann, Claudio Carvalho,
Ben Herrenschmidt, Guerney Hunt, Paul Mackerras.

Signed-off-by: Sukadev Bhattiprolu 
Signed-off-by: Ram Pai 
Signed-off-by: Guerney Hunt 
Reviewed-by: Claudio Carvalho 
Reviewed-by: Michael Anderson 
Reviewed-by: Thiago Bauermann 
Signed-off-by: Claudio Carvalho 
---
 Documentation/powerpc/ultravisor.rst | 1057 ++
 1 file changed, 1057 insertions(+)
 create mode 100644 Documentation/powerpc/ultravisor.rst

diff --git a/Documentation/powerpc/ultravisor.rst 
b/Documentation/powerpc/ultravisor.rst
new file mode 100644
index ..94a149f34ec3
--- /dev/null
+++ b/Documentation/powerpc/ultravisor.rst
@@ -0,0 +1,1057 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _ultravisor:
+
+
+Protected Execution Facility
+
+
+.. contents::
+:depth: 3
+
+.. sectnum::
+:depth: 3
+
+Protected Execution Facility
+
+
+Protected Execution Facility (PEF) is an architectural change for
+POWER 9 that enables Secure Virtual Machines (SVMs). DD2.3 chips
+(PVR=0x004e1203) or greater will be PEF-capable. A new ISA release
+will include the PEF RFC02487 changes.
+
+When enabled, PEF adds a new higher privileged mode, called Ultravisor
+mode, to POWER architecture. Along with the new mode there is new
+firmware called the Protected Execution Ultravisor (or Ultravisor
+for short). Ultravisor mode is the highest privileged mode in POWER
+architecture.
+
+   +--+
+   | Privilege States |
+   +==+
+   |  Problem |
+   +--+
+   |  Supervisor  |
+   +--+
+   |  Hypervisor  |
+   +--+
+   |  Ultravisor  |
+   +--+
+
+PEF protects SVMs from the hypervisor, privileged users, and other
+VMs in the system. SVMs are protected while at rest and can only be
+executed by an authorized machine. All virtual machines utilize
+hypervisor services. The Ultravisor filters calls between the SVMs
+and the hypervisor to assure that information does not accidentally
+leak. All hypercalls except H_RANDOM are reflected to the hypervisor.
+H_RANDOM is not reflected to prevent the hypervisor from influencing
+random values in the SVM.
+
+To support this there is a refactoring of the ownership of resources
+in the CPU. Some of the resources which were previously hypervisor
+privileged are now ultravisor privileged.
+
+Hardware
+
+
+The hardware changes include the following:
+
+* There is a new bit in the MSR that determines whether the current
+  process is running in secure mode, MSR(S) bit 41. MSR(S)=1, process
+  is in secure mode, MSR(s)=0 process is in normal mode.
+
+* The MSR(S) bit can only be set by the Ultravisor.
+
+* HRFID cannot be used to set the MSR(S) bit. If the hypervisor needs
+  to return to a SVM it must use an ultracall. It can determine if
+  the VM it is returning to is secure.
+
+* There is a new Ultravisor privileged register, SMFCTRL, which has an
+  enable/disable bit SMFCTRL(E).
+
+* The privilege of a process is now determined by three MSR bits,
+  MSR(S, HV, PR). In each of the tables below the modes are listed
+  from least privilege to highest privilege. The higher privilege
+  modes can access all the resources of the lower privilege modes.
+
+  **Secure Mode MSR Settings**
+
+  +---+---+---+---+
+  | S | HV| PR|Privilege  |
+  +===+===+===+===+
+  | 1 | 0 | 1 | Problem   |
+  +---+---+---+---+
+  | 1 | 0 | 0 | Privileged(OS)|
+  +---+---+---+---+
+  | 1 | 1 | 0 | Ultravisor|
+  +---+---+---+---+
+  | 1 | 1 | 1 | Reserved  |
+  +---+---+---+---+
+
+  **Normal Mode MSR Settings**
+
+  +---+---+---+---+
+  | S | HV| PR|Privilege  |
+  +===+===+===+===+
+  | 0 | 0 | 1 | Problem   |
+  +---+---+---+---+
+  | 0 | 0 | 0 | Privileged(OS)|
+  +---+---+---+---+
+  | 0 | 1 | 0 | Hypervisor|
+  

[PATCH v6 0/7] kvmppc: Paravirtualize KVM to support ultravisor

2019-08-21 Thread Claudio Carvalho
Protected Execution Facility (PEF) is an architectural change for POWER 9
that enables Secure Virtual Machines (SVMs). When enabled, PEF adds a new
higher privileged mode, called Ultravisor mode, to POWER architecture.
Along with the new mode there is new firmware called the Protected
Execution Ultravisor (or Ultravisor for short). Ultravisor mode is the
highest privileged mode in POWER architecture.

The Ultravisor calls allow the SVMs and Hypervisor to request services from
the Ultravisor such as accessing a register or memory region that can only
be accessed when running in Ultravisor-privileged mode.

This patch set adds support for Ultravisor calls and do some preparation
for running secure guests.

---
Changelog:
---

v5->v6:

- Patch "Documentation/powerpc: Ultravisor API"
  - Added chip version and PEF RFC number as suggested by mpe.
  - Applied Fabiano Rosas' suggestions.
  - Updated the commit message.

- Patch "powerpc/kernel: Add ucall_norets() ultravisor call handler"
  - Dropped CR save/restore from ucall_norets.
  - Updated the commit message.

- Patch "powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE"
  - Replaced the "old_patb0" arg in flush_partition() by "bool radix".
  - Better documented mmu_partition_table_set_entry() to address mpe's
questions about the copy of the partition table.
  - Updated the commit message.

- Patch "powerpc/mm: Write to PTCR only if ultravisor disabled"
  - Moved reg.h changes to ultravisor.h.
  - Replaced the try_set_ptcr macro by static inline set_ptcr_when_no_uv().
  - Updated the commit message.

- Rebased to powerpc/next

v4->v5:

- New patch "Documentation/powerpc: Ultravisor API"

- Patch "v4: KVM: PPC: Ultravisor: Add generic ultravisor call handler":
  - Made global the ucall_norets symbol without adding it to the TOC.
  - Implemented ucall_norets() rather than ucall().
  - Defined the ucall_norets in "asm/asm-prototypes.h" for symbol
versioning.
  - Renamed to "powerpc/kernel: Add ucall_norets() ultravisor call
handler".
  - Updated the commit message.

- Patch "v4: powerpc: Introduce FW_FEATURE_ULTRAVISOR":
  - Changed to scan for a node that is compatible with "ibm,ultravisor"
  - Renamed to "powerpc/powernv: Introduce FW_FEATURE_ULTRAVISOR".
  - Updated the commit message.

- Patch "v4: KVM: PPC: Ultravisor: Restrict flush of the partition tlb
  cache":
  - Merged into "v4: ... Use UV_WRITE_PATE ucall to register a PATE".

- Patch "v4: KVM: PPC: Ultravisor: Use UV_WRITE_PATE ucall to register a
  PATE":
  - Added back the missing "ptesync" instruction in flush_partition().
  - Updated source code comments for the partition table creation.
  - Factored out "powerpc/mm: Write to PTCR only if ultravisor disabled".
  - Cleaned up the code a bit.
  - Renamed to "powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE".
  - Updated the commit message.

- Patch "v4: KVM: PPC: Ultravisor: Restrict LDBAR access":
  - Dropped the change that skips loading the IMC driver if ultravisor
enabled because skiboot will remove the IMC devtree nodes if
ultravisor enabled.
  - Dropped the BEGIN_{END_}FW_FTR_SECTION_NESTED in power8 code.
  - Renamed to "powerpc/powernv: Access LDBAR only if ultravisor
disabled".
  - Updated the commit message.

- Patch "v4: KVM: PPC: Ultravisor: Enter a secure guest":
  - Openned "LOAD_REG_IMMEDIATE(r3, UV_RETURN)" to save instructions
  - Used R2, rather than R11, to pass synthesized interrupts in
UV_RETURN ucall.
  - Dropped the change that preserves the MSR[S] bit in
"kvmppc_msr_interrupt" because that is done by the ultravisor.
  - Hoisted up the load of R6 and R7 to before "bne ret_to_ultra".
  - Cleaned up the code a bit.
  - Renamed to "powerpc/kvm: Use UV_RETURN ucall to return to ultravisor".
  - Updated the commit message.

- Patch "v4: KVM: PPC: Ultravisor: Check for MSR_S during hv_reset_msr":
  - Dropped from the patch set because "kvm_arch->secure_guest" rather
than MSR[S] is used to determine if we need to return to the
ultravisor.

- Patch "v4: KVM: PPC: Ultravisor: Introduce the MSR_S bit":
  - Moved to the patch set "Secure Virtual Machine Enablement" posted by
Thiago Bauermann. MSR[S] is no longer needed in this patch set.

- Rebased to powerpc/next

v3->v4:

- Patch "KVM: PPC: Ultravisor: Add PPC_UV config option":
  - Moved to the patchset "kvmppc: HMM driver to manage pages of secure
guest" v5 that will be posted by Bharata Rao.

- Patch "powerpc: Introduce FW_FEATURE_ULTRAVISOR":
  - Changed to depend only on CONFIG_PPC_POWERNV.

- Patch "KVM: PPC: Ultravisor: Add generic ultravisor call handler":
  - Fixed whitespaces in ucall.S and in ultravisor-api.h.
  - Changed to depend only on CONFIG_PPC_POWERNV.
  - Changed the ucall wrapper to pass the ucall number in R3.

- Patch "KVM: PPC: Ultravisor: Use UV_WRITE_PATE ucall to register a
  PATE:
  - Changed to depend only on CONFIG_PPC_POWERNV.

- Patch "KVM: PPC: Ultravisor: Restrict LDBAR access":
  - Fixed 

Re: [PATCH v6 1/7] kvmppc: Driver to manage pages of secure guest

2019-08-21 Thread Bharata B Rao
On Tue, Aug 20, 2019 at 12:04:33AM -0300, Thiago Jung Bauermann wrote:
> 
> Hello Bharata,
> 
> I have just a couple of small comments.
> 
> Bharata B Rao  writes:
> 
> > +/*
> > + * Get a free device PFN from the pool
> > + *
> > + * Called when a normal page is moved to secure memory (UV_PAGE_IN). Device
> > + * PFN will be used to keep track of the secure page on HV side.
> > + *
> > + * @rmap here is the slot in the rmap array that corresponds to @gpa.
> > + * Thus a non-zero rmap entry indicates that the corresonding guest
> 
> Typo: corresponding
> 
> > +static u64 kvmppc_get_secmem_size(void)
> > +{
> > +   struct device_node *np;
> > +   int i, len;
> > +   const __be32 *prop;
> > +   u64 size = 0;
> > +
> > +   np = of_find_node_by_path("/ibm,ultravisor/ibm,uv-firmware");
> > +   if (!np)
> > +   goto out;
> 
> I believe that in general we try to avoid hard-coding the path when a
> node is accessed and searched instead via its compatible property.

Thanks, will switch to of_find_compatible_node().

Regards,
Bharata.



Re: [PATCH v5 4/7] powerpc/mm: Use UV_WRITE_PATE ucall to register a PATE

2019-08-21 Thread Claudio Carvalho


On 8/14/19 8:33 AM, Michael Ellerman wrote:
> Hi Claudio,
>
> Claudio Carvalho  writes:
>> From: Michael Anderson 
>>
>> In ultravisor enabled systems, the ultravisor creates and maintains the
>> partition table in secure memory where the hypervisor cannot access, and
>^
>which?
>
>> therefore, the hypervisor have to do the UV_WRITE_PATE ucall whenever it
> ^  ^
> hasa
>> wants to set a partition table entry (PATE).
>>
>> This patch adds the UV_WRITE_PATE ucall and uses it to set a PATE if
>> ultravisor is enabled. Additionally, this also also keeps a copy of the
>> partition table because the nestMMU does not have access to secure
>> memory. Such copy has entries for nonsecure and hypervisor partition.
> I'm having trouble parsing the last sentence there.
>
> Or at least it doesn't seem to match the code, or I don't understand
> either the code or the comment. More below.
>
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
>> b/arch/powerpc/mm/book3s64/pgtable.c
>> index 85bc81abd286..033731f5dbaa 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -213,34 +223,50 @@ void __init mmu_partition_table_init(void)
>>  powernv_set_nmmu_ptcr(ptcr);
>>  }
>>  
>> -void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
>> -   unsigned long dw1)
>> +/*
>> + * Global flush of TLBs and partition table caches for this lpid. The type 
>> of
>> + * flush (hash or radix) depends on what the previous use of this partition 
>> ID
>> + * was, not the new use.
>> + */
>> +static void flush_partition(unsigned int lpid, unsigned long old_patb0)
> A nicer API would be for the 2nd param to be a "bool radix", and have
> the caller worry about the fact that it comes from (patb0 & PATB_HR).

Yes, I agree. I applied that to next patchset version.


>
>>  {
>> -unsigned long old = be64_to_cpu(partition_tb[lpid].patb0);
>> -
>> -partition_tb[lpid].patb0 = cpu_to_be64(dw0);
>> -partition_tb[lpid].patb1 = cpu_to_be64(dw1);
>> -
>> -/*
>> - * Global flush of TLBs and partition table caches for this lpid.
>> - * The type of flush (hash or radix) depends on what the previous
>> - * use of this partition ID was, not the new use.
>> - */
>>  asm volatile("ptesync" : : : "memory");
>> -if (old & PATB_HR) {
>> -asm volatile(PPC_TLBIE_5(%0,%1,2,0,1) : :
>> +if (old_patb0 & PATB_HR) {
>> +asm volatile(PPC_TLBIE_5(%0, %1, 2, 0, 1) : :
>>   "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
>> -asm volatile(PPC_TLBIE_5(%0,%1,2,1,1) : :
>> +asm volatile(PPC_TLBIE_5(%0, %1, 2, 1, 1) : :
> That looks like an unrelated whitespace change.
>
>>   "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
>>  trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 1);
>>  } else {
>> -asm volatile(PPC_TLBIE_5(%0,%1,2,0,0) : :
>> +asm volatile(PPC_TLBIE_5(%0, %1, 2, 0, 0) : :
> Ditto.

True. I dropped the two changes above in the next patchset version.

Thanks,
Claudio


>
>>   "r" (TLBIEL_INVAL_SET_LPID), "r" (lpid));
>>  trace_tlbie(lpid, 0, TLBIEL_INVAL_SET_LPID, lpid, 2, 0, 0);
>>  }
>>  /* do we need fixup here ?*/
>>  asm volatile("eieio; tlbsync; ptesync" : : : "memory");
>>  }
>> +
>> +void mmu_partition_table_set_entry(unsigned int lpid, unsigned long dw0,
>> +  unsigned long dw1)
>> +{
>> +unsigned long old = be64_to_cpu(partition_tb[lpid].patb0);
>> +
>> +partition_tb[lpid].patb0 = cpu_to_be64(dw0);
>> +partition_tb[lpid].patb1 = cpu_to_be64(dw1);
> ie. here we always update the copy of the partition table, regardless of
> whether we're running under an ultravisor or not. So the copy is a
> complete copy isn't it?
>
>> +/*
>> + * In ultravisor enabled systems, the ultravisor maintains the partition
>> + * table in secure memory where we don't have access, therefore, we have
>> + * to do a ucall to set an entry.
>> + */
>> +if (firmware_has_feature(FW_FEATURE_ULTRAVISOR)) {
>> +uv_register_pate(lpid, dw0, dw1);
>> +pr_info("PATE registered by ultravisor: dw0 = 0x%lx, dw1 = 
>> 0x%lx\n",
>> +dw0, dw1);
>> +} else {
>> +flush_partition(lpid, old);
>> +}
> What is different is whether we flush or not.
>
> And don't we still need to do the flush for the nestMMU? I assume we're
> saying the ultravisor will broadcast a flush for us, which will also
> handle the nestMMU case?
>
> cheers
>


Re: [PATCH v5 2/7] powerpc/kernel: Add ucall_norets() ultravisor call handler

2019-08-21 Thread Claudio Carvalho


On 8/14/19 3:34 PM, Segher Boessenkool wrote:
> On Wed, Aug 14, 2019 at 08:46:15PM +1000, Michael Ellerman wrote:
>> Claudio Carvalho  writes:
>>> +_GLOBAL(ucall_norets)
>>> +EXPORT_SYMBOL_GPL(ucall_norets)
>>> +   mfcrr0
>>> +   stw r0,8(r1)
>>> +
>>> +   sc  2   /* Invoke the ultravisor */
>>> +
>>> +   lwz r0,8(r1)
>>> +   mtcrf   0xff,r0
>>> +   blr /* Return r3 = status */
>> Paulus points that we shouldn't need to save CR here. Our caller will
>> have already saved it if it needed to, and we don't use CR in this
>> function so we don't need to save it.
>>
>> That's assuming the Ultravisor follows the hcall ABI in which CR2-4 are
>> non-volatile (PAPR § 14.5.3).
> And assuming the ultravisor already clears (or sets, or whatever) all CR
> fields it does not want to leak the contents of (which it also should,
> of course).

Thanks Segher. We are working on that in the ultravisor source code.

Claudio.


>
>> I know plpar_hcall_norets() does save CR, but it shouldn't need to, that
>> seems to be historical. aka. no one knows why it does it but it always
>> has.
>
> Segher
>


Re: [PATCH v5 2/7] powerpc/kernel: Add ucall_norets() ultravisor call handler

2019-08-21 Thread Claudio Carvalho


On 8/14/19 7:46 AM, Michael Ellerman wrote:
> Claudio Carvalho  writes:
>> diff --git a/arch/powerpc/kernel/ucall.S b/arch/powerpc/kernel/ucall.S
>> new file mode 100644
>> index ..de9133e45d21
>> --- /dev/null
>> +++ b/arch/powerpc/kernel/ucall.S
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * Generic code to perform an ultravisor call.
>> + *
>> + * Copyright 2019, IBM Corporation.
>> + *
>> + */
>> +#include 
>> +#include 
>> +
>> +_GLOBAL(ucall_norets)
>> +EXPORT_SYMBOL_GPL(ucall_norets)
>> +mfcrr0
>> +stw r0,8(r1)
>> +
>> +sc  2   /* Invoke the ultravisor */
>> +
>> +lwz r0,8(r1)
>> +mtcrf   0xff,r0
>> +blr /* Return r3 = status */
> Paulus points that we shouldn't need to save CR here. Our caller will
> have already saved it if it needed to, and we don't use CR in this
> function so we don't need to save it.

Dropped the CR save/restore in the next patchset version:

_GLOBAL(ucall_norets)
EXPORT_SYMBOL_GPL(ucall_norets)
    sc  2   /* Invoke the ultravisor */
    blr /* Return r3 = status */


Thanks,
Claudio


>
> That's assuming the Ultravisor follows the hcall ABI in which CR2-4 are
> non-volatile (PAPR § 14.5.3).
>
> I know plpar_hcall_norets() does save CR, but it shouldn't need to, that
> seems to be historical. aka. no one knows why it does it but it always
> has.
>
> cheers
>


[PATCH] powerpc/64s/exception: most SRR type interrupts need only test KVM for PR-KVM

2019-08-21 Thread Nicholas Piggin
Apart from SRESET, MCE, and syscall (hcall variant), SRR (EXC_STD) type
interrupts are not escalated to hypervisor mode, so delivered to the OS.

When running PR-KVM, the OS is the hypervisor, and the guest runs with
MSR[PR]=1, so these interrupts must test if a guest was running when
interrupted. These tests are done at the real-mode entry points only
because PR-KVM runs with LPCR[AIL]=0, so no virt-mode interrupt entry.

In HV KVM and nested HV KVM, the guest always receives these interrupts,
so there is no need for the host to make this test.

Therefore, remove these tests if PR KVM is not configured.

Improve the KVM interrupt comments, which explains this change and other
KVm interrupt delivery issues.

Signed-off-by: Nicholas Piggin 
---

 arch/powerpc/kernel/exceptions-64s.S | 76 ++--
 1 file changed, 61 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 2963b46f9580..b4d39330789d 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -215,23 +215,37 @@ do_define_int n
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
- * All interrupts which set HSRR registers (EXC_HV) as well as SRESET and
- * MCE and syscall when invoked with "sc 1" switch to the hypervisor to
- * be taken, so all generally need to test whether they were taken in guest
+ * All interrupts which set HSRR registers (EXC_HV) as well as SRESET and MCE
+ * and syscall when invoked with "sc 1" switch to MSR[HV]=1 (HVMODE) to be
+ * taken, so they all generally need to test whether they were taken in guest
  * context.
  *
- * SRESET and MCE may also be sent to the guest by the hypervisor.
- *
- * Interrupts which set SRR registers except SRESET and MCE and sc 1 are not
- * elevated to the hypervisor, though many can be taken when running in
- * hypervisor mode (e.g., bare metal kernel and userspace). These generally
- * need to test whether taken in guest context for PR KVM guests. PR KVM
- * does not enable AIL interrupts, so always takes them in real mode, which
- * is why these generally only need test the real-mode case.
- *
- * If hv is possible, interrupts come into to the hv version
- * of the kvmppc_interrupt code, which then jumps to the PR handler,
- * kvmppc_interrupt_pr, if the guest is a PR guest.
+ * Note: SRESET and MCE may also be sent to the guest by the hypervisor, and be
+ * taken with MSR[HV]=0.
+ *
+ * Interrupts which set SRR registers (with the above exceptions) do not
+ * elevate to MSR[HV]=1 mode, though most can be taken when running with
+ * MSR[HV]=1  (e.g., bare metal kernel and userspace). So these interrupts do
+ * not need to test whether a guest is running because they get delivered to
+ * the guest directly, including nested HV KVM guests.
+ *
+ * The exception is PR KVM, where the guest runs with MSR[PR]=1 and the host
+ * runs with MSR[HV]=0, so the host takes all interrupts on behalf of the
+ * guest. PR KVM runs with LPCR[AIL]=0 which causes interrupts to always be
+ * delivered to the real-mode entry point, therefore such interrupts only test
+ * KVM in their real mode handlers, and only when PR KVM is possible.
+ *
+ * Interrupts that are taken in MSR[HV]=0 and escalate to MSR[HV]=1 are always
+ * delivered in real-mode when the MMU is in hash mode because the MMU
+ * registers are not set appropriately to translate host addresses. In nested
+ * radix mode these can be delivered in virt-mode as the host translations are
+ * used implicitly (see: effective LPID, effective PID).
+ */
+
+/*
+ * If an interrupt is taken while a guest is running, it is immediately routed
+ * to KVM to handle. If both HV and PR KVM arepossible, KVM interrupts go first
+ * to kvmppc_interrupt_hv, which handles the PR guest case.
  */
 #define kvmppc_interrupt kvmppc_interrupt_hv
 #else
@@ -1277,8 +1291,10 @@ INT_DEFINE_BEGIN(data_access)
IAREA=PACA_EXGEN
IDAR=1
IDSISR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_SKIP=1
IKVM_REAL=1
+#endif
 INT_DEFINE_END(data_access)
 
 EXC_REAL_BEGIN(data_access, 0x300, 0x80)
@@ -1326,8 +1342,10 @@ INT_DEFINE_BEGIN(data_access_slb)
IAREA=PACA_EXSLB
IRECONCILE=0
IDAR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_SKIP=1
IKVM_REAL=1
+#endif
 INT_DEFINE_END(data_access_slb)
 
 EXC_REAL_BEGIN(data_access_slb, 0x380, 0x80)
@@ -1379,7 +1397,9 @@ INT_DEFINE_BEGIN(instruction_access)
IISIDE=1
IDAR=1
IDSISR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(instruction_access)
 
 EXC_REAL_BEGIN(instruction_access, 0x400, 0x80)
@@ -1419,7 +1439,9 @@ INT_DEFINE_BEGIN(instruction_access_slb)
IRECONCILE=0
IISIDE=1
IDAR=1
+#ifdef CONFIG_KVM_BOOK3S_PR_POSSIBLE
IKVM_REAL=1
+#endif
 INT_DEFINE_END(instruction_access_slb)
 
 EXC_REAL_BEGIN(instruction_access_slb, 0x480, 

RE: [RFC PATCH] powerpc: Convert ____flush_dcache_icache_phys() to C

2019-08-21 Thread Alastair D'Silva
On Wed, 2019-08-21 at 22:27 +0200, Christophe Leroy wrote:
> 
> Le 20/08/2019 à 06:36, Alastair D'Silva a écrit :
> > On Fri, 2019-08-16 at 15:52 +, Christophe Leroy wrote:
> 
> [...]
> 
> > 
> > Thanks Christophe,
> > 
> > I'm trying a somewhat different approach that requires less
> > knowledge
> > of assembler. Handling of CPU_FTR_COHERENT_ICACHE is outside this
> > function. The code below is not a patch as my tree is a bit messy,
> > sorry:
> 
> Can we be 100% sure that GCC won't add any code accessing some
> global 
> data or stack while the Data MMU is OFF ?
> 
> Christophe
> 

+mpe

I'm not sure how we would go about making such a guarantee, but I've
tied every variable used to a register and addr is passed in a
register, so there is no stack usage, and every call in there only
operates on it's operands.

The calls to the inline cache helpers (for the PPC32 case) are all
constants, so I can't see a reasonable scenario where there would be a
function call and reordered to after the DR bit is turned off, but I
guess if we want to be paranoid, we could always add an mb() call
before the DR bit is manipulated to prevent the compiler from
reordering across the section where the data MMU is disabled.


> 
> > /**
> >   * flush_dcache_icache_phys() - Flush a page by it's physical
> > address
> >   * @addr: the physical address of the page
> >   */
> > static void flush_dcache_icache_phys(unsigned long addr)
> > {
> > register unsigned long msr;
> > register unsigned long dlines = PAGE_SIZE >> l1_dcache_shift();
> > register unsigned long dbytes = l1_dcache_bytes();
> > register unsigned long ilines = PAGE_SIZE >> l1_icache_shift();
> > register unsigned long ibytes = l1_icache_bytes();
> > register unsigned long i;
> > register unsigned long address = addr;
> > 
> > /*
> >  * Clear the DR bit so that we operate on physical
> >  * rather than virtual addresses
> >  */
> > msr = mfmsr();
> > mtmsr(msr & ~(MSR_DR));
> > 
> > /* Write out the data cache */
> > for (i = 0; i < dlines; i++, address += dbytes)
> > dcbst((void *)address);
> > 
> > /* Invalidate the instruction cache */
> > address = addr;
> > for (i = 0; i < ilines; i++, address += ibytes)
> > icbi((void *)address);
> > 
> > mtmsr(msr);
> > }
> > 
> > void test_flush_phys(unsigned long addr)
> > {
> > flush_dcache_icache_phys(addr);
> > }
> > 
> > 
> > This gives the following assembler (using pmac32_defconfig):
> > 03cc :
> >   3cc:   94 21 ff f0 stwur1,-16(r1)
> >   3d0:   7d 00 00 a6 mfmsr   r8
> >   3d4:   55 09 07 34 rlwinm  r9,r8,0,28,26
> >   3d8:   7d 20 01 24 mtmsr   r9
> >   3dc:   39 20 00 80 li  r9,128
> >   3e0:   7d 29 03 a6 mtctr   r9
> >   3e4:   39 43 10 00 addir10,r3,4096
> >   3e8:   7c 69 1b 78 mr  r9,r3
> >   3ec:   7c 00 48 6c dcbst   0,r9
> >   3f0:   39 29 00 20 addir9,r9,32
> >   3f4:   42 00 ff f8 bdnz3ec 
> >   3f8:   7c 00 1f ac icbi0,r3
> >   3fc:   38 63 00 20 addir3,r3,32
> >   400:   7f 8a 18 40 cmplw   cr7,r10,r3
> >   404:   40 9e ff f4 bne cr7,3f8 
> >   408:   7d 00 01 24 mtmsr   r8
> >   40c:   38 21 00 10 addir1,r1,16
> >   410:   4e 80 00 20 blr
> > 
> > 
-- 
Alastair D'Silva
Open Source Developer
Linux Technology Centre, IBM Australia
mob: 0423 762 819



Re: [PATCH v5 1/7] Documentation/powerpc: Ultravisor API

2019-08-21 Thread Claudio Carvalho


On 8/9/19 9:45 AM, Michael Ellerman wrote:
> Claudio Carvalho  writes:
>> From: Sukadev Bhattiprolu 
>>
>> POWER9 processor includes support for Protected Execution Facility (PEF).

>> Which POWER9? Please be more precise.
>>
>> It's public knowledge that some versions of Power9 don't have PEF (or
>> have it broken / fused off).
>>
>> People are going to try and test this on various chip revisions that are
>> out in the wild, we need to make it clear where it's expected to work
>> and where it's not.

Updated this part of the commit message to:

    Protected Execution Facility (PEF) is an architectural change for
    POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
    PEF adds a new higher privileged mode, called Ultravisor mode, to POWER
    architecture. Along with the new mode there is new firmware called the
    Protected Execution Ultravisor (or Ultravisor for short).
   
    POWER 9 DD2.3 chips (PVR=0x004e1203) or greater will be PEF-capable.

    Attached documentation ...


>
>> Attached documentation provides an overview of PEF and defines the API
>> for various interfaces that must be implemented in the Ultravisor
>> firmware as well as in the KVM Hypervisor.
>>
>> Based on input from Mike Anderson, Thiago Bauermann, Claudio Carvalho,
>> Ben Herrenschmidt, Guerney Hunt, Paul Mackerras.
>>
>> Signed-off-by: Sukadev Bhattiprolu 
>> Signed-off-by: Ram Pai 
>> Signed-off-by: Guerney Hunt 
>> Reviewed-by: Claudio Carvalho 
>> Reviewed-by: Michael Anderson 
>> Reviewed-by: Thiago Bauermann 
>> Signed-off-by: Claudio Carvalho 
>> ---
>>  Documentation/powerpc/ultravisor.rst | 1055 ++
>>  1 file changed, 1055 insertions(+)
>>  create mode 100644 Documentation/powerpc/ultravisor.rst
>>
>> diff --git a/Documentation/powerpc/ultravisor.rst 
>> b/Documentation/powerpc/ultravisor.rst
>> new file mode 100644
>> index ..8d5246585b66
>> --- /dev/null
>> +++ b/Documentation/powerpc/ultravisor.rst
>> @@ -0,0 +1,1055 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +.. _ultravisor:
>> +
>> +
>> +Protected Execution Facility
>> +
>> +
>> +.. contents::
>> +:depth: 3
>> +
>> +.. sectnum::
>> +:depth: 3
>> +
>> +Protected Execution Facility
>> +
>> +
>> +Protected Execution Facility (PEF) is an architectural change for
>> +POWER 9 that enables Secure Virtual Machines (SVMs). When enabled,
> Ditto here.
>
> Also you don't mention which ISA version PEF is (will be) documented in.
> Do we know? Or can we at least reference the RFC number so folks can
> find it.
>
> Otherwise this looks really good. I'll try and find time to proof read
> it thoroughly.
>
> cheers
>
>> +PEF adds a new higher privileged mode, called Ultravisor mode, to
>> +POWER architecture. Along with the new mode there is new firmware
>> +called the Protected Execution Ultravisor (or Ultravisor for short).
>> +Ultravisor mode is the highest privileged mode in POWER architecture.

Updated this part to:

    Protected Execution Facility (PEF) is an architectural change for
    POWER 9 that enables Secure Virtual Machines (SVMs). DD2.3 chips
    (PVR=0x004e1203) or greater will be PEF-capable. A new ISA release
    will include the PEF RFC02487 changes.

    When enabled, PEF adds a new higher privileged mode, called Ultravisor
    mode, to POWER architecture. Along with the new mode there is new
    firmware called the Protected Execution Ultravisor (or Ultravisor
    for short). Ultravisor mode is the highest privileged mode in POWER
    architecture.

Followed by the original table below.

Thanks Michael

Claudio.

>> +
>> ++--+
>> +| Privilege States |
>> ++==+
>> +|  Problem |
>> ++--+
>> +|  Supervisor  |
>> ++--+
>> +|  Hypervisor  |
>> ++--+
>> +|  Ultravisor  |
>> ++--+
>> +
>> +PEF protects SVMs from the hypervisor, privileged users, and other
>> +VMs in the system. SVMs are protected while at rest and can only be
>> +executed by an authorized machine. All virtual machines utilize
>> +hypervisor services. The Ultravisor filters calls between the SVMs
>> +and the hypervisor to assure that information does not accidentally
>> +leak. All hypercalls except H_RANDOM are reflected to the hypervisor.
>> +H_RANDOM is not reflected to prevent the hypervisor from influencing
>> +random values in the SVM.
>> +
>> +To support this there is a refactoring of the ownership of resources
>> +in the CPU. Some of the resources which were previously hypervisor
>> +privileged are now ultravisor privileged.
>> +
>> +Hardware
>> +
>> +
>> +The hardware changes include the following:
>> +
>> +* There is a new bit in the MSR that determines whether the current
>> +  process 

Re: [PATCH v5 1/7] Documentation/powerpc: Ultravisor API

2019-08-21 Thread Claudio Carvalho


On 8/12/19 12:58 PM, Fabiano Rosas wrote:
> Claudio Carvalho  writes:
>
> Some small suggestions below:
>
>> +
>> +* The privilege of a process is now determined by three MSR bits,
>> +  MSR(S, HV, PR). In each of the tables below the modes are listed
>> +  from least privilege to highest privilege. The higher privilege
>> +  modes can access all the resources of the lower privilege modes.
>> +
>> +  **Secure Mode MSR Settings**
>> +
>> +  +---+---+---+---+
>> +  | S | HV| PR|Privilege  |
>> +  +===+===+===+===+
>> +  | 1 | 0 | 1 | Problem   |
>> +  +---+---+---+---+
>> +  | 1 | 0 | 0 | Privileged(OS)|
>> +  +---+---+---+---+
>> +  | 1 | 1 | 0 | Ultravisor|
>> +  +---+---+---+---+
>> +  | 1 | 1 | 1 | Reserved  |
>> +  +---+---+---+---+
>> +
>> +  **Normal Mode MSR Settings**
>> +
>> +  +---+---+---+---+
>> +  | S | HV| PR|Privilege  |
>> +  +===+===+===+===+
>> +  | 0 | 0 | 1 | Problem   |
>> +  +---+---+---+---+
>> +  | 0 | 0 | 0 | Privileged(OS)|
>> +  +---+---+---+---+
>> +  | 0 | 1 | 0 | Hypervisor|
>> +  +---+---+---+---+
>> +  | 0 | 1 | 1 | Problem (HV)  |
>> +  +---+---+---+---+
> I find the use of '(HV)' in this last line a bit ambiguous. Since we are
> already using 'HV' to refer to MSR(HV). I'd suggest using '(Host)' or
> simply leaving it out.
>
>> +
>> +* Memory is partitioned into secure and normal memory. Only processes
>> +  that are running in secure mode can access secure memory.
>> +
>> +* The hardware does not allow anything that is not running secure to
>> +  access secure memory. This means that the Hypervisor cannot access
>> +  the memory of the SVM without using an ultracall (asking the
>> +  Ultravisor). The Ultravisor will only allow the hypervisor to see
>> +  the SVM memory encrypted.
>> +
>> +* I/O systems are not allowed to directly address secure memory. This
>> +  limits the SVMs to virtual I/O only.
>> +
>> +* The architecture allows the SVM to share pages of memory with the
>> +  hypervisor that are not protected with encryption. However, this
>> +  sharing must be initiated by the SVM.
>> +
>> +* When a process is running in secure mode all hypercalls
>> +  (syscall lev=1) are reflected to the Ultravisor.
> Here 'reflected' refers to the Ultravisor. Later on, it is used as
> meaning the Ultravisor reflects hypercalls/interrupts to the
> Hypervisor. I suggest we use this term to mean the latter only.
>
>> +
>> +* When a process is in secure mode all interrupts go to the
>> +  Ultravisor.
>> +
>> +* The following resources have become Ultravisor privileged and
>> +  require an Ultravisor interface to manipulate:
>> +
>> +  * Processor configurations registers (SCOMs).
>> +
>> +  * Stop state information.
>> +
>> +  * The debug registers CIABR, DAWR, and DAWRX become Ultravisor
>> +resources when SMFCTRL(D) is set. If SMFCTRL(D) is not set they do
> It looks like you could go without "become Ultravisor resources" since
> it is already mentioned in the parent item above (The following...).
>
>> +not work in secure mode. When set, reading and writing requires
>> +an Ultravisor call, otherwise that will cause a Hypervisor Emulation
>> +Assistance interrupt.
>> +
>> +  * PTCR and partition table entries (partition table is in secure
>> +memory). An attempt to write to PTCR will cause a Hypervisor
>> +Emulation Assitance interrupt.
>> +
>> +  * LDBAR (LD Base Address Register) and IMC (In-Memory Collection)
>> +non-architected registers. An attempt to write to them will cause a
>> +Hypervisor Emulation Assistance interrupt.
>> +
>> +  * Paging for an SVM, sharing of memory with Hypervisor for an SVM.
>> +(Including Virtual Processor Area (VPA) and virtual I/O).
>> +
>> +
>> +Software/Microcode
>> +==
>> +
>> +The software changes include:
>> +
>> +* SVMs are created from normal VM using (open source) tooling supplied
>> +  by IBM.
>> +
>> +* All SVMs start as normal VMs and utilize an ultracall, UV_ESM
>> +  (Enter Secure Mode), to make the transition.
>> +
>> +* When the UV_ESM ultracall is made the Ultravisor copies the VM into
>> +  secure memory, decrypts the verification information, and checks the
>> +  integrity of the SVM. If the integrity check passes the Ultravisor
>> +  passes control in secure mode.
>> +
>> +* For external interrupts the Ultravisor saves the state of the SVM,
>> +  and reflects the interrupt to the hypervisor for processing.
>> +  For hypercalls, the Ultravisor inserts neutral state into all
>> +  registers not needed for the hypercall 

Re: [PATCH] bpf: handle 32-bit zext during constant blinding

2019-08-21 Thread Jiong Wang


Naveen N. Rao writes:

> Since BPF constant blinding is performed after the verifier pass, the
> ALU32 instructions inserted for doubleword immediate loads don't have a
> corresponding zext instruction. This is causing a kernel oops on powerpc
> and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
>
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
>
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to 
> analysis result")
> Reported-by: Michael Ellerman 
> Signed-off-by: Naveen N. Rao 

Thanks for the fix.

Reviewed-by: Jiong Wang 

Just two other comments during review in case I am wrong on somewhere.

  - Use verifier_zext instead of bpf_jit_needs_zext() seems better, even
though the latter could avoid extending function argument.

Because JIT back-ends look at verifier_zext, true means zext inserted
by verifier so JITs won't do the code-gen.

Use verifier_zext is sort of keeping JIT blinding the same behaviour
has verifier even though blinding doesn't belong to verifier, but for
such insn patching, it could be seen as a extension of verifier,
therefore use verifier_zext seems better than bpf_jit_needs_zext() to
me.
   
  - JIT blinding is also escaping the HI32 randomization which happens
inside verifier, otherwise x86-64 regression should have caught this issue.

Regards,
Jiong

> ---
> Changes since RFC:
> - Removed changes to ALU32 and JMP32 ops since those don't alter program 
>   execution, and the verifier would have already accounted for them.  
>
>
>  kernel/bpf/core.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..66088a9e9b9e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -890,7 +890,8 @@ int bpf_jit_get_func_addr(const struct bpf_prog *prog,
>  
>  static int bpf_jit_blind_insn(const struct bpf_insn *from,
> const struct bpf_insn *aux,
> -   struct bpf_insn *to_buff)
> +   struct bpf_insn *to_buff,
> +   bool emit_zext)
>  {
>   struct bpf_insn *to = to_buff;
>   u32 imm_rnd = get_random_int();
> @@ -1005,6 +1006,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn 
> *from,
>   case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
>   *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ 
> aux[0].imm);
>   *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
> + if (emit_zext)
> + *to++ = BPF_ZEXT_REG(BPF_REG_AX);
>   *to++ = BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX);
>   break;
>  
> @@ -1088,7 +1091,8 @@ struct bpf_prog *bpf_jit_blind_constants(struct 
> bpf_prog *prog)
>   insn[1].code == 0)
>   memcpy(aux, insn, sizeof(aux));
>  
> - rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
> + rewritten = bpf_jit_blind_insn(insn, aux, insn_buff,
> + clone->aux->verifier_zext);
>   if (!rewritten)
>   continue;



Re: [RFC PATCH] powerpc: Convert ____flush_dcache_icache_phys() to C

2019-08-21 Thread Christophe Leroy




Le 20/08/2019 à 06:36, Alastair D'Silva a écrit :

On Fri, 2019-08-16 at 15:52 +, Christophe Leroy wrote:


[...]




Thanks Christophe,

I'm trying a somewhat different approach that requires less knowledge
of assembler. Handling of CPU_FTR_COHERENT_ICACHE is outside this
function. The code below is not a patch as my tree is a bit messy,
sorry:


Can we be 100% sure that GCC won't add any code accessing some global 
data or stack while the Data MMU is OFF ?


Christophe




/**
  * flush_dcache_icache_phys() - Flush a page by it's physical address
  * @addr: the physical address of the page
  */
static void flush_dcache_icache_phys(unsigned long addr)
{
register unsigned long msr;
register unsigned long dlines = PAGE_SIZE >> l1_dcache_shift();
register unsigned long dbytes = l1_dcache_bytes();
register unsigned long ilines = PAGE_SIZE >> l1_icache_shift();
register unsigned long ibytes = l1_icache_bytes();
register unsigned long i;
register unsigned long address = addr;

/*
 * Clear the DR bit so that we operate on physical
 * rather than virtual addresses
 */
msr = mfmsr();
mtmsr(msr & ~(MSR_DR));

/* Write out the data cache */
for (i = 0; i < dlines; i++, address += dbytes)
dcbst((void *)address);

/* Invalidate the instruction cache */
address = addr;
for (i = 0; i < ilines; i++, address += ibytes)
icbi((void *)address);

mtmsr(msr);
}

void test_flush_phys(unsigned long addr)
{
flush_dcache_icache_phys(addr);
}


This gives the following assembler (using pmac32_defconfig):
03cc :
  3cc:   94 21 ff f0 stwur1,-16(r1)
  3d0:   7d 00 00 a6 mfmsr   r8
  3d4:   55 09 07 34 rlwinm  r9,r8,0,28,26
  3d8:   7d 20 01 24 mtmsr   r9
  3dc:   39 20 00 80 li  r9,128
  3e0:   7d 29 03 a6 mtctr   r9
  3e4:   39 43 10 00 addir10,r3,4096
  3e8:   7c 69 1b 78 mr  r9,r3
  3ec:   7c 00 48 6c dcbst   0,r9
  3f0:   39 29 00 20 addir9,r9,32
  3f4:   42 00 ff f8 bdnz3ec 
  3f8:   7c 00 1f ac icbi0,r3
  3fc:   38 63 00 20 addir3,r3,32
  400:   7f 8a 18 40 cmplw   cr7,r10,r3
  404:   40 9e ff f4 bne cr7,3f8 
  408:   7d 00 01 24 mtmsr   r8
  40c:   38 21 00 10 addir1,r1,16
  410:   4e 80 00 20 blr




Re: [PATCH v4 1/7] kexec: add KEXEC_ELF

2019-08-21 Thread Thiago Jung Bauermann


Thiago Jung Bauermann  writes:

>> diff --git a/arch/powerpc/kernel/kexec_elf_64.c b/kernel/kexec_elf.c
>> similarity index 71%
>> copy from arch/powerpc/kernel/kexec_elf_64.c
>> copy to kernel/kexec_elf.c
>> index ba4f18a43ee8..6e9f52171ede 100644
>> --- a/arch/powerpc/kernel/kexec_elf_64.c
>> +++ b/kernel/kexec_elf.c
>> @@ -1,33 +1,10 @@
>> -/*
>> - * Load ELF vmlinux file for the kexec_file_load syscall.
>> - *
>> - * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
>> - * Copyright (C) 2004  IBM Corp.
>> - * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
>> - * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
>> - * Copyright (C) 2016  IBM Corporation
>> - *
>> - * Based on kexec-tools' kexec-elf-exec.c and kexec-elf-ppc64.c.
>> - * Heavily modified for the kernel by
>> - * Thiago Jung Bauermann .
>> - *
>> - * This program is free software; you can redistribute it and/or modify
>> - * it under the terms of the GNU General Public License as published by
>> - * the Free Software Foundation (version 2 of the License).
>> - *
>> - * This program is distributed in the hope that it will be useful,
>> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> - * GNU General Public License for more details.
>> - */
>> +// SPDX-License-Identifier: GPL-2.0-only
>
> I may be wrong, but my understanding of the SPDX license identifier is
> that it substitutes the license text (i.e., the last two paragraphs
> above), but not the copyright statements. Is it ok to have a file with a
> SPDX license identifier but no copyright statement?

Answering my own question: I just came accross commit b24413180f56
("License cleanup: add SPDX GPL-2.0 license identifier to files with no
license") which adds SPDX license identifiers to a lot of files without
any copyright statement so I conclude that it is indeed ok to not have
copyright statements in a file.

In this instance the new file is heavily based on the old one though, so
IMHO it makes sense for it to inherit the copyright statements from the
original file.

--
Thiago Jung Bauermann
IBM Linux Technology Center


[PATCH] powerpc/8xx: drop unused self-modifying code alternative to FixupDAR.

2019-08-21 Thread Christophe Leroy
The code which fixups the DAR on TLB errors for dbcX instructions
has a self-modifying code alternative that has never been used.

Drop it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 24 
 1 file changed, 24 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index b8ca5b43e587..19f583e18402 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -575,8 +575,6 @@ InstructionBreakpoint:
  * by decoding the registers used by the dcbx instruction and adding them.
  * DAR is set to the calculated address.
  */
- /* define if you don't want to use self modifying code */
-#define NO_SELF_MODIFYING_CODE
 FixupDAR:/* Entry point for dcbx workaround. */
mtspr   SPRN_M_TW, r10
/* fetch instruction from memory. */
@@ -640,27 +638,6 @@ FixupDAR:/* Entry point for dcbx workaround. */
rlwinm  r10, r10,0,7,5  /* Clear store bit for buggy dcbst insn */
mtspr   SPRN_DSISR, r10
 142:   /* continue, it was a dcbx, dcbi instruction. */
-#ifndef NO_SELF_MODIFYING_CODE
-   andis.  r10,r11,0x1f/* test if reg RA is r0 */
-   li  r10,modified_instr@l
-   dcbtst  r0,r10  /* touch for store */
-   rlwinm  r11,r11,0,0,20  /* Zero lower 10 bits */
-   orisr11,r11,640 /* Transform instr. to a "add r10,RA,RB" */
-   ori r11,r11,532
-   stw r11,0(r10)  /* store add/and instruction */
-   dcbf0,r10   /* flush new instr. to memory. */
-   icbi0,r10   /* invalidate instr. cache line */
-   mfspr   r11, SPRN_SPRG_SCRATCH1 /* restore r11 */
-   mfspr   r10, SPRN_SPRG_SCRATCH0 /* restore r10 */
-   isync   /* Wait until new instr is loaded from memory */
-modified_instr:
-   .space  4   /* this is where the add instr. is stored */
-   bne+143f
-   subfr10,r0,r10  /* r10=r10-r0, only if reg RA is r0 */
-143:   mtdar   r10 /* store faulting EA in DAR */
-   mfspr   r10,SPRN_M_TW
-   b   DARFixed/* Go back to normal TLB handling */
-#else
mfctr   r10
mtdar   r10 /* save ctr reg in DAR */
rlwinm  r10, r11, 24, 24, 28/* offset into jump table for reg RB */
@@ -724,7 +701,6 @@ modified_instr:
add r10, r10, r11   /* add it */
mfctr   r11 /* restore r11 */
b   151b
-#endif
 
 /*
  * This is where the main kernel code starts.
-- 
2.13.3



Re: [PATCHv3 1/3] dt-bindings: pci: layerscape-pci: add compatible strings "fsl,ls1028a-pcie"

2019-08-21 Thread Rob Herring
On Tue,  6 Aug 2019 14:15:51 +0800, Xiaowei Bao wrote:
> Add the PCIe compatible string for LS1028A
> 
> Signed-off-by: Xiaowei Bao 
> Signed-off-by: Hou Zhiqiang 
> ---
> v2:
>  - no change.
> v3:
>  - no change.
> 
>  Documentation/devicetree/bindings/pci/layerscape-pci.txt | 1 +
>  1 file changed, 1 insertion(+)
> 

Reviewed-by: Rob Herring 


[PATCH] bpf: handle 32-bit zext during constant blinding

2019-08-21 Thread Naveen N. Rao
Since BPF constant blinding is performed after the verifier pass, the
ALU32 instructions inserted for doubleword immediate loads don't have a
corresponding zext instruction. This is causing a kernel oops on powerpc
and can be reproduced by running 'test_cgroup_storage' with
bpf_jit_harden=2.

Fix this by emitting BPF_ZEXT during constant blinding if
prog->aux->verifier_zext is set.

Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to 
analysis result")
Reported-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
Changes since RFC:
- Removed changes to ALU32 and JMP32 ops since those don't alter program 
  execution, and the verifier would have already accounted for them.  


 kernel/bpf/core.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 8191a7db2777..66088a9e9b9e 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -890,7 +890,8 @@ int bpf_jit_get_func_addr(const struct bpf_prog *prog,
 
 static int bpf_jit_blind_insn(const struct bpf_insn *from,
  const struct bpf_insn *aux,
- struct bpf_insn *to_buff)
+ struct bpf_insn *to_buff,
+ bool emit_zext)
 {
struct bpf_insn *to = to_buff;
u32 imm_rnd = get_random_int();
@@ -1005,6 +1006,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
*to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ 
aux[0].imm);
*to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
+   if (emit_zext)
+   *to++ = BPF_ZEXT_REG(BPF_REG_AX);
*to++ = BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX);
break;
 
@@ -1088,7 +1091,8 @@ struct bpf_prog *bpf_jit_blind_constants(struct bpf_prog 
*prog)
insn[1].code == 0)
memcpy(aux, insn, sizeof(aux));
 
-   rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
+   rewritten = bpf_jit_blind_insn(insn, aux, insn_buff,
+   clone->aux->verifier_zext);
if (!rewritten)
continue;
 
-- 
2.22.0



Re: Regression fix for bpf in v5.3 (was Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding)

2019-08-21 Thread Naveen N. Rao

Jiong Wang wrote:


Michael Ellerman writes:


"Naveen N. Rao"  writes:

Since BPF constant blinding is performed after the verifier pass, there
are certain ALU32 instructions inserted which don't have a corresponding
zext instruction inserted after. This is causing a kernel oops on
powerpc and can be reproduced by running 'test_cgroup_storage' with
bpf_jit_harden=2.

Fix this by emitting BPF_ZEXT during constant blinding if
prog->aux->verifier_zext is set.

Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis 
result")
Reported-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
This approach (the location where zext is being introduced below, in 
particular) works for powerpc, but I am not entirely sure if this is 
sufficient for other architectures as well. This is broken on v5.3-rc4.


Any comment on this?


Have commented on https://marc.info/?l=linux-netdev=156637836024743=2

The fix looks correct to me on "BPF_LD | BPF_IMM | BPF_DW", but looks
unnecessary on two other places. It would be great if you or Naveen could
confirm it.


Jiong,
Thanks for the review. I can now see why the other two changes are not 
necessary. I will post a follow-on patch.


Thanks!
- Naveen



Re: [PATCH v2 4/4] powerpc: load firmware trusted keys into kernel keyring

2019-08-21 Thread Greg Kroah-Hartman
On Wed, Aug 21, 2019 at 11:08:23AM -0400, Nayna Jain wrote:
> The keys used to verify the Host OS kernel are managed by OPAL as secure
> variables. This patch loads the verification keys into the .platform
> keyring and revocation keys into .blacklist keyring. This enables
> verification and loading of the kernels signed by the boot time keys which
> are trusted by firmware.
> 
> Signed-off-by: Nayna Jain 
> ---
>  security/integrity/Kconfig|  9 ++
>  security/integrity/Makefile   |  3 +
>  .../integrity/platform_certs/load_powerpc.c   | 94 +++
>  3 files changed, 106 insertions(+)
>  create mode 100644 security/integrity/platform_certs/load_powerpc.c
> 
> diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig
> index 0bae6adb63a9..2b4109c157e2 100644
> --- a/security/integrity/Kconfig
> +++ b/security/integrity/Kconfig
> @@ -72,6 +72,15 @@ config LOAD_IPL_KEYS
> depends on S390
> def_bool y
>  
> +config LOAD_PPC_KEYS
> + bool "Enable loading of platform and revocation keys for POWER"
> + depends on INTEGRITY_PLATFORM_KEYRING
> + depends on PPC_SECURE_BOOT
> + def_bool y

def_bool y only for things that the system will not boot if it is not
enabled because you added a new feature.  Otherwise just do not set the
default.

> + help
> +   Enable loading of db keys to the .platform keyring and dbx keys to
> +   the .blacklist keyring for powerpc based platforms.
> +
>  config INTEGRITY_AUDIT
>   bool "Enables integrity auditing support "
>   depends on AUDIT
> diff --git a/security/integrity/Makefile b/security/integrity/Makefile
> index 525bf1d6e0db..9eeb6b053de3 100644
> --- a/security/integrity/Makefile
> +++ b/security/integrity/Makefile
> @@ -14,6 +14,9 @@ integrity-$(CONFIG_LOAD_UEFI_KEYS) += 
> platform_certs/efi_parser.o \
> platform_certs/load_uefi.o \
> platform_certs/keyring_handler.o
>  integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
> +integrity-$(CONFIG_LOAD_PPC_KEYS) += platform_certs/efi_parser.o \
> +  platform_certs/load_powerpc.o \
> +  platform_certs/keyring_handler.o
>  $(obj)/load_uefi.o: KBUILD_CFLAGS += -fshort-wchar
  
>  subdir-$(CONFIG_IMA) += ima
> diff --git a/security/integrity/platform_certs/load_powerpc.c 
> b/security/integrity/platform_certs/load_powerpc.c
> new file mode 100644
> index ..f4d869171062
> --- /dev/null
> +++ b/security/integrity/platform_certs/load_powerpc.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2019 IBM Corporation
> + * Author: Nayna Jain 
> + *
> + * load_powernv.c

That's not the name of this file :(

And the perfect example of why you NEVER have the name of the file in
the file itself, as it's not needed and easy to get wrong :)

thanks,

greg k-h


Re: [PATCH v2 3/4] x86/efi: move common keyring handler functions to new file

2019-08-21 Thread Greg Kroah-Hartman
On Wed, Aug 21, 2019 at 11:08:22AM -0400, Nayna Jain wrote:
> This patch moves the common code to keyring_handler.c

That says _what_ you are doing, but not _why_ you are doing it.  We have
no idea :(



Re: [PATCH v2 2/4] powerpc: expose secure variables to userspace via sysfs

2019-08-21 Thread Greg Kroah-Hartman
On Wed, Aug 21, 2019 at 11:08:21AM -0400, Nayna Jain wrote:
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-secvar
> @@ -0,0 +1,27 @@
> +What:/sys/firmware/secvar
> +Date:August 2019
> +Contact: Nayna Jain 
> +Description:
> + This directory exposes interfaces for interacting with
> + the secure variables managed by OPAL firmware.
> +
> + This is only for the powerpc/powernv platform.
> +
> + Directory:
> + vars:   This directory lists all the variables that
> + are supported by the OPAL. The variables are
> + represented in the form of directories with
> + their variable names. The variable name is
> + unique and is in ASCII representation. The data
> + and size can be determined by reading their
> + respective attribute files.
> +
> + Each variable directory has the following files:
> + name:   An ASCII representation of the variable name
> + data:   A read-only file containing the value of the
> + variable
> + size:   An integer representation of the size of the
> + content of the variable. In other works, it
> + represents the size of the data
> + update: A write-only file that is used to submit the new
> + value for the variable.

Can you break this out into one-entry-per-file like most other entries
are defined?  That makes it easier for tools to parse (specifically the
tool in the tree right now...)



> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 42109682b727..b4bdf77837b2 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -925,6 +925,15 @@ config PPC_SECURE_BOOT
> allows user to enable OS Secure Boot on PowerPC systems that
> have firmware secure boot support.
>  
> +config SECVAR_SYSFS
> +tristate "Enable sysfs interface for POWER secure variables"
> +depends on PPC_SECURE_BOOT

No depends on SYSFS?

> +help
> +  POWER secure variables are managed and controlled by firmware.
> +  These variables are exposed to userspace via sysfs to enable
> +  read/write operations on these variables. Say Y if you have
> +   secure boot enabled and want to expose variables to userspace.

Mix of tabs and spaces :(

> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
> index 9041563f1c74..4ea7b738c3a3 100644
> --- a/arch/powerpc/kernel/Makefile
> +++ b/arch/powerpc/kernel/Makefile
> @@ -158,6 +158,7 @@ obj-$(CONFIG_EPAPR_PARAVIRT)  += epapr_paravirt.o 
> epapr_hcalls.o
>  obj-$(CONFIG_KVM_GUEST)  += kvm.o kvm_emul.o
>  
>  obj-$(CONFIG_PPC_SECURE_BOOT)+= secboot.o ima_arch.o secvar-ops.o
> +obj-$(CONFIG_SECVAR_SYSFS) += secvar-sysfs.o

No tab?

>  
>  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
>  GCOV_PROFILE_prom_init.o := n
> diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
> b/arch/powerpc/kernel/secvar-sysfs.c
> new file mode 100644
> index ..e46986bb29a0
> --- /dev/null
> +++ b/arch/powerpc/kernel/secvar-sysfs.c
> @@ -0,0 +1,210 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2019 IBM Corporation 
> + *
> + * This code exposes secure variables to user via sysfs
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +//Approximating it for now, it is bound to change.

" " before "A" here please.

> +#define VARIABLE_MAX_SIZE  32000
> +
> +static struct kobject *powerpc_kobj;
> +static struct secvar_operations *secvarops;
> +struct kset *secvar_kset;
> +
> +static ssize_t name_show(struct kobject *kobj, struct kobj_attribute *attr,
> +  char *buf)
> +{
> + return sprintf(buf, "%s", kobj->name);
> +}

Why do you need this entry as it is the directory name?  Userspace
already "knows" it if they can open this file.


> +
> +static ssize_t size_show(struct kobject *kobj, struct kobj_attribute *attr,
> +  char *buf)
> +{
> + unsigned long dsize;
> + int rc;
> +
> + rc = secvarops->get_variable(kobj->name, strlen(kobj->name) + 1, NULL,
> +  );
> + if (rc) {
> + pr_err("Error retrieving variable size %d\n", rc);
> + return rc;
> + }
> +
> + rc = sprintf(buf, "%ld", dsize);
> +
> + return rc;
> +}
> +
> +static ssize_t data_read(struct file *filep, struct kobject *kobj,
> +  struct bin_attribute *attr, char *buf, loff_t off,
> +  size_t count)
> +{
> + unsigned long dsize;
> + int rc;
> 

Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Christophe Leroy




Le 21/08/2019 à 14:15, Segher Boessenkool a écrit :

On Wed, Aug 21, 2019 at 01:50:52PM +0200, Christophe Leroy wrote:

Do you have any idea on how to avoid that bcl/mflr stuff ?


Do a load from some fixed address?  Maybe an absolute address, even?
lwz r3,-12344(0)  or similar (that address is in kernel space...)

There aren't many options, and certainly not many *good* options!



IIUC, the VDSO is seen by apps the same way as a dynamic lib. Couldn't 
the relocation be done only once when the app loads the VDSO as for a 
regular .so lib ?


It looks like it is what others do, at least x86 and arm64, unless I 
misunderstood their code.


Christophe


Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Nathan Lynch
Christophe Leroy  writes:
> Le 21/08/2019 à 11:29, Santosh Sivaraj a écrit :
>> __get_datapage() is only a few instructions to retrieve the
>> address of the page where the kernel stores data to the VDSO.
>> 
>> By inlining this function into its users, a bl/blr pair and
>> a mflr/mtlr pair is avoided, plus a few reg moves.
>> 
>> clock-gettime-monotonic: syscall: 514 nsec/call  396 nsec/call
>> clock-gettime-monotonic:libc: 25 nsec/call   24 nsec/call
>> clock-gettime-monotonic:vdso: 20 nsec/call   20 nsec/call
>> clock-getres-monotonic: syscall: 347 nsec/call   372 nsec/call
>> clock-getres-monotonic:libc: 19 nsec/call19 nsec/call
>> clock-getres-monotonic:vdso: 10 nsec/call10 nsec/call
>> clock-gettime-monotonic-coarse: syscall: 511 nsec/call   396 nsec/call
>> clock-gettime-monotonic-coarse:libc: 23 nsec/call21 nsec/call
>> clock-gettime-monotonic-coarse:vdso: 15 nsec/call13 nsec/call
>> clock-gettime-realtime: syscall: 526 nsec/call   405 nsec/call
>> clock-gettime-realtime:libc: 24 nsec/call23 nsec/call
>> clock-gettime-realtime:vdso: 18 nsec/call18 nsec/call
>> clock-getres-realtime: syscall: 342 nsec/call372 nsec/call
>> clock-getres-realtime:libc: 19 nsec/call 19 nsec/call
>> clock-getres-realtime:vdso: 10 nsec/call 10 nsec/call
>> clock-gettime-realtime-coarse: syscall: 515 nsec/call373 nsec/call
>> clock-gettime-realtime-coarse:libc: 23 nsec/call 22 nsec/call
>> clock-gettime-realtime-coarse:vdso: 14 nsec/call 13 nsec/call
>
> I think you should only put the measurements on vdso calls, and only the 
> ones that are impacted by the change. For exemple, getres function 
> doesn't use __get_datapage so showing it here is pointless.

I agree with this point, but also, I would caution against using
vdsotest's benchmark function for anything like rigorous performance
analysis. The intention was to roughly confirm the VDSO's relative
performance vs the in-kernel implementations. Not to compare one VDSO
implementation of (say) clock_gettime to another.

I suggest using perf to confirm the expected effects of the change, if
possible.


[PATCH v1 5/5] mm/memory_hotplug: Remove zone parameter from __remove_pages()

2019-08-21 Thread David Hildenbrand
No longer in use, let's drop it. We no longer access the zone of
possibly never onlined memory (and therefore don't read garabage in
these scenarios).

Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: Andrew Morton 
Cc: Mark Rutland 
Cc: Steve Capper 
Cc: Mike Rapoport 
Cc: Anshuman Khandual 
Cc: Yu Zhao 
Cc: Jun Yao 
Cc: Robin Murphy 
Cc: Michal Hocko 
Cc: Oscar Salvador 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Christophe Leroy 
Cc: "Aneesh Kumar K.V" 
Cc: Pavel Tatashin 
Cc: Gerald Schaefer 
Cc: Halil Pasic 
Cc: Tom Lendacky 
Cc: Greg Kroah-Hartman 
Cc: Masahiro Yamada 
Cc: Dan Williams 
Cc: Wei Yang 
Cc: Qian Cai 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Ira Weiny 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: David Hildenbrand 
---
 arch/arm64/mm/mmu.c| 4 +---
 arch/ia64/mm/init.c| 4 +---
 arch/powerpc/mm/mem.c  | 3 +--
 arch/s390/mm/init.c| 4 +---
 arch/sh/mm/init.c  | 4 +---
 arch/x86/mm/init_32.c  | 4 +---
 arch/x86/mm/init_64.c  | 4 +---
 include/linux/memory_hotplug.h | 4 ++--
 mm/memory_hotplug.c| 6 +++---
 mm/memremap.c  | 3 +--
 10 files changed, 13 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index e67bab4d613e..b3843aff12bf 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1080,7 +1080,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
 
/*
 * FIXME: Cleanup page tables (also in arch_add_memory() in case
@@ -1089,7 +1088,6 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
 * unlocked yet.
 */
-   zone = page_zone(pfn_to_page(start_pfn));
-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index bf9df2625bc8..a6dd80a2c939 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
 
-   zone = page_zone(pfn_to_page(start_pfn));
-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9191a66b3bc5..7351c44c435a 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap);
int ret;
 
-   __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+   __remove_pages(start_pfn, nr_pages, altmap);
 
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index 20340a03ad90..6f13eb66e375 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -296,10 +296,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
 
-   zone = page_zone(pfn_to_page(start_pfn));
-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(start_pfn, nr_pages, altmap);
vmem_remove_mapping(start, size);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index dfdbaa50946e..d1b1ff2be17a 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size,
 {
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
-   struct zone *zone;
 
-   zone = page_zone(pfn_to_page(start_pfn));
-   __remove_pages(zone, start_pfn, nr_pages, altmap);
+   __remove_pages(start_pfn, nr_pages, altmap);
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 4068abb9427f..9d036be27aaa 100644
--- a/arch/x86/mm/init_32.c
+++ 

[PATCH] powerpc/mm: tell if a bad page fault on data is read or write.

2019-08-21 Thread Christophe Leroy
DSISR has a bit to tell if the fault is due to a read or a write.

Display it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/fault.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8432c281de92..b5047f9b5dec 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -645,6 +645,7 @@ NOKPROBE_SYMBOL(do_page_fault);
 void bad_page_fault(struct pt_regs *regs, unsigned long address, int sig)
 {
const struct exception_table_entry *entry;
+   int is_write = page_fault_is_write(regs->dsisr);
 
/* Are we prepared to handle this fault?  */
if ((entry = search_exception_tables(regs->nip)) != NULL) {
@@ -658,9 +659,10 @@ void bad_page_fault(struct pt_regs *regs, unsigned long 
address, int sig)
case 0x300:
case 0x380:
case 0xe00:
-   pr_alert("BUG: %s at 0x%08lx\n",
+   pr_alert("BUG: %s on %s at 0x%08lx\n",
 regs->dar < PAGE_SIZE ? "Kernel NULL pointer 
dereference" :
-"Unable to handle kernel data access", regs->dar);
+"Unable to handle kernel data access",
+is_write ? "write" : "read", regs->dar);
break;
case 0x400:
case 0x480:
-- 
2.13.3



[PATCH v2 4/4] powerpc: load firmware trusted keys into kernel keyring

2019-08-21 Thread Nayna Jain
The keys used to verify the Host OS kernel are managed by OPAL as secure
variables. This patch loads the verification keys into the .platform
keyring and revocation keys into .blacklist keyring. This enables
verification and loading of the kernels signed by the boot time keys which
are trusted by firmware.

Signed-off-by: Nayna Jain 
---
 security/integrity/Kconfig|  9 ++
 security/integrity/Makefile   |  3 +
 .../integrity/platform_certs/load_powerpc.c   | 94 +++
 3 files changed, 106 insertions(+)
 create mode 100644 security/integrity/platform_certs/load_powerpc.c

diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig
index 0bae6adb63a9..2b4109c157e2 100644
--- a/security/integrity/Kconfig
+++ b/security/integrity/Kconfig
@@ -72,6 +72,15 @@ config LOAD_IPL_KEYS
depends on S390
def_bool y
 
+config LOAD_PPC_KEYS
+   bool "Enable loading of platform and revocation keys for POWER"
+   depends on INTEGRITY_PLATFORM_KEYRING
+   depends on PPC_SECURE_BOOT
+   def_bool y
+   help
+ Enable loading of db keys to the .platform keyring and dbx keys to
+ the .blacklist keyring for powerpc based platforms.
+
 config INTEGRITY_AUDIT
bool "Enables integrity auditing support "
depends on AUDIT
diff --git a/security/integrity/Makefile b/security/integrity/Makefile
index 525bf1d6e0db..9eeb6b053de3 100644
--- a/security/integrity/Makefile
+++ b/security/integrity/Makefile
@@ -14,6 +14,9 @@ integrity-$(CONFIG_LOAD_UEFI_KEYS) += 
platform_certs/efi_parser.o \
  platform_certs/load_uefi.o \
  platform_certs/keyring_handler.o
 integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
+integrity-$(CONFIG_LOAD_PPC_KEYS) += platform_certs/efi_parser.o \
+platform_certs/load_powerpc.o \
+platform_certs/keyring_handler.o
 $(obj)/load_uefi.o: KBUILD_CFLAGS += -fshort-wchar
 
 subdir-$(CONFIG_IMA)   += ima
diff --git a/security/integrity/platform_certs/load_powerpc.c 
b/security/integrity/platform_certs/load_powerpc.c
new file mode 100644
index ..f4d869171062
--- /dev/null
+++ b/security/integrity/platform_certs/load_powerpc.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ * load_powernv.c
+ *  - loads keys and certs stored and controlled
+ *  by the firmware.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "keyring_handler.h"
+
+static struct secvar_operations *secvarops;
+
+/*
+ * Get a certificate list blob from the named EFI variable.
+ */
+static __init void *get_cert_list(u8 *key, unsigned long keylen,
+ unsigned long *size)
+{
+   int rc;
+   void *db;
+
+   rc = secvarops->get_variable(key, keylen, NULL, size);
+   if (rc) {
+   pr_err("Couldn't get size: %d\n", rc);
+   return NULL;
+   }
+
+   db = kmalloc(*size, GFP_KERNEL);
+   if (!db)
+   return NULL;
+
+   rc = secvarops->get_variable(key, keylen, db, size);
+   if (rc) {
+   kfree(db);
+   pr_err("Error reading db var: %d\n", rc);
+   return NULL;
+   }
+
+   return db;
+}
+
+/*
+ * Load the certs contained in the UEFI databases into the platform trusted
+ * keyring and the UEFI blacklisted X.509 cert SHA256 hashes into the blacklist
+ * keyring.
+ */
+static int __init load_powerpc_certs(void)
+{
+   void *db = NULL, *dbx = NULL;
+   unsigned long dbsize = 0, dbxsize = 0;
+   int rc = 0;
+
+   secvarops = get_secvar_ops();
+   if (!secvarops)
+   return -ENOENT;
+
+   /* Get db, and dbx.  They might not exist, so it isn't
+* an error if we can't get them.
+*/
+   db = get_cert_list("db", 3, );
+   if (!db) {
+   pr_err("Couldn't get db list from OPAL\n");
+   } else {
+   rc = parse_efi_signature_list("OPAL:db",
+   db, dbsize, get_handler_for_db);
+   if (rc)
+   pr_err("Couldn't parse db signatures: %d\n",
+   rc);
+   kfree(db);
+   }
+
+   dbx = get_cert_list("dbx", 3,  );
+   if (!dbx) {
+   pr_info("Couldn't get dbx list from OPAL\n");
+   } else {
+   rc = parse_efi_signature_list("OPAL:dbx",
+   dbx, dbxsize,
+   get_handler_for_dbx);
+   if (rc)
+   pr_err("Couldn't parse dbx signatures: %d\n", rc);
+   kfree(dbx);
+   }
+
+   return rc;
+}
+late_initcall(load_powerpc_certs);
-- 
2.20.1



[PATCH v2 3/4] x86/efi: move common keyring handler functions to new file

2019-08-21 Thread Nayna Jain
This patch moves the common code to keyring_handler.c

Signed-off-by: Nayna Jain 
---
 security/integrity/Makefile   |  3 +-
 .../platform_certs/keyring_handler.c  | 80 +++
 .../platform_certs/keyring_handler.h  | 35 
 security/integrity/platform_certs/load_uefi.c | 67 +---
 4 files changed, 118 insertions(+), 67 deletions(-)
 create mode 100644 security/integrity/platform_certs/keyring_handler.c
 create mode 100644 security/integrity/platform_certs/keyring_handler.h

diff --git a/security/integrity/Makefile b/security/integrity/Makefile
index 19faace69644..525bf1d6e0db 100644
--- a/security/integrity/Makefile
+++ b/security/integrity/Makefile
@@ -11,7 +11,8 @@ integrity-$(CONFIG_INTEGRITY_SIGNATURE) += digsig.o
 integrity-$(CONFIG_INTEGRITY_ASYMMETRIC_KEYS) += digsig_asymmetric.o
 integrity-$(CONFIG_INTEGRITY_PLATFORM_KEYRING) += 
platform_certs/platform_keyring.o
 integrity-$(CONFIG_LOAD_UEFI_KEYS) += platform_certs/efi_parser.o \
-   platform_certs/load_uefi.o
+ platform_certs/load_uefi.o \
+ platform_certs/keyring_handler.o
 integrity-$(CONFIG_LOAD_IPL_KEYS) += platform_certs/load_ipl_s390.o
 $(obj)/load_uefi.o: KBUILD_CFLAGS += -fshort-wchar
 
diff --git a/security/integrity/platform_certs/keyring_handler.c 
b/security/integrity/platform_certs/keyring_handler.c
new file mode 100644
index ..c5ba695c10e3
--- /dev/null
+++ b/security/integrity/platform_certs/keyring_handler.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../integrity.h"
+
+static efi_guid_t efi_cert_x509_guid __initdata = EFI_CERT_X509_GUID;
+static efi_guid_t efi_cert_x509_sha256_guid __initdata =
+   EFI_CERT_X509_SHA256_GUID;
+static efi_guid_t efi_cert_sha256_guid __initdata = EFI_CERT_SHA256_GUID;
+
+/*
+ * Blacklist a hash.
+ */
+static __init void uefi_blacklist_hash(const char *source, const void *data,
+  size_t len, const char *type,
+  size_t type_len)
+{
+   char *hash, *p;
+
+   hash = kmalloc(type_len + len * 2 + 1, GFP_KERNEL);
+   if (!hash)
+   return;
+   p = memcpy(hash, type, type_len);
+   p += type_len;
+   bin2hex(p, data, len);
+   p += len * 2;
+   *p = 0;
+
+   mark_hash_blacklisted(hash);
+   kfree(hash);
+}
+
+/*
+ * Blacklist an X509 TBS hash.
+ */
+static __init void uefi_blacklist_x509_tbs(const char *source,
+  const void *data, size_t len)
+{
+   uefi_blacklist_hash(source, data, len, "tbs:", 4);
+}
+
+/*
+ * Blacklist the hash of an executable.
+ */
+static __init void uefi_blacklist_binary(const char *source,
+const void *data, size_t len)
+{
+   uefi_blacklist_hash(source, data, len, "bin:", 4);
+}
+
+/*
+ * Return the appropriate handler for particular signature list types found in
+ * the UEFI db and MokListRT tables.
+ */
+__init efi_element_handler_t get_handler_for_db(const efi_guid_t *sig_type)
+{
+   if (efi_guidcmp(*sig_type, efi_cert_x509_guid) == 0)
+   return add_to_platform_keyring;
+   return 0;
+}
+
+/*
+ * Return the appropriate handler for particular signature list types found in
+ * the UEFI dbx and MokListXRT tables.
+ */
+__init efi_element_handler_t get_handler_for_dbx(const efi_guid_t *sig_type)
+{
+   if (efi_guidcmp(*sig_type, efi_cert_x509_sha256_guid) == 0)
+   return uefi_blacklist_x509_tbs;
+   if (efi_guidcmp(*sig_type, efi_cert_sha256_guid) == 0)
+   return uefi_blacklist_binary;
+   return 0;
+}
diff --git a/security/integrity/platform_certs/keyring_handler.h 
b/security/integrity/platform_certs/keyring_handler.h
new file mode 100644
index ..829a14b95218
--- /dev/null
+++ b/security/integrity/platform_certs/keyring_handler.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ */
+#ifndef PLATFORM_CERTS_INTERNAL_H
+#define PLATFORM_CERTS_INTERNAL_H
+
+#include 
+
+void blacklist_hash(const char *source, const void *data,
+   size_t len, const char *type,
+   size_t type_len);
+
+/*
+ * Blacklist an X509 TBS hash.
+ */
+void blacklist_x509_tbs(const char *source, const void *data, size_t len);
+
+/*
+ * Blacklist the hash of an executable.
+ */
+void blacklist_binary(const char *source, const void *data, size_t len);
+
+/*
+ * Return the handler for particular signature list types found in the db.
+ */
+efi_element_handler_t get_handler_for_db(const efi_guid_t *sig_type);
+
+/*
+ * Return the handler for particular signature list types found in the dbx.
+ */
+efi_element_handler_t 

[PATCH v2 2/4] powerpc: expose secure variables to userspace via sysfs

2019-08-21 Thread Nayna Jain
PowerNV secure variables, which store the keys used for OS kernel
verification, are managed by the firmware. These secure variables need to
be accessed by the userspace for addition/deletion of the certificates.

This patch adds the sysfs interface to expose secure variables for PowerNV
secureboot. The users shall use this interface for manipulating
the keys stored in the secure variables.

Signed-off-by: Nayna Jain 
---
 Documentation/ABI/testing/sysfs-secvar |  27 
 arch/powerpc/Kconfig   |   9 ++
 arch/powerpc/kernel/Makefile   |   1 +
 arch/powerpc/kernel/secvar-sysfs.c | 210 +
 4 files changed, 247 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-secvar
 create mode 100644 arch/powerpc/kernel/secvar-sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-secvar 
b/Documentation/ABI/testing/sysfs-secvar
new file mode 100644
index ..68f0e03d873d
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-secvar
@@ -0,0 +1,27 @@
+What:  /sys/firmware/secvar
+Date:  August 2019
+Contact:   Nayna Jain 
+Description:
+   This directory exposes interfaces for interacting with
+   the secure variables managed by OPAL firmware.
+
+   This is only for the powerpc/powernv platform.
+
+   Directory:
+   vars:   This directory lists all the variables that
+   are supported by the OPAL. The variables are
+   represented in the form of directories with
+   their variable names. The variable name is
+   unique and is in ASCII representation. The data
+   and size can be determined by reading their
+   respective attribute files.
+
+   Each variable directory has the following files:
+   name:   An ASCII representation of the variable name
+   data:   A read-only file containing the value of the
+   variable
+   size:   An integer representation of the size of the
+   content of the variable. In other works, it
+   represents the size of the data
+   update: A write-only file that is used to submit the new
+   value for the variable.
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 42109682b727..b4bdf77837b2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -925,6 +925,15 @@ config PPC_SECURE_BOOT
  allows user to enable OS Secure Boot on PowerPC systems that
  have firmware secure boot support.
 
+config SECVAR_SYSFS
+tristate "Enable sysfs interface for POWER secure variables"
+depends on PPC_SECURE_BOOT
+help
+  POWER secure variables are managed and controlled by firmware.
+  These variables are exposed to userspace via sysfs to enable
+  read/write operations on these variables. Say Y if you have
+ secure boot enabled and want to expose variables to userspace.
+
 endmenu
 
 config ISA_DMA_API
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 9041563f1c74..4ea7b738c3a3 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -158,6 +158,7 @@ obj-$(CONFIG_EPAPR_PARAVIRT)+= epapr_paravirt.o 
epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
 
 obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o ima_arch.o secvar-ops.o
+obj-$(CONFIG_SECVAR_SYSFS) += secvar-sysfs.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
new file mode 100644
index ..e46986bb29a0
--- /dev/null
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 IBM Corporation 
+ *
+ * This code exposes secure variables to user via sysfs
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+//Approximating it for now, it is bound to change.
+#define VARIABLE_MAX_SIZE  32000
+
+static struct kobject *powerpc_kobj;
+static struct secvar_operations *secvarops;
+struct kset *secvar_kset;
+
+static ssize_t name_show(struct kobject *kobj, struct kobj_attribute *attr,
+char *buf)
+{
+   return sprintf(buf, "%s", kobj->name);
+}
+
+static ssize_t size_show(struct kobject *kobj, struct kobj_attribute *attr,
+char *buf)
+{
+   unsigned long dsize;
+   int rc;
+
+   rc = secvarops->get_variable(kobj->name, strlen(kobj->name) + 1, NULL,
+);
+   if (rc) {
+   pr_err("Error retrieving variable size %d\n", rc);
+   

[PATCH v2 1/4] powerpc/powernv: Add OPAL API interface to access secure variable

2019-08-21 Thread Nayna Jain
The X.509 certificates trusted by the platform and required to secure boot
the OS kernel are wrapped in secure variables, which are controlled by
OPAL.

This patch adds firmware/kernel interface to read and write OPAL secure
variables based on the unique key.

This support can be enabled using CONFIG_OPAL_SECVAR.

Signed-off-by: Claudio Carvalho 
Signed-off-by: Nayna Jain 
---
 arch/powerpc/include/asm/opal-api.h  |   5 +-
 arch/powerpc/include/asm/opal.h  |   6 ++
 arch/powerpc/include/asm/secvar.h|  55 ++
 arch/powerpc/kernel/Makefile |   2 +-
 arch/powerpc/kernel/secvar-ops.c |  25 +
 arch/powerpc/platforms/powernv/Kconfig   |   6 ++
 arch/powerpc/platforms/powernv/Makefile  |   1 +
 arch/powerpc/platforms/powernv/opal-call.c   |   3 +
 arch/powerpc/platforms/powernv/opal-secvar.c | 102 +++
 arch/powerpc/platforms/powernv/opal.c|   5 +
 10 files changed, 208 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/include/asm/secvar.h
 create mode 100644 arch/powerpc/kernel/secvar-ops.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 383242eb0dea..b238b4f26c5b 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -208,7 +208,10 @@
 #define OPAL_HANDLE_HMI2   166
 #defineOPAL_NX_COPROC_INIT 167
 #define OPAL_XIVE_GET_VP_STATE 170
-#define OPAL_LAST  170
+#define OPAL_SECVAR_GET 173
+#define OPAL_SECVAR_GET_NEXT174
+#define OPAL_SECVAR_ENQUEUE_UPDATE  175
+#define OPAL_LAST   175
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 57bd029c715e..247adec2375f 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -388,6 +388,12 @@ void opal_powercap_init(void);
 void opal_psr_init(void);
 void opal_sensor_groups_init(void);
 
+extern int opal_secvar_get(uint64_t k_key, uint64_t k_key_len,
+  uint64_t k_data, uint64_t k_data_size);
+extern int opal_secvar_get_next(uint64_t k_key, uint64_t k_key_len,
+   uint64_t k_key_size);
+extern int opal_secvar_enqueue_update(uint64_t k_key, uint64_t k_key_len,
+ uint64_t k_data, uint64_t k_data_size);
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_OPAL_H */
diff --git a/arch/powerpc/include/asm/secvar.h 
b/arch/powerpc/include/asm/secvar.h
new file mode 100644
index ..645654456265
--- /dev/null
+++ b/arch/powerpc/include/asm/secvar.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PowerPC secure variable operations.
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ */
+#ifndef SECVAR_OPS_H
+#define SECVAR_OPS_H
+
+#include
+#include
+
+struct secvar_operations {
+   int (*get_variable)(const char *key, unsigned long key_len, u8 *data,
+   unsigned long *data_size);
+   int (*get_next_variable)(const char *key, unsigned long *key_len,
+unsigned long keysize);
+   int (*set_variable)(const char *key, unsigned long key_len, u8 *data,
+   unsigned long data_size);
+};
+
+#ifdef CONFIG_PPC_SECURE_BOOT
+
+extern void set_secvar_ops(struct secvar_operations *ops);
+extern struct secvar_operations *get_secvar_ops(void);
+
+#else
+
+static inline void set_secvar_ops(struct secvar_operations *ops)
+{
+}
+
+static inline struct secvar_operations *get_secvar_ops(void)
+{
+   return NULL;
+}
+
+#endif
+
+#ifdef CONFIG_OPAL_SECVAR
+
+extern int secvar_init(void);
+
+#else
+
+static inline int secvar_init(void)
+{
+   return -EINVAL;
+}
+
+#endif
+
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 520b1c814197..9041563f1c74 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -157,7 +157,7 @@ endif
 obj-$(CONFIG_EPAPR_PARAVIRT)   += epapr_paravirt.o epapr_hcalls.o
 obj-$(CONFIG_KVM_GUEST)+= kvm.o kvm_emul.o
 
-obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o ima_arch.o
+obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o ima_arch.o secvar-ops.o
 
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/secvar-ops.c b/arch/powerpc/kernel/secvar-ops.c
new file mode 100644
index ..198222499848
--- /dev/null
+++ b/arch/powerpc/kernel/secvar-ops.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna 

[PATCH v2 0/4] powerpc: expose secure variables to the kernel and userspace

2019-08-21 Thread Nayna Jain
In order to verify the OS kernel on PowerNV systems, secure boot requires
X.509 certificates trusted by the platform. These are stored in secure
variables controlled by OPAL, called OPAL secure variables. In order to
enable users to manage the keys, the secure variables need to be exposed
to userspace.

OPAL provides the runtime services for the kernel to be able to access the
secure variables[1]. This patchset defines the kernel interface for the
OPAL APIs. These APIs are used by the hooks, which load these variables
to the keyring and expose them to the userspace for reading/writing.

The previous version[2] of the patchset added support only for the sysfs
interface. This patch adds two more patches that involves loading of
the firmware trusted keys to the kernel keyring. This patchset is
dependent on the base CONFIG PPC_SECURE_BOOT added by ima arch specific
patches for POWER[3]

Overall, this patchset adds the following support:

* expose secure variables to the kernel via OPAL Runtime API interface
* expose secure variables to the userspace via kernel sysfs interface
* load kernel verification and revocation keys to .platform and
.blacklist keyring respectively.

The secure variables can be read/written using simple linux utilities
cat/hexdump.

For example:
Path to the secure variables is:
/sys/firmware/secvar/vars

Each secure variable is listed as directory. 
$ ls -l
total 0
drwxr-xr-x. 2 root root 0 Aug 20 21:20 db
drwxr-xr-x. 2 root root 0 Aug 20 21:20 KEK
drwxr-xr-x. 2 root root 0 Aug 20 21:20 PK

The attributes of each of the secure variables are(for example: PK):
[PK]$ ls -l
total 0
-r--r--r--. 1 root root 32000 Aug 21 08:28 data
-r--r--r--. 1 root root 65536 Aug 21 08:28 name
-r--r--r--. 1 root root 65536 Aug 21 08:28 size
--w---. 1 root root 32000 Aug 21 08:28 update

The "data" is used to read the existing variable value using hexdump. The
data is stored in ESL format.
The "update" is used to write a new value using cat. The update is
to be submitted as AUTH file.

[1] Depends on skiboot OPAL API changes which removes metadata from
the API. The new version with the changes are going to be posted soon.
[2] https://lkml.org/lkml/2019/6/13/1644
[3] https://lkml.org/lkml/2019/8/19/402

Changelog:

v2:
* removes complete efi-sms from the sysfs implementation and is simplified
* includes Greg's and Oliver's feedbacks:
 * adds sysfs documentation
 * moves sysfs code to arch/powerpc
 * other code related feedbacks.
* adds two new patches to load keys to .platform and .blacklist keyring.
These patches are added to this series as they are also dependent on
OPAL APIs.

Nayna Jain (4):
  powerpc/powernv: Add OPAL API interface to access secure variable
  powerpc: expose secure variables to userspace via sysfs
  x86/efi: move common keyring handler functions to new file
  powerpc: load firmware trusted keys into kernel keyring

 Documentation/ABI/testing/sysfs-secvar|  27 +++
 arch/powerpc/Kconfig  |   9 +
 arch/powerpc/include/asm/opal-api.h   |   5 +-
 arch/powerpc/include/asm/opal.h   |   6 +
 arch/powerpc/include/asm/secvar.h |  55 +
 arch/powerpc/kernel/Makefile  |   3 +-
 arch/powerpc/kernel/secvar-ops.c  |  25 +++
 arch/powerpc/kernel/secvar-sysfs.c| 210 ++
 arch/powerpc/platforms/powernv/Kconfig|   6 +
 arch/powerpc/platforms/powernv/Makefile   |   1 +
 arch/powerpc/platforms/powernv/opal-call.c|   3 +
 arch/powerpc/platforms/powernv/opal-secvar.c  | 102 +
 arch/powerpc/platforms/powernv/opal.c |   5 +
 security/integrity/Kconfig|   9 +
 security/integrity/Makefile   |   6 +-
 .../platform_certs/keyring_handler.c  |  80 +++
 .../platform_certs/keyring_handler.h  |  35 +++
 .../integrity/platform_certs/load_powerpc.c   |  94 
 security/integrity/platform_certs/load_uefi.c |  67 +-
 19 files changed, 679 insertions(+), 69 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-secvar
 create mode 100644 arch/powerpc/include/asm/secvar.h
 create mode 100644 arch/powerpc/kernel/secvar-ops.c
 create mode 100644 arch/powerpc/kernel/secvar-sysfs.c
 create mode 100644 arch/powerpc/platforms/powernv/opal-secvar.c
 create mode 100644 security/integrity/platform_certs/keyring_handler.c
 create mode 100644 security/integrity/platform_certs/keyring_handler.h
 create mode 100644 security/integrity/platform_certs/load_powerpc.c

-- 
2.20.1



[PATCH v2] btrfs: fix allocation of bitmap pages.

2019-08-21 Thread Christophe Leroy
Various notifications of type "BUG kmalloc-4096 () : Redzone
overwritten" have been observed recently in various parts of
the kernel. After some time, it has been made a relation with
the use of BTRFS filesystem.

[   22.809700] BUG kmalloc-4096 (Tainted: GW): Redzone 
overwritten
[   22.809971] 
-

[   22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc
[   22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] 
age=22 cpu=0 pid=224
[   22.811193]  __slab_alloc.constprop.26+0x44/0x70
[   22.811345]  kmem_cache_alloc_trace+0xf0/0x2ec
[   22.811588]  __load_free_space_cache+0x588/0x780 [btrfs]
[   22.811848]  load_free_space_cache+0xf4/0x1b0 [btrfs]
[   22.812090]  cache_block_group+0x1d0/0x3d0 [btrfs]
[   22.812321]  find_free_extent+0x680/0x12a4 [btrfs]
[   22.812549]  btrfs_reserve_extent+0xec/0x220 [btrfs]
[   22.812785]  btrfs_alloc_tree_block+0x178/0x5f4 [btrfs]
[   22.813032]  __btrfs_cow_block+0x150/0x5d4 [btrfs]
[   22.813262]  btrfs_cow_block+0x194/0x298 [btrfs]
[   22.813484]  commit_cowonly_roots+0x44/0x294 [btrfs]
[   22.813718]  btrfs_commit_transaction+0x63c/0xc0c [btrfs]
[   22.813973]  close_ctree+0xf8/0x2a4 [btrfs]
[   22.814107]  generic_shutdown_super+0x80/0x110
[   22.814250]  kill_anon_super+0x18/0x30
[   22.814437]  btrfs_kill_super+0x18/0x90 [btrfs]
[   22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83
[   22.814841]  proc_cgroup_show+0xc0/0x248
[   22.814967]  proc_single_show+0x54/0x98
[   22.815086]  seq_read+0x278/0x45c
[   22.815190]  __vfs_read+0x28/0x17c
[   22.815289]  vfs_read+0xa8/0x14c
[   22.815381]  ksys_read+0x50/0x94
[   22.815475]  ret_from_syscall+0x0/0x38

Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead
of memcpy") changed the way bitmap blocks are copied. But allthough
bitmaps have the size of a page, they were allocated with kzalloc().

Most of the time, kzalloc() allocates aligned blocks of memory, so
copy_page() can be used. But when some debug options like SLAB_DEBUG
are activated, kzalloc() may return unaligned pointer.

On powerpc, memcpy(), copy_page() and other copying functions use
'dcbz' instruction which provides an entire zeroed cacheline to avoid
memory read when the intention is to overwrite a full line. Functions
like memcpy() are writen to care about partial cachelines at the start
and end of the destination, but copy_page() assumes it gets pages. As
pages are naturally cache aligned, copy_page() doesn't care about
partial lines. This means that when copy_page() is called with a
misaligned pointer, a few leading bytes are zeroed.

To fix it, allocate bitmaps through kmem_cache instead of using kzalloc()
The cache pool is created with PAGE_SIZE alignment constraint.

Reported-by: Erhard F. 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=204371
Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
v2: Using kmem_cache instead of get_zeroed_page() in order to benefit from SLAB 
debugging features like redzone.
---
 fs/btrfs/ctree.h|  1 +
 fs/btrfs/free-space-cache.c | 17 ++---
 fs/btrfs/inode.c|  7 +++
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 299e11e6c554..26abb95becc9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -43,6 +43,7 @@ extern struct kmem_cache *btrfs_trans_handle_cachep;
 extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
 extern struct kmem_cache *btrfs_free_space_cachep;
+extern struct kmem_cache *btrfs_bitmap_cachep;
 struct btrfs_ordered_sum;
 struct btrfs_ref;
 
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 062be9dde4c6..9a708e7920a0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -764,7 +764,8 @@ static int __load_free_space_cache(struct btrfs_root *root, 
struct inode *inode,
} else {
ASSERT(num_bitmaps);
num_bitmaps--;
-   e->bitmap = kzalloc(PAGE_SIZE, GFP_NOFS);
+   e->bitmap = kmem_cache_zalloc(btrfs_bitmap_cachep,
+ GFP_NOFS);
if (!e->bitmap) {
kmem_cache_free(
btrfs_free_space_cachep, e);
@@ -1881,7 +1882,7 @@ static void free_bitmap(struct btrfs_free_space_ctl *ctl,
struct btrfs_free_space *bitmap_info)
 {
unlink_free_space(ctl, bitmap_info);
-   kfree(bitmap_info->bitmap);
+   kmem_cache_free(btrfs_bitmap_cachep, bitmap_info->bitmap);
kmem_cache_free(btrfs_free_space_cachep, bitmap_info);
ctl->total_bitmaps--;

Re: [PATCH] powerpc: Add barrier_nospec to raw_copy_in_user()

2019-08-21 Thread Michal Suchánek
On Wed,  6 Mar 2019 12:10:38 +1100
Suraj Jitindar Singh  wrote:

> Commit ddf35cf3764b ("powerpc: Use barrier_nospec in copy_from_user()")
> Added barrier_nospec before loading from user-controller pointers.
> The intention was to order the load from the potentially user-controlled
> pointer vs a previous branch based on an access_ok() check or similar.
> 
> In order to achieve the same result, add a barrier_nospec to the
> raw_copy_in_user() function before loading from such a user-controlled
> pointer.
> 
> Signed-off-by: Suraj Jitindar Singh 
> ---
>  arch/powerpc/include/asm/uaccess.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/include/asm/uaccess.h 
> b/arch/powerpc/include/asm/uaccess.h
> index e3a731793ea2..bb615592d5bb 100644
> --- a/arch/powerpc/include/asm/uaccess.h
> +++ b/arch/powerpc/include/asm/uaccess.h
> @@ -306,6 +306,7 @@ extern unsigned long __copy_tofrom_user(void __user *to,
>  static inline unsigned long
>  raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
>  {
> + barrier_nospec();
>   return __copy_tofrom_user(to, from, n);
>  }
>  #endif /* __powerpc64__ */

Hello,

AFAICT this is not needed. The data cannot be used in the kernel without
subsequent copy_from_user which already has the nospec barrier.

Thanks

Michal


Re: Regression fix for bpf in v5.3 (was Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding)

2019-08-21 Thread Jiong Wang


Michael Ellerman writes:

> "Naveen N. Rao"  writes:
>> Since BPF constant blinding is performed after the verifier pass, there
>> are certain ALU32 instructions inserted which don't have a corresponding
>> zext instruction inserted after. This is causing a kernel oops on
>> powerpc and can be reproduced by running 'test_cgroup_storage' with
>> bpf_jit_harden=2.
>>
>> Fix this by emitting BPF_ZEXT during constant blinding if
>> prog->aux->verifier_zext is set.
>>
>> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to 
>> analysis result")
>> Reported-by: Michael Ellerman 
>> Signed-off-by: Naveen N. Rao 
>> ---
>> This approach (the location where zext is being introduced below, in 
>> particular) works for powerpc, but I am not entirely sure if this is 
>> sufficient for other architectures as well. This is broken on v5.3-rc4.
>
> Any comment on this?

Have commented on https://marc.info/?l=linux-netdev=156637836024743=2

The fix looks correct to me on "BPF_LD | BPF_IMM | BPF_DW", but looks
unnecessary on two other places. It would be great if you or Naveen could
confirm it.

Thanks.

Regards,
Jiong

> This is a regression in v5.3, which results in a kernel crash, it would
> be nice to get it fixed before the release please?
>
> cheers
>
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 8191a7db2777..d84146e6fd9e 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -890,7 +890,8 @@ int bpf_jit_get_func_addr(const struct bpf_prog *prog,
>>  
>>  static int bpf_jit_blind_insn(const struct bpf_insn *from,
>>const struct bpf_insn *aux,
>> -  struct bpf_insn *to_buff)
>> +  struct bpf_insn *to_buff,
>> +  bool emit_zext)
>>  {
>>  struct bpf_insn *to = to_buff;
>>  u32 imm_rnd = get_random_int();
>> @@ -939,6 +940,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn 
>> *from,
>>  *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
>>  *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
>>  *to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
>> +if (emit_zext)
>> +*to++ = BPF_ZEXT_REG(from->dst_reg);
>>  break;
>>  
>>  case BPF_ALU64 | BPF_ADD | BPF_K:
>> @@ -992,6 +995,10 @@ static int bpf_jit_blind_insn(const struct bpf_insn 
>> *from,
>>  off -= 2;
>>  *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
>>  *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
>> +if (emit_zext) {
>> +*to++ = BPF_ZEXT_REG(BPF_REG_AX);
>> +off--;
>> +}
>>  *to++ = BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX,
>>off);
>>  break;
>> @@ -1005,6 +1012,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn 
>> *from,
>>  case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
>>  *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ 
>> aux[0].imm);
>>  *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
>> +if (emit_zext)
>> +*to++ = BPF_ZEXT_REG(BPF_REG_AX);
>>  *to++ = BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX);
>>  break;
>>  
>> @@ -1088,7 +1097,8 @@ struct bpf_prog *bpf_jit_blind_constants(struct 
>> bpf_prog *prog)
>>  insn[1].code == 0)
>>  memcpy(aux, insn, sizeof(aux));
>>  
>> -rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
>> +rewritten = bpf_jit_blind_insn(insn, aux, insn_buff,
>> +clone->aux->verifier_zext);
>>  if (!rewritten)
>>  continue;
>>  
>> -- 
>> 2.22.0



Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding

2019-08-21 Thread Jiong Wang


Naveen N. Rao writes:

> Naveen N. Rao wrote:
>> Since BPF constant blinding is performed after the verifier pass, there
>> are certain ALU32 instructions inserted which don't have a corresponding
>> zext instruction inserted after. This is causing a kernel oops on
>> powerpc and can be reproduced by running 'test_cgroup_storage' with
>> bpf_jit_harden=2.
>> 
>> Fix this by emitting BPF_ZEXT during constant blinding if
>> prog->aux->verifier_zext is set.
>> 
>> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to 
>> analysis result")
>> Reported-by: Michael Ellerman 
>> Signed-off-by: Naveen N. Rao 
>> ---
>> This approach (the location where zext is being introduced below, in 
>> particular) works for powerpc, but I am not entirely sure if this is 
>> sufficient for other architectures as well. This is broken on v5.3-rc4.
>
> Alexie, Daniel, Jiong,
> Any feedback on this?

The fix on BPF_LD | BPF_IMM | BPF_DW looks correct to me, but the two other
places looks to me is unnecessary, as those destinations are exposed to
external and if they are used as 64-bit then there will be zext inserted
for them.

Have you verified removing those two fixes will still cause the bug?

Regards,
Jiong

>
> - Naveen



[PATCH 10/10] powerpc/64s/exception: add more comments for interrupt handlers

2019-08-21 Thread Nicholas Piggin
A few of the non-standard handlers are left uncommented. Some more
description could be added to some.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 408 ---
 1 file changed, 369 insertions(+), 39 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 87bce1fd7e11..2963b46f9580 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -126,25 +126,25 @@ name:
 /*
  * Interrupt code generation macros
  */
-#define IVEC   .L_IVEC_\name\()
-#define IHSRR  .L_IHSRR_\name\()
-#define IAREA  .L_IAREA_\name\()
-#define IVIRT  .L_IVIRT_\name\()
-#define IISIDE .L_IISIDE_\name\()
-#define IDAR   .L_IDAR_\name\()
-#define IDSISR .L_IDSISR_\name\()
-#define ISET_RI.L_ISET_RI_\name\()
-#define IBRANCH_COMMON .L_IBRANCH_COMMON_\name\()
-#define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\()
-#define IMASK  .L_IMASK_\name\()
-#define IKVM_SKIP  .L_IKVM_SKIP_\name\()
-#define IKVM_REAL  .L_IKVM_REAL_\name\()
+#define IVEC   .L_IVEC_\name\()/* Interrupt vector address */
+#define IHSRR  .L_IHSRR_\name\()   /* Sets SRR or HSRR registers */
+#define IAREA  .L_IAREA_\name\()   /* PACA save area */
+#define IVIRT  .L_IVIRT_\name\()   /* Has virt mode entry point */
+#define IISIDE .L_IISIDE_\name\()  /* Uses SRR0/1 not DAR/DSISR */
+#define IDAR   .L_IDAR_\name\()/* Uses DAR (or SRR0) */
+#define IDSISR .L_IDSISR_\name\()  /* Uses DSISR (or SRR1) */
+#define ISET_RI.L_ISET_RI_\name\() /* Run common code w/ 
MSR[RI]=1 */
+#define IBRANCH_COMMON .L_IBRANCH_COMMON_\name\() /* ENTRY branch to common */
+#define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\() /* Common runs in 
realmode */
+#define IMASK  .L_IMASK_\name\()   /* IRQ soft-mask bit */
+#define IKVM_SKIP  .L_IKVM_SKIP_\name\()   /* Generate KVM skip handler */
+#define IKVM_REAL  .L_IKVM_REAL_\name\()   /* Real entry tests KVM */
 #define __IKVM_REAL(name)  .L_IKVM_REAL_ ## name
-#define IKVM_VIRT  .L_IKVM_VIRT_\name\()
-#define ISTACK .L_ISTACK_\name\()
+#define IKVM_VIRT  .L_IKVM_VIRT_\name\()   /* Virt entry tests KVM */
+#define ISTACK .L_ISTACK_\name\()  /* Set regular kernel stack */
 #define __ISTACK(name) .L_ISTACK_ ## name
-#define IRECONCILE .L_IRECONCILE_\name\()
-#define IKUAP  .L_IKUAP_\name\()
+#define IRECONCILE .L_IRECONCILE_\name\()  /* Do RECONCILE_IRQ_STATE */
+#define IKUAP  .L_IKUAP_\name\()   /* Do KUAP lock */
 
 #define INT_DEFINE_BEGIN(n)\
 .macro int_define_ ## n name
@@ -215,6 +215,20 @@ do_define_int n
 #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 /*
+ * All interrupts which set HSRR registers (EXC_HV) as well as SRESET and
+ * MCE and syscall when invoked with "sc 1" switch to the hypervisor to
+ * be taken, so all generally need to test whether they were taken in guest
+ * context.
+ *
+ * SRESET and MCE may also be sent to the guest by the hypervisor.
+ *
+ * Interrupts which set SRR registers except SRESET and MCE and sc 1 are not
+ * elevated to the hypervisor, though many can be taken when running in
+ * hypervisor mode (e.g., bare metal kernel and userspace). These generally
+ * need to test whether taken in guest context for PR KVM guests. PR KVM
+ * does not enable AIL interrupts, so always takes them in real mode, which
+ * is why these generally only need test the real-mode case.
+ *
  * If hv is possible, interrupts come into to the hv version
  * of the kvmppc_interrupt code, which then jumps to the PR handler,
  * kvmppc_interrupt_pr, if the guest is a PR guest.
@@ -758,6 +772,39 @@ __start_interrupts:
 EXC_VIRT_NONE(0x4000, 0x100)
 
 
+/**
+ * Interrupt 0x100 - System Reset Interrupt (SRESET aka NMI).
+ * This is a non-maskable, asynchronous interrupt always taken in real-mode.
+ * It is caused by:
+ * - Wake from power-saving state, on powernv.
+ * - An NMI from another CPU, triggered by firmware or hypercall.
+ * - As crash/debug signal injected from BMC, firmware or hypervisor.
+ *
+ * Handling:
+ * Power-save wakeup is the only performance critical path, so this is
+ * determined quickly as possible first. In this case volatile registers
+ * can be discarded and SPRs like CFAR don't need to be read.
+ *
+ * If not a powersave wakeup, then it's run as a regular interrupt, however
+ * it uses its own stack and PACA save area to preserve the regular kernel
+ * environment for debugging.
+ *
+ * This interrupt is not maskable, so triggering it when MSR[RI] is clear,
+ * or SCRATCH0 is in use, etc. may cause a crash. It's also not entirely
+ * correct to switch to virtual mode to run the regular interrupt handler
+ * because 

[PATCH 09/10] powerpc/64s/exception: re-inline some handlers

2019-08-21 Thread Nicholas Piggin
The reduction in interrupt entry size allows some handlers to be
re-inlined.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 3f4b7dfa800b..87bce1fd7e11 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1187,7 +1187,7 @@ INT_DEFINE_BEGIN(data_access)
 INT_DEFINE_END(data_access)
 
 EXC_REAL_BEGIN(data_access, 0x300, 0x80)
-   GEN_INT_ENTRY data_access, GEN_REAL, GEN_OOL
+   GEN_INT_ENTRY data_access, GEN_REAL
 EXC_REAL_END(data_access, 0x300, 0x80)
 EXC_VIRT_BEGIN(data_access, 0x4300, 0x80)
GEN_INT_ENTRY data_access, GEN_VIRT
@@ -1218,7 +1218,7 @@ INT_DEFINE_BEGIN(data_access_slb)
 INT_DEFINE_END(data_access_slb)
 
 EXC_REAL_BEGIN(data_access_slb, 0x380, 0x80)
-   GEN_INT_ENTRY data_access_slb, GEN_REAL, GEN_OOL
+   GEN_INT_ENTRY data_access_slb, GEN_REAL
 EXC_REAL_END(data_access_slb, 0x380, 0x80)
 EXC_VIRT_BEGIN(data_access_slb, 0x4380, 0x80)
GEN_INT_ENTRY data_access_slb, GEN_VIRT
@@ -1485,7 +1485,7 @@ INT_DEFINE_BEGIN(decrementer)
 INT_DEFINE_END(decrementer)
 
 EXC_REAL_BEGIN(decrementer, 0x900, 0x80)
-   GEN_INT_ENTRY decrementer, GEN_REAL, GEN_OOL
+   GEN_INT_ENTRY decrementer, GEN_REAL
 EXC_REAL_END(decrementer, 0x900, 0x80)
 EXC_VIRT_BEGIN(decrementer, 0x4900, 0x80)
GEN_INT_ENTRY decrementer, GEN_VIRT
-- 
2.22.0



[PATCH 08/10] powerpc/64s/exception: hdecrementer avoid touching the stack

2019-08-21 Thread Nicholas Piggin
The hdec interrupt handler is reported to sometimes fire in Linux if
KVM leaves it pending after a guest exists. This is harmless, so there
is a no-op handler for it.

The interrupt handler currently uses the regular kernel stack. Change
this to avoid touching the stack entirely.

This should be the last place where the regular Linux stack can be
accessed with asynchronous interrupts (including PMI) soft-masked.
It might be possible to take advantage of this invariant, e.g., to
context switch the kernel stack SLB entry without clearing MSR[EE].

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h  |  1 -
 arch/powerpc/kernel/exceptions-64s.S | 25 -
 arch/powerpc/kernel/time.c   |  9 -
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 54f4ec1f9fab..0e17c41c4f46 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -24,7 +24,6 @@ extern struct clock_event_device decrementer_clockevent;
 
 
 extern void generic_calibrate_decr(void);
-extern void hdec_interrupt(struct pt_regs *regs);
 
 /* Some sane defaults: 125 MHz timebase, 1GHz processor */
 extern unsigned long ppc_proc_freq;
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index a4ceb88c53c4..3f4b7dfa800b 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1505,6 +1505,8 @@ INT_DEFINE_BEGIN(hdecrementer)
IVEC=0x980
IHSRR=EXC_HV
IAREA=PACA_EXGEN
+   ISTACK=0
+   IRECONCILE=0
IKVM_REAL=1
IKVM_VIRT=1
 INT_DEFINE_END(hdecrementer)
@@ -1516,11 +1518,24 @@ EXC_VIRT_BEGIN(hdecrementer, 0x4980, 0x80)
GEN_INT_ENTRY hdecrementer, GEN_VIRT
 EXC_VIRT_END(hdecrementer, 0x4980, 0x80)
 EXC_COMMON_BEGIN(hdecrementer_common)
-   GEN_COMMON hdecrementer
-   bl  save_nvgprs
-   addir3,r1,STACK_FRAME_OVERHEAD
-   bl  hdec_interrupt
-   b   ret_from_except
+   __GEN_COMMON_ENTRY hdecrementer
+   /*
+* Hypervisor decrementer interrupts not caught by the KVM test
+* shouldn't occur but are sometimes left pending on exit from a KVM
+* guest.  We don't need to do anything to clear them, as they are
+* edge-triggered.
+*
+* Be careful to avoid touching the kernel stack.
+*/
+   ld  r10,PACA_EXGEN+EX_CTR(r13)
+   mtctr   r10
+   mtcrf   0x80,r9
+   ld  r9,PACA_EXGEN+EX_R9(r13)
+   ld  r10,PACA_EXGEN+EX_R10(r13)
+   ld  r11,PACA_EXGEN+EX_R11(r13)
+   ld  r12,PACA_EXGEN+EX_R12(r13)
+   ld  r13,PACA_EXGEN+EX_R13(r13)
+   HRFI_TO_KERNEL
 
GEN_KVM hdecrementer
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 694522308cd5..bebc8c440289 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -663,15 +663,6 @@ void timer_broadcast_interrupt(void)
 }
 #endif
 
-/*
- * Hypervisor decrementer interrupts shouldn't occur but are sometimes
- * left pending on exit from a KVM guest.  We don't need to do anything
- * to clear them, as they are edge-triggered.
- */
-void hdec_interrupt(struct pt_regs *regs)
-{
-}
-
 #ifdef CONFIG_SUSPEND
 static void generic_suspend_disable_irqs(void)
 {
-- 
2.22.0



[PATCH 07/10] powerpc/64s/exception: trim unused arguments from KVMTEST macro

2019-08-21 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 14351e1b27df..a4ceb88c53c4 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -224,7 +224,7 @@ do_define_int n
 #define kvmppc_interrupt kvmppc_interrupt_pr
 #endif
 
-.macro KVMTEST name, hsrr, n
+.macro KVMTEST name
lbz r10,HSTATE_IN_GUEST(r13)
cmpwi   r10,0
bne \name\()_kvm
@@ -293,7 +293,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 .endm
 
 #else
-.macro KVMTEST name, hsrr, n
+.macro KVMTEST name
 .endm
 .macro GEN_KVM name
 .endm
@@ -440,7 +440,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
.if IKVM_REAL
-   KVMTEST \name IHSRR IVEC
+   KVMTEST \name
.endif
 
ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
@@ -455,7 +455,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
 DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 \name\()_common_virt:
.if IKVM_VIRT
-   KVMTEST \name IHSRR IVEC
+   KVMTEST \name
.endif
 
.if ISET_RI
@@ -1615,7 +1615,7 @@ INT_DEFINE_END(system_call)
GET_PACA(r13)
std r10,PACA_EXGEN+EX_R10(r13)
INTERRUPT_TO_KERNEL
-   KVMTEST system_call EXC_STD 0xc00 /* uses r10, branch to 
system_call_kvm */
+   KVMTEST system_call /* uses r10, branch to system_call_kvm */
mfctr   r9
 #else
mr  r9,r13
-- 
2.22.0



[PATCH 06/10] powerpc/64s/exception: remove the SPR saving patch code macros

2019-08-21 Thread Nicholas Piggin
These are used infrequently enough they don't provide much help, so
inline them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 82 ++--
 1 file changed, 28 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 705cca95e03c..14351e1b27df 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -110,46 +110,6 @@ name:
 #define EXC_HV 1
 #define EXC_STD0
 
-/*
- * PPR save/restore macros used in exceptions-64s.S
- * Used for P7 or later processors
- */
-#define SAVE_PPR(area, ra) \
-BEGIN_FTR_SECTION_NESTED(940)  \
-   ld  ra,area+EX_PPR(r13);/* Read PPR from paca */\
-   std ra,_PPR(r1);\
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
-
-#define RESTORE_PPR_PACA(area, ra) \
-BEGIN_FTR_SECTION_NESTED(941)  \
-   ld  ra,area+EX_PPR(r13);\
-   mtspr   SPRN_PPR,ra;\
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,941)
-
-/*
- * Get an SPR into a register if the CPU has the given feature
- */
-#define OPT_GET_SPR(ra, spr, ftr)  \
-BEGIN_FTR_SECTION_NESTED(943)  \
-   mfspr   ra,spr; \
-END_FTR_SECTION_NESTED(ftr,ftr,943)
-
-/*
- * Set an SPR from a register if the CPU has the given feature
- */
-#define OPT_SET_SPR(ra, spr, ftr)  \
-BEGIN_FTR_SECTION_NESTED(943)  \
-   mtspr   spr,ra; \
-END_FTR_SECTION_NESTED(ftr,ftr,943)
-
-/*
- * Save a register to the PACA if the CPU has the given feature
- */
-#define OPT_SAVE_REG_TO_PACA(offset, ra, ftr)  \
-BEGIN_FTR_SECTION_NESTED(943)  \
-   std ra,offset(r13); \
-END_FTR_SECTION_NESTED(ftr,ftr,943)
-
 /*
  * Branch to label using its 0xC000 address. This results in instruction
  * address suitable for MSR[IR]=0 or 1, which allows relocation to be turned
@@ -278,18 +238,18 @@ do_define_int n
cmpwi   r10,KVM_GUEST_MODE_SKIP
beq 89f
.else
-BEGIN_FTR_SECTION_NESTED(947)
+BEGIN_FTR_SECTION
ld  r10,IAREA+EX_CFAR(r13)
std r10,HSTATE_CFAR(r13)
-END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947)
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
.endif
 
ld  r10,PACA_EXGEN+EX_CTR(r13)
mtctr   r10
-BEGIN_FTR_SECTION_NESTED(948)
+BEGIN_FTR_SECTION
ld  r10,IAREA+EX_PPR(r13)
std r10,HSTATE_PPR(r13)
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
ld  r11,IAREA+EX_R11(r13)
ld  r12,IAREA+EX_R12(r13)
std r12,HSTATE_SCRATCH0(r13)
@@ -389,10 +349,14 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
SET_SCRATCH0(r13)   /* save r13 */
GET_PACA(r13)
std r9,IAREA+EX_R9(r13) /* save r9 */
-   OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR)
+BEGIN_FTR_SECTION
+   mfspr   r9,SPRN_PPR
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
HMT_MEDIUM
std r10,IAREA+EX_R10(r13)   /* save r10 - r12 */
-   OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR)
+BEGIN_FTR_SECTION
+   mfspr   r10,SPRN_CFAR
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
.if \ool
.if !\virt
b   tramp_real_\name
@@ -405,8 +369,12 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
.endif
 
-   OPT_SAVE_REG_TO_PACA(IAREA+EX_PPR, r9, CPU_FTR_HAS_PPR)
-   OPT_SAVE_REG_TO_PACA(IAREA+EX_CFAR, r10, CPU_FTR_CFAR)
+BEGIN_FTR_SECTION
+   std r9,IAREA+EX_PPR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+BEGIN_FTR_SECTION
+   std r10,IAREA+EX_CFAR(r13)
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
INTERRUPT_TO_KERNEL
mfctr   r10
std r10,IAREA+EX_CTR(r13)
@@ -553,7 +521,10 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
.endif
beq 101f/* if from kernel mode  */
ACCOUNT_CPU_USER_ENTRY(r13, r9, r10)
-   SAVE_PPR(IAREA, r9)
+BEGIN_FTR_SECTION
+   ld  r9,IAREA+EX_PPR(r13)/* Read PPR from paca   */
+   std r9,_PPR(r1)
+END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 101:
.else
.if IKUAP
@@ -593,10 +564,10 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
std r10,_DSISR(r1)
.endif
 

[PATCH 05/10] powerpc/64s/exception: remove confusing IEARLY option

2019-08-21 Thread Nicholas Piggin
Replace IEARLY=1 and IEARLY=2 with IBRANCH_COMMON, which controls if
the entry code branches to a common handler; and IREALMODE_COMMON,
which controls whether the common handler should remain in real mode.

These special cases no longer avoid loading the SRR registers, there
is no point as most of them load the registers immediately anyway.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 48 ++--
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index e5d5d81d6107..705cca95e03c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -174,7 +174,8 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IDAR   .L_IDAR_\name\()
 #define IDSISR .L_IDSISR_\name\()
 #define ISET_RI.L_ISET_RI_\name\()
-#define IEARLY .L_IEARLY_\name\()
+#define IBRANCH_COMMON .L_IBRANCH_COMMON_\name\()
+#define IREALMODE_COMMON   .L_IREALMODE_COMMON_\name\()
 #define IMASK  .L_IMASK_\name\()
 #define IKVM_SKIP  .L_IKVM_SKIP_\name\()
 #define IKVM_REAL  .L_IKVM_REAL_\name\()
@@ -218,8 +219,15 @@ do_define_int n
.ifndef ISET_RI
ISET_RI=1
.endif
-   .ifndef IEARLY
-   IEARLY=0
+   .ifndef IBRANCH_COMMON
+   IBRANCH_COMMON=1
+   .endif
+   .ifndef IREALMODE_COMMON
+   IREALMODE_COMMON=0
+   .else
+   .if ! IBRANCH_COMMON
+   .error "IREALMODE_COMMON=1 but IBRANCH_COMMON=0"
+   .endif
.endif
.ifndef IMASK
IMASK=0
@@ -356,6 +364,11 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 #define GEN_OOL1
 
 .macro GEN_BRANCH_TO_COMMON name virt
+   .if IREALMODE_COMMON
+   LOAD_HANDLER(r10, \name\()_common)
+   mtctr   r10
+   bctr
+   .else
.if \virt
 #ifndef CONFIG_RELOCATABLE
b   \name\()_common_virt
@@ -369,6 +382,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
mtctr   r10
bctr
.endif
+   .endif
 .endm
 
 .macro GEN_INT_ENTRY name virt ool=0
@@ -424,11 +438,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
stw r10,IAREA+EX_DSISR(r13)
.endif
 
-   .if IEARLY == 2
-   /* nothing more */
-   .elseif IEARLY
-   BRANCH_TO_C000(r11, \name\()_common)
-   .else
.if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
@@ -444,6 +453,8 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
mfspr   r11,SPRN_SRR0   /* save SRR0 */
mfspr   r12,SPRN_SRR1   /* and SRR1 */
.endif
+
+   .if IBRANCH_COMMON
GEN_BRANCH_TO_COMMON \name \virt
.endif
 
@@ -923,6 +934,7 @@ INT_DEFINE_BEGIN(machine_check_early)
IHSRR=EXC_STD
IAREA=PACA_EXMC
IVIRT=0 /* no virt entry point */
+   IREALMODE_COMMON=1
/*
 * MSR_RI is not enabled, because PACA_EXMC is being used, so a
 * nested machine check corrupts it. machine_check_common enables
@@ -930,7 +942,6 @@ INT_DEFINE_BEGIN(machine_check_early)
 */
ISET_RI=0
ISTACK=0
-   IEARLY=1
IDAR=1
IDSISR=1
IRECONCILE=0
@@ -971,9 +982,6 @@ TRAMP_REAL_BEGIN(machine_check_fwnmi)
EXCEPTION_RESTORE_REGS EXC_STD
 
 EXC_COMMON_BEGIN(machine_check_early_common)
-   mfspr   r11,SPRN_SRR0
-   mfspr   r12,SPRN_SRR1
-
/*
 * Switch to mc_emergency stack and handle re-entrancy (we limit
 * the nested MCE upto level 4 to avoid stack overflow).
@@ -1849,7 +1857,7 @@ INT_DEFINE_BEGIN(hmi_exception_early)
IHSRR=EXC_HV
IAREA=PACA_EXGEN
IVIRT=0 /* no virt entry point */
-   IEARLY=1
+   IREALMODE_COMMON=1
ISTACK=0
IRECONCILE=0
IKUAP=0 /* We don't touch AMR here, we never go to virtual mode */
@@ -1871,8 +1879,6 @@ EXC_REAL_END(hmi_exception, 0xe60, 0x20)
 EXC_VIRT_NONE(0x4e60, 0x20)
 
 EXC_COMMON_BEGIN(hmi_exception_early_common)
-   mfspr   r11,SPRN_HSRR0  /* Save HSRR0 */
-   mfspr   r12,SPRN_HSRR1  /* Save HSRR1 */
mr  r10,r1  /* Save r1 */
ld  r1,PACAEMERGSP(r13) /* Use emergency stack for realmode */
subir1,r1,INT_FRAME_SIZE/* alloc stack frame*/
@@ -2213,29 +2219,23 @@ INT_DEFINE_BEGIN(denorm_exception)
IVEC=0x1500
IHSRR=EXC_HV
IAREA=PACA_EXGEN
-   IEARLY=2
+   IBRANCH_COMMON=0
IKVM_REAL=1
 INT_DEFINE_END(denorm_exception)
 
 EXC_REAL_BEGIN(denorm_exception, 0x1500, 0x100)
GEN_INT_ENTRY denorm_exception, GEN_REAL
 #ifdef CONFIG_PPC_DENORMALISATION
-   mfspr   r10,SPRN_HSRR1
-   andis.  r10,r10,(HSRR1_DENORM)@h /* 

[PATCH 04/10] powerpc/64s/exception: move KVM test to common code

2019-08-21 Thread Nicholas Piggin
This allows more code to be moved out of unrelocated regions. The system
call KVMTEST is changed to be open-coded and remain in the tramp area to
avoid having to move it to entry_64.S. The custom nature of the system
call entry code means the hcall case can be made more streamlined than
regular interrupt handlers.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S| 235 
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  11 --
 arch/powerpc/kvm/book3s_segment.S   |   7 -
 3 files changed, 115 insertions(+), 138 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index b78167ce6355..e5d5d81d6107 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -44,7 +44,6 @@
  * EXC_VIRT_BEGIN/END  - virt (AIL), unrelocated exception vectors
  * TRAMP_REAL_BEGIN- real, unrelocated helpers (virt may call these)
  * TRAMP_VIRT_BEGIN- virt, unreloc helpers (in practice, real can use)
- * TRAMP_KVM_BEGIN - KVM handlers, these are put into real, unrelocated
  * EXC_COMMON  - After switching to virtual, relocated mode.
  */
 
@@ -74,13 +73,6 @@ name:
 #define TRAMP_VIRT_BEGIN(name) \
FIXED_SECTION_ENTRY_BEGIN(virt_trampolines, name)
 
-#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
-#define TRAMP_KVM_BEGIN(name)  \
-   TRAMP_VIRT_BEGIN(name)
-#else
-#define TRAMP_KVM_BEGIN(name)
-#endif
-
 #define EXC_REAL_NONE(start, size) \
FIXED_SECTION_ENTRY_BEGIN_LOCATION(real_vectors, 
exc_real_##start##_##unused, start, size); \
FIXED_SECTION_ENTRY_END_LOCATION(real_vectors, 
exc_real_##start##_##unused, start, size)
@@ -271,6 +263,9 @@ do_define_int n
 .endm
 
 .macro GEN_KVM name
+   .balign IFETCH_ALIGN_BYTES
+\name\()_kvm:
+
.if IKVM_SKIP
cmpwi   r10,KVM_GUEST_MODE_SKIP
beq 89f
@@ -281,13 +276,18 @@ BEGIN_FTR_SECTION_NESTED(947)
 END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947)
.endif
 
+   ld  r10,PACA_EXGEN+EX_CTR(r13)
+   mtctr   r10
 BEGIN_FTR_SECTION_NESTED(948)
ld  r10,IAREA+EX_PPR(r13)
std r10,HSTATE_PPR(r13)
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
-   ld  r10,IAREA+EX_R10(r13)
+   ld  r11,IAREA+EX_R11(r13)
+   ld  r12,IAREA+EX_R12(r13)
std r12,HSTATE_SCRATCH0(r13)
sldir12,r9,32
+   ld  r9,IAREA+EX_R9(r13)
+   ld  r10,IAREA+EX_R10(r13)
/* HSRR variants have the 0x2 bit added to their trap number */
.if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
@@ -300,29 +300,16 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.else
ori r12,r12,(IVEC)
.endif
-
-#ifdef CONFIG_RELOCATABLE
-   /*
-* KVM requires __LOAD_FAR_HANDLER beause kvmppc_interrupt lives
-* outside the head section. CONFIG_RELOCATABLE KVM expects CTR
-* to be saved in HSTATE_SCRATCH1.
-*/
-   ld  r9,IAREA+EX_CTR(r13)
-   std r9,HSTATE_SCRATCH1(r13)
-   __LOAD_FAR_HANDLER(r9, kvmppc_interrupt)
-   mtctr   r9
-   ld  r9,IAREA+EX_R9(r13)
-   bctr
-#else
-   ld  r9,IAREA+EX_R9(r13)
b   kvmppc_interrupt
-#endif
-
 
.if IKVM_SKIP
 89:mtocrf  0x80,r9
+   ld  r10,PACA_EXGEN+EX_CTR(r13)
+   mtctr   r10
ld  r9,IAREA+EX_R9(r13)
ld  r10,IAREA+EX_R10(r13)
+   ld  r11,IAREA+EX_R11(r13)
+   ld  r12,IAREA+EX_R12(r13)
.if IHSRR == EXC_HV_OR_STD
BEGIN_FTR_SECTION
b   kvmppc_skip_Hinterrupt
@@ -410,11 +397,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
mfctr   r10
std r10,IAREA+EX_CTR(r13)
mfcrr9
-
-   .if (!\virt && IKVM_REAL) || (\virt && IKVM_VIRT)
-   KVMTEST \name IHSRR IVEC
-   .endif
-
std r11,IAREA+EX_R11(r13)
std r12,IAREA+EX_R12(r13)
 
@@ -478,6 +460,10 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 .macro __GEN_COMMON_ENTRY name
 DEFINE_FIXED_SYMBOL(\name\()_common_real)
 \name\()_common_real:
+   .if IKVM_REAL
+   KVMTEST \name IHSRR IVEC
+   .endif
+
ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
.if ! ISET_RI
xorir10,r10,MSR_RI  /* Clear MSR_RI */
@@ -489,6 +475,10 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real)
.balign IFETCH_ALIGN_BYTES
 DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 \name\()_common_virt:
+   .if IKVM_VIRT
+   KVMTEST \name IHSRR IVEC
+   .endif
+
.if ISET_RI
li  r10,MSR_RI
mtmsrd  r10,1   /* Set MSR_RI */
@@ -848,8 +838,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 */
 EXC_REAL_END(system_reset, 0x100, 0x100)
 

[PATCH 03/10] powerpc/64s/exception: move soft-mask test to common code

2019-08-21 Thread Nicholas Piggin
As well as moving code out of the unrelocated vectors, this allows the
masked handlers to be moved to common code, and allows the soft_nmi
handler to be generated more like a regular handler.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 108 +--
 1 file changed, 51 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 6ea6176de00a..b78167ce6355 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -414,36 +414,6 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.if (!\virt && IKVM_REAL) || (\virt && IKVM_VIRT)
KVMTEST \name IHSRR IVEC
.endif
-   .if IMASK
-   lbz r10,PACAIRQSOFTMASK(r13)
-   andi.   r10,r10,IMASK
-   /* Associate vector numbers with bits in paca->irq_happened */
-   .if IVEC == 0x500 || IVEC == 0xea0
-   li  r10,PACA_IRQ_EE
-   .elseif IVEC == 0x900
-   li  r10,PACA_IRQ_DEC
-   .elseif IVEC == 0xa00 || IVEC == 0xe80
-   li  r10,PACA_IRQ_DBELL
-   .elseif IVEC == 0xe60
-   li  r10,PACA_IRQ_HMI
-   .elseif IVEC == 0xf00
-   li  r10,PACA_IRQ_PMI
-   .else
-   .abort "Bad maskable vector"
-   .endif
-
-   .if IHSRR == EXC_HV_OR_STD
-   BEGIN_FTR_SECTION
-   bne masked_Hinterrupt
-   FTR_SECTION_ELSE
-   bne masked_interrupt
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif IHSRR
-   bne masked_Hinterrupt
-   .else
-   bne masked_interrupt
-   .endif
-   .endif
 
std r11,IAREA+EX_R11(r13)
std r12,IAREA+EX_R12(r13)
@@ -528,6 +498,37 @@ DEFINE_FIXED_SYMBOL(\name\()_common_virt)
 .endm
 
 .macro __GEN_COMMON_BODY name
+   .if IMASK
+   lbz r10,PACAIRQSOFTMASK(r13)
+   andi.   r10,r10,IMASK
+   /* Associate vector numbers with bits in paca->irq_happened */
+   .if IVEC == 0x500 || IVEC == 0xea0
+   li  r10,PACA_IRQ_EE
+   .elseif IVEC == 0x900
+   li  r10,PACA_IRQ_DEC
+   .elseif IVEC == 0xa00 || IVEC == 0xe80
+   li  r10,PACA_IRQ_DBELL
+   .elseif IVEC == 0xe60
+   li  r10,PACA_IRQ_HMI
+   .elseif IVEC == 0xf00
+   li  r10,PACA_IRQ_PMI
+   .else
+   .abort "Bad maskable vector"
+   .endif
+
+   .if IHSRR == EXC_HV_OR_STD
+   BEGIN_FTR_SECTION
+   bne masked_Hinterrupt
+   FTR_SECTION_ELSE
+   bne masked_interrupt
+   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
+   .elseif IHSRR
+   bne masked_Hinterrupt
+   .else
+   bne masked_interrupt
+   .endif
+   .endif
+
.if ISTACK
andi.   r10,r12,MSR_PR  /* See if coming from user  */
mr  r10,r1  /* Save r1  */
@@ -2395,18 +2396,12 @@ EXC_VIRT_NONE(0x5800, 0x100)
 
 #ifdef CONFIG_PPC_WATCHDOG
 
-#define MASKED_DEC_HANDLER_LABEL 3f
-
-#define MASKED_DEC_HANDLER(_H) \
-3: /* soft-nmi */  \
-   std r12,PACA_EXGEN+EX_R12(r13); \
-   GET_SCRATCH0(r10);  \
-   std r10,PACA_EXGEN+EX_R13(r13); \
-   mfspr   r11,SPRN_SRR0;  /* save SRR0 */ \
-   mfspr   r12,SPRN_SRR1;  /* and SRR1 */  \
-   LOAD_HANDLER(r10, soft_nmi_common); \
-   mtctr   r10;\
-   bctr
+INT_DEFINE_BEGIN(soft_nmi)
+   IVEC=0x900
+   IHSRR=EXC_STD
+   IAREA=PACA_EXGEN
+   ISTACK=0
+INT_DEFINE_END(soft_nmi)
 
 /*
  * Branch to soft_nmi_interrupt using the emergency stack. The emergency
@@ -2418,19 +2413,16 @@ EXC_VIRT_NONE(0x5800, 0x100)
  * and run it entirely with interrupts hard disabled.
  */
 EXC_COMMON_BEGIN(soft_nmi_common)
+   mfspr   r11,SPRN_SRR0
mr  r10,r1
ld  r1,PACAEMERGSP(r13)
subir1,r1,INT_FRAME_SIZE
-   __ISTACK(decrementer)=0
-   __GEN_COMMON_BODY decrementer
+   __GEN_COMMON_BODY soft_nmi
bl  save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
bl  soft_nmi_interrupt
b   ret_from_except
 
-#else /* CONFIG_PPC_WATCHDOG */
-#define MASKED_DEC_HANDLER_LABEL 2f /* normal return */
-#define MASKED_DEC_HANDLER(_H)
 #endif /* CONFIG_PPC_WATCHDOG */
 
 /*
@@ -2449,7 +2441,6 @@ 

[PATCH 02/10] powerpc/64s/exception: move real->virt switch into the common handler

2019-08-21 Thread Nicholas Piggin
The real mode interrupt entry points currently use rfid to branch to
the common handler in virtual mode. This is a significant amount of
code, and forces other code (notably the KVM test) to live in the
real mode handler.

In the interest of minimising the amount of code that runs unrelocated
move the switch to virt mode into the common code, and do it with
mtmsrd, which avoids clobbering SRRs (although the post-KVMTEST
performance of real-mode interrupt handlers is not a big concern these
days).

This requires CTR to always be saved (real-mode needs to reach 0xc...)
but that's not a huge impact these days. It could be optimized away in
future.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h |   4 -
 arch/powerpc/kernel/exceptions-64s.S | 248 ++-
 2 files changed, 106 insertions(+), 146 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 33f4f72eb035..47bd4ea0837d 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -33,11 +33,7 @@
 #include 
 
 /* PACA save area size in u64 units (exgen, exmc, etc) */
-#if defined(CONFIG_RELOCATABLE)
 #define EX_SIZE10
-#else
-#define EX_SIZE9
-#endif
 
 /*
  * maximum recursive depth of MCE exceptions
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 2d2c6f19eec1..6ea6176de00a 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -32,16 +32,10 @@
 #define EX_CCR 52
 #define EX_CFAR56
 #define EX_PPR 64
-#if defined(CONFIG_RELOCATABLE)
 #define EX_CTR 72
 .if EX_SIZE != 10
.error "EX_SIZE is wrong"
 .endif
-#else
-.if EX_SIZE != 9
-   .error "EX_SIZE is wrong"
-.endif
-#endif
 
 /*
  * Following are fixed section helper macros.
@@ -124,22 +118,6 @@ name:
 #define EXC_HV 1
 #define EXC_STD0
 
-#if defined(CONFIG_RELOCATABLE)
-/*
- * If we support interrupts with relocation on AND we're a relocatable kernel,
- * we need to use CTR to get to the 2nd level handler.  So, save/restore it
- * when required.
- */
-#define SAVE_CTR(reg, area)mfctr   reg ;   std reg,area+EX_CTR(r13)
-#define GET_CTR(reg, area) ld  reg,area+EX_CTR(r13)
-#define RESTORE_CTR(reg, area) ld  reg,area+EX_CTR(r13) ; mtctr reg
-#else
-/* ...else CTR is unused and in register. */
-#define SAVE_CTR(reg, area)
-#define GET_CTR(reg, area) mfctr   reg
-#define RESTORE_CTR(reg, area)
-#endif
-
 /*
  * PPR save/restore macros used in exceptions-64s.S
  * Used for P7 or later processors
@@ -199,6 +177,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IVEC   .L_IVEC_\name\()
 #define IHSRR  .L_IHSRR_\name\()
 #define IAREA  .L_IAREA_\name\()
+#define IVIRT  .L_IVIRT_\name\()
 #define IISIDE .L_IISIDE_\name\()
 #define IDAR   .L_IDAR_\name\()
 #define IDSISR .L_IDSISR_\name\()
@@ -232,6 +211,9 @@ do_define_int n
.ifndef IAREA
.error "IAREA not defined"
.endif
+   .ifndef IVIRT
+   IVIRT=1
+   .endif
.ifndef IISIDE
IISIDE=0
.endif
@@ -325,7 +307,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 * outside the head section. CONFIG_RELOCATABLE KVM expects CTR
 * to be saved in HSTATE_SCRATCH1.
 */
-   mfctr   r9
+   ld  r9,IAREA+EX_CTR(r13)
std r9,HSTATE_SCRATCH1(r13)
__LOAD_FAR_HANDLER(r9, kvmppc_interrupt)
mtctr   r9
@@ -362,101 +344,6 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 .endm
 #endif
 
-.macro INT_SAVE_SRR_AND_JUMP label, hsrr, set_ri
-   ld  r10,PACAKMSR(r13)   /* get MSR value for kernel */
-   .if ! \set_ri
-   xorir10,r10,MSR_RI  /* Clear MSR_RI */
-   .endif
-   .if \hsrr == EXC_HV_OR_STD
-   BEGIN_FTR_SECTION
-   mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
-   mfspr   r12,SPRN_HSRR1  /* and HSRR1 */
-   mtspr   SPRN_HSRR1,r10
-   FTR_SECTION_ELSE
-   mfspr   r11,SPRN_SRR0   /* save SRR0 */
-   mfspr   r12,SPRN_SRR1   /* and SRR1 */
-   mtspr   SPRN_SRR1,r10
-   ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
-   .elseif \hsrr
-   mfspr   r11,SPRN_HSRR0  /* save HSRR0 */
-   mfspr   r12,SPRN_HSRR1  /* and HSRR1 */
-   mtspr   SPRN_HSRR1,r10
-   .else
-   mfspr   r11,SPRN_SRR0   /* save SRR0 */
-   mfspr   r12,SPRN_SRR1   /* and SRR1 */
-   mtspr   SPRN_SRR1,r10
-   .endif
-   LOAD_HANDLER(r10, \label\())
-   .if \hsrr == EXC_HV_OR_STD
-   BEGIN_FTR_SECTION
-   mtspr   SPRN_HSRR0,r10
-   HRFI_TO_KERNEL
-   FTR_SECTION_ELSE
-   mtspr   SPRN_SRR0,r10
-   

[PATCH 00/10] powerpc/64s/exception: initial reworking of

2019-08-21 Thread Nicholas Piggin
This applies on top of the previous exception series.

Until now the focus was on making the code generation macros sane,
ironing out irregularities and needless differences in handlers,
and sharing code. Some change of generated code was required, but
minimised as much as possible.

This series begins to change generated code in a bigger way. This is
an assortment of changes, one of the big goals goals is really moving
code and especially branches and logic and conditional compilation
out of "unrelocated" interrupt entry points, with the hope of making
KASLR more effective when it is implemented.

After this is a more significant rework of the instruction sequences
with the aim of improving performance, but I will stop here for the
next merge window.

Thanks,
Nick

Nicholas Piggin (10):
  powerpc/64s/exception: Add ISIDE option
  powerpc/64s/exception: move real->virt switch into the common handler
  powerpc/64s/exception: move soft-mask test to common code
  powerpc/64s/exception: move KVM test to common code
  powerpc/64s/exception: remove confusing IEARLY option
  powerpc/64s/exception: remove the SPR saving patch code macros
  powerpc/64s/exception: trim unused arguments from KVMTEST macro
  powerpc/64s/exception: hdecrementer avoid touching the stack
  powerpc/64s/exception: re-inline some handlers
  powerpc/64s/exception: add more comments for interrupt handlers

 arch/powerpc/include/asm/exception-64s.h |4 -
 arch/powerpc/include/asm/time.h  |1 -
 arch/powerpc/kernel/exceptions-64s.S | 1151 ++
 arch/powerpc/kernel/time.c   |9 -
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   11 -
 arch/powerpc/kvm/book3s_segment.S|7 -
 6 files changed, 716 insertions(+), 467 deletions(-)

-- 
2.22.0



[PATCH 01/10] powerpc/64s/exception: Add ISIDE option

2019-08-21 Thread Nicholas Piggin
Rather than using DAR=2 to select the i-side registers, add an
explicit option.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 23 ---
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 5d9fdf9dce55..2d2c6f19eec1 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -199,6 +199,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 #define IVEC   .L_IVEC_\name\()
 #define IHSRR  .L_IHSRR_\name\()
 #define IAREA  .L_IAREA_\name\()
+#define IISIDE .L_IISIDE_\name\()
 #define IDAR   .L_IDAR_\name\()
 #define IDSISR .L_IDSISR_\name\()
 #define ISET_RI.L_ISET_RI_\name\()
@@ -231,6 +232,9 @@ do_define_int n
.ifndef IAREA
.error "IAREA not defined"
.endif
+   .ifndef IISIDE
+   IISIDE=0
+   .endif
.ifndef IDAR
IDAR=0
.endif
@@ -546,7 +550,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
 */
GET_SCRATCH0(r10)
std r10,IAREA+EX_R13(r13)
-   .if IDAR == 1
+   .if IDAR && !IISIDE
.if IHSRR
mfspr   r10,SPRN_HDAR
.else
@@ -554,7 +558,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
std r10,IAREA+EX_DAR(r13)
.endif
-   .if IDSISR == 1
+   .if IDSISR && !IISIDE
.if IHSRR
mfspr   r10,SPRN_HDSISR
.else
@@ -629,16 +633,18 @@ 
END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
std r9,GPR11(r1)
std r10,GPR12(r1)
std r11,GPR13(r1)
+
.if IDAR
-   .if IDAR == 2
+   .if IISIDE
ld  r10,_NIP(r1)
.else
ld  r10,IAREA+EX_DAR(r13)
.endif
std r10,_DAR(r1)
.endif
+
.if IDSISR
-   .if IDSISR == 2
+   .if IISIDE
ld  r10,_MSR(r1)
lis r11,DSISR_SRR1_MATCH_64S@h
and r10,r10,r11
@@ -647,6 +653,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,948)
.endif
std r10,_DSISR(r1)
.endif
+
 BEGIN_FTR_SECTION_NESTED(66)
ld  r10,IAREA+EX_CFAR(r13)
std r10,ORIG_GPR3(r1)
@@ -1324,8 +1331,9 @@ INT_DEFINE_BEGIN(instruction_access)
IVEC=0x400
IHSRR=EXC_STD
IAREA=PACA_EXGEN
-   IDAR=2
-   IDSISR=2
+   IISIDE=1
+   IDAR=1
+   IDSISR=1
IKVM_REAL=1
 INT_DEFINE_END(instruction_access)
 
@@ -1355,7 +1363,8 @@ INT_DEFINE_BEGIN(instruction_access_slb)
IHSRR=EXC_STD
IAREA=PACA_EXSLB
IRECONCILE=0
-   IDAR=2
+   IISIDE=1
+   IDAR=1
IKVM_REAL=1
 INT_DEFINE_END(instruction_access_slb)
 
-- 
2.22.0



Re: [PATCH v2 21/44] powerpc/64s/exception: remove 0xb00 handler

2019-08-21 Thread Michael Ellerman
Nicholas Piggin  writes:
> This vector is not used by any supported processor, and has been
> implemented as an unknown exception going back to 2.6. There is
> nothing special about 0xb00, so remove it like other unused
> vectors.

Actually it goes back to the original ppc64 submission.

See (takes a while to load):
  
https://github.com/mpe/linux-fullhistory/commit/c3aa9878533e724f639852c3d951e6a169e04081#diff-c7b0adae374819e9003279ff5f69226fR340


That commit had handlers for all the vectors from 0x100 through 0xf00,
with stubs for 0xa00, 0xb00 and 0xe00. But it's not at all clear why it
needed the stubs, possibly it was just being verbose.

0xa00 eventually became doorbell_super and 0xe00 became h_data_storage.
Leaving just 0xb00 as the lone relic.



Patch looks good.

cheers


> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> index 723c37f3da17..9c407392774c 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -1563,10 +1563,8 @@ EXC_COMMON_ASYNC(doorbell_super_common, 0xa00, 
> unknown_exception)
>  #endif
>  
>  
> -EXC_REAL(trap_0b, 0xb00, 0x100)
> -EXC_VIRT(trap_0b, 0x4b00, 0x100, 0xb00)
> -TRAMP_KVM(PACA_EXGEN, 0xb00)
> -EXC_COMMON(trap_0b_common, 0xb00, unknown_exception)
> +EXC_REAL_NONE(0xb00, 0x100)
> +EXC_VIRT_NONE(0x4b00, 0x100)
>  
>  /*
>   * system call / hypercall (0xc00, 0x4c00)
> -- 
> 2.22.0


Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Segher Boessenkool
On Wed, Aug 21, 2019 at 01:50:52PM +0200, Christophe Leroy wrote:
> Le 21/08/2019 à 13:44, Segher Boessenkool a écrit :
> >Calls are cheap, in principle...  It is the LR stuff that can make it
> >slower on some cores, and a lot of calling sequence stuff may have
> >considerable overhead of course.
> 
> On an 8xx, a taken branch is 2 cycles and a non taken branch in 1 cycle 
> (+ the refetch if that was not the anticipate branch).

Yup.  And on the big cores they are all 0 cycles, if correctly predicted.
(Taken branches end your fetch group, of course, there are small
inefficiencies everywhere, but that's about the gist of it).

> >>+.macro get_datapage ptr, tmp
> >>+   bcl 20,31,888f
> >>+888:
> >>+   mflr\ptr
> >>+   addi\ptr, \ptr, __kernel_datapage_offset - 888b
> >>+   lwz \tmp, 0(\ptr)
> >>+   add \ptr, \tmp, \ptr
> >>+.endm
> >
> >(You can just write that as
> > bcl 20,31,$+4
> > mflr \ptr
> >etc.  Useless labels are useless :-) )
> 
> Nice trick. Will use that.

Or .+4 if you like that syntax better...  It's all the same thing.

> >One thing you might want to do to improve performance is to do this without
> >the bcl etc., because you cannot really hide the LR latency of that.  But
> >that isn't very many ns either...  Superscalar helps, OoO helps, but it is
> >mostly just that >100MHz helps ;-)
> 
> Good idea. Did you have a look at my vdso32 similar patch ? 
> https://patchwork.ozlabs.org/patch/1148274/

Yes, I saw it.

> Do you have any idea on how to avoid that bcl/mflr stuff ?

Do a load from some fixed address?  Maybe an absolute address, even?
lwz r3,-12344(0)  or similar (that address is in kernel space...)

There aren't many options, and certainly not many *good* options!


Segher


Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Christophe Leroy




Le 21/08/2019 à 13:44, Segher Boessenkool a écrit :

Hi!

On Wed, Aug 21, 2019 at 02:59:59PM +0530, Santosh Sivaraj wrote:

except for a couple of calls (1 or 2 nsec reduction), there are no
improvements in the call times. Or is 10 nsec the minimum granularity??

So I don't know if its even worth updating vdso64 except to keep vdso32 and
vdso64 equal.


Calls are cheap, in principle...  It is the LR stuff that can make it
slower on some cores, and a lot of calling sequence stuff may have
considerable overhead of course.


On an 8xx, a taken branch is 2 cycles and a non taken branch in 1 cycle 
(+ the refetch if that was not the anticipate branch).





+.macro get_datapage ptr, tmp
+   bcl 20,31,888f
+888:
+   mflr\ptr
+   addi\ptr, \ptr, __kernel_datapage_offset - 888b
+   lwz \tmp, 0(\ptr)
+   add \ptr, \tmp, \ptr
+.endm


(You can just write that as
bcl 20,31,$+4
mflr \ptr
etc.  Useless labels are useless :-) )


Nice trick. Will use that.



One thing you might want to do to improve performance is to do this without
the bcl etc., because you cannot really hide the LR latency of that.  But
that isn't very many ns either...  Superscalar helps, OoO helps, but it is
mostly just that >100MHz helps ;-)


Good idea. Did you have a look at my vdso32 similar patch ? 
https://patchwork.ozlabs.org/patch/1148274/


Do you have any idea on how to avoid that bcl/mflr stuff ?

Christophe


Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Segher Boessenkool
Hi!

On Wed, Aug 21, 2019 at 02:59:59PM +0530, Santosh Sivaraj wrote:
> except for a couple of calls (1 or 2 nsec reduction), there are no
> improvements in the call times. Or is 10 nsec the minimum granularity??
> 
> So I don't know if its even worth updating vdso64 except to keep vdso32 and
> vdso64 equal.

Calls are cheap, in principle...  It is the LR stuff that can make it
slower on some cores, and a lot of calling sequence stuff may have
considerable overhead of course.

> +.macro get_datapage ptr, tmp
> + bcl 20,31,888f
> +888:
> + mflr\ptr
> + addi\ptr, \ptr, __kernel_datapage_offset - 888b
> + lwz \tmp, 0(\ptr)
> + add \ptr, \tmp, \ptr
> +.endm

(You can just write that as
bcl 20,31,$+4
mflr \ptr
etc.  Useless labels are useless :-) )

One thing you might want to do to improve performance is to do this without
the bcl etc., because you cannot really hide the LR latency of that.  But
that isn't very many ns either...  Superscalar helps, OoO helps, but it is
mostly just that >100MHz helps ;-)


Segher


Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Santosh Sivaraj
Christophe Leroy  writes:

> Le 21/08/2019 à 11:29, Santosh Sivaraj a écrit :
>> __get_datapage() is only a few instructions to retrieve the
>> address of the page where the kernel stores data to the VDSO.
>> 
>> By inlining this function into its users, a bl/blr pair and
>> a mflr/mtlr pair is avoided, plus a few reg moves.
>> 
>> clock-gettime-monotonic: syscall: 514 nsec/call  396 nsec/call
>> clock-gettime-monotonic:libc: 25 nsec/call   24 nsec/call
>> clock-gettime-monotonic:vdso: 20 nsec/call   20 nsec/call
>> clock-getres-monotonic: syscall: 347 nsec/call   372 nsec/call
>> clock-getres-monotonic:libc: 19 nsec/call19 nsec/call
>> clock-getres-monotonic:vdso: 10 nsec/call10 nsec/call
>> clock-gettime-monotonic-coarse: syscall: 511 nsec/call   396 nsec/call
>> clock-gettime-monotonic-coarse:libc: 23 nsec/call21 nsec/call
>> clock-gettime-monotonic-coarse:vdso: 15 nsec/call13 nsec/call
>> clock-gettime-realtime: syscall: 526 nsec/call   405 nsec/call
>> clock-gettime-realtime:libc: 24 nsec/call23 nsec/call
>> clock-gettime-realtime:vdso: 18 nsec/call18 nsec/call
>> clock-getres-realtime: syscall: 342 nsec/call372 nsec/call
>> clock-getres-realtime:libc: 19 nsec/call 19 nsec/call
>> clock-getres-realtime:vdso: 10 nsec/call 10 nsec/call
>> clock-gettime-realtime-coarse: syscall: 515 nsec/call373 nsec/call
>> clock-gettime-realtime-coarse:libc: 23 nsec/call 22 nsec/call
>> clock-gettime-realtime-coarse:vdso: 14 nsec/call 13 nsec/call
>
> I think you should only put the measurements on vdso calls, and only the 
> ones that are impacted by the change. For exemple, getres function 
> doesn't use __get_datapage so showing it here is pointless.
>
> gettimeofday should be shown there as it uses __get_datapage()
>
>
>> 
>> Based on the patch by Christophe Leroy  for vdso32.
>> 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>> 
>> except for a couple of calls (1 or 2 nsec reduction), there are no
>> improvements in the call times. Or is 10 nsec the minimum granularity??
>
> Maybe the ones that show no improvements are the ones that don't use 
> __get_datapage() at all ...

Yes makes sense.

>
>> 
>> So I don't know if its even worth updating vdso64 except to keep vdso32 and
>> vdso64 equal.
>
> 2ns on a 15ns call is 13% so it is worth it I think.

true. Since datapage.h is the same for both 32 and 64, may be we should put
it in include/asm.

Thanks,
Santosh
>
> Christophe
>
>
>> 
>> 
>>   arch/powerpc/kernel/vdso64/cacheflush.S   | 10 
>>   arch/powerpc/kernel/vdso64/datapage.S | 29 ---
>>   arch/powerpc/kernel/vdso64/datapage.h | 10 
>>   arch/powerpc/kernel/vdso64/gettimeofday.S |  8 ---
>>   4 files changed, 24 insertions(+), 33 deletions(-)
>>   create mode 100644 arch/powerpc/kernel/vdso64/datapage.h
>> 
>> diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S 
>> b/arch/powerpc/kernel/vdso64/cacheflush.S
>> index 3f92561a64c4..30e8b0d29bea 100644
>> --- a/arch/powerpc/kernel/vdso64/cacheflush.S
>> +++ b/arch/powerpc/kernel/vdso64/cacheflush.S
>> @@ -10,6 +10,8 @@
>>   #include 
>>   #include 
>>   
>> +#include "datapage.h"
>> +
>>  .text
>>   
>>   /*
>> @@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>> .cfi_startproc
>>  mflrr12
>> .cfi_register lr,r12
>> -mr  r11,r3
>> -bl  V_LOCAL_FUNC(__get_datapage)
>> +get_datapager11, r0
>>  mtlrr12
>> -mr  r10,r3
>>   
>>  lwz r7,CFG_DCACHE_BLOCKSZ(r10)
>>  addir5,r7,-1
>> -andcr6,r11,r5   /* round low to line bdy */
>> +andcr6,r3,r5/* round low to line bdy */
>>  subfr8,r6,r4/* compute length */
>>  add r8,r8,r5/* ensure we get enough */
>>  lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
>> @@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>>   
>>  lwz r7,CFG_ICACHE_BLOCKSZ(r10)
>>  addir5,r7,-1
>> -andcr6,r11,r5   /* round low to line bdy */
>> +andcr6,r3,r5/* round low to line bdy */
>>  subfr8,r6,r4/* compute length */
>>  add r8,r8,r5
>>  lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
>> diff --git a/arch/powerpc/kernel/vdso64/datapage.S 
>> b/arch/powerpc/kernel/vdso64/datapage.S
>> index dc84f5ae3802..8712f57c931c 100644
>> --- a/arch/powerpc/kernel/vdso64/datapage.S
>> +++ b/arch/powerpc/kernel/vdso64/datapage.S
>> @@ -11,34 +11,13 @@
>>   #include 
>>   #include 
>>   
>> +#include "datapage.h"
>> +
>>  .text
>>   .global__kernel_datapage_offset;
>>   __kernel_datapage_offset:
>>  .long   0
>>   
>> -V_FUNCTION_BEGIN(__get_datapage)
>> -  .cfi_startproc
>> -/* We don't want that exposed or overridable as we want other objects
>> - * to be able to bl directly to here
>> - */
>> -.protected __get_datapage
>> -.hidden 

Regression fix for bpf in v5.3 (was Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding)

2019-08-21 Thread Michael Ellerman
"Naveen N. Rao"  writes:
> Since BPF constant blinding is performed after the verifier pass, there
> are certain ALU32 instructions inserted which don't have a corresponding
> zext instruction inserted after. This is causing a kernel oops on
> powerpc and can be reproduced by running 'test_cgroup_storage' with
> bpf_jit_harden=2.
>
> Fix this by emitting BPF_ZEXT during constant blinding if
> prog->aux->verifier_zext is set.
>
> Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to 
> analysis result")
> Reported-by: Michael Ellerman 
> Signed-off-by: Naveen N. Rao 
> ---
> This approach (the location where zext is being introduced below, in 
> particular) works for powerpc, but I am not entirely sure if this is 
> sufficient for other architectures as well. This is broken on v5.3-rc4.

Any comment on this?

This is a regression in v5.3, which results in a kernel crash, it would
be nice to get it fixed before the release please?

cheers

> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 8191a7db2777..d84146e6fd9e 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -890,7 +890,8 @@ int bpf_jit_get_func_addr(const struct bpf_prog *prog,
>  
>  static int bpf_jit_blind_insn(const struct bpf_insn *from,
> const struct bpf_insn *aux,
> -   struct bpf_insn *to_buff)
> +   struct bpf_insn *to_buff,
> +   bool emit_zext)
>  {
>   struct bpf_insn *to = to_buff;
>   u32 imm_rnd = get_random_int();
> @@ -939,6 +940,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn *from,
>   *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
>   *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
>   *to++ = BPF_ALU32_REG(from->code, from->dst_reg, BPF_REG_AX);
> + if (emit_zext)
> + *to++ = BPF_ZEXT_REG(from->dst_reg);
>   break;
>  
>   case BPF_ALU64 | BPF_ADD | BPF_K:
> @@ -992,6 +995,10 @@ static int bpf_jit_blind_insn(const struct bpf_insn 
> *from,
>   off -= 2;
>   *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ from->imm);
>   *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
> + if (emit_zext) {
> + *to++ = BPF_ZEXT_REG(BPF_REG_AX);
> + off--;
> + }
>   *to++ = BPF_JMP32_REG(from->code, from->dst_reg, BPF_REG_AX,
> off);
>   break;
> @@ -1005,6 +1012,8 @@ static int bpf_jit_blind_insn(const struct bpf_insn 
> *from,
>   case 0: /* Part 2 of BPF_LD | BPF_IMM | BPF_DW. */
>   *to++ = BPF_ALU32_IMM(BPF_MOV, BPF_REG_AX, imm_rnd ^ 
> aux[0].imm);
>   *to++ = BPF_ALU32_IMM(BPF_XOR, BPF_REG_AX, imm_rnd);
> + if (emit_zext)
> + *to++ = BPF_ZEXT_REG(BPF_REG_AX);
>   *to++ = BPF_ALU64_REG(BPF_OR,  aux[0].dst_reg, BPF_REG_AX);
>   break;
>  
> @@ -1088,7 +1097,8 @@ struct bpf_prog *bpf_jit_blind_constants(struct 
> bpf_prog *prog)
>   insn[1].code == 0)
>   memcpy(aux, insn, sizeof(aux));
>  
> - rewritten = bpf_jit_blind_insn(insn, aux, insn_buff);
> + rewritten = bpf_jit_blind_insn(insn, aux, insn_buff,
> + clone->aux->verifier_zext);
>   if (!rewritten)
>   continue;
>  
> -- 
> 2.22.0


[PATCH] powerpc/8xx: set STACK_END_MAGIC earlier on the init_stack

2019-08-21 Thread Christophe Leroy
Today, the STACK_END_MAGIC is set on init_stack in start_kernel().

To avoid a false 'Thread overran stack, or stack corrupted' message
on early Oopses, setup STACK_END_MAGIC as soon as possible.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 5ab9178c2347..b8ca5b43e587 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -15,6 +15,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -741,6 +742,9 @@ start_here:
/* stack */
lis r1,init_thread_union@ha
addir1,r1,init_thread_union@l
+   lis r0, STACK_END_MAGIC@h
+   ori r0, r0, STACK_END_MAGIC@l
+   stw r0, 0(r1)
li  r0,0
stwur0,THREAD_SIZE-STACK_FRAME_OVERHEAD(r1)
 
-- 
2.13.3



[PATCH 2/2] powerpc/mm: Fix an Oops in kasan_mmu_init()

2019-08-21 Thread Christophe Leroy
   Uncompressing Kernel Image ... OK
   Loading Device Tree to 01ff7000, end 01fff74f ... OK
[0.00] printk: bootconsole [udbg0] enabled
[0.00] BUG: Unable to handle kernel data access at 0xf818c000
[0.00] Faulting instruction address: 0xc0013c7c
[0.00] Thread overran stack, or stack corrupted
[0.00] Oops: Kernel access of bad area, sig: 11 [#1]
[0.00] BE PAGE_SIZE=16K PREEMPT
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty #2080
[0.00] NIP:  c0013c7c LR: c0013310 CTR: 
[0.00] REGS: c0c5ff38 TRAP: 0300   Not tainted  
(5.3.0-rc4-s3k-dev-00743-g5abe4a3e8fd3-dirty)
[0.00] MSR:  1032   CR: 99033955  XER: 80002100
[0.00] DAR: f818c000 DSISR: 8200
[0.00] GPR00: c0013310 c0c5fff0 c0ad6ac0 c0c600c0 f818c031 8200 
 
[0.00] GPR08:  f1f1f1f1 c0013c2c c0013304 99033955 0048 
 07ff9598
[0.00] GPR16:  07ffb94c     
 f818cfb2
[0.00] GPR24:   1000   c07dbf80 
 f818c000
[0.00] NIP [c0013c7c] do_page_fault+0x50/0x904
[0.00] LR [c0013310] handle_page_fault+0xc/0x38
[0.00] Call Trace:
[0.00] Instruction dump:
[0.00] be010080 91410014 553fe8fe 3d40c001 3d20f1f1 7d800026 394a3c2c 
3fffe000
[0.00] 6129f1f1 900100c4 9181007c 91410018 <913f> 3d2001f4 6129f4f4 
913f0004

Don't map the early shadow page read-only yet when creating the new
page tables for the real shadow memory, otherwise the memblock
allocations that immediately follows to create the real shadow pages
that are about to replace the early shadow page trigger a page fault
if they fall into the region being worked on at the moment.

Signed-off-by: Christophe Leroy 
Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support")
Cc: sta...@vger.kernel.org
---
 arch/powerpc/mm/kasan/kasan_init_32.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index e8ab3cc5f6e4..0e6ed4413eea 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -34,7 +34,7 @@ static int __ref kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned l
 {
pmd_t *pmd;
unsigned long k_cur, k_next;
-   pgprot_t prot = kasan_prot_ro();
+   pgprot_t prot = slab_is_available() ? kasan_prot_ro() : PAGE_KERNEL;
 
pmd = pmd_offset(pud_offset(pgd_offset_k(k_start), k_start), k_start);
 
@@ -110,9 +110,22 @@ static int __ref kasan_init_region(void *start, size_t 
size)
 static void __init kasan_remap_early_shadow_ro(void)
 {
pgprot_t prot = kasan_prot_ro();
+   unsigned long k_start = KASAN_SHADOW_START;
+   unsigned long k_end = KASAN_SHADOW_END;
+   unsigned long k_cur;
+   phys_addr_t pa = __pa(kasan_early_shadow_page);
 
kasan_populate_pte(kasan_early_shadow_pte, prot);
 
+   for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) {
+   pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), 
k_cur);
+   pte_t *ptep = pte_offset_kernel(pmd, k_cur);
+
+   if ((pte_val(*ptep) & PTE_RPN_MASK) != pa)
+   continue;
+
+   __set_pte_at(_mm, k_cur, ptep, pfn_pte(PHYS_PFN(pa), 
prot), 0);
+   }
flush_tlb_kernel_range(KASAN_SHADOW_START, KASAN_SHADOW_END);
 }
 
-- 
2.13.3



[PATCH 1/2] powerpc/mm: Add a helper to select PAGE_KERNEL_RO or PAGE_READONLY

2019-08-21 Thread Christophe Leroy
In a couple of places there is a need to select whether read-only
protection of shadow pages is performed with PAGE_KERNEL_RO or with
PAGE_READONLY.

Add a helper to avoid duplicating the choice.

Signed-off-by: Christophe Leroy 
Cc: sta...@vger.kernel.org

---
This small series applies on top of patch "powerpc/kasan: fix parallele
loading of modules" (https://patchwork.ozlabs.org/patch/1144757/).
Otherwise a manual merge has to be done (the modifications are
independant but close to each other in the file.)
---
 arch/powerpc/mm/kasan/kasan_init_32.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c 
b/arch/powerpc/mm/kasan/kasan_init_32.c
index 802387b231ad..e8ab3cc5f6e4 100644
--- a/arch/powerpc/mm/kasan/kasan_init_32.c
+++ b/arch/powerpc/mm/kasan/kasan_init_32.c
@@ -12,6 +12,14 @@
 #include 
 #include 
 
+static pgprot_t kasan_prot_ro(void)
+{
+   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
+   return PAGE_READONLY;
+
+   return PAGE_KERNEL_RO;
+}
+
 static void kasan_populate_pte(pte_t *ptep, pgprot_t prot)
 {
unsigned long va = (unsigned long)kasan_early_shadow_page;
@@ -26,6 +34,7 @@ static int __ref kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned l
 {
pmd_t *pmd;
unsigned long k_cur, k_next;
+   pgprot_t prot = kasan_prot_ro();
 
pmd = pmd_offset(pud_offset(pgd_offset_k(k_start), k_start), k_start);
 
@@ -43,10 +52,7 @@ static int __ref kasan_init_shadow_page_tables(unsigned long 
k_start, unsigned l
 
if (!new)
return -ENOMEM;
-   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
-   kasan_populate_pte(new, PAGE_READONLY);
-   else
-   kasan_populate_pte(new, PAGE_KERNEL_RO);
+   kasan_populate_pte(new, prot);
 
smp_wmb(); /* See comment in __pte_alloc */
 
@@ -103,10 +109,9 @@ static int __ref kasan_init_region(void *start, size_t 
size)
 
 static void __init kasan_remap_early_shadow_ro(void)
 {
-   if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
-   kasan_populate_pte(kasan_early_shadow_pte, PAGE_READONLY);
-   else
-   kasan_populate_pte(kasan_early_shadow_pte, PAGE_KERNEL_RO);
+   pgprot_t prot = kasan_prot_ro();
+
+   kasan_populate_pte(kasan_early_shadow_pte, prot);
 
flush_tlb_kernel_range(KASAN_SHADOW_START, KASAN_SHADOW_END);
 }
-- 
2.13.3



[PATCH] powerpc/mm: drop #ifdef CONFIG_MMU in is_ioremap_addr()

2019-08-21 Thread Christophe Leroy
powerpc always selects CONFIG_MMU and CONFIG_MMU is not checked
anywhere else in powerpc code.

Drop the #ifdef and the alternative part of is_ioremap_addr()

Fixes: 9bd3bb6703d8("mm/nvdimm: add is_ioremap_addr and use that to check 
ioremap address")
Signed-off-by: Christophe Leroy 
Cc: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index c58ba7963688..fe77129c5055 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -168,13 +168,9 @@ static inline bool pgd_is_leaf(pgd_t pgd)
 #define is_ioremap_addr is_ioremap_addr
 static inline bool is_ioremap_addr(const void *x)
 {
-#ifdef CONFIG_MMU
unsigned long addr = (unsigned long)x;
 
return addr >= IOREMAP_BASE && addr < IOREMAP_END;
-#else
-   return false;
-#endif
 }
 #endif /* CONFIG_PPC64 */
 
-- 
2.13.3



Re: [PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Christophe Leroy




Le 21/08/2019 à 11:29, Santosh Sivaraj a écrit :

__get_datapage() is only a few instructions to retrieve the
address of the page where the kernel stores data to the VDSO.

By inlining this function into its users, a bl/blr pair and
a mflr/mtlr pair is avoided, plus a few reg moves.

clock-gettime-monotonic: syscall: 514 nsec/call  396 nsec/call
clock-gettime-monotonic:libc: 25 nsec/call   24 nsec/call
clock-gettime-monotonic:vdso: 20 nsec/call   20 nsec/call
clock-getres-monotonic: syscall: 347 nsec/call   372 nsec/call
clock-getres-monotonic:libc: 19 nsec/call19 nsec/call
clock-getres-monotonic:vdso: 10 nsec/call10 nsec/call
clock-gettime-monotonic-coarse: syscall: 511 nsec/call   396 nsec/call
clock-gettime-monotonic-coarse:libc: 23 nsec/call21 nsec/call
clock-gettime-monotonic-coarse:vdso: 15 nsec/call13 nsec/call
clock-gettime-realtime: syscall: 526 nsec/call   405 nsec/call
clock-gettime-realtime:libc: 24 nsec/call23 nsec/call
clock-gettime-realtime:vdso: 18 nsec/call18 nsec/call
clock-getres-realtime: syscall: 342 nsec/call372 nsec/call
clock-getres-realtime:libc: 19 nsec/call 19 nsec/call
clock-getres-realtime:vdso: 10 nsec/call 10 nsec/call
clock-gettime-realtime-coarse: syscall: 515 nsec/call373 nsec/call
clock-gettime-realtime-coarse:libc: 23 nsec/call 22 nsec/call
clock-gettime-realtime-coarse:vdso: 14 nsec/call 13 nsec/call


I think you should only put the measurements on vdso calls, and only the 
ones that are impacted by the change. For exemple, getres function 
doesn't use __get_datapage so showing it here is pointless.


gettimeofday should be shown there as it uses __get_datapage()




Based on the patch by Christophe Leroy  for vdso32.

Signed-off-by: Santosh Sivaraj 
---

except for a couple of calls (1 or 2 nsec reduction), there are no
improvements in the call times. Or is 10 nsec the minimum granularity??


Maybe the ones that show no improvements are the ones that don't use 
__get_datapage() at all ...




So I don't know if its even worth updating vdso64 except to keep vdso32 and
vdso64 equal.


2ns on a 15ns call is 13% so it is worth it I think.

Christophe





  arch/powerpc/kernel/vdso64/cacheflush.S   | 10 
  arch/powerpc/kernel/vdso64/datapage.S | 29 ---
  arch/powerpc/kernel/vdso64/datapage.h | 10 
  arch/powerpc/kernel/vdso64/gettimeofday.S |  8 ---
  4 files changed, 24 insertions(+), 33 deletions(-)
  create mode 100644 arch/powerpc/kernel/vdso64/datapage.h

diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S 
b/arch/powerpc/kernel/vdso64/cacheflush.S
index 3f92561a64c4..30e8b0d29bea 100644
--- a/arch/powerpc/kernel/vdso64/cacheflush.S
+++ b/arch/powerpc/kernel/vdso64/cacheflush.S
@@ -10,6 +10,8 @@
  #include 
  #include 
  
+#include "datapage.h"

+
.text
  
  /*

@@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
.cfi_startproc
mflrr12
.cfi_register lr,r12
-   mr  r11,r3
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager11, r0
mtlrr12
-   mr  r10,r3
  
  	lwz	r7,CFG_DCACHE_BLOCKSZ(r10)

addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5/* ensure we get enough */
lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
@@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
  
  	lwz	r7,CFG_ICACHE_BLOCKSZ(r10)

addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5
lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
diff --git a/arch/powerpc/kernel/vdso64/datapage.S 
b/arch/powerpc/kernel/vdso64/datapage.S
index dc84f5ae3802..8712f57c931c 100644
--- a/arch/powerpc/kernel/vdso64/datapage.S
+++ b/arch/powerpc/kernel/vdso64/datapage.S
@@ -11,34 +11,13 @@
  #include 
  #include 
  
+#include "datapage.h"

+
.text
  .global   __kernel_datapage_offset;
  __kernel_datapage_offset:
.long   0
  
-V_FUNCTION_BEGIN(__get_datapage)

-  .cfi_startproc
-   /* We don't want that exposed or overridable as we want other objects
-* to be able to bl directly to here
-*/
-   .protected __get_datapage
-   .hidden __get_datapage
-
-   mflrr0
-  .cfi_register lr,r0
-
-   bcl 20,31,data_page_branch
-data_page_branch:
-   mflrr3
-   mtlrr0
-   addir3, r3, __kernel_datapage_offset-data_page_branch
-   lwz r0,0(r3)
-  .cfi_restore lr
-   add r3,r0,r3
-   blr
-  .cfi_endproc
-V_FUNCTION_END(__get_datapage)
-
  /*
   * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;

[PATCH] powerpc/vdso64: inline __get_datapage()

2019-08-21 Thread Santosh Sivaraj
__get_datapage() is only a few instructions to retrieve the
address of the page where the kernel stores data to the VDSO.

By inlining this function into its users, a bl/blr pair and
a mflr/mtlr pair is avoided, plus a few reg moves.

clock-gettime-monotonic: syscall: 514 nsec/call  396 nsec/call
clock-gettime-monotonic:libc: 25 nsec/call   24 nsec/call
clock-gettime-monotonic:vdso: 20 nsec/call   20 nsec/call
clock-getres-monotonic: syscall: 347 nsec/call   372 nsec/call
clock-getres-monotonic:libc: 19 nsec/call19 nsec/call
clock-getres-monotonic:vdso: 10 nsec/call10 nsec/call
clock-gettime-monotonic-coarse: syscall: 511 nsec/call   396 nsec/call
clock-gettime-monotonic-coarse:libc: 23 nsec/call21 nsec/call
clock-gettime-monotonic-coarse:vdso: 15 nsec/call13 nsec/call
clock-gettime-realtime: syscall: 526 nsec/call   405 nsec/call
clock-gettime-realtime:libc: 24 nsec/call23 nsec/call
clock-gettime-realtime:vdso: 18 nsec/call18 nsec/call
clock-getres-realtime: syscall: 342 nsec/call372 nsec/call
clock-getres-realtime:libc: 19 nsec/call 19 nsec/call
clock-getres-realtime:vdso: 10 nsec/call 10 nsec/call
clock-gettime-realtime-coarse: syscall: 515 nsec/call373 nsec/call
clock-gettime-realtime-coarse:libc: 23 nsec/call 22 nsec/call
clock-gettime-realtime-coarse:vdso: 14 nsec/call 13 nsec/call

Based on the patch by Christophe Leroy  for vdso32.

Signed-off-by: Santosh Sivaraj 
---

except for a couple of calls (1 or 2 nsec reduction), there are no
improvements in the call times. Or is 10 nsec the minimum granularity??

So I don't know if its even worth updating vdso64 except to keep vdso32 and
vdso64 equal.


 arch/powerpc/kernel/vdso64/cacheflush.S   | 10 
 arch/powerpc/kernel/vdso64/datapage.S | 29 ---
 arch/powerpc/kernel/vdso64/datapage.h | 10 
 arch/powerpc/kernel/vdso64/gettimeofday.S |  8 ---
 4 files changed, 24 insertions(+), 33 deletions(-)
 create mode 100644 arch/powerpc/kernel/vdso64/datapage.h

diff --git a/arch/powerpc/kernel/vdso64/cacheflush.S 
b/arch/powerpc/kernel/vdso64/cacheflush.S
index 3f92561a64c4..30e8b0d29bea 100644
--- a/arch/powerpc/kernel/vdso64/cacheflush.S
+++ b/arch/powerpc/kernel/vdso64/cacheflush.S
@@ -10,6 +10,8 @@
 #include 
 #include 
 
+#include "datapage.h"
+
.text
 
 /*
@@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
   .cfi_startproc
mflrr12
   .cfi_register lr,r12
-   mr  r11,r3
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager11, r0
mtlrr12
-   mr  r10,r3
 
lwz r7,CFG_DCACHE_BLOCKSZ(r10)
addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5/* ensure we get enough */
lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
@@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
 
lwz r7,CFG_ICACHE_BLOCKSZ(r10)
addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5
lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
diff --git a/arch/powerpc/kernel/vdso64/datapage.S 
b/arch/powerpc/kernel/vdso64/datapage.S
index dc84f5ae3802..8712f57c931c 100644
--- a/arch/powerpc/kernel/vdso64/datapage.S
+++ b/arch/powerpc/kernel/vdso64/datapage.S
@@ -11,34 +11,13 @@
 #include 
 #include 
 
+#include "datapage.h"
+
.text
 .global__kernel_datapage_offset;
 __kernel_datapage_offset:
.long   0
 
-V_FUNCTION_BEGIN(__get_datapage)
-  .cfi_startproc
-   /* We don't want that exposed or overridable as we want other objects
-* to be able to bl directly to here
-*/
-   .protected __get_datapage
-   .hidden __get_datapage
-
-   mflrr0
-  .cfi_register lr,r0
-
-   bcl 20,31,data_page_branch
-data_page_branch:
-   mflrr3
-   mtlrr0
-   addir3, r3, __kernel_datapage_offset-data_page_branch
-   lwz r0,0(r3)
-  .cfi_restore lr
-   add r3,r0,r3
-   blr
-  .cfi_endproc
-V_FUNCTION_END(__get_datapage)
-
 /*
  * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;
  *
@@ -53,7 +32,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map)
mflrr12
   .cfi_register lr,r12
mr  r4,r3
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager3, r0
mtlrr12
addir3,r3,CFG_SYSCALL_MAP64
cmpldi  cr0,r4,0
@@ -75,7 +54,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq)
   .cfi_startproc
mflrr12
   .cfi_register lr,r12
-   bl  V_LOCAL_FUNC(__get_datapage)
+   get_datapager3, r0
ld  

Re: [RFC PATCH] bpf: handle 32-bit zext during constant blinding

2019-08-21 Thread Naveen N. Rao

Naveen N. Rao wrote:

Since BPF constant blinding is performed after the verifier pass, there
are certain ALU32 instructions inserted which don't have a corresponding
zext instruction inserted after. This is causing a kernel oops on
powerpc and can be reproduced by running 'test_cgroup_storage' with
bpf_jit_harden=2.

Fix this by emitting BPF_ZEXT during constant blinding if
prog->aux->verifier_zext is set.

Fixes: a4b1d3c1ddf6cb ("bpf: verifier: insert zero extension according to analysis 
result")
Reported-by: Michael Ellerman 
Signed-off-by: Naveen N. Rao 
---
This approach (the location where zext is being introduced below, in 
particular) works for powerpc, but I am not entirely sure if this is 
sufficient for other architectures as well. This is broken on v5.3-rc4.


Alexie, Daniel, Jiong,
Any feedback on this?

- Naveen



[PATCH v3 2/2] powerpc/powernv: Add new opal message type

2019-08-21 Thread Vasant Hegde
We have OPAL_MSG_PRD message type to pass prd related messages from OPAL
to `opal-prd`. It can handle messages upto 64 bytes. We have a requirement
to send bigger than 64 bytes of data from OPAL to `opal-prd`. Lets add new
message type (OPAL_MSG_PRD2) to pass bigger data.

Cc: Jeremy Kerr 
Signed-off-by: Vasant Hegde 
---
 arch/powerpc/include/asm/opal-api.h   | 1 +
 arch/powerpc/platforms/powernv/opal-prd.c | 9 -
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 383242eb0dea..1cad413e1e0e 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -453,6 +453,7 @@ enum opal_msg_type {
OPAL_MSG_DPO= 5,
OPAL_MSG_PRD= 6,
OPAL_MSG_OCC= 7,
+   OPAL_MSG_PRD2   = 8,
OPAL_MSG_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index e072bf157d62..50a735d77192 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -342,7 +342,7 @@ static int opal_prd_msg_notifier(struct notifier_block *nb,
int msg_size, item_size;
unsigned long flags;
 
-   if (msg_type != OPAL_MSG_PRD)
+   if (msg_type != OPAL_MSG_PRD && msg_type != OPAL_MSG_PRD2)
return 0;
 
/* Calculate total size of the message and item we need to store. The
@@ -393,6 +393,13 @@ static int opal_prd_probe(struct platform_device *pdev)
return rc;
}
 
+   rc = opal_message_notifier_register(OPAL_MSG_PRD2, _prd_event_nb);
+   if (rc) {
+   pr_err("%s: Couldn't register event notifier (%d)\n",
+  __func__, OPAL_MSG_PRD2);
+   return rc;
+   }
+
rc = misc_register(_prd_dev);
if (rc) {
pr_err("failed to register miscdev\n");
-- 
2.21.0



[PATCH v3 1/2] powerpc/powernv: Enhance opal message read interface

2019-08-21 Thread Vasant Hegde
Use "opal-msg-size" device tree property to allocate memory for "opal_msg".

Cc: Mahesh Salgaonkar 
Cc: Jeremy Kerr 
Signed-off-by: Vasant Hegde 
---
Changes in v3:
  - Call BUG_ON, if we fail to allocate memory during init.

-Vasant

 arch/powerpc/platforms/powernv/opal.c | 29 ++-
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index aba443be7daa..4f1f68f568bf 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -58,6 +58,8 @@ static DEFINE_SPINLOCK(opal_write_lock);
 static struct atomic_notifier_head opal_msg_notifier_head[OPAL_MSG_TYPE_MAX];
 static uint32_t opal_heartbeat;
 static struct task_struct *kopald_tsk;
+static struct opal_msg *opal_msg;
+static uint64_t opal_msg_size;
 
 void opal_configure_cores(void)
 {
@@ -271,14 +273,9 @@ static void opal_message_do_notify(uint32_t msg_type, void 
*msg)
 static void opal_handle_message(void)
 {
s64 ret;
-   /*
-* TODO: pre-allocate a message buffer depending on opal-msg-size
-* value in /proc/device-tree.
-*/
-   static struct opal_msg msg;
u32 type;
 
-   ret = opal_get_msg(__pa(), sizeof(msg));
+   ret = opal_get_msg(__pa(opal_msg), opal_msg_size);
/* No opal message pending. */
if (ret == OPAL_RESOURCE)
return;
@@ -290,14 +287,14 @@ static void opal_handle_message(void)
return;
}
 
-   type = be32_to_cpu(msg.msg_type);
+   type = be32_to_cpu(opal_msg->msg_type);
 
/* Sanity check */
if (type >= OPAL_MSG_TYPE_MAX) {
pr_warn_once("%s: Unknown message type: %u\n", __func__, type);
return;
}
-   opal_message_do_notify(type, (void *));
+   opal_message_do_notify(type, (void *)opal_msg);
 }
 
 static irqreturn_t opal_message_notify(int irq, void *data)
@@ -306,9 +303,21 @@ static irqreturn_t opal_message_notify(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static int __init opal_message_init(void)
+static int __init opal_message_init(struct device_node *opal_node)
 {
int ret, i, irq;
+   const __be32 *val;
+
+   val = of_get_property(opal_node, "opal-msg-size", NULL);
+   if (val)
+   opal_msg_size = be32_to_cpup(val);
+
+   /* If opal-msg-size property is not available then use default size */
+   if (!opal_msg_size)
+   opal_msg_size = sizeof(struct opal_msg);
+
+   opal_msg = kmalloc(opal_msg_size, GFP_KERNEL);
+   BUG_ON(opal_msg == NULL);
 
for (i = 0; i < OPAL_MSG_TYPE_MAX; i++)
ATOMIC_INIT_NOTIFIER_HEAD(_msg_notifier_head[i]);
@@ -910,7 +919,7 @@ static int __init opal_init(void)
}
 
/* Initialise OPAL messaging system */
-   opal_message_init();
+   opal_message_init(opal_node);
 
/* Initialise OPAL asynchronous completion interface */
opal_async_comp_init();
-- 
2.21.0



[PATCH] soc/fsl/qbman: fix return value error in bm_shutdown_pool()

2019-08-21 Thread Jason Yan
Commit 0505d00c8dba ("soc/fsl/qbman: Cleanup buffer pools if BMan was
initialized prior to bootup") defined a new variable to store the return
error, but forgot to return this value at the end of the function.

Fixes: 0505d00c8dba ("soc/fsl/qbman: Cleanup buffer pools if BMan was 
initialized prior to bootup")
Signed-off-by: Jason Yan 
---
 drivers/soc/fsl/qbman/bman.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c
index f4fb527d8301..c5dd026fe889 100644
--- a/drivers/soc/fsl/qbman/bman.c
+++ b/drivers/soc/fsl/qbman/bman.c
@@ -660,7 +660,7 @@ int bm_shutdown_pool(u32 bpid)
}
 done:
put_affine_portal();
-   return 0;
+   return err;
 }
 
 struct gen_pool *bm_bpalloc;
-- 
2.17.2



Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-08-21 Thread Wanpeng Li
On Wed, 21 Aug 2019 at 13:41, Naoya Horiguchi  wrote:
>
> On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> > Cc Mel Gorman, Kirill, Dave Hansen,
> > On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi  
> > wrote:
> > >
> > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > > Cc Paolo,
> > > > > Hi all,
> > > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  
> > > > > wrote:
> > > > >>
> > > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > > >>> Andrew Morton  writes:
> > > > >>>
> > > >  On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal 
> > > >   wrote:
> > > > 
> > > > >>
> > > > >> So I don't think that the above test result means that errors 
> > > > >> are properly
> > > > >> handled, and the proposed patch should help for arm64.
> > > > >
> > > > > Although, the deviation of pud_huge() avoids a kernel crash the 
> > > > > code
> > > > > would be easier to maintain and reason about if arm64 helpers are
> > > > > consistent with expectations by core code.
> > > > >
> > > > > I'll look to update the arm64 helpers once this patch gets 
> > > > > merged. But
> > > > > it would be helpful if there was a clear expression of semantics 
> > > > > for
> > > > > pud_huge() for various cases. Is there any version that can be 
> > > > > used as
> > > > > reference?
> > > > 
> > > >  Is that an ack or tested-by?
> > > > 
> > > >  Mike keeps plaintively asking the powerpc developers to take a 
> > > >  look,
> > > >  but they remain steadfastly in hiding.
> > > > >>>
> > > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > > >>>
> > > > >>
> > > > >> Thanks Michael,
> > > > >>
> > > > >> I was mostly concerned about use cases for soft/hard offline of huge 
> > > > >> pages
> > > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports 
> > > > >> PGD_SIZE
> > > > >> huge pages, and soft/hard offline support was specifically added for 
> > > > >> this.
> > > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB 
> > > > >> pages
> > > > >> at PGD level"
> > > > >>
> > > > >> This patch will disable that functionality.  So, at a minimum this 
> > > > >> is a
> > > > >> 'heads up'.  If there are actual use cases that depend on this, then 
> > > > >> more
> > > > >> work/discussions will need to happen.  From the e-mail thread on 
> > > > >> PGD_SIZE
> > > > >> support, I can not tell if there is a real use case or this is just a
> > > > >> 'nice to have'.
> > > > >
> > > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > > encounter gup_pud_range() panic several times in product environment.
> > > > > Is there any plan to reenable and fix arch codes?
> > > >
> > > > I too am aware of slightly more interest in 1G huge pages.  Suspect 
> > > > that as
> > > > Intel MMU capacity increases to handle more TLB entries there will be 
> > > > more
> > > > and more interest.
> > > >
> > > > Personally, I am not looking at this issue.  Perhaps Naoya will comment 
> > > > as
> > > > he know most about this code.
> > >
> > > Thanks for forwarding this to me, I'm feeling that memory error handling
> > > on 1GB hugepage is demanded as real use case.
> > >
> > > >
> > > > > In addition, 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > > > will be delivered to VM at page fault next time for the offensive
> > > > > SPTE. Is this proposal acceptable?
> > > >
> > > > I am not sure of the error handling design, but this does sound 
> > > > reasonable.
> > >
> > > I agree that that's better.
> > >
> > > > That block of code which potentially dissolves a huge page on memory 
> > > > error
> > > > is hard to understand and I'm not sure if that is even the 'normal'
> > > > functionality.  Certainly, we would hate to waste/poison an entire 1G 
> > > > page
> > > > for an error on a small subsection.
> > >
> > > Yes, that's not practical, so we need at first establish the code base for
> > > 2GB hugetlb splitting and then extending it to 1GB next.
> >
> > I found it is not easy to split. There is a unique hugetlb page size
> > that is 

[PATCH 3/3] powerpc/pcidn: Warn when sriov pci_dn management is used incorrectly

2019-08-21 Thread Oliver O'Halloran
These functions can only be used on a SR-IOV capable physical function and
they're only called in pcibios_sriov_enable / disable. Make them emit a
warning in the future if they're used incorrectly and remove the dead
code that checks if the device is a VF.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/kernel/pci_dn.c | 17 +++--
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 24da1d8..69dafc3 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -158,8 +158,8 @@ struct pci_dn *add_sriov_vf_pdns(struct pci_dev *pdev)
int i;
 
/* Only support IOV for now */
-   if (!pdev->is_physfn)
-   return pci_get_pdn(pdev);
+   if (WARN_ON(!pdev->is_physfn))
+   return NULL;
 
/* Check if VFs have been populated */
pdn = pci_get_pdn(pdev);
@@ -199,19 +199,8 @@ void remove_sriov_vf_pdns(struct pci_dev *pdev)
struct pci_dn *pdn, *tmp;
int i;
 
-   /*
-* VF and VF PE are created/released dynamically, so we need to
-* bind/unbind them.  Otherwise the VF and VF PE would be mismatched
-* when re-enabling SR-IOV.
-*/
-   if (pdev->is_virtfn) {
-   pdn = pci_get_pdn(pdev);
-   pdn->pe_number = IODA_INVALID_PE;
-   return;
-   }
-
/* Only support IOV PF for now */
-   if (!pdev->is_physfn)
+   if (WARN_ON(!pdev->is_physfn))
return;
 
/* Check if VFs have been populated */
-- 
2.9.5



[PATCH 2/3] powerpc/pcidn: Make VF pci_dn management CONFIG_PCI_IOV specific

2019-08-21 Thread Oliver O'Halloran
The powerpc PCI code requires that a pci_dn structure exists for all
devices in the system. This is fine for real devices since at boot a pci_dn
is created for each PCI device in the DT and it's fine for hotplugged devices
since the hotplug slot driver will manage the pci_dn's devices in hotplug
slots. For SR-IOV, we need the platform / pcibios to manage the pci_dn for
virtual functions since firmware is unaware of VFs, and they aren't
"hot plugged" in the traditional sense.

Management of the pci_dn is handled by the, poorly named, functions:
add_pci_dev_data() and remove_pci_dev_data(). The entire body of these
functions is #ifdef`ed around CONFIG_PCI_IOV and they cannot be used
in any other context, so make them only available when CONFIG_PCI_IOV
is selected, and rename them to reflect their actual usage rather than
having them masquerade as generic code.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/pci-bridge.h |  7 +--
 arch/powerpc/kernel/pci_dn.c  | 15 +--
 arch/powerpc/platforms/powernv/pci-ioda.c |  4 ++--
 arch/powerpc/platforms/pseries/pci.c  |  4 ++--
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index ea6ec65..69f4cb3 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -223,12 +223,15 @@ struct pci_dn {
 extern struct pci_dn *pci_get_pdn_by_devfn(struct pci_bus *bus,
   int devfn);
 extern struct pci_dn *pci_get_pdn(struct pci_dev *pdev);
-extern struct pci_dn *add_dev_pci_data(struct pci_dev *pdev);
-extern void remove_dev_pci_data(struct pci_dev *pdev);
 extern struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
   struct device_node *dn);
 extern void pci_remove_device_node_info(struct device_node *dn);
 
+#ifdef CONFIG_PCI_IOV
+struct pci_dn *add_sriov_vf_pdns(struct pci_dev *pdev);
+void remove_sriov_vf_pdns(struct pci_dev *pdev);
+#endif
+
 static inline int pci_device_from_OF_node(struct device_node *np,
  u8 *bus, u8 *devfn)
 {
diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 795c4e3..24da1d8 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -125,7 +125,7 @@ struct pci_dn *pci_get_pdn(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_PCI_IOV
-static struct pci_dn *add_one_dev_pci_data(struct pci_dn *parent,
+static struct pci_dn *add_one_sriov_vf_pdn(struct pci_dn *parent,
   int vf_index,
   int busno, int devfn)
 {
@@ -151,11 +151,9 @@ static struct pci_dn *add_one_dev_pci_data(struct pci_dn 
*parent,
 
return pdn;
 }
-#endif
 
-struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
+struct pci_dn *add_sriov_vf_pdns(struct pci_dev *pdev)
 {
-#ifdef CONFIG_PCI_IOV
struct pci_dn *parent, *pdn;
int i;
 
@@ -176,7 +174,7 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
struct eeh_dev *edev __maybe_unused;
 
-   pdn = add_one_dev_pci_data(parent, i,
+   pdn = add_one_sriov_vf_pdn(parent, i,
   pci_iov_virtfn_bus(pdev, i),
   pci_iov_virtfn_devfn(pdev, i));
if (!pdn) {
@@ -192,14 +190,11 @@ struct pci_dn *add_dev_pci_data(struct pci_dev *pdev)
edev->physfn = pdev;
 #endif /* CONFIG_EEH */
}
-#endif /* CONFIG_PCI_IOV */
-
return pci_get_pdn(pdev);
 }
 
-void remove_dev_pci_data(struct pci_dev *pdev)
+void remove_sriov_vf_pdns(struct pci_dev *pdev)
 {
-#ifdef CONFIG_PCI_IOV
struct pci_dn *parent;
struct pci_dn *pdn, *tmp;
int i;
@@ -271,8 +266,8 @@ void remove_dev_pci_data(struct pci_dev *pdev)
kfree(pdn);
}
}
-#endif /* CONFIG_PCI_IOV */
 }
+#endif /* CONFIG_PCI_IOV */
 
 struct pci_dn *pci_add_device_node_info(struct pci_controller *hose,
struct device_node *dn)
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index d8080558d0..f1fa489 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1719,14 +1719,14 @@ int pnv_pcibios_sriov_disable(struct pci_dev *pdev)
pnv_pci_sriov_disable(pdev);
 
/* Release PCI data */
-   remove_dev_pci_data(pdev);
+   remove_sriov_vf_pdns(pdev);
return 0;
 }
 
 int pnv_pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
 {
/* Allocate PCI data */
-   add_dev_pci_data(pdev);
+   add_sriov_vf_pdns(pdev);
 
return pnv_pci_sriov_enable(pdev, num_vfs);
 }
diff --git 

[PATCH 1/3] powerpc/sriov: Remove VF eeh_dev state when disabling SR-IOV

2019-08-21 Thread Oliver O'Halloran
When disabling virtual functions on an SR-IOV adapter we currently do not
correctly remove the EEH state for the now-dead virtual functions. When
removing the pci_dn that was created for the VF when SR-IOV was enabled
we free the corresponding eeh_dev without removing it from the child device
list of the eeh_pe that contained it. This can result in crashes due to the
use-after-free.

Signed-off-by: Oliver O'Halloran 
---
No Fixes: here since I'm not sure if the commit that added this actually
introduced the bug. EEH is amazing.

I suspect backporting this would cause more problems than it solves since
reliably replicating the crash required enabling memory poisoning and
hacking a device driver to remove the PCI error handling callbacks so
the EEH fallback path (which removes and re-probes PCI devices)
would be used.
---
 arch/powerpc/kernel/pci_dn.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
index 6556b57..795c4e3 100644
--- a/arch/powerpc/kernel/pci_dn.c
+++ b/arch/powerpc/kernel/pci_dn.c
@@ -244,9 +244,22 @@ void remove_dev_pci_data(struct pci_dev *pdev)
continue;
 
 #ifdef CONFIG_EEH
-   /* Release EEH device for the VF */
+   /*
+* Release EEH state for this VF. The PCI core
+* has already torn down the pci_dev for this VF, but
+* we're responsible to removing the eeh_dev since it
+* has the same lifetime as the pci_dn that spawned it.
+*/
edev = pdn_to_eeh_dev(pdn);
if (edev) {
+   /*
+* We allocate pci_dn's for the totalvfs count,
+* but only only the vfs that were activated
+* have a configured PE.
+*/
+   if (edev->pe)
+   eeh_rmv_from_parent_pe(edev);
+
pdn->edev = NULL;
kfree(edev);
}
-- 
2.9.5



Re: [PATCH] platform/powernv: Avoid re-registration of imc debugfs directory

2019-08-21 Thread Oliver O'Halloran
On Wed, Aug 21, 2019 at 3:37 PM Anju T Sudhakar  wrote:
>
> Hi,
>
> On 8/21/19 10:16 AM, Oliver O'Halloran wrote:
> > Is there a reason why we create the debugfs directory here and not in
> > opal_imc_counters_probe()? There's logic to remove the debugfs
> > directory in _probe() already so it seems like a more natural place to
> > it.
> >
> Good point. But we can only create the parent directory,
> i.e 'imc' directory in `_probe()` function and the entries can be
> created only here.

I know the entries can't be added in the probe function and I'm not
suggesting they should be.

> The reason is, this debugfs entries are only for
> IMC nest units. So, to get the imc mode and command values from
> the nest memory region we need the relevant offsets from the control
> block structure.
>
> Since imc_get_mem_addr_nest() function reads this address
> for each chip, we invoke the function to create the debugfs
> entries after this values are populated(i.e export_imc_mode_and_cmd() in
> invoked by imc_get_mem_addr_nest()).
>
> Also, if we create the parent directory in `_probe()` function,
> we need to track whether the entries(i.e imc_cmd and imc_mode) are
> created or not.

I think you should be tracking this anyway so you can do proper
cleanup if a debugfs entry can't be added. The current approach of
nuking the entire debugfs directory is pretty questionable.

> Regards,
>
> Anju
>