Re: [PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-06-04 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> Persistent memory devices like NVDIMMs can loose cached writes in case
> something prevents flush on power-fail. Such situations are termed as
> dirty shutdown and are exposed to applications as
> last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
> described at [1]. The latter being useful in conditions where multiple
> applications want to detect a dirty shutdown event without racing with
> one another.
>
> PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
> dirty-shutdown-state. This patch further adds support for DSC via the
> "ibm,persistence-failed-count" device tree property of an NVDIMM. This
> property is a monotonic increasing 64-bit counter thats an indication
> of number of times an NVDIMM has encountered a dirty-shutdown event
> causing persistence loss.
>
> Since this value is not expected to change after system-boot hence
> papr_scm reads & caches its value during NVDIMM probe and exposes it
> as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
> similarly named NFIT sysfs attribute. Also this value is available to
> libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
> has been extended to add a new member called 'dimm_dsc' presence of
> which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.
>
> References:
> [1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v2:
> * Rebased the patch on latest ppc-next tree
> * s/psdm/psdm/g   [ Santosh ]
> ---
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
>  arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
>  2 files changed, 36 insertions(+)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> index 50ef95e2f5b1..82488b1e7276 100644
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> @@ -77,6 +77,9 @@
>  /* Indicate that the 'dimm_fuel_gauge' field is valid */
>  #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
>  
> +/* Indicate that the 'dimm_dsc' field is valid */
> +#define PDSM_DIMM_DSC_VALID 2
> +
>  /*
>   * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
>   * Various flags indicate the health status of the dimm.
> @@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
>  
>   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
>   __u16 dimm_fuel_gauge;
> +
> + /* Extension flag PDSM_DIMM_DSC_VALID */
> + __u64 dimm_dsc;
>   };
>   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
>   };
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 11e7b90a3360..eb8be0eb6119 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -114,6 +114,9 @@ struct papr_scm_priv {
>   /* Health information for the dimm */
>   u64 health_bitmap;
>  
> + /* Holds the last known dirty shutdown counter value */
> + u64 dirty_shutdown_counter;
> +
>   /* length of the stat buffer as expected by phyp */
>   size_t stat_buffer_len;
>  };
> @@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
>   return rc;
>  }
>  
> +/* Add the dirty-shutdown-counter value to the pdsm */
> +static int papr_pdsm_dsc(struct papr_scm_priv *p,
> +  union nd_pdsm_payload *payload)
> +{
> + payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
> + payload->health.dimm_dsc = p->dirty_shutdown_counter;
> +
> + return sizeof(struct nd_papr_pdsm_health);
> +}
> +
>  /* Fetch the DIMM health info and populate it in provided package. */
>  static int papr_pdsm_health(struct papr_scm_priv *p,
>   union nd_pdsm_payload *payload)
> @@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
>  
>   /* Populate the fuel gauge meter in the payload */
>   papr_pdsm_fuel_gauge(p, payload);
> + /* Populate the dirty-shutdown-counter field */
> + papr_pdsm_dsc(p, payload);
>  
>   rc = sizeof(struct nd_papr_pdsm_health);
>  
> @@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
>  }
>  DEVICE_ATTR_RO(flags);
>  
> +static ssize_t dirty_shutdown_show(struct device *dev,
> +   struct device_attribute *attr, char *buf)
> +{
> + struct nvdimm *dimm = to_nvdimm(dev);
> + struct papr_scm_priv *p = nvdimm_provider_data(dimm);
> +
> + return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
> +}
> +DEVICE_ATTR_RO(dirty_shutdown);
> +
>  static umode_t papr_nd_attribute_visible(struct kobject *kobj,
>struct attribute *attr, int n)
>  {
> @@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject 
> *kobj,
>  static str

[PATCH] powerpc: Fix kernel-jump address for ppc64 wrapper boot

2021-06-04 Thread He Ying
>From "64-bit PowerPC ELF Application Binary Interface Supplement 1.9",
we know that the value of a function pointer in a language like C is
the address of the function descriptor and the first doubleword
of the function descriptor contains the address of the entry point
of the function.

So, when we want to jump to an address (e.g. addr) to execute for
PPC-elf64abi, we should assign the address of addr *NOT* addr itself
to the function pointer or system will jump to the wrong address.

Link: https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi.html#FUNC-DES
Signed-off-by: He Ying 
---
 arch/powerpc/boot/main.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/powerpc/boot/main.c b/arch/powerpc/boot/main.c
index cae31a6e8f02..50fd7f11b642 100644
--- a/arch/powerpc/boot/main.c
+++ b/arch/powerpc/boot/main.c
@@ -268,7 +268,16 @@ void start(void)
if (console_ops.close)
console_ops.close();
 
+#ifdef CONFIG_PPC64_BOOT_WRAPPER
+   /*
+* For PPC-elf64abi, the value of a function pointer is the address
+* of the function descriptor. And the first doubleword of a function
+* descriptor contains the address of the entry point of the function.
+*/
+   kentry = (kernel_entry_t) &vmlinux.addr;
+#else
kentry = (kernel_entry_t) vmlinux.addr;
+#endif
if (ft_addr) {
if(platform_ops.kentry)
platform_ops.kentry(ft_addr, vmlinux.addr);
-- 
2.17.1



[PATCH] powerpc: Remove klimit

2021-06-04 Thread Christophe Leroy
klimit is a global variable initialised at build time with the
value of _end.

This variable is never modified, so _end symbol can be used directly.

Remove klimit.

Signed-off-by: Christophe Leroy 
Cc: Kefeng Wang 
---
 arch/powerpc/include/asm/setup.h | 1 -
 arch/powerpc/kernel/head_book3s_32.S | 6 ++
 arch/powerpc/kernel/prom.c   | 2 +-
 arch/powerpc/kernel/setup-common.c   | 4 +---
 arch/powerpc/platforms/powermac/bootx_init.c | 2 +-
 5 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index e89bfebd4e00..6c1a7d217d1a 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -10,7 +10,6 @@ extern void ppc_printk_progress(char *s, unsigned short hex);
 extern unsigned int rtas_data;
 extern unsigned long long memory_limit;
 extern bool init_mem_is_free;
-extern unsigned long klimit;
 extern void *zalloc_maybe_bootmem(size_t size, gfp_t mask);
 
 struct device_node;
diff --git a/arch/powerpc/kernel/head_book3s_32.S 
b/arch/powerpc/kernel/head_book3s_32.S
index 326262030279..b724e88bcdaf 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -766,9 +766,6 @@ PerformanceMonitor:
  * the kernel image to physical address PHYSICAL_START.
  */
 relocate_kernel:
-   addis   r9,r26,klimit@ha/* fetch klimit */
-   lwz r25,klimit@l(r9)
-   addis   r25,r25,-KERNELBASE@h
lis r3,PHYSICAL_START@h /* Destination base address */
li  r6,0/* Destination offset */
li  r5,0x4000   /* # bytes of memory to copy */
@@ -776,7 +773,8 @@ relocate_kernel:
addir0,r3,4f@l  /* jump to the address of 4f */
mtctr   r0  /* in copy and do the rest. */
bctr/* jump to the copy */
-4: mr  r5,r25
+4: lis r5,_end-KERNELBASE@h
+   ori r5,r5,_end-KERNELBASE@l
bl  copy_and_flush  /* copy the rest */
b   turn_on_mmu
 
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index fbe9deebc8e1..f620e04dc9bf 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -758,7 +758,7 @@ void __init early_init_devtree(void *params)
first_memblock_size = min_t(u64, first_memblock_size, 
memory_limit);
setup_initial_memory_limit(memstart_addr, first_memblock_size);
/* Reserve MEMBLOCK regions used by kernel, initrd, dt, etc... */
-   memblock_reserve(PHYSICAL_START, __pa(klimit) - PHYSICAL_START);
+   memblock_reserve(PHYSICAL_START, __pa(_end) - PHYSICAL_START);
/* If relocatable, reserve first 32k for interrupt vectors etc. */
if (PHYSICAL_START > MEMORY_START)
memblock_reserve(MEMORY_START, 0x8000);
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 74a98fff2c2f..138bb7f49ef9 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -91,8 +91,6 @@ EXPORT_SYMBOL_GPL(boot_cpuid);
 int dcache_bsize;
 int icache_bsize;
 
-unsigned long klimit = (unsigned long) _end;
-
 /*
  * This still seems to be needed... -- paulus
  */ 
@@ -930,7 +928,7 @@ void __init setup_arch(char **cmdline_p)
init_mm.start_code = (unsigned long)_stext;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
-   init_mm.brk = klimit;
+   init_mm.brk = (unsigned long)_end;
 
mm_iommu_init(&init_mm);
irqstack_early_init();
diff --git a/arch/powerpc/platforms/powermac/bootx_init.c 
b/arch/powerpc/platforms/powermac/bootx_init.c
index 9d4ecd292255..d20ef35e6d9d 100644
--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -433,7 +433,7 @@ static void __init btext_welcome(boot_infos_t *bi)
bootx_printf("\nframe buffer at  : 0x%x", bi->dispDeviceBase);
bootx_printf(" (phys), 0x%x", bi->logicalDisplayBase);
bootx_printf(" (log)");
-   bootx_printf("\nklimit   : 0x%x",(unsigned long)klimit);
+   bootx_printf("\nklimit   : 0x%x",(unsigned long)_end);
bootx_printf("\nboot_info at : 0x%x", bi);
__asm__ __volatile__ ("mfmsr %0" : "=r" (flags));
bootx_printf("\nMSR  : 0x%x", flags);
-- 
2.25.0



Re: [PATCH v4 06/16] powerpc/vas: Define and use common vas_window struct

2021-06-04 Thread Michael Ellerman
Haren Myneni  writes:
> On Thu, 2021-06-03 at 14:38 +1000, Nicholas Piggin wrote:
>> Excerpts from Haren Myneni's message of May 21, 2021 7:33 pm:
>> > Same vas_window struct is used on powerNV and pseries. So this
>> > patch
>> > changes in struct vas_window to support both platforms and also the
>> > corresponding modifications in powerNV vas code.
>> > 
>> > On powerNV, vas_window is used for both TX and RX windows, whereas
>> > only for TX windows on powerVM. So some elements are specific to
>> > these platforms.
>> > 
>> > Signed-off-by: Haren Myneni 
>> > ---
>> >  arch/powerpc/include/asm/vas.h  |  50 +++-
>> >  arch/powerpc/platforms/powernv/vas-debug.c  |  12 +-
>> >  arch/powerpc/platforms/powernv/vas-fault.c  |   4 +-
>> >  arch/powerpc/platforms/powernv/vas-trace.h  |   6 +-
>> >  arch/powerpc/platforms/powernv/vas-window.c | 129 +++-
>> > 
>> >  arch/powerpc/platforms/powernv/vas.h|  38 +-
>> >  6 files changed, 135 insertions(+), 104 deletions(-)
>> > 
>> > diff --git a/arch/powerpc/include/asm/vas.h
>> > b/arch/powerpc/include/asm/vas.h
>> > index 2c1040f399d9..49bfb5be896d 100644
>> > --- a/arch/powerpc/include/asm/vas.h
>> > +++ b/arch/powerpc/include/asm/vas.h
>> > @@ -10,8 +10,6 @@
>> >  #include 
>> >  #include 
>> >  
>> > -struct vas_window;
>> > -
>> >  /*
>> >   * Min and max FIFO sizes are based on Version 1.05 Section
>> > 3.1.4.25
>> >   * (Local FIFO Size Register) of the VAS workbook.
>> > @@ -63,6 +61,54 @@ struct vas_user_win_ref {
>> >struct mm_struct *mm;   /* Linux process mm_struct */
>> >  };
>> >  
>> > +/*
>> > + * In-kernel state a VAS window. One per window.
>> > + * powerVM: Used only for Tx windows.
>> > + * powerNV: Used for both Tx and Rx windows.
>> > + */
>> > +struct vas_window {
>> > +  u32 winid;
>> > +  u32 wcreds_max; /* Window credits */
>> > +  enum vas_cop_type cop;
>> > +  struct vas_user_win_ref task_ref;
>> > +  char *dbgname;
>> > +  struct dentry *dbgdir;
>> > +  union {
>> > +  /* powerNV specific data */
>> > +  struct {
>> > +  void *vinst;/* points to VAS instance
>> > */
>> > +  bool tx_win;/* True if send window */
>> > +  bool nx_win;/* True if NX window */
>> > +  bool user_win;  /* True if user space
>> > window */
>> > +  void *hvwc_map; /* HV window context */
>> > +  void *uwc_map;  /* OS/User window context
>> > */
>> > +
>> > +  /* Fields applicable only to send windows */
>> > +  void *paste_kaddr;
>> > +  char *paste_addr_name;
>> > +  struct vas_window *rxwin;
>> > +
>> > +  atomic_t num_txwins;/* Only for receive
>> > windows */
>> > +  } pnv;
>> > +  struct {
>> > +  u64 win_addr;   /* Physical paste address
>> > */
>> > +  u8 win_type;/* QoS or Default window */
>> > +  u8 status;
>> > +  u32 complete_irq;   /* Completion interrupt */
>> > +  u32 fault_irq;  /* Fault interrupt */
>> > +  u64 domain[6];  /* Associativity domain Ids
>> > */
>> > +  /* this window is allocated */
>> > +  u64 util;
>> > +
>> > +  /* List of windows opened which is used for LPM
>> > */
>> > +  struct list_head win_list;
>> > +  u64 flags;
>> > +  char *name;
>> > +  int fault_virq;
>> > +  } lpar;
>> > +  };
>> > +};
>> 
>> The more typical way to do code like this is have the common bit as
>> its own struct, and then have the users embed it into their own structs.
>> 
>> 
>> struct vas_window {
>>  u32 winid;
>>  u32 wcreds_max; /* Window credits */
>>  enum vas_cop_type cop;
>>  struct vas_user_win_ref task_ref;
>>  char *dbgname;
>>  struct dentry *dbgdir;
>> };
>> 
>> struct pnv_vas_window {
>>  struct vas_window vas_window;
>> 
>>  void *vinst;/* points to VAS instance */
>>  bool tx_win;/* True if send window */
>>  bool nx_win;/* True if NX window */
>>  bool user_win;  /* True if user space window */
>>  void *hvwc_map; /* HV window context */
>>  void *uwc_map;  /* OS/User window context */
>> 
>>  /* Fields applicable only to send windows */
>>  void *paste_kaddr;
>>  char *paste_addr_name;
>>  struct vas_window *rxwin;
>> 
>>  atomic_t num_txwins;/* Only for receive windows */
>> };
>> 
>> Which helps reusability / avoids churn (don't have to update the
>> same 
>> structure to add new functionality), slightly helps naming and union 
>> size mismatches, helps with type checking, etc.
>> 
>> Maybe not a great benefit for your code as is which may not grow more
>> users, but unless there are some good reasons for the unions I would 
>> really consider c

Re: [PATCH v4 16/16] crypto/nx: Add sysfs interface to export NX capabilities

2021-06-04 Thread Michael Ellerman
Haren Myneni  writes:
> On Thu, 2021-06-03 at 14:57 +1000, Nicholas Piggin wrote:
>> Excerpts from Haren Myneni's message of May 21, 2021 7:42 pm:
>> > Changes to export the following NXGZIP capabilities through sysfs:
>> > 
>> > /sys/devices/vio/ibm,compression-v1/NxGzCaps:
>> 
>> Where's the horrible camel case name coming from? PowerVM?
>
> Yes, pHyp provides the capabalities string.
>
> Capability Description Descriptor Value Descriptor ascii Value
> Overall NX Capabilities 0x4E78204361707320 “Nx Caps ”
> NX GZIP Capabilities 0x4E78477A43617073 “NxGzCaps”

That doesn't mean we have to use that name in sysfs though. In fact we
couldn't use the "Nx Caps " name, because it contains spaces.

And we don't have to squeeze our name into 8 bytes, so it can be less
ugly.

Like "nx_gzip_capabilities"?

cheers


Re: [PATCH v4 09/16] powerpc/pseries/vas: Add HCALL wrappers for VAS handling

2021-06-04 Thread Michael Ellerman
Haren Myneni  writes:
> This patch adds the following HCALL wrapper functions to allocate,

Normal spelling is "hcall".

> modify and deallocate VAS windows, and retrieve VAS capabilities.
>
> H_ALLOCATE_VAS_WINDOW: Allocate VAS window
> H_DEALLOCATE_VAS_WINDOW: Close VAS window
> H_MODIFY_VAS_WINDOW: Setup window before using
> H_QUERY_VAS_CAPABILITIES: Get VAS capabilities

Please tell us which version of PAPR, and in which section etc., these
are described in.

> Signed-off-by: Haren Myneni 
> Reviewed-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/pseries/vas.c | 217 +++
>  1 file changed, 217 insertions(+)
>  create mode 100644 arch/powerpc/platforms/pseries/vas.c
>
> diff --git a/arch/powerpc/platforms/pseries/vas.c 
> b/arch/powerpc/platforms/pseries/vas.c
> new file mode 100644
> index ..06960151477c
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -0,0 +1,217 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright 2020-21 IBM Corp.
> + */
> +
> +#define pr_fmt(fmt) "vas: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Do we need all those headers?

> +#include 
> +#include 
> +#include "vas.h"
> +
> +#define  VAS_INVALID_WIN_ADDRESS 0xul
> +#define  VAS_DEFAULT_DOMAIN_ID   0xul

Some blank lines for formatting please.

> +/* Authority Mask Register (AMR) value is not supported in */
> +/* linux implementation. So pass '0' to modify window HCALL */

Please fix the comment formatting.

> +#define  VAS_AMR_VALUE   0

This is only used in one place. It'd be simpler to just pass 0 and move
the comment there.

> +/* phyp allows one credit per window right now */
> +#define DEF_WIN_CREDS1
> +
> +static int64_t hcall_return_busy_check(int64_t rc)
> +{

Please use normal kernel types, ie. s64, or just long.

Same comment throughout.

> + /* Check if we are stalled for some time */
> + if (H_IS_LONG_BUSY(rc)) {
> + msleep(get_longbusy_msecs(rc));
> + rc = H_BUSY;
> + } else if (rc == H_BUSY) {
> + cond_resched();
> + }
> +
> + return rc;
> +}
> +
> +/*
> + * Allocate VAS window HCALL
> + */
> +static int plpar_vas_allocate_window(struct vas_window *win, u64 *domain,
> +  u8 wintype, u16 credits)

You don't have to use the "plpar" prefix for these sort of wrappers.

Just naming them after the hcall would probably be clearer, so:

 h_allocate_vas_window(... )

> +{
> + long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
> + int64_t rc;
> +
> + do {
> + rc = plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, wintype,
> +   credits, domain[0], domain[1], domain[2],
> +   domain[3], domain[4], domain[5]);
> +
> + rc = hcall_return_busy_check(rc);
> + } while (rc == H_BUSY);
> +
> + switch (rc) {
> + case H_SUCCESS:
> + win->winid = retbuf[0];
> + win->lpar.win_addr = retbuf[1];
> + win->lpar.complete_irq = retbuf[2];
> + win->lpar.fault_irq = retbuf[3];

You shouldn't mutate win until you know there is no error.

> + if (win->lpar.win_addr == VAS_INVALID_WIN_ADDRESS) {
> + pr_err("HCALL(%x): COPY/PASTE is not supported\n",
> + H_ALLOCATE_VAS_WINDOW);
> + return -ENOTSUPP;
> + }
> + return 0;
> + case H_PARAMETER:
> + pr_err("HCALL(%x): Invalid window type (%u)\n",
> + H_ALLOCATE_VAS_WINDOW, wintype);
> + return -EINVAL;
> + case H_P2:
> + pr_err("HCALL(%x): Credits(%u) exceed maximum window credits\n",
> + H_ALLOCATE_VAS_WINDOW, credits);
> + return -EINVAL;
> + case H_COP_HW:
> + pr_err("HCALL(%x): User-mode COPY/PASTE is not supported\n",
> + H_ALLOCATE_VAS_WINDOW);
> + return -ENOTSUPP;
> + case H_RESOURCE:
> + pr_err("HCALL(%x): LPAR credit limit exceeds window limit\n",
> + H_ALLOCATE_VAS_WINDOW);
> + return -EPERM;
> + case H_CONSTRAINED:
> + pr_err("HCALL(%x): Credits (%u) are not available\n",
> + H_ALLOCATE_VAS_WINDOW, credits);
> + return -EPERM;
> + default:
> + pr_err("HCALL(%x): Unexpected error %lld\n",
> + H_ALLOCATE_VAS_WINDOW, rc);
> + return -EIO;
> + }

Do we really need all these error prints? It's very verbose, and
presumably in normal operation none of these are meant to happen anyway.

Can't we just have a single case that prints the error value?

Same comment for the other hcalls.

> +}
> +
> +/*
> + * Deallocate VAS wi

[PATCH] powerpc/selftests: Use gettid() instead of getppid() for null_syscall

2021-06-04 Thread Christophe Leroy
gettid() is 10% lighter than getppid(), use it for null_syscall selftest.

Signed-off-by: Christophe Leroy 
---
 tools/testing/selftests/powerpc/benchmarks/null_syscall.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/benchmarks/null_syscall.c 
b/tools/testing/selftests/powerpc/benchmarks/null_syscall.c
index 579f0215c6e7..9836838a529f 100644
--- a/tools/testing/selftests/powerpc/benchmarks/null_syscall.c
+++ b/tools/testing/selftests/powerpc/benchmarks/null_syscall.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static volatile int soak_done;
@@ -121,7 +122,7 @@ static void do_null_syscall(unsigned long nr)
unsigned long i;
 
for (i = 0; i < nr; i++)
-   getppid();
+   syscall(__NR_gettid);
 }
 
 #define TIME(A, STR) \
-- 
2.25.0



Re: [PATCH v2 3/9] arc: remove support for DISCONTIGMEM

2021-06-04 Thread Mike Rapoport
On Fri, Jun 04, 2021 at 02:07:39PM +, Vineet Gupta wrote:
> On 6/3/21 11:49 PM, Mike Rapoport wrote:
> > From: Mike Rapoport 
> >
> > DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
> > in v5.11.
> >
> > Remove the support for DISCONTIGMEM entirely.
> >
> > Signed-off-by: Mike Rapoport 
> 
> Looks non intrusive, but I'd still like to give this a spin on hardware 
> - considering highmem on ARC has tendency to go sideways ;-)
> Can you please share a branch !

Sure:

https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=memory-models/rm-discontig/v2
 
> Acked-by: Vineet Gupta 

Thanks!
 
> Thx,
> -Vineet
> 
> > ---
> >   arch/arc/Kconfig  | 13 
> >   arch/arc/include/asm/mmzone.h | 40 ---
> >   arch/arc/mm/init.c|  8 ---
> >   3 files changed, 61 deletions(-)
> >   delete mode 100644 arch/arc/include/asm/mmzone.h
> >
> > diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
> > index 2d98501c0897..d8f51eb8963b 100644
> > --- a/arch/arc/Kconfig
> > +++ b/arch/arc/Kconfig
> > @@ -62,10 +62,6 @@ config SCHED_OMIT_FRAME_POINTER
> >   config GENERIC_CSUM
> > def_bool y
> >   
> > -config ARCH_DISCONTIGMEM_ENABLE
> > -   def_bool n
> > -   depends on BROKEN
> > -
> >   config ARCH_FLATMEM_ENABLE
> > def_bool y
> >   
> > @@ -344,15 +340,6 @@ config ARC_HUGEPAGE_16M
> >   
> >   endchoice
> >   
> > -config NODES_SHIFT
> > -   int "Maximum NUMA Nodes (as a power of 2)"
> > -   default "0" if !DISCONTIGMEM
> > -   default "1" if DISCONTIGMEM
> > -   depends on NEED_MULTIPLE_NODES
> > -   help
> > - Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory
> > - zones.
> > -
> >   config ARC_COMPACT_IRQ_LEVELS
> > depends on ISA_ARCOMPACT
> > bool "Setup Timer IRQ as high Priority"
> > diff --git a/arch/arc/include/asm/mmzone.h b/arch/arc/include/asm/mmzone.h
> > deleted file mode 100644
> > index b86b9d1e54dc..
> > --- a/arch/arc/include/asm/mmzone.h
> > +++ /dev/null
> > @@ -1,40 +0,0 @@
> > -/* SPDX-License-Identifier: GPL-2.0-only */
> > -/*
> > - * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com)
> > - */
> > -
> > -#ifndef _ASM_ARC_MMZONE_H
> > -#define _ASM_ARC_MMZONE_H
> > -
> > -#ifdef CONFIG_DISCONTIGMEM
> > -
> > -extern struct pglist_data node_data[];
> > -#define NODE_DATA(nid) (&node_data[nid])
> > -
> > -static inline int pfn_to_nid(unsigned long pfn)
> > -{
> > -   int is_end_low = 1;
> > -
> > -   if (IS_ENABLED(CONFIG_ARC_HAS_PAE40))
> > -   is_end_low = pfn <= virt_to_pfn(0xUL);
> > -
> > -   /*
> > -* node 0: lowmem: 0x8000_   to 0x_
> > -* node 1: HIGHMEM w/o  PAE40: 0x0   to 0x7FFF_
> > -* HIGHMEM with PAE40: 0x1__ to ...
> > -*/
> > -   if (pfn >= ARCH_PFN_OFFSET && is_end_low)
> > -   return 0;
> > -
> > -   return 1;
> > -}
> > -
> > -static inline int pfn_valid(unsigned long pfn)
> > -{
> > -   int nid = pfn_to_nid(pfn);
> > -
> > -   return (pfn <= node_end_pfn(nid));
> > -}
> > -#endif /* CONFIG_DISCONTIGMEM  */
> > -
> > -#endif
> > diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> > index 397a201adfe3..abfeef7bf6f8 100644
> > --- a/arch/arc/mm/init.c
> > +++ b/arch/arc/mm/init.c
> > @@ -32,11 +32,6 @@ unsigned long arch_pfn_offset;
> >   EXPORT_SYMBOL(arch_pfn_offset);
> >   #endif
> >   
> > -#ifdef CONFIG_DISCONTIGMEM
> > -struct pglist_data node_data[MAX_NUMNODES] __read_mostly;
> > -EXPORT_SYMBOL(node_data);
> > -#endif
> > -
> >   long __init arc_get_mem_sz(void)
> >   {
> > return low_mem_sz;
> > @@ -147,9 +142,6 @@ void __init setup_arch_memory(void)
> >  * to the hole is freed and ARC specific version of pfn_valid()
> >  * handles the hole in the memory map.
> >  */
> > -#ifdef CONFIG_DISCONTIGMEM
> > -   node_set_online(1);
> > -#endif
> >   
> > min_high_pfn = PFN_DOWN(high_mem_start);
> > max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);
> 

-- 
Sincerely yours,
Mike.


[PATCH 1/4] powerpc/32: Interchange r10 and r12 in SYSCALL_ENTRY on non booke

2021-06-04 Thread Christophe Leroy
To better match booke version of SYSCALL_ENTRY macro, interchange
r10 and r12 in the non booke version.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.h | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index a8221ddcbd66..1e55bc054659 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -142,42 +142,42 @@ _ASM_NOKPROBE_SYMBOL(\name\()_virt)
 
 .macro SYSCALL_ENTRY trapno
mfspr   r9, SPRN_SRR1
-   mfspr   r10, SPRN_SRR0
+   mfspr   r12, SPRN_SRR0
LOAD_REG_IMMEDIATE(r11, MSR_KERNEL) /* can take exceptions 
*/
-   lis r12, 1f@h
-   ori r12, r12, 1f@l
+   lis r10, 1f@h
+   ori r10, r10, 1f@l
mtspr   SPRN_SRR1, r11
-   mtspr   SPRN_SRR0, r12
-   mfspr   r12,SPRN_SPRG_THREAD
+   mtspr   SPRN_SRR0, r10
+   mfspr   r10,SPRN_SPRG_THREAD
mr  r11, r1
-   lwz r1,TASK_STACK-THREAD(r12)
-   tovirt(r12, r12)
+   lwz r1,TASK_STACK-THREAD(r10)
+   tovirt(r10, r10)
addir1, r1, THREAD_SIZE - INT_FRAME_SIZE
rfi
 1:
stw r11,GPR1(r1)
stw r11,0(r1)
mr  r11, r1
-   stw r10,_NIP(r11)
-   mflrr10
-   stw r10, _LINK(r11)
-   mfcrr10
-   rlwinm  r10,r10,0,4,2   /* Clear SO bit in CR */
-   stw r10,_CCR(r11)   /* save registers */
+   stw r12,_NIP(r11)
+   mflrr12
+   stw r12, _LINK(r11)
+   mfcrr12
+   rlwinm  r12,r12,0,4,2   /* Clear SO bit in CR */
+   stw r12,_CCR(r11)   /* save registers */
 #ifdef CONFIG_40x
rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
 #endif
-   lis r10,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
+   lis r12,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
stw r2,GPR2(r11)
-   addir10,r10,STACK_FRAME_REGS_MARKER@l
+   addir12,r12,STACK_FRAME_REGS_MARKER@l
stw r9,_MSR(r11)
li  r2, \trapno
-   stw r10,8(r11)
+   stw r12,8(r11)
stw r2,_TRAP(r11)
SAVE_GPR(0, r11)
SAVE_4GPRS(3, r11)
SAVE_2GPRS(7, r11)
-   addir2,r12,-THREAD
+   addir2,r10,-THREAD
b   transfer_to_syscall /* jump to handler */
 .endm
 
-- 
2.25.0



[PATCH 2/4] powerpc/32: Interchange r1 and r11 in SYSCALL_ENTRY on booke

2021-06-04 Thread Christophe Leroy
To better match non booke version of SYSCALL_ENTRY macro,
interchange r1 and r11 in the booke version.

While at it, in both versions use r1 instead of r11 to save
_NIP and _CCR.

All other uses of r11 will go away in next patch, so don't
bother changing them for now.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.h|  4 ++--
 arch/powerpc/kernel/head_booke.h | 17 +
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 1e55bc054659..7ca25eb9bc75 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -158,12 +158,12 @@ _ASM_NOKPROBE_SYMBOL(\name\()_virt)
stw r11,GPR1(r1)
stw r11,0(r1)
mr  r11, r1
-   stw r12,_NIP(r11)
+   stw r12,_NIP(r1)
mflrr12
stw r12, _LINK(r11)
mfcrr12
rlwinm  r12,r12,0,4,2   /* Clear SO bit in CR */
-   stw r12,_CCR(r11)   /* save registers */
+   stw r12,_CCR(r1)
 #ifdef CONFIG_40x
rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
 #endif
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index f82470091697..4a2fad9f225e 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -135,17 +135,18 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
 #endif
mfspr   r9, SPRN_SRR1
BOOKE_CLEAR_BTB(r11)
-   lwz r11, TASK_STACK - THREAD(r10)
+   mr  r11, r1
+   lwz r1, TASK_STACK - THREAD(r10)
rlwinm  r12,r12,0,4,2   /* Clear SO bit in CR */
-   ALLOC_STACK_FRAME(r11, THREAD_SIZE - INT_FRAME_SIZE)
-   stw r12, _CCR(r11)  /* save various registers */
+   ALLOC_STACK_FRAME(r1, THREAD_SIZE - INT_FRAME_SIZE)
+   stw r12, _CCR(r1)
mflrr12
-   stw r12,_LINK(r11)
+   stw r12,_LINK(r1)
mfspr   r12,SPRN_SRR0
-   stw r1, GPR1(r11)
-   stw r1, 0(r11)
-   mr  r1, r11
-   stw r12,_NIP(r11)
+   stw r11, GPR1(r1)
+   stw r11, 0(r1)
+   mr  r11, r1
+   stw r12,_NIP(r1)
rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?)   */
lis r12, STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
stw r2,GPR2(r11)
-- 
2.25.0



[PATCH 3/4] powerpc/32: Reduce code duplication of system call entry

2021-06-04 Thread Christophe Leroy
booke and non booke do pretty similar things in SYSCALL_ENTRY macro
just before calling jumping to transfer_to_syscall().

Do them in transfer_to_syscall() instead.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S   | 19 +++
 arch/powerpc/kernel/head_32.h| 19 ---
 arch/powerpc/kernel/head_booke.h | 18 --
 3 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 9160285cb2f4..0f53f6d11865 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "head_32.h"
 
@@ -74,6 +75,24 @@ _ASM_NOKPROBE_SYMBOL(prepare_transfer_to_handler)
 
.globl  transfer_to_syscall
 transfer_to_syscall:
+   stw r11, GPR1(r1)
+   stw r11, 0(r1)
+   mflrr12
+   stw r12, _LINK(r1)
+#if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
+   rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
+#endif
+   lis r12,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
+   SAVE_GPR(2, r1)
+   addir12,r12,STACK_FRAME_REGS_MARKER@l
+   stw r9,_MSR(r1)
+   li  r2, INTERRUPT_SYSCALL
+   stw r12,8(r1)
+   stw r2,_TRAP(r1)
+   SAVE_GPR(0, r1)
+   SAVE_4GPRS(3, r1)
+   SAVE_2GPRS(7, r1)
+   addir2,r10,-THREAD
SAVE_NVGPRS(r1)
 
/* Calling convention has r9 = orig r0, r10 = regs */
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 7ca25eb9bc75..6b1ec9e3541b 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -155,29 +155,10 @@ _ASM_NOKPROBE_SYMBOL(\name\()_virt)
addir1, r1, THREAD_SIZE - INT_FRAME_SIZE
rfi
 1:
-   stw r11,GPR1(r1)
-   stw r11,0(r1)
-   mr  r11, r1
stw r12,_NIP(r1)
-   mflrr12
-   stw r12, _LINK(r11)
mfcrr12
rlwinm  r12,r12,0,4,2   /* Clear SO bit in CR */
stw r12,_CCR(r1)
-#ifdef CONFIG_40x
-   rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?) */
-#endif
-   lis r12,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
-   stw r2,GPR2(r11)
-   addir12,r12,STACK_FRAME_REGS_MARKER@l
-   stw r9,_MSR(r11)
-   li  r2, \trapno
-   stw r12,8(r11)
-   stw r2,_TRAP(r11)
-   SAVE_GPR(0, r11)
-   SAVE_4GPRS(3, r11)
-   SAVE_2GPRS(7, r11)
-   addir2,r10,-THREAD
b   transfer_to_syscall /* jump to handler */
 .endm
 
diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index 4a2fad9f225e..10f31146b472 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -140,26 +140,8 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
rlwinm  r12,r12,0,4,2   /* Clear SO bit in CR */
ALLOC_STACK_FRAME(r1, THREAD_SIZE - INT_FRAME_SIZE)
stw r12, _CCR(r1)
-   mflrr12
-   stw r12,_LINK(r1)
mfspr   r12,SPRN_SRR0
-   stw r11, GPR1(r1)
-   stw r11, 0(r1)
-   mr  r11, r1
stw r12,_NIP(r1)
-   rlwinm  r9,r9,0,14,12   /* clear MSR_WE (necessary?)   */
-   lis r12, STACK_FRAME_REGS_MARKER@ha /* exception frame marker */
-   stw r2,GPR2(r11)
-   addir12, r12, STACK_FRAME_REGS_MARKER@l
-   stw r9,_MSR(r11)
-   li  r2, \trapno
-   stw r12, 8(r11)
-   stw r2,_TRAP(r11)
-   SAVE_GPR(0, r11)
-   SAVE_4GPRS(3, r11)
-   SAVE_2GPRS(7, r11)
-
-   addir2,r10,-THREAD
b   transfer_to_syscall /* jump to handler */
 .endm
 
-- 
2.25.0



[PATCH 4/4] powerpc/32: Avoid #ifdef nested with FTR_SECTION on booke syscall entry

2021-06-04 Thread Christophe Leroy
On booke, SYSCALL_ENTRY macro nests an FTR_SECTION with a

Duplicate the single instruction alternative to avoid nesting.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_booke.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/head_booke.h b/arch/powerpc/kernel/head_booke.h
index 10f31146b472..87b806e8eded 100644
--- a/arch/powerpc/kernel/head_booke.h
+++ b/arch/powerpc/kernel/head_booke.h
@@ -128,10 +128,10 @@ BEGIN_FTR_SECTION
mr  r12, r13
lwz r13, THREAD_NORMSAVE(2)(r10)
 FTR_SECTION_ELSE
-#endif
mfcrr12
-#ifdef CONFIG_KVM_BOOKE_HV
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_EMB_HV)
+#else
+   mfcrr12
 #endif
mfspr   r9, SPRN_SRR1
BOOKE_CLEAR_BTB(r11)
-- 
2.25.0



[PATCH v2 1/4] powerpc/interrupt: Interchange prep_irq_for_{kernel_enabled/user}_exit()

2021-06-04 Thread Christophe Leroy
prep_irq_for_user_exit() is a superset of
prep_irq_for_kernel_enabled_exit(). In order to allow refactoring in
following patch, interchange the two as prep_irq_for_user_exit() will
call prep_irq_for_kernel_enabled_exit().

Signed-off-by: Christophe Leroy 
---
This series applies on top of Nic's series to speed up interrupt return on 64s

 arch/powerpc/kernel/interrupt.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index 74c995a42399..539455c62c5b 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -40,33 +40,27 @@ static inline bool exit_must_hard_disable(void)
 #endif
 
 /*
- * local irqs must be disabled. Returns false if the caller must re-enable
- * them, check for new work, and try again.
- *
- * This should be called with local irqs disabled, but if they were previously
- * enabled when the interrupt handler returns (indicating a process-context /
- * synchronous interrupt) then irqs_enabled should be true.
+ * restartable is true then EE/RI can be left on because interrupts are handled
+ * with a restart sequence.
  */
-static notrace __always_inline bool prep_irq_for_user_exit(void)
+static notrace __always_inline bool prep_irq_for_kernel_enabled_exit(bool 
restartable)
 {
-   user_enter_irqoff();
/* This must be done with RI=1 because tracing may touch vmaps */
trace_hardirqs_on();
 
 #ifdef CONFIG_PPC32
__hard_EE_RI_disable();
 #else
-   if (exit_must_hard_disable())
+   if (exit_must_hard_disable() || !restartable)
__hard_EE_RI_disable();
 
/* This pattern matches prep_irq_for_idle */
if (unlikely(lazy_irq_pending_nocheck())) {
-   if (exit_must_hard_disable()) {
+   if (exit_must_hard_disable() || !restartable) {
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
__hard_RI_enable();
}
trace_hardirqs_off();
-   user_exit_irqoff();
 
return false;
}
@@ -75,27 +69,33 @@ static notrace __always_inline bool 
prep_irq_for_user_exit(void)
 }
 
 /*
- * restartable is true then EE/RI can be left on because interrupts are handled
- * with a restart sequence.
+ * local irqs must be disabled. Returns false if the caller must re-enable
+ * them, check for new work, and try again.
+ *
+ * This should be called with local irqs disabled, but if they were previously
+ * enabled when the interrupt handler returns (indicating a process-context /
+ * synchronous interrupt) then irqs_enabled should be true.
  */
-static notrace __always_inline bool prep_irq_for_kernel_enabled_exit(bool 
restartable)
+static notrace __always_inline bool prep_irq_for_user_exit(void)
 {
+   user_enter_irqoff();
/* This must be done with RI=1 because tracing may touch vmaps */
trace_hardirqs_on();
 
 #ifdef CONFIG_PPC32
__hard_EE_RI_disable();
 #else
-   if (exit_must_hard_disable() || !restartable)
+   if (exit_must_hard_disable())
__hard_EE_RI_disable();
 
/* This pattern matches prep_irq_for_idle */
if (unlikely(lazy_irq_pending_nocheck())) {
-   if (exit_must_hard_disable() || !restartable) {
+   if (exit_must_hard_disable()) {
local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
__hard_RI_enable();
}
trace_hardirqs_off();
+   user_exit_irqoff();
 
return false;
}
-- 
2.25.0



[PATCH v2 2/4] powerpc/interrupt: Refactor prep_irq_for_user_exit()

2021-06-04 Thread Christophe Leroy
prep_irq_for_user_exit() is a superset of
prep_irq_for_kernel_enabled_exit().

Refactor it.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/interrupt.c | 25 +
 1 file changed, 5 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index 539455c62c5b..b6aa80930733 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -78,29 +78,14 @@ static notrace __always_inline bool 
prep_irq_for_kernel_enabled_exit(bool restar
  */
 static notrace __always_inline bool prep_irq_for_user_exit(void)
 {
-   user_enter_irqoff();
-   /* This must be done with RI=1 because tracing may touch vmaps */
-   trace_hardirqs_on();
-
-#ifdef CONFIG_PPC32
-   __hard_EE_RI_disable();
-#else
-   if (exit_must_hard_disable())
-   __hard_EE_RI_disable();
+   bool ret;
 
-   /* This pattern matches prep_irq_for_idle */
-   if (unlikely(lazy_irq_pending_nocheck())) {
-   if (exit_must_hard_disable()) {
-   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
-   __hard_RI_enable();
-   }
-   trace_hardirqs_off();
+   user_enter_irqoff();
+   ret = prep_irq_for_kernel_enabled_exit(true);
+   if (!ret)
user_exit_irqoff();
 
-   return false;
-   }
-#endif
-   return true;
+   return ret;
 }
 
 /* Has to run notrace because it is entered not completely "reconciled" */
-- 
2.25.0



[PATCH v2 3/4] powerpc/interrupt: Rename and lightly change syscall_exit_prepare_main()

2021-06-04 Thread Christophe Leroy
Rename syscall_exit_prepare_main() into interrupt_exit_prepare_main()

Make it static as it is not used anywhere else.

Pass it the 'ret' so that it can 'or' it directly instead of
oring twice, once inside the function and once outside.

And remove 'r3' parameter which is not used.

Also fix a typo where CONFIG_PPC_BOOK3S should be CONFIG_PPC_BOOK3S_64.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/interrupt.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index b6aa80930733..bc3c1892ed80 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -228,11 +228,10 @@ static notrace void booke_load_dbcr0(void)
 #endif
 }
 
-notrace unsigned long syscall_exit_prepare_main(unsigned long r3,
-   struct pt_regs *regs)
+static notrace unsigned long
+interrupt_exit_user_prepare_main(struct pt_regs *regs, unsigned long ret)
 {
unsigned long ti_flags;
-   unsigned long ret = 0;
 
 again:
ti_flags = READ_ONCE(current_thread_info()->flags);
@@ -254,7 +253,7 @@ notrace unsigned long syscall_exit_prepare_main(unsigned 
long r3,
ti_flags = READ_ONCE(current_thread_info()->flags);
}
 
-   if (IS_ENABLED(CONFIG_PPC_BOOK3S) && IS_ENABLED(CONFIG_PPC_FPU)) {
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && IS_ENABLED(CONFIG_PPC_FPU)) {
if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
unlikely((ti_flags & _TIF_RESTORE_TM))) {
restore_tm_state(regs);
@@ -350,7 +349,7 @@ notrace unsigned long syscall_exit_prepare(unsigned long r3,
}
 
local_irq_disable();
-   ret |= syscall_exit_prepare_main(r3, regs);
+   ret = interrupt_exit_user_prepare_main(regs, ret);
 
 #ifdef CONFIG_PPC64
regs->exit_result = ret;
@@ -378,7 +377,7 @@ notrace unsigned long syscall_exit_restart(unsigned long 
r3, struct pt_regs *reg
 
BUG_ON(!user_mode(regs));
 
-   regs->exit_result |= syscall_exit_prepare_main(r3, regs);
+   regs->exit_result = interrupt_exit_user_prepare_main(regs, 
regs->exit_result);
 
return regs->exit_result;
 }
-- 
2.25.0



[PATCH v2 4/4] powerpc/interrupt: Refactor interrupt_exit_user_prepare()

2021-06-04 Thread Christophe Leroy
interrupt_exit_user_prepare() is a superset of
interrupt_exit_user_prepare_main().

Refactor to avoid code duplication.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/interrupt.c | 57 ++---
 1 file changed, 3 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index bc3c1892ed80..f5d30323028e 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -385,9 +385,7 @@ notrace unsigned long syscall_exit_restart(unsigned long 
r3, struct pt_regs *reg
 
 notrace unsigned long interrupt_exit_user_prepare(struct pt_regs *regs)
 {
-   unsigned long ti_flags;
-   unsigned long flags;
-   unsigned long ret = 0;
+   unsigned long ret;
 
if (!IS_ENABLED(CONFIG_BOOKE) && !IS_ENABLED(CONFIG_40x))
BUG_ON(!(regs->msr & MSR_RI));
@@ -401,63 +399,14 @@ notrace unsigned long interrupt_exit_user_prepare(struct 
pt_regs *regs)
 */
kuap_assert_locked();
 
-   local_irq_save(flags);
-
-again:
-   ti_flags = READ_ONCE(current_thread_info()->flags);
-   while (unlikely(ti_flags & (_TIF_USER_WORK_MASK & ~_TIF_RESTORE_TM))) {
-   local_irq_enable(); /* returning to user: may enable */
-   if (ti_flags & _TIF_NEED_RESCHED) {
-   schedule();
-   } else {
-   if (ti_flags & _TIF_SIGPENDING)
-   ret |= _TIF_RESTOREALL;
-   do_notify_resume(regs, ti_flags);
-   }
-   local_irq_disable();
-   ti_flags = READ_ONCE(current_thread_info()->flags);
-   }
-
-   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && IS_ENABLED(CONFIG_PPC_FPU)) {
-   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM) &&
-   unlikely((ti_flags & _TIF_RESTORE_TM))) {
-   restore_tm_state(regs);
-   } else {
-   unsigned long mathflags = MSR_FP;
-
-   if (cpu_has_feature(CPU_FTR_VSX))
-   mathflags |= MSR_VEC | MSR_VSX;
-   else if (cpu_has_feature(CPU_FTR_ALTIVEC))
-   mathflags |= MSR_VEC;
-
-   /* See above restore_math comment */
-   if ((regs->msr & mathflags) != mathflags)
-   restore_math(regs);
-   }
-   }
-
-   if (!prep_irq_for_user_exit()) {
-   local_irq_enable();
-   local_irq_disable();
-   goto again;
-   }
-
-   booke_load_dbcr0();
-
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   local_paca->tm_scratch = regs->msr;
-#endif
+   local_irq_disable();
 
-   account_cpu_user_exit();
+   ret = interrupt_exit_user_prepare_main(regs, 0);
 
 #ifdef CONFIG_PPC64
regs->exit_result = ret;
 #endif
 
-   /* Restore user access locks last */
-   kuap_user_restore(regs);
-   kuep_unlock();
-
return ret;
 }
 
-- 
2.25.0



[PATCH v2 2/2] powerpc/ps3: Re-align DTB in image

2021-06-04 Thread Geoff Levand
Change the PS3 linker script to align the DTB at 8 bytes,
the same alignment as that of the of the 'generic' powerpc
linker script.

Signed-off-by: Geoff Levand 
---
 arch/powerpc/boot/zImage.ps3.lds.S | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/zImage.ps3.lds.S 
b/arch/powerpc/boot/zImage.ps3.lds.S
index 7b2ff2eaa73a..d0ffb493614d 100644
--- a/arch/powerpc/boot/zImage.ps3.lds.S
+++ b/arch/powerpc/boot/zImage.ps3.lds.S
@@ -8,7 +8,7 @@ SECTIONS
   .kernel:vmlinux.bin : { *(.kernel:vmlinux.bin) }
   _vmlinux_end =  .;
 
-  . = ALIGN(4096);
+  . = ALIGN(8);
   _dtb_start = .;
   .kernel:dtb : { *(.kernel:dtb) }
   _dtb_end = .;
-- 
2.25.1



[PATCH v2 0/2] PS3 Updates

2021-06-04 Thread Geoff Levand
Hi Michael,

I've rebased the V1 patches to v5.13-rc4, and moved the firmware version export
from procfs to sysfs/firmware.

Please consider.

-Geoff

The following changes since commit 8124c8a6b35386f73523d27eacb71b5364a68c4c:

  Linux 5.13-rc4 (2021-05-30 11:58:25 -1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/ps3-linux.git 
for-merge-powerpc

for you to fetch changes up to 245897ed65e402686a4b114ba618e935cb5c6506:

  powerpc/ps3: Re-align DTB in image (2021-06-04 08:35:45 -0700)


Geoff Levand (2):
  powerpc/ps3: Add firmware version to sysfs
  powerpc/ps3: Re-align DTB in image

 arch/powerpc/boot/zImage.ps3.lds.S |  2 +-
 arch/powerpc/platforms/ps3/setup.c | 43 +++---
 2 files changed, 41 insertions(+), 4 deletions(-)

-- 
2.25.1



[PATCH v2 1/2] powerpc/ps3: Add firmware version to sysfs

2021-06-04 Thread Geoff Levand
Add a new sysfs entry /sys/firmware/ps3/fw-version that exports
the PS3's firmware version.

The firmware version is available through an LV1 hypercall, and we've
been printing it to the boot log, but haven't provided an easy way for
user utilities to get it.

Signed-off-by: Geoff Levand 
---
 arch/powerpc/platforms/ps3/setup.c | 43 +++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/setup.c 
b/arch/powerpc/platforms/ps3/setup.c
index e9ae5dd03593..3de9145c20bc 100644
--- a/arch/powerpc/platforms/ps3/setup.c
+++ b/arch/powerpc/platforms/ps3/setup.c
@@ -36,6 +36,7 @@ DEFINE_MUTEX(ps3_gpu_mutex);
 EXPORT_SYMBOL_GPL(ps3_gpu_mutex);
 
 static union ps3_firmware_version ps3_firmware_version;
+static char ps3_firmware_version_str[16];
 
 void ps3_get_firmware_version(union ps3_firmware_version *v)
 {
@@ -182,6 +183,40 @@ static int ps3_set_dabr(unsigned long dabr, unsigned long 
dabrx)
return lv1_set_dabr(dabr, dabrx) ? -1 : 0;
 }
 
+static ssize_t ps3_fw_version_show(struct kobject *kobj,
+   struct kobj_attribute *attr, char *buf)
+{
+   return sprintf(buf, "%s", ps3_firmware_version_str);
+}
+
+static int __init ps3_setup_sysfs(void)
+{
+   static struct kobj_attribute attr = __ATTR(fw-version, S_IRUGO,
+   ps3_fw_version_show, NULL);
+   static struct kobject *kobj;
+   int result;
+
+   kobj = kobject_create_and_add("ps3", firmware_kobj);
+
+   if (!kobj) {
+   pr_warn("%s:%d: kobject_create_and_add failed.\n", __func__,
+   __LINE__);
+   return -ENOMEM;
+   }
+
+   result = sysfs_create_file(kobj, &attr.attr);
+
+   if (result) {
+   pr_warn("%s:%d: sysfs_create_file failed.\n", __func__,
+   __LINE__);
+   kobject_put(kobj);
+   return -ENOMEM;
+   }
+
+   return 0;
+}
+core_initcall(ps3_setup_sysfs);
+
 static void __init ps3_setup_arch(void)
 {
u64 tmp;
@@ -190,9 +225,11 @@ static void __init ps3_setup_arch(void)
 
lv1_get_version_info(&ps3_firmware_version.raw, &tmp);
 
-   printk(KERN_INFO "PS3 firmware version %u.%u.%u\n",
-  ps3_firmware_version.major, ps3_firmware_version.minor,
-  ps3_firmware_version.rev);
+   snprintf(ps3_firmware_version_str, sizeof(ps3_firmware_version_str),
+   "%u.%u.%u", ps3_firmware_version.major,
+   ps3_firmware_version.minor, ps3_firmware_version.rev);
+
+   printk(KERN_INFO "PS3 firmware version %s\n", ps3_firmware_version_str);
 
ps3_spu_set_platform();
 
-- 
2.25.1




Re: simplify gendisk and request_queue allocation for blk-mq based drivers

2021-06-04 Thread Konrad Rzeszutek Wilk
On Wed, Jun 02, 2021 at 09:53:15AM +0300, Christoph Hellwig wrote:
> Hi all,

Hi!

You wouldn't have a nice git repo to pull so one can test it easily?

Thank you!

Cc-ing Boris/Juergen - pls see below xen.
> 
> this series is the scond part of cleaning up lifetimes and allocation of
> the gendisk and request_queue structure.  It adds a new interface to
> allocate the disk and queue together for blk based drivers, and uses that
> in all drivers that do not have any caveats in their gendisk and
> request_queue lifetime rules.
> 
> Diffstat:
>  block/blk-mq.c  |   91 +++---
>  block/blk.h |1 
>  block/elevator.c|2 
>  drivers/block/amiflop.c |   16 +-
>  drivers/block/aoe/aoeblk.c  |   33 
>  drivers/block/aoe/aoedev.c  |3 -
>  drivers/block/ataflop.c |   16 +-
>  drivers/block/floppy.c  |   20 +--
>  drivers/block/loop.c|   19 ++-
>  drivers/block/nbd.c |   53 +++
>  drivers/block/null_blk/main.c   |   11 +---
>  drivers/block/paride/pcd.c  |   19 +++
>  drivers/block/paride/pd.c   |   30 ---
>  drivers/block/paride/pf.c   |   18 ++
>  drivers/block/ps3disk.c |   36 +
>  drivers/block/rbd.c |   52 ++-
>  drivers/block/rnbd/rnbd-clt.c   |   35 +++--
>  drivers/block/sunvdc.c  |   47 -
>  drivers/block/swim.c|   34 +---
>  drivers/block/swim3.c   |   33 +---
>  drivers/block/sx8.c |   23 ++--
>  drivers/block/virtio_blk.c  |   26 ++---
>  drivers/block/xen-blkfront.c|   96 
> ++--
>  drivers/block/z2ram.c   |   15 +
>  drivers/cdrom/gdrom.c   |   45 +++-
>  drivers/md/dm-rq.c  |9 +--
>  drivers/memstick/core/ms_block.c|   25 +++--
>  drivers/memstick/core/mspro_block.c |   26 -
>  drivers/mtd/mtd_blkdevs.c   |   48 --
>  drivers/mtd/ubi/block.c |   68 ++---
>  drivers/s390/block/scm_blk.c|   21 ++-
>  include/linux/blk-mq.h  |   24 ++---
>  include/linux/elevator.h|1 
>  33 files changed, 386 insertions(+), 610 deletions(-)


[RFC] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-06-04 Thread Pratik R. Sampat
Adds a generic interface to represent the energy and frequency related
PAPR attributes on the system using the new H_CALL
"H_GET_ENERGY_SCALE_INFO".

H_GET_EM_PARMS H_CALL was previously responsible for exporting this
information in the lparcfg, however the H_GET_EM_PARMS H_CALL
will be deprecated P10 onwards.

The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
hcall(
  uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
  uint64 flags,   // Per the flag request
  uint64 firstAttributeId,// The attribute id
  uint64 bufferAddress,   // The logical address of the output buffer
  uint64 bufferSize   // The size in bytes of the output buffer
);

This H_CALL can query either all the attributes at once with
firstAttributeId = 0, flags = 0 as well as query only one attribute
at a time with firstAttributeId = id

The output buffer consists of the following
1. number of attributes  - 8 bytes
2. array offset to the data location - 8 bytes
3. version info  - 1 byte
4. A data array of size num attributes, which contains the following:
  a. attribute ID  - 8 bytes
  b. attribute value in number - 8 bytes
  c. attribute name in string  - 64 bytes
  d. attribute value in string - 64 bytes

The new H_CALL exports information in direct string value format, hence
a new interface has been introduced in /sys/firmware/papr to export
this information to userspace in an extensible pass-through format.
The H_CALL returns the name, numeric value and string value. As string
values are in human readable format, therefore if the string value
exists then that is given precedence over the numeric value.

The format of exposing the sysfs information is as follows:
/sys/firmware/papr/
  |-- attr_0_name
  |-- attr_0_val
  |-- attr_1_name
  |-- attr_1_val
...

The energy information that is exported is useful for userspace tools
such as powerpc-utils. Currently these tools infer the
"power_mode_data" value in the lparcfg, which in turn is obtained from
the to be deprecated H_GET_EM_PARMS H_CALL.
On future platforms, such userspace utilities will have to look at the
data returned from the new H_CALL being populated in this new sysfs
interface and report this information directly without the need of
interpretation.

Signed-off-by: Pratik R. Sampat 
---
 Documentation/ABI/testing/sysfs-firmware-papr |  24 +++
 arch/powerpc/include/asm/hvcall.h |  21 +-
 arch/powerpc/kvm/trace_hv.h   |   1 +
 arch/powerpc/platforms/pseries/Makefile   |   3 +-
 .../pseries/papr_platform_attributes.c| 203 ++
 5 files changed, 250 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-papr
 create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-papr 
b/Documentation/ABI/testing/sysfs-firmware-papr
new file mode 100644
index ..1c040b44ac3b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-papr
@@ -0,0 +1,24 @@
+What:  /sys/firmware/papr
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description :  Director hosting a set of platform attributes on Linux
+   running as a PAPR guest.
+
+   Each file in a directory contains a platform
+   attribute pertaining to performance/energy-savings
+   mode and processor frequency.
+
+What:  /sys/firmware/papr/attr_X_name
+   /sys/firmware/papr/attr_X_val
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   PAPR attributes directory for POWERVM servers
+
+   This directory provides PAPR information. It
+   contains below sysfs attributes:
+
+   - attr_X_name: File contains the name of
+   attribute X
+
+   - attr_X_val: Numeric/string value of
+   attribute X
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index e3b29eda8074..19a2a8c77a49 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -316,7 +316,8 @@
 #define H_SCM_PERFORMANCE_STATS 0x418
 #define H_RPT_INVALIDATE   0x448
 #define H_SCM_FLUSH0x44C
-#define MAX_HCALL_OPCODE   H_SCM_FLUSH
+#define H_GET_ENERGY_SCALE_INFO0x450
+#define MAX_HCALL_OPCODE   H_GET_ENERGY_SCALE_INFO
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -631,6 +632,24 @@ struct hv_gpci_request_buffer {
uint8_t bytes[HGPCI_MAX_DATA_BYTES];
 } __packed;
 
+#define MAX_EM_ATTRS   10
+#define MAX_EM_DATA_BYTES \
+   (sizeof(struct energy_scale_attributes) * MAX_EM_ATTRS)
+struct energy_scale_attributes {
+   __be64 attr_id;
+   __be64 attr_value;
+   unsigned char attr_desc[64];
+   unsigned char attr_value_desc[64];
+} __packed;
+
+struct hv_energy_scale_buffer {
+  

Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Andy Lutomirski
On 5/31/21 11:22 PM, Nicholas Piggin wrote:
> There haven't been objections to the series since last posting, this
> is just a rebase and tidies up a few comments minor patch rearranging.
> 

I continue to object to having too many modes.  I like my more generic
improvements better.  Let me try to find some time to email again.


Re: [PATCH 4/4] powerpc/32: Avoid #ifdef nested with FTR_SECTION on booke syscall entry

2021-06-04 Thread Andreas Schwab
On Jun 04 2021, Christophe Leroy wrote:

> On booke, SYSCALL_ENTRY macro nests an FTR_SECTION with a

That sentence lacks an

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Andy Lutomirski
On 6/4/21 9:54 AM, Andy Lutomirski wrote:
> On 5/31/21 11:22 PM, Nicholas Piggin wrote:
>> There haven't been objections to the series since last posting, this
>> is just a rebase and tidies up a few comments minor patch rearranging.
>>
> 
> I continue to object to having too many modes.  I like my more generic
> improvements better.  Let me try to find some time to email again.
> 

Specifically, this:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm

I, or someone, needs to dust off my membarrier series before any of
these kinds of changes get made.  The barrier situation in the scheduler
is too confusing otherwise.


Re: [PATCH v4 16/16] crypto/nx: Add sysfs interface to export NX capabilities

2021-06-04 Thread Haren Myneni
On Fri, 2021-06-04 at 21:52 +1000, Michael Ellerman wrote:
> Haren Myneni  writes:
> > On Thu, 2021-06-03 at 14:57 +1000, Nicholas Piggin wrote:
> > > Excerpts from Haren Myneni's message of May 21, 2021 7:42 pm:
> > > > Changes to export the following NXGZIP capabilities through
> > > > sysfs:
> > > > 
> > > > /sys/devices/vio/ibm,compression-v1/NxGzCaps:
> > > 
> > > Where's the horrible camel case name coming from? PowerVM?
> > 
> > Yes, pHyp provides the capabalities string.
> > 
> > Capability Description Descriptor Value Descriptor ascii Value
> > Overall NX Capabilities 0x4E78204361707320 “Nx Caps ”
> > NX GZIP Capabilities 0x4E78477A43617073 “NxGzCaps”
> 
> That doesn't mean we have to use that name in sysfs though. In fact
> we
> couldn't use the "Nx Caps " name, because it contains spaces.
> 
> And we don't have to squeeze our name into 8 bytes, so it can be less
> ugly.
> 
> Like "nx_gzip_capabilities"?

Thanks for your comments. 

'NX Caps " provides the all available features in the hypervisor (only
NX GZIP caps is supported right now) and we export information for the
specific feature via sysfs.

I will change it to "nx_gzip_caps" 

Thanks
Haren

> 
> cheers



Re: [PATCH v8 00/15] Restricted DMA

2021-06-04 Thread Will Deacon
Hi Claire,

On Thu, May 27, 2021 at 08:58:30PM +0800, Claire Chang wrote:
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
> 
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
> 
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
> 
> [1a] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> [1b] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> [2] https://blade.tencent.com/en/advisories/qualpwn/
> [3] 
> https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> [4] 
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> 
> v8:
> - Fix reserved-memory.txt and add the reg property in example.
> - Fix sizeof for of_property_count_elems_of_size in
>   drivers/of/address.c#of_dma_set_restricted_buffer.
> - Apply Will's suggestion to try the OF node having DMA configuration in
>   drivers/of/address.c#of_dma_set_restricted_buffer.
> - Fix typo in the comment of 
> drivers/of/address.c#of_dma_set_restricted_buffer.
> - Add error message for PageHighMem in
>   kernel/dma/swiotlb.c#rmem_swiotlb_device_init and move it to
>   rmem_swiotlb_setup.
> - Fix the message string in rmem_swiotlb_setup.

Thanks for the v8. It works for me out of the box on arm64 under KVM, so:

Tested-by: Will Deacon 

Note that something seems to have gone wrong with the mail threading, so
the last 5 patches ended up as a separate thread for me. Probably worth
posting again with all the patches in one place, if you can.

Cheers,

Will


Re: [PATCH v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks

2021-06-04 Thread Daniel Borkmann

On 6/4/21 6:50 AM, Paul Moore wrote:

On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann  wrote:

[...]

I did run this entire discussion by both of the other BPF co-maintainers
(Alexei, Andrii, CC'ed) and together we did further brainstorming on the
matter on how we could solve this, but couldn't find a sensible & clean
solution so far.


Before I jump into the patch below I just want to say that I
appreciate you looking into solutions on the BPF side of things.
However, I voted "no" on this patch previously and since you haven't
really changed it, my "no"/NACK vote remains, at least until we
exhaust a few more options.


Just to set the record straight, you previously did neither ACK nor NACK it. And
again, as summarized earlier, this patch is _fixing_ the majority of the damage
caused by 59438b46471a for at least the BPF side of things where users run into 
this,
Ondrej the rest. Everything else can be discussed on top, and so far it seems 
there
is no clean solution in front of us either, not even speaking of one that is 
small
and suitable for _stable_ trees. Let me reiterate where we are: it's not that 
the
original implementation in 9d1f8be5cf42 ("bpf: Restrict bpf when kernel 
lockdown is
in confidentiality mode") was broken, it's that the later added _SELinux_ 
locked_down
hook implementation that is broken, and not other LSMs. Now you're trying to 
retrofittingly
ask us for hacks at all costs just because of /a/ broken LSM implementation. 
Maybe
another avenue is to just swallow the pill and revert 59438b46471a since it made
assumptions that don't work /generally/. And the use case for an application 
runtime
policy change in particular in case of lockdown where users only have 3 choices 
is
extremely tiny/rare, if it's not then something is very wrong in your 
deployment.
Let me ask you this ... are you also planning to address *all* the other cases 
inside
the kernel? If your answer is no, then this entire discussion is pointless.


[PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:

1) The audit events that are triggered due to calls to 
security_locked_down()
   can OOM kill a machine, see below details [0].

2) It also seems to be causing a deadlock via 
avc_has_perm()/slow_avc_audit()
   when trying to wake up kauditd, for example, when using 
trace_sched_switch()
   tracepoint, see details in [1]. Triggering this was not via some 
hypothetical
   corner case, but with existing tools like runqlat & runqslower from bcc, 
for
   example, which make use of this tracepoint. Rough call sequence goes 
like:

   rq_lock(rq) -> -+
 trace_sched_switch() ->   |
   bpf_prog_xyz() ->   +-> deadlock
 selinux_lockdown() -> |
   audit_log_end() ->  |
 wake_up_interruptible() ->|
   try_to_wake_up() -> |
 rq_lock(rq) --+


Since BPF is a bit of chaotic nightmare in the sense that it basically
out-of-tree kernel code that can be called from anywhere and do pretty
much anything; it presents quite the challenge for those of us worried
about LSM access controls.


There is no need to generalize ... for those worried, BPF subsystem has LSM 
access
controls for the syscall since 2017 via afdb09c720b6 ("security: bpf: Add LSM 
hooks
for bpf object related syscall").

[...]

So let's look at this from a different angle.  Let's look at the two
problems you mention above.

If we start with the runqueue deadlock we see the main problem is that
audit_log_end() pokes the kauditd_wait waitqueue to ensure the
kauditd_thread thread wakes up and processes the audit queue.  The
audit_log_start() function does something similar, but it is
conditional on a number of factors and isn't as likely to be hit.  If
we relocate these kauditd wakeup calls we can remove the deadlock in
trace_sched_switch().  In the case of CONFIG_AUDITSYSCALL=y we can
probably just move the wakeup to __audit_syscall_exit() and in the
case of CONFIG_AUDITSYSCALL=n we can likely just change the
wait_event_freezable() call in kauditd_thread to a
wait_event_freezable_timeout() call with a HZ timeout (the audit
stream will be much less on these systems anyway so a queue overflow
is much less likely).  I'm building a kernel with these changes now, I
should have something to test when I wake up tomorrow morning.  It
might even provide a bit of a performance boost as we would only be
calling a wa

Re: [PATCH v2 3/9] arc: remove support for DISCONTIGMEM

2021-06-04 Thread Vineet Gupta
On 6/3/21 11:49 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
>
> DISCONTIGMEM was replaced by FLATMEM with freeing of the unused memory map
> in v5.11.
>
> Remove the support for DISCONTIGMEM entirely.
>
> Signed-off-by: Mike Rapoport 

Looks non intrusive, but I'd still like to give this a spin on hardware 
- considering highmem on ARC has tendency to go sideways ;-)
Can you please share a branch !

Acked-by: Vineet Gupta 

Thx,
-Vineet

> ---
>   arch/arc/Kconfig  | 13 
>   arch/arc/include/asm/mmzone.h | 40 ---
>   arch/arc/mm/init.c|  8 ---
>   3 files changed, 61 deletions(-)
>   delete mode 100644 arch/arc/include/asm/mmzone.h
>
> diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
> index 2d98501c0897..d8f51eb8963b 100644
> --- a/arch/arc/Kconfig
> +++ b/arch/arc/Kconfig
> @@ -62,10 +62,6 @@ config SCHED_OMIT_FRAME_POINTER
>   config GENERIC_CSUM
>   def_bool y
>   
> -config ARCH_DISCONTIGMEM_ENABLE
> - def_bool n
> - depends on BROKEN
> -
>   config ARCH_FLATMEM_ENABLE
>   def_bool y
>   
> @@ -344,15 +340,6 @@ config ARC_HUGEPAGE_16M
>   
>   endchoice
>   
> -config NODES_SHIFT
> - int "Maximum NUMA Nodes (as a power of 2)"
> - default "0" if !DISCONTIGMEM
> - default "1" if DISCONTIGMEM
> - depends on NEED_MULTIPLE_NODES
> - help
> -   Accessing memory beyond 1GB (with or w/o PAE) requires 2 memory
> -   zones.
> -
>   config ARC_COMPACT_IRQ_LEVELS
>   depends on ISA_ARCOMPACT
>   bool "Setup Timer IRQ as high Priority"
> diff --git a/arch/arc/include/asm/mmzone.h b/arch/arc/include/asm/mmzone.h
> deleted file mode 100644
> index b86b9d1e54dc..
> --- a/arch/arc/include/asm/mmzone.h
> +++ /dev/null
> @@ -1,40 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0-only */
> -/*
> - * Copyright (C) 2016 Synopsys, Inc. (www.synopsys.com)
> - */
> -
> -#ifndef _ASM_ARC_MMZONE_H
> -#define _ASM_ARC_MMZONE_H
> -
> -#ifdef CONFIG_DISCONTIGMEM
> -
> -extern struct pglist_data node_data[];
> -#define NODE_DATA(nid) (&node_data[nid])
> -
> -static inline int pfn_to_nid(unsigned long pfn)
> -{
> - int is_end_low = 1;
> -
> - if (IS_ENABLED(CONFIG_ARC_HAS_PAE40))
> - is_end_low = pfn <= virt_to_pfn(0xUL);
> -
> - /*
> -  * node 0: lowmem: 0x8000_   to 0x_
> -  * node 1: HIGHMEM w/o  PAE40: 0x0   to 0x7FFF_
> -  * HIGHMEM with PAE40: 0x1__ to ...
> -  */
> - if (pfn >= ARCH_PFN_OFFSET && is_end_low)
> - return 0;
> -
> - return 1;
> -}
> -
> -static inline int pfn_valid(unsigned long pfn)
> -{
> - int nid = pfn_to_nid(pfn);
> -
> - return (pfn <= node_end_pfn(nid));
> -}
> -#endif /* CONFIG_DISCONTIGMEM  */
> -
> -#endif
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index 397a201adfe3..abfeef7bf6f8 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -32,11 +32,6 @@ unsigned long arch_pfn_offset;
>   EXPORT_SYMBOL(arch_pfn_offset);
>   #endif
>   
> -#ifdef CONFIG_DISCONTIGMEM
> -struct pglist_data node_data[MAX_NUMNODES] __read_mostly;
> -EXPORT_SYMBOL(node_data);
> -#endif
> -
>   long __init arc_get_mem_sz(void)
>   {
>   return low_mem_sz;
> @@ -147,9 +142,6 @@ void __init setup_arch_memory(void)
>* to the hole is freed and ARC specific version of pfn_valid()
>* handles the hole in the memory map.
>*/
> -#ifdef CONFIG_DISCONTIGMEM
> - node_set_online(1);
> -#endif
>   
>   min_high_pfn = PFN_DOWN(high_mem_start);
>   max_high_pfn = PFN_DOWN(high_mem_start + high_mem_sz);



Re: [PATCH v2 2/9] arc: update comment about HIGHMEM implementation

2021-06-04 Thread Vineet Gupta
On 6/3/21 11:49 PM, Mike Rapoport wrote:
> From: Mike Rapoport 
>
> Arc does not use DISCONTIGMEM to implement high memory, update the comment
> describing how high memory works to reflect this.
>
> Signed-off-by: Mike Rapoport 

Acked-by: Vineet Gupta 

Thx,
-Vineet

> ---
>   arch/arc/mm/init.c | 13 +
>   1 file changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arc/mm/init.c b/arch/arc/mm/init.c
> index e2ed355438c9..397a201adfe3 100644
> --- a/arch/arc/mm/init.c
> +++ b/arch/arc/mm/init.c
> @@ -139,16 +139,13 @@ void __init setup_arch_memory(void)
>   
>   #ifdef CONFIG_HIGHMEM
>   /*
> -  * Populate a new node with highmem
> -  *
>* On ARC (w/o PAE) HIGHMEM addresses are actually smaller (0 based)
> -  * than addresses in normal ala low memory (0x8000_ based).
> +  * than addresses in normal aka low memory (0x8000_ based).
>* Even with PAE, the huge peripheral space hole would waste a lot of
> -  * mem with single mem_map[]. This warrants a mem_map per region design.
> -  * Thus HIGHMEM on ARC is imlemented with DISCONTIGMEM.
> -  *
> -  * DISCONTIGMEM in turns requires multiple nodes. node 0 above is
> -  * populated with normal memory zone while node 1 only has highmem
> +  * mem with single contiguous mem_map[].
> +  * Thus when HIGHMEM on ARC is enabled the memory map corresponding
> +  * to the hole is freed and ARC specific version of pfn_valid()
> +  * handles the hole in the memory map.
>*/
>   #ifdef CONFIG_DISCONTIGMEM
>   node_set_online(1);



Re: [PATCH 4/4] powerpc/32: Avoid #ifdef nested with FTR_SECTION on booke syscall entry

2021-06-04 Thread Christophe Leroy




Le 04/06/2021 à 19:02, Andreas Schwab a écrit :

On Jun 04 2021, Christophe Leroy wrote:


On booke, SYSCALL_ENTRY macro nests an FTR_SECTION with a


That sentence lacks an




Argh !

It was . FTR_SECTION with a
#ifdef CONFIG_KVM_BOOKE_HV.

And git discarded the line starting with a #


Christophe


Re: [PATCH v4 06/16] powerpc/vas: Define and use common vas_window struct

2021-06-04 Thread Haren Myneni
On Fri, 2021-06-04 at 21:52 +1000, Michael Ellerman wrote:
> Haren Myneni  writes:
> > On Thu, 2021-06-03 at 14:38 +1000, Nicholas Piggin wrote:
> > > Excerpts from Haren Myneni's message of May 21, 2021 7:33 pm:
> > > > Same vas_window struct is used on powerNV and pseries. So this
> > > > patch
> > > > changes in struct vas_window to support both platforms and also
> > > > the
> > > > corresponding modifications in powerNV vas code.
> > > > 
> > > > On powerNV, vas_window is used for both TX and RX windows,
> > > > whereas
> > > > only for TX windows on powerVM. So some elements are specific
> > > > to
> > > > these platforms.
> > > > 
> > > > Signed-off-by: Haren Myneni 
> > > > ---
> > > >  arch/powerpc/include/asm/vas.h  |  50 +++-
> > > >  arch/powerpc/platforms/powernv/vas-debug.c  |  12 +-
> > > >  arch/powerpc/platforms/powernv/vas-fault.c  |   4 +-
> > > >  arch/powerpc/platforms/powernv/vas-trace.h  |   6 +-
> > > >  arch/powerpc/platforms/powernv/vas-window.c | 129 +++-
> > > > 
> > > > 
> > > >  arch/powerpc/platforms/powernv/vas.h|  38 +-
> > > >  6 files changed, 135 insertions(+), 104 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/include/asm/vas.h
> > > > b/arch/powerpc/include/asm/vas.h
> > > > index 2c1040f399d9..49bfb5be896d 100644
> > > > --- a/arch/powerpc/include/asm/vas.h
> > > > +++ b/arch/powerpc/include/asm/vas.h
> > > > @@ -10,8 +10,6 @@
> > > >  #include 
> > > >  #include 
> > > >  
> > > > -struct vas_window;
> > > > -
> > > >  /*
> > > >   * Min and max FIFO sizes are based on Version 1.05 Section
> > > > 3.1.4.25
> > > >   * (Local FIFO Size Register) of the VAS workbook.
> > > > @@ -63,6 +61,54 @@ struct vas_user_win_ref {
> > > > struct mm_struct *mm;   /* Linux process mm_struct */
> > > >  };
> > > >  
> > > > +/*
> > > > + * In-kernel state a VAS window. One per window.
> > > > + * powerVM: Used only for Tx windows.
> > > > + * powerNV: Used for both Tx and Rx windows.
> > > > + */
> > > > +struct vas_window {
> > > > +   u32 winid;
> > > > +   u32 wcreds_max; /* Window credits */
> > > > +   enum vas_cop_type cop;
> > > > +   struct vas_user_win_ref task_ref;
> > > > +   char *dbgname;
> > > > +   struct dentry *dbgdir;
> > > > +   union {
> > > > +   /* powerNV specific data */
> > > > +   struct {
> > > > +   void *vinst;/* points to VAS
> > > > instance
> > > > */
> > > > +   bool tx_win;/* True if send window
> > > > */
> > > > +   bool nx_win;/* True if NX window */
> > > > +   bool user_win;  /* True if user space
> > > > window */
> > > > +   void *hvwc_map; /* HV window context */
> > > > +   void *uwc_map;  /* OS/User window
> > > > context
> > > > */
> > > > +
> > > > +   /* Fields applicable only to send
> > > > windows */
> > > > +   void *paste_kaddr;
> > > > +   char *paste_addr_name;
> > > > +   struct vas_window *rxwin;
> > > > +
> > > > +   atomic_t num_txwins;/* Only for
> > > > receive
> > > > windows */
> > > > +   } pnv;
> > > > +   struct {
> > > > +   u64 win_addr;   /* Physical paste
> > > > address
> > > > */
> > > > +   u8 win_type;/* QoS or Default
> > > > window */
> > > > +   u8 status;
> > > > +   u32 complete_irq;   /* Completion
> > > > interrupt */
> > > > +   u32 fault_irq;  /* Fault interrupt */
> > > > +   u64 domain[6];  /* Associativity domain
> > > > Ids
> > > > */
> > > > +   /* this window is
> > > > allocated */
> > > > +   u64 util;
> > > > +
> > > > +   /* List of windows opened which is used
> > > > for LPM
> > > > */
> > > > +   struct list_head win_list;
> > > > +   u64 flags;
> > > > +   char *name;
> > > > +   int fault_virq;
> > > > +   } lpar;
> > > > +   };
> > > > +};
> > > 
> > > The more typical way to do code like this is have the common bit
> > > as
> > > its own struct, and then have the users embed it into their own
> > > structs.
> > > 
> > > 
> > > struct vas_window {
> > >   u32 winid;
> > >   u32 wcreds_max; /* Window credits */
> > >   enum vas_cop_type cop;
> > >   struct vas_user_win_ref task_ref;
> > >   char *dbgname;
> > >   struct dentry *dbgdir;
> > > };
> > > 
> > > struct pnv_vas_window {
> > >   struct vas_window vas_window;
> > > 
> > >   void *vinst;/* points to VAS instance */
> > >   bool tx_win;/* True if send window */
> > >   bool nx_win;/* True if NX window */
> > >   bool user_win;  /* True if user space window */

Re: [PATCH v4 09/16] powerpc/pseries/vas: Add HCALL wrappers for VAS handling

2021-06-04 Thread Haren Myneni
On Fri, 2021-06-04 at 21:52 +1000, Michael Ellerman wrote:
> Haren Myneni  writes:
> > This patch adds the following HCALL wrapper functions to allocate,
> 
> Normal spelling is "hcall".
> 
> > modify and deallocate VAS windows, and retrieve VAS capabilities.
> > 
> > H_ALLOCATE_VAS_WINDOW: Allocate VAS window
> > H_DEALLOCATE_VAS_WINDOW: Close VAS window
> > H_MODIFY_VAS_WINDOW: Setup window before using
> > H_QUERY_VAS_CAPABILITIES: Get VAS capabilities
> 
> Please tell us which version of PAPR, and in which section etc.,
> these
> are described in.
> 
> > Signed-off-by: Haren Myneni 
> > Reviewed-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/platforms/pseries/vas.c | 217
> > +++
> >  1 file changed, 217 insertions(+)
> >  create mode 100644 arch/powerpc/platforms/pseries/vas.c
> > 
> > diff --git a/arch/powerpc/platforms/pseries/vas.c
> > b/arch/powerpc/platforms/pseries/vas.c
> > new file mode 100644
> > index ..06960151477c
> > --- /dev/null
> > +++ b/arch/powerpc/platforms/pseries/vas.c
> > @@ -0,0 +1,217 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright 2020-21 IBM Corp.
> > + */
> > +
> > +#define pr_fmt(fmt) "vas: " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> 
> Do we need all those headers?
> 
> > +#include 
> > +#include 
> > +#include "vas.h"
> > +
> > +#defineVAS_INVALID_WIN_ADDRESS 0xul
> > +#defineVAS_DEFAULT_DOMAIN_ID   0xul
> 
> Some blank lines for formatting please.
> 
> > +/* Authority Mask Register (AMR) value is not supported in */
> > +/* linux implementation. So pass '0' to modify window HCALL */
> 
> Please fix the comment formatting.
> 
> > +#defineVAS_AMR_VALUE   0
> 
> This is only used in one place. It'd be simpler to just pass 0 and
> move
> the comment there.
> 
> > +/* phyp allows one credit per window right now */
> > +#define DEF_WIN_CREDS  1
> > +
> > +static int64_t hcall_return_busy_check(int64_t rc)
> > +{
> 
> Please use normal kernel types, ie. s64, or just long.
> 
> Same comment throughout.
> 
> > +   /* Check if we are stalled for some time */
> > +   if (H_IS_LONG_BUSY(rc)) {
> > +   msleep(get_longbusy_msecs(rc));
> > +   rc = H_BUSY;
> > +   } else if (rc == H_BUSY) {
> > +   cond_resched();
> > +   }
> > +
> > +   return rc;
> > +}
> > +
> > +/*
> > + * Allocate VAS window HCALL
> > + */
> > +static int plpar_vas_allocate_window(struct vas_window *win, u64
> > *domain,
> > +u8 wintype, u16 credits)
> 
> You don't have to use the "plpar" prefix for these sort of wrappers.
> 
> Just naming them after the hcall would probably be clearer, so:
> 
>  h_allocate_vas_window(... )
> 
> > +{
> > +   long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
> > +   int64_t rc;
> > +
> > +   do {
> > +   rc = plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf,
> > wintype,
> > + credits, domain[0], domain[1],
> > domain[2],
> > + domain[3], domain[4], domain[5]);
> > +
> > +   rc = hcall_return_busy_check(rc);
> > +   } while (rc == H_BUSY);
> > +
> > +   switch (rc) {
> > +   case H_SUCCESS:
> > +   win->winid = retbuf[0];
> > +   win->lpar.win_addr = retbuf[1];
> > +   win->lpar.complete_irq = retbuf[2];
> > +   win->lpar.fault_irq = retbuf[3];
> 
> You shouldn't mutate win until you know there is no error.
> 
> > +   if (win->lpar.win_addr == VAS_INVALID_WIN_ADDRESS) {
> > +   pr_err("HCALL(%x): COPY/PASTE is not
> > supported\n",
> > +   H_ALLOCATE_VAS_WINDOW);
> > +   return -ENOTSUPP;
> > +   }
> > +   return 0;
> > +   case H_PARAMETER:
> > +   pr_err("HCALL(%x): Invalid window type (%u)\n",
> > +   H_ALLOCATE_VAS_WINDOW, wintype);
> > +   return -EINVAL;
> > +   case H_P2:
> > +   pr_err("HCALL(%x): Credits(%u) exceed maximum window
> > credits\n",
> > +   H_ALLOCATE_VAS_WINDOW, credits);
> > +   return -EINVAL;
> > +   case H_COP_HW:
> > +   pr_err("HCALL(%x): User-mode COPY/PASTE is not
> > supported\n",
> > +   H_ALLOCATE_VAS_WINDOW);
> > +   return -ENOTSUPP;
> > +   case H_RESOURCE:
> > +   pr_err("HCALL(%x): LPAR credit limit exceeds window
> > limit\n",
> > +   H_ALLOCATE_VAS_WINDOW);
> > +   return -EPERM;
> > +   case H_CONSTRAINED:
> > +   pr_err("HCALL(%x): Credits (%u) are not available\n",
> > +   H_ALLOCATE_VAS_WINDOW, credits);
> > +   return -EPERM;
> > +   default:
> > +   pr_err("HCALL(%x): Unexpected error %lld\n",
> > +   H_ALLOCATE_VAS_WINDOW, rc);
> > +   return -

Re: [PATCH v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks

2021-06-04 Thread Paul Moore
On Fri, Jun 4, 2021 at 2:02 PM Daniel Borkmann  wrote:
>
> On 6/4/21 6:50 AM, Paul Moore wrote:
> > On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann  wrote:
> [...]
> >> I did run this entire discussion by both of the other BPF co-maintainers
> >> (Alexei, Andrii, CC'ed) and together we did further brainstorming on the
> >> matter on how we could solve this, but couldn't find a sensible & clean
> >> solution so far.
> >
> > Before I jump into the patch below I just want to say that I
> > appreciate you looking into solutions on the BPF side of things.
> > However, I voted "no" on this patch previously and since you haven't
> > really changed it, my "no"/NACK vote remains, at least until we
> > exhaust a few more options.
>
> Just to set the record straight, you previously did neither ACK nor NACK it.

I think I said it wasn't a great solution.  Perhaps I should have been
more clear, but I don't like NACK'ing things while we are still
hashing out possible solutions on the lists.

>From my perspective NACK'ing is pretty harsh and I try to leave that
as a last resort.

> And
> again, as summarized earlier, this patch is _fixing_ the majority of the 
> damage
> caused by 59438b46471a for at least the BPF side of things where users run 
> into this,
> Ondrej the rest. Everything else can be discussed on top, and so far it seems 
> there
> is no clean solution in front of us either, not even speaking of one that is 
> small
> and suitable for _stable_ trees. Let me reiterate where we are: it's not that 
> the
> original implementation in 9d1f8be5cf42 ("bpf: Restrict bpf when kernel 
> lockdown is
> in confidentiality mode") was broken, it's that the later added _SELinux_ 
> locked_down
> hook implementation that is broken, and not other LSMs.

I think there are still options for how to solve this moving forward,
more on that at the end of the email, and I would like to see if we
can chase down some of those ideas first.  Maybe the ideas aren't
practical, or maybe they are practical but not from a -stable
perspective; we/I don't know until we talk about it.  Based on
previous experience I'm afraid to ACK a quick-fix without some
discussion about the proper long-term fix.  If the long-term fix isn't
suitable for -stable, then we can take a smaller fix just for the
stable trees.

> Now you're trying to retrofittingly
> ask us for hacks at all costs just because of /a/ broken LSM implementation.

This is starting to get a little off the rails now isn't it?  Daniel
you've always come across as a reasonable person to me, but phrasing
things like that is inflammatory at best.

Could the SELinux lockdown hooks have been done better, of course, I
don't think we are debating that anymore.  New functionality, checks,
etc. are added to the kernel all the time, and because we're all human
we screw that up on occasion.  The important part is that we come
together to fix it, which is what I think we're trying to do here
(it's what I'm trying to do here).

> Maybe
> another avenue is to just swallow the pill and revert 59438b46471a since it 
> made
> assumptions that don't work /generally/. And the use case for an application 
> runtime
> policy change in particular in case of lockdown where users only have 3 
> choices is
> extremely tiny/rare, if it's not then something is very wrong in your 
> deployment.
> Let me ask you this ... are you also planning to address *all* the other 
> cases inside
> the kernel? If your answer is no, then this entire discussion is pointless.

Um, yes?  We were talking about that earlier in the thread before the
BPF portions of the fix started to dominate the discussion.  Just
because the BPF portion of the fix has dominated the discussion
recently doesn't mean the other issues aren't going to be addressed.

When stuff is busted, I work to fix it.  I think everyone here does.

> >> [PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
> >>
> >> Commit 59438b46471a ("security,lockdown,selinux: implement SELinux 
> >> lockdown")
> >> added an implementation of the locked_down LSM hook to SELinux, with the 
> >> aim
> >> to restrict which domains are allowed to perform operations that would 
> >> breach
> >> lockdown. This is indirectly also getting audit subsystem involved to 
> >> report
> >> events. The latter is problematic, as reported by Ondrej and Serhei, since 
> >> it
> >> can bring down the whole system via audit:
> >>
> >> 1) The audit events that are triggered due to calls to 
> >> security_locked_down()
> >>can OOM kill a machine, see below details [0].
> >>
> >> 2) It also seems to be causing a deadlock via 
> >> avc_has_perm()/slow_avc_audit()
> >>when trying to wake up kauditd, for example, when using 
> >> trace_sched_switch()
> >>tracepoint, see details in [1]. Triggering this was not via some 
> >> hypothetical
> >>corner case, but with existing tools like runqlat & runqslower from 
> >> bcc, for
> >>example, which

Re: [PATCH v2] lockdown,selinux: avoid bogus SELinux lockdown permission checks

2021-06-04 Thread Alexei Starovoitov
On Fri, Jun 4, 2021 at 4:34 PM Paul Moore  wrote:
>
> > Again, the problem is not limited to BPF at all. kprobes is doing register-
> > time hooks which are equivalent to the one of BPF. Anything in run-time
> > trying to prevent probe_read_kernel by kprobes or BPF is broken by design.
>
> Not being an expert on kprobes I can't really comment on that, but
> right now I'm focused on trying to make things work for the BPF
> helpers.  I suspect that if we can get the SELinux lockdown
> implementation working properly for BPF the solution for kprobes won't
> be far off.

Paul,

Both kprobe and bpf can call probe_read_kernel==copy_from_kernel_nofault
from all contexts.
Including NMI. Most of audit_log_* is not acceptable.
Just removing a wakeup is not solving anything.
Audit hooks don't belong in NMI.
Audit design needs memory allocation. Hence it's not suitable
for NMI and hardirq. But kprobes and bpf progs do run just fine there.
BPF, for example, only uses pre-allocated memory.


Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Nicholas Piggin
Excerpts from Andy Lutomirski's message of June 5, 2021 3:05 am:
> On 6/4/21 9:54 AM, Andy Lutomirski wrote:
>> On 5/31/21 11:22 PM, Nicholas Piggin wrote:
>>> There haven't been objections to the series since last posting, this
>>> is just a rebase and tidies up a few comments minor patch rearranging.
>>>
>> 
>> I continue to object to having too many modes.  I like my more generic
>> improvements better.  Let me try to find some time to email again.
>> 
> 
> Specifically, this:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm

That's worse than what powerpc does with the shoot lazies code so 
we wouldn't use it anyway.

The fact is mm-cpumask and lazy mm is very architecture specific, so I 
don't really see that another "mode" is such a problem, it's for the 
most part "this is what powerpc does" -> "this is what powerpc does".
The only mode in the context switch is just "take a ref on the lazy mm"
or "don't take a ref". Surely that's not too onerous to add!?

Actually the bigger part of it is actually the no-lazy mmu mode which
is not yet used, I thought it was a neat little demonstrator of how code
works with/without lazy but I will get rid of that for submission.


> I, or someone, needs to dust off my membarrier series before any of
> these kinds of changes get made.  The barrier situation in the scheduler
> is too confusing otherwise.
> 

I disagree, I've disentangled the changes from membarrier stuff now, 
they can be done concurrently.

Thanks,
Nick


Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of June 5, 2021 10:17 am:
> Excerpts from Andy Lutomirski's message of June 5, 2021 3:05 am:
>> On 6/4/21 9:54 AM, Andy Lutomirski wrote:
>>> On 5/31/21 11:22 PM, Nicholas Piggin wrote:
 There haven't been objections to the series since last posting, this
 is just a rebase and tidies up a few comments minor patch rearranging.

>>> 
>>> I continue to object to having too many modes.  I like my more generic
>>> improvements better.  Let me try to find some time to email again.
>>> 
>> 
>> Specifically, this:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm
> 
> That's worse than what powerpc does with the shoot lazies code so 
> we wouldn't use it anyway.
> 
> The fact is mm-cpumask and lazy mm is very architecture specific, so I 
> don't really see that another "mode" is such a problem, it's for the 
> most part "this is what powerpc does" -> "this is what powerpc does".
> The only mode in the context switch is just "take a ref on the lazy mm"
> or "don't take a ref". Surely that's not too onerous to add!?
> 
> Actually the bigger part of it is actually the no-lazy mmu mode which
> is not yet used, I thought it was a neat little demonstrator of how code
> works with/without lazy but I will get rid of that for submission.

I admit that does add a bit more churn than necessary maybe that was
the main objection.

Here is the entire kernel/sched/core.c change after that is removed.
Pretty simple now. I'll resubmit.

Thanks,
Nick


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e359c76ea2e2..1be0b97e12ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4171,7 +4171,7 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
__releases(rq->lock)
 {
struct rq *rq = this_rq();
-   struct mm_struct *mm = rq->prev_mm;
+   struct mm_struct *mm = NULL;
long prev_state;
 
/*
@@ -4190,7 +4190,10 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
  current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
 
-   rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   mm = rq->prev_lazy_mm;
+   rq->prev_lazy_mm = NULL;
+#endif
 
/*
 * A task struct has one reference for the use as "current".
@@ -4326,9 +4329,21 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
if (!prev->mm) {// from kernel
-   /* will mmdrop_lazy_tlb() in finish_task_switch(). */
-   rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   /* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+   rq->prev_lazy_mm = prev->active_mm;
prev->active_mm = NULL;
+#else
+   /*
+* Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+* tracking (because no rq->prev_lazy_mm) in
+* finish_task_switch, so no mmdrop_lazy_tlb(),
+* so no memory barrier for membarrier (see the
+* membarrier comment in finish_task_switch()).
+* Do it here.
+*/
+   smp_mb();
+#endif
}
}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..0729cf19a987 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -961,7 +961,9 @@ struct rq {
struct task_struct  *idle;
struct task_struct  *stop;
unsigned long   next_balance;
-   struct mm_struct*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   struct mm_struct*prev_lazy_mm;
+#endif
 
unsigned intclock_update_flags;
u64 clock;



Re: [PATCH v4 04/16] powerpc/vas: Create take/drop pid and mm references

2021-06-04 Thread Nicholas Piggin
Excerpts from Haren Myneni's message of June 4, 2021 2:08 pm:
> On Thu, 2021-06-03 at 14:21 +1000, Nicholas Piggin wrote:
>> Excerpts from Haren Myneni's message of May 21, 2021 7:31 pm:
>> > Take pid and mm references when each window opens and drops during
>> > close. This functionality is needed for powerNV and pseries. So
>> > this patch defines the existing code as functions in common book3s
>> > platform vas-api.c
>> > 
>> > Signed-off-by: Haren Myneni 
>> 
>> Seems like a good idea to put these into their own helper functions.
>> 
>> > ---
>> >  arch/powerpc/include/asm/vas.h  | 25 +
>> >  arch/powerpc/platforms/book3s/vas-api.c | 51
>> > ++
>> >  arch/powerpc/platforms/powernv/vas-fault.c  | 10 ++--
>> >  arch/powerpc/platforms/powernv/vas-window.c | 57 ++---
>> > 
>> >  arch/powerpc/platforms/powernv/vas.h|  6 +--
>> >  5 files changed, 88 insertions(+), 61 deletions(-)
>> > 
>> > diff --git a/arch/powerpc/include/asm/vas.h
>> > b/arch/powerpc/include/asm/vas.h
>> > index 668303198772..3f2b02461a76 100644
>> > --- a/arch/powerpc/include/asm/vas.h
>> > +++ b/arch/powerpc/include/asm/vas.h
>> > @@ -5,6 +5,9 @@
>> >  
>> >  #ifndef _ASM_POWERPC_VAS_H
>> >  #define _ASM_POWERPC_VAS_H
>> > +#include 
>> > +#include 
>> > +#include 
>> >  #include 
>> >  
>> >  struct vas_window;
>> > @@ -49,6 +52,17 @@ enum vas_cop_type {
>> >VAS_COP_TYPE_MAX,
>> >  };
>> >  
>> > +/*
>> > + * User space VAS windows are opened by tasks and take references
>> > + * to pid and mm until windows are closed.
>> > + * Stores pid, mm, and tgid for each window.
>> > + */
>> > +struct vas_user_win_ref {
>> > +  struct pid *pid;/* PID of owner */
>> > +  struct pid *tgid;   /* Thread group ID of owner */
>> > +  struct mm_struct *mm;   /* Linux process mm_struct */
>> > +};
>> > +
>> >  /*
>> >   * User space window operations used for powernv and powerVM
>> >   */
>> > @@ -59,6 +73,16 @@ struct vas_user_win_ops {
>> >int (*close_win)(void *);
>> >  };
>> >  
>> > +static inline void vas_drop_reference_pid_mm(struct
>> > vas_user_win_ref *ref)
>> > +{
>> > +  /* Drop references to pid and mm */
>> > +  put_pid(ref->pid);
>> > +  if (ref->mm) {
>> > +  mm_context_remove_vas_window(ref->mm);
>> > +  mmdrop(ref->mm);
>> > +  }
>> > +}
>> 
>> You don't have to make up a new name for such a thing because you 
>> already have one
>> 
>> put_vas_user_win_ref(struct vas_user_win_ref *ref)
>> 
>> 
>> > +
>> >  /*
>> >   * Receive window attributes specified by the (in-kernel) owner of
>> > window.
>> >   */
>> > @@ -192,4 +216,5 @@ int vas_register_coproc_api(struct module *mod,
>> > enum vas_cop_type cop_type,
>> >struct vas_user_win_ops *vops);
>> >  void vas_unregister_coproc_api(void);
>> >  
>> > +int vas_reference_pid_mm(struct vas_user_win_ref *task_ref);
>> >  #endif /* __ASM_POWERPC_VAS_H */
>> > diff --git a/arch/powerpc/platforms/book3s/vas-api.c
>> > b/arch/powerpc/platforms/book3s/vas-api.c
>> > index 6c39320bfb9b..a0141bfb2e4b 100644
>> > --- a/arch/powerpc/platforms/book3s/vas-api.c
>> > +++ b/arch/powerpc/platforms/book3s/vas-api.c
>> > @@ -55,6 +55,57 @@ static char *coproc_devnode(struct device *dev,
>> > umode_t *mode)
>> >return kasprintf(GFP_KERNEL, "crypto/%s", dev_name(dev));
>> >  }
>> >  
>> > +/*
>> > + * Take reference to pid and mm
>> > + */
>> > +int vas_reference_pid_mm(struct vas_user_win_ref *task_ref)
>> > +{
>> 
>> So this is quite different from a typical refcount object in that
>> it's 
>> opening it for access as well. I would split it in two functions, one
>> matching put_vas_user_win_ref() and appearing in the same place in
>> code,
>> which is up to about mmput and another function that adds the window
>> and
>> does the CP_ABORT etc... hmm, where do you release tgid?
> 
> Basically copied the existing code in to these functions
> (vas_reference_pid_mm/vas_drop_reference_pid_mm) so that useful for
> both platforms. 
> 
> mm_context_add/remove_vas_window() is also like taking reference. So
> instead of adding 2 seperate functions, how about naming
> get/put_vas_user_win_ref() 

It's actually different though. What I'm asking is the parts where you 
interact with core kernel data structure refcounts go into their own 
get/put functions.

Someone who understands that refcounting and looks at the code will care 
about those bits, so having them all together I think is helpful. They 
don't know about adding vas windows or CP_ABORT.

> Regarding tgid, the reference is taking only with pid, but not tgid.
> pid reuse can happen only in the case of multithread applications when
> the child that opened VAS window exits. But these windows will be
> closed when tgid exists. So do not need tgid reference.

I don't understand you.  The code you added does take a reference to 
tgid...

>> > +  /*
>> > +   * Window opened by a child thread may not be closed when
>> > +   * it exits. So ta

Re: [PATCH v4 07/16] powerpc/pseries/vas: Define VAS/NXGZIP HCALLs and structs

2021-06-04 Thread Nicholas Piggin
Excerpts from Haren Myneni's message of June 4, 2021 11:30 am:
> On Thu, 2021-06-03 at 14:47 +1000, Nicholas Piggin wrote:
>> Excerpts from Haren Myneni's message of May 21, 2021 7:34 pm:
>> > This patch adds HCALLs and other definitions. Also define structs
>> > that are used in VAS implementation on powerVM.
>> > 
>> > Signed-off-by: Haren Myneni 
>> > ---
>> >  arch/powerpc/include/asm/hvcall.h|   7 ++
>> >  arch/powerpc/include/asm/vas.h   |  32 
>> >  arch/powerpc/platforms/pseries/vas.h | 110
>> > +++
>> >  3 files changed, 149 insertions(+)
>> >  create mode 100644 arch/powerpc/platforms/pseries/vas.h
>> > 
>> > diff --git a/arch/powerpc/include/asm/hvcall.h
>> > b/arch/powerpc/include/asm/hvcall.h
>> > index e3b29eda8074..7c3418d1b5e9 100644
>> > --- a/arch/powerpc/include/asm/hvcall.h
>> > +++ b/arch/powerpc/include/asm/hvcall.h
>> > @@ -294,6 +294,13 @@
>> >  #define H_RESIZE_HPT_COMMIT   0x370
>> >  #define H_REGISTER_PROC_TBL   0x37C
>> >  #define H_SIGNAL_SYS_RESET0x380
>> > +#define H_ALLOCATE_VAS_WINDOW 0x388
>> > +#define H_MODIFY_VAS_WINDOW   0x38C
>> > +#define H_DEALLOCATE_VAS_WINDOW   0x390
>> > +#define H_QUERY_VAS_WINDOW0x394
>> > +#define H_QUERY_VAS_CAPABILITIES  0x398
>> > +#define H_QUERY_NX_CAPABILITIES   0x39C
>> > +#define H_GET_NX_FAULT0x3A0
>> >  #define H_INT_GET_SOURCE_INFO   0x3A8
>> >  #define H_INT_SET_SOURCE_CONFIG 0x3AC
>> >  #define H_INT_GET_SOURCE_CONFIG 0x3B0
>> > diff --git a/arch/powerpc/include/asm/vas.h
>> > b/arch/powerpc/include/asm/vas.h
>> > index 49bfb5be896d..371f62d99174 100644
>> > --- a/arch/powerpc/include/asm/vas.h
>> > +++ b/arch/powerpc/include/asm/vas.h
>> > @@ -181,6 +181,7 @@ struct vas_tx_win_attr {
>> >bool rx_win_ord_mode;
>> >  };
>> >  
>> > +#ifdef CONFIG_PPC_POWERNV
>> >  /*
>> >   * Helper to map a chip id to VAS id.
>> >   * For POWER9, this is a 1:1 mapping. In the future this maybe a
>> > 1:N
>> > @@ -248,6 +249,37 @@ void vas_win_paste_addr(struct vas_window
>> > *window, u64 *addr,
>> >  int vas_register_api_powernv(struct module *mod, enum vas_cop_type
>> > cop_type,
>> > const char *name);
>> >  void vas_unregister_api_powernv(void);
>> > +#endif
>> > +
>> > +#ifdef CONFIG_PPC_PSERIES
>> > +
>> > +/* VAS Capabilities */
>> > +#define VAS_GZIP_QOS_FEAT 0x1
>> > +#define VAS_GZIP_DEF_FEAT 0x2
>> > +#define VAS_GZIP_QOS_FEAT_BIT PPC_BIT(VAS_GZIP_QOS_FEAT) /*
>> > Bit 1 */
>> > +#define VAS_GZIP_DEF_FEAT_BIT PPC_BIT(VAS_GZIP_DEF_FEAT) /*
>> > Bit 2 */
>> > +
>> > +/* NX Capabilities */
>> > +#define VAS_NX_GZIP_FEAT  0x1
>> > +#define VAS_NX_GZIP_FEAT_BIT  PPC_BIT(VAS_NX_GZIP_FEAT) /*
>> > Bit 1 */
>> > +#define VAS_DESCR_LEN 8
>> > +
>> > +/*
>> > + * These structs are used to retrieve overall VAS capabilities
>> > that
>> > + * the hypervisor provides.
>> > + */
>> > +struct hv_vas_all_caps {
>> 
>> ...
>> 
>> > +
>> > +/*
>> > + * Use to get feature specific capabilities from the
>> > + * hypervisor.
>> > + */
>> > +struct hv_vas_ct_caps {
>> 
>> ...
>> 
>> > +/*
>> > + * To get window information from the hypervisor.
>> > + */
>> > +struct hv_vas_win_lpar {
>> 
>> Are any of these names coming from PAPR? If not, then typically we
>> don't 
>> use hv_ kind of prefixes for something we got from the hypervisor,
>> but
>> rather something that's hypervisor privileged or specific information
>> about the hypervisor perhaps (which is not the same as what the 
>> hypervisor may or may not advertise to the guest).
>> 
>> So if these are all capabilities and features the hypervisor
>> advertises 
>> to the LPAR, I think the hv_ should be dropped.
> 
> The hypervisor advertises the available capabalities and the LPAR
> passes buffer to HCALLs to retrieve these capabilties / features (in
> BE). 
> 
> I was using *_be in the previous version. So can I use like 'struct
> vas_ct_caps_be'  
> 
>> 
>> Otherwise seems okay. I would be careful of the "lpar" name. I think 
>> it's okay in this situation where the hypervisor advertises features
>> to 
>> the partition, but in other parts of the vas code you call it
>> pseries_
>> but you also have some lpar_ bits. So aside from interacting with
>> PAPR
>> APIs, it would be safe to consistently use pseries for your driver
>> and
>> type names.
> 
> I can use 'struct hv_vas_win_pseries' or 'struct vas_win_pseries_be'

I'm actually wrong about hv_ prefix now I look a bit further, there are
other structures that do use it for similar hcalls, sorry about that.

I'm not sure I like it too much but it is what it is. I would say don't
worry about churning your hv_ names for now. It's probably better / more 
used than _be postfix. Changing conventions could be a later 
powerpc-wide series if it's that important.

Thanks,
Nick


Re: [PATCH v4 12/16] powerpc/pseries/vas: Setup IRQ and fault handling

2021-06-04 Thread Nicholas Piggin
Excerpts from Haren Myneni's message of June 4, 2021 11:19 am:
> On Thu, 2021-06-03 at 15:48 +1000, Nicholas Piggin wrote:
>> Excerpts from Haren Myneni's message of May 21, 2021 7:39 pm:
>> > NX generates an interrupt when sees a fault on the user space
>> > buffer and the hypervisor forwards that interrupt to OS. Then
>> > the kernel handles the interrupt by issuing H_GET_NX_FAULT hcall
>> > to retrieve the fault CRB information.
>> > 
>> > This patch also adds changes to setup and free IRQ per each
>> > window and also handles the fault by updating the CSB.
>> > 
>> > Signed-off-by: Haren Myneni 
>> > ---
>> >  arch/powerpc/platforms/pseries/vas.c | 111
>> > +++
>> >  1 file changed, 111 insertions(+)
>> > 
>> > diff --git a/arch/powerpc/platforms/pseries/vas.c
>> > b/arch/powerpc/platforms/pseries/vas.c
>> > index ef0c455f6e93..31dc17573f50 100644
>> > --- a/arch/powerpc/platforms/pseries/vas.c
>> > +++ b/arch/powerpc/platforms/pseries/vas.c
>> > @@ -224,6 +224,62 @@ int plpar_vas_query_capabilities(const u64
>> > hcall, u8 query_type,
>> >  }
>> >  EXPORT_SYMBOL_GPL(plpar_vas_query_capabilities);
>> >  
>> > +/*
>> > + * HCALL to get fault CRB from pHyp.
>> > + */
>> > +static int plpar_get_nx_fault(u32 winid, u64 buffer)
>> > +{
>> > +  int64_t rc;
>> > +
>> > +  rc = plpar_hcall_norets(H_GET_NX_FAULT, winid, buffer);
>> > +
>> > +  switch (rc) {
>> > +  case H_SUCCESS:
>> > +  return 0;
>> > +  case H_PARAMETER:
>> > +  pr_err("HCALL(%x): Invalid window ID %u\n",
>> > H_GET_NX_FAULT,
>> > + winid);
>> > +  return -EINVAL;
>> > +  case H_STATE:
>> > +  pr_err("HCALL(%x): No outstanding faults for window ID
>> > %u\n",
>> > + H_GET_NX_FAULT, winid);
>> > +  return -EINVAL;
>> > +  case H_PRIVILEGE:
>> > +  pr_err("HCALL(%x): Window(%u): Invalid fault buffer
>> > 0x%llx\n",
>> > + H_GET_NX_FAULT, winid, buffer);
>> > +  return -EACCES;
>> > +  default:
>> > +  pr_err("HCALL(%x): Unexpected error %lld for
>> > window(%u)\n",
>> > + H_GET_NX_FAULT, rc, winid);
>> > +  return -EIO;
>> > +  }
>> > +}
>> 
>> Out of curiosity, you get one of these errors and it just drops the
>> interrupt on the floor. Then what happens, I assume everything
>> stops. Should it put some error in the csb, or signal the process or
>> something? Or is there nothing very sane that can be done?
> 
> The user space polls on CSB with timout interval. If the kernel or NX
> does not return, the request will be timeout.

Okay, if there is no sane way it can respond to the different error 
cases that's not necessarily unreasonable to just print something in the 
kernel log. Hopefully the kernel log would be useful to the 
administrator / developer / etc, but that's pretty rarely the case for 
Linux errors as it is.

> The hypervisor returns the credit after H_GET_NX_FAULT HCALL is
> successful. Also one credit is assigned for each window. So in this
> case, the error is coming from the hypervisor and the application can
> not issue another request on the same window. 
> 
>> 
>> > +
>> > +/*
>> > + * Handle the fault interrupt.
>> > + * When the fault interrupt is received for each window, query
>> > pHyp to get
>> > + * the fault CRB on the specific fault. Then process the CRB by
>> > updating
>> > + * CSB or send signal if the user space CSB is invalid.
>> > + * Note: pHyp forwards an interrupt for each fault request. So one
>> > fault
>> > + *CRB to process for each H_GET_NX_FAULT HCALL.
>> > + */
>> > +irqreturn_t pseries_vas_fault_thread_fn(int irq, void *data)
>> > +{
>> > +  struct vas_window *txwin = data;
>> > +  struct coprocessor_request_block crb;
>> > +  struct vas_user_win_ref *tsk_ref;
>> > +  int rc;
>> > +
>> > +  rc = plpar_get_nx_fault(txwin->winid, (u64)virt_to_phys(&crb));
>> > +  if (!rc) {
>> > +  tsk_ref = &txwin->task_ref;
>> > +  vas_dump_crb(&crb);
>> 
>> This (and existing powernv vas code) is printk()ing a lot of lines
>> per 
>> fault. This should be pretty normal operation I think? It should
>> avoid
>> filling the kernel logs, if so. Particularly if it can be triggered
>> by 
>> userspace.
>> 
>> I know it's existing code, so could be fixed separately from the
>> series.
> 
> printk messages are only if HCALL returns failure or kernel issue (ex:
> not valid window and etc on powerNV). These errors should not be
> depending on the iser space requests. So generally we should not get
> these errors.  

Ah I was looking at dump_crb but that's using pr_devel so that's 
probably okay.

Thanks,
Nick


[PATCH v4 0/4] shoot lazy tlbs

2021-06-04 Thread Nicholas Piggin
The additional unused config option was a valid criticism, so this now
purely just toggles refcounting of the lazy tlb mm.

Thanks,
Nick

Since v3:
- Removed the extra config option, MMU_LAZY_TLB=n. This can be
  resurrected if an arch wants it.

Nicholas Piggin (4):
  lazy tlb: introduce lazy mm refcount helper functions
  lazy tlb: allow lazy tlb mm refcounting to be configurable
  lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

 arch/Kconfig | 17 ++
 arch/arm/mach-rpc/ecard.c|  2 +-
 arch/powerpc/Kconfig |  1 +
 arch/powerpc/kernel/smp.c|  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 +--
 fs/exec.c|  4 +--
 include/linux/sched/mm.h | 20 +++
 kernel/cpu.c |  2 +-
 kernel/exit.c|  2 +-
 kernel/fork.c| 51 
 kernel/kthread.c | 11 +++---
 kernel/sched/core.c  | 35 +--
 kernel/sched/sched.h |  4 ++-
 13 files changed, 132 insertions(+), 23 deletions(-)

-- 
2.23.0



[PATCH v4 1/4] lazy tlb: introduce lazy mm refcount helper functions

2021-06-04 Thread Nicholas Piggin
Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes lazy mm references more obvious, and allows explicit
refcounting to be removed if it is not used.

Signed-off-by: Nicholas Piggin 
---
 arch/arm/mach-rpc/ecard.c|  2 +-
 arch/powerpc/kernel/smp.c|  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c|  4 ++--
 include/linux/sched/mm.h | 11 +++
 kernel/cpu.c |  2 +-
 kernel/exit.c|  2 +-
 kernel/kthread.c | 11 +++
 kernel/sched/core.c  | 15 ---
 9 files changed, 34 insertions(+), 19 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index 827b50f1c73e..1b4a41aad793 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
current->mm = mm;
current->active_mm = mm;
activate_mm(active_mm, mm);
-   mmdrop(active_mm);
+   mmdrop_lazy_tlb(active_mm);
ecard_init_pgtables(mm);
return 0;
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e05c783440a..fb0bdfc67366 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1541,7 +1541,7 @@ void start_secondary(void *unused)
 {
unsigned int cpu = raw_smp_processor_id();
 
-   mmgrab(&init_mm);
+   mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
 
smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 409e61210789..2962082787c0 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -663,10 +663,10 @@ void exit_lazy_flush_tlb(struct mm_struct *mm, bool 
always_flush)
if (current->active_mm == mm) {
WARN_ON_ONCE(current->mm != NULL);
/* Is a kernel thread and is using mm as the lazy tlb */
-   mmgrab(&init_mm);
+   mmgrab_lazy_tlb(&init_mm);
current->active_mm = &init_mm;
switch_mm_irqs_off(mm, &init_mm, current);
-   mmdrop(mm);
+   mmdrop_lazy_tlb(mm);
}
 
/*
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..ca0f8b1af23a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1029,9 +1029,9 @@ static int exec_mmap(struct mm_struct *mm)
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
mm_update_next_owner(old_mm);
mmput(old_mm);
-   return 0;
+   } else {
+   mmdrop_lazy_tlb(active_mm);
}
-   mmdrop(active_mm);
return 0;
 }
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..bfd1baca5266 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,17 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
 }
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+   mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+   mmdrop(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
diff --git a/kernel/cpu.c b/kernel/cpu.c
index e538518556f4..e87a89824e6c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -602,7 +602,7 @@ static int finish_cpu(unsigned int cpu)
 */
if (mm != &init_mm)
idle->active_mm = &init_mm;
-   mmdrop(mm);
+   mmdrop_lazy_tlb(mm);
return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..8e87ec5f6be2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -476,7 +476,7 @@ static void exit_mm(void)
__set_current_state(TASK_RUNNING);
mmap_read_lock(mm);
}
-   mmgrab(mm);
+   mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fe3f2a40d61e..b70e28431a01 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1314,14 +1314,14 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);
 
+   mmgrab(mm);
+
task_lock(tsk);
/* Hold off tlb flush IPIs while switching mm's */
local_irq_disable();
active_mm = tsk->active_mm;
-   if (active_mm != mm) {
-   mmgrab(mm);
+   if (active_mm != mm)
tsk->active_mm = mm;
-   }
tsk->mm = mm;
membarrier_update_current_mm(mm);
switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1341,7 +1341,7 @@ void kthread_use_mm(struct mm_struct *mm)
 * mmdrop(), or explicitly with smp_mb().
 */
if (active_mm != mm)

[PATCH v4 2/4] lazy tlb: allow lazy tlb mm refcounting to be configurable

2021-06-04 Thread Nicholas Piggin
Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched. This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the
last refcount is dropped. Currently this is always enabled, which is
what existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig |  4 
 include/linux/sched/mm.h | 13 +++--
 kernel/sched/core.c  | 22 ++
 kernel/sched/sched.h |  4 +++-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c45b770d3579..1cff045cdde6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,10 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  irqs disabled over activate_mm. Architectures that do IPI based TLB
  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+config MMU_LAZY_TLB_REFCOUNT
+   def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index bfd1baca5266..29e4638ad124 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -52,12 +52,21 @@ static inline void mmdrop(struct mm_struct *mm)
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-   mmgrab(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+   mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-   mmdrop(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+   mmdrop(mm);
+   } else {
+   /*
+* mmdrop_lazy_tlb must provide a full memory barrier, see the
+* membarrier comment finish_task_switch which relies on this.
+*/
+   smp_mb();
+   }
 }
 
 /**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e359c76ea2e2..5e10cb712be3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4171,7 +4171,7 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
__releases(rq->lock)
 {
struct rq *rq = this_rq();
-   struct mm_struct *mm = rq->prev_mm;
+   struct mm_struct *mm = NULL;
long prev_state;
 
/*
@@ -4190,7 +4190,10 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
  current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
 
-   rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   mm = rq->prev_lazy_mm;
+   rq->prev_lazy_mm = NULL;
+#endif
 
/*
 * A task struct has one reference for the use as "current".
@@ -4326,9 +4329,20 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
if (!prev->mm) {// from kernel
-   /* will mmdrop_lazy_tlb() in finish_task_switch(). */
-   rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   /* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+   rq->prev_lazy_mm = prev->active_mm;
prev->active_mm = NULL;
+#else
+   /*
+* Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+* tracking (because no rq->prev_lazy_mm) in
+* finish_task_switch, so no mmdrop_lazy_tlb(), so no
+* memory barrier for membarrier (see the membarrier
+* comment in finish_task_switch()).  Do it here.
+*/
+   smp_mb();
+#endif
}
}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a189bec13729..0729cf19a987 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -961,7 +961,9 @@ struct rq {
struct task_struct  *idle;
struct task_struct  *stop;
unsigned long   next_balance;
-   struct mm_struct*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+   struct mm_struct*prev_lazy_mm;
+#endif
 
unsigned intclock_update_flags;
u64 clock;
-- 
2.23.0



[PATCH v4 3/4] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2021-06-04 Thread Nicholas Piggin
On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be
a big problem with this scheme (the powerpc implementation generated
314 additional interrupts on a 144 CPU system during a kernel compile).
There are a number of strategies that could be employed to reduce IPIs
if they turn out to be a problem for some workload.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig  | 13 +
 kernel/fork.c | 51 +++
 2 files changed, 64 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 1cff045cdde6..f8136c893991 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -421,6 +421,19 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
 config MMU_LAZY_TLB_REFCOUNT
def_bool y
+   depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# Instead of refcounting the lazy mm struct for kernel thread references
+# (which can cause contention with multi-threaded apps on large multiprocessor
+# systems), this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
+# switch to init_mm if they were using the to-be-freed mm as the lazy tlb. To
+# implement this, architectures must use _lazy_tlb variants of mm refcounting
+# when releasing kernel thread mm references, and mm_cpumask must include at
+# least all possible CPUs in which the mm might be lazy, at the time of the
+# final mmdrop. mmgrab/mmdrop in arch/ code must be switched to _lazy_tlb
+# postfix as necessary.
+config MMU_LAZY_TLB_SHOOTDOWN
+   bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..8085ff33c7f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -674,6 +674,53 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()  (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   if (current->active_mm == mm) {
+   WARN_ON_ONCE(current->mm);
+   current->active_mm = &init_mm;
+   switch_mm(mm, &init_mm, current);
+   }
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+   /*
+* IPI overheads have not found to be expensive, but they could
+* be reduced in a number of possible ways, for example (in
+* roughly increasing order of complexity):
+* - A batch of mms requiring IPIs could be gathered and freed
+*   at once.
+* - CPUs could store their active mm somewhere that can be
+*   remotely checked without a lock, to filter out
+*   false-positives in the cpumask.
+* - After mm_users or mm_count reaches zero, switching away
+*   from the mm could clear mm_cpumask to reduce some IPIs
+*   (some batching or delaying would help).
+* - A delayed freeing and RCU-like quiescing sequence based on
+*   mm switching to avoid IPIs completely.
+*/
+   on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 
1);
+   if (IS_ENABLED(CONFIG_DEBUG_VM))
+   on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+   } else {
+   /*
+* In this case, lazy tlb mms are refounted and would not reach
+* __mmdrop until all CPUs have switched away and mmdrop()ed.
+*/
+   }
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -683,6 +730,10 @@ void __mmdrop(struct mm_struct *mm)
 {
BUG_ON(mm == &init_mm);
WARN_ON_ONCE(mm == current->mm);
+
+   /* Ensure no CPUs are using this as their lazy tlb mm */
+   shoot_lazy_tlbs(mm);
+
WARN_ON_ONCE(mm == current->active_mm);
mm_free_pgd(mm);
destroy_context(mm);
-- 
2.23.0



[PATCH v4 4/4] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

2021-06-04 Thread Nicholas Piggin
On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 088dd2afcfe4..8a092eedc692 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -252,6 +252,7 @@ config PPC
select IRQ_FORCED_THREADING
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
+   select MMU_LAZY_TLB_SHOOTDOWN   if PPC_BOOK3S_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE   if PPC64 || NOT_COHERENT_CACHE
select NEED_SG_DMA_LENGTH
-- 
2.23.0



Re: [PATCH] powerpc: Remove klimit

2021-06-04 Thread Kefeng Wang



On 2021/6/4 17:57, Christophe Leroy wrote:

klimit is a global variable initialised at build time with the
value of _end.

This variable is never modified, so _end symbol can be used directly.

Remove klimit.

Reviewed-by: Kefeng Wang 


Signed-off-by: Christophe Leroy 
Cc: Kefeng Wang 
---
  arch/powerpc/include/asm/setup.h | 1 -
  arch/powerpc/kernel/head_book3s_32.S | 6 ++
  arch/powerpc/kernel/prom.c   | 2 +-
  arch/powerpc/kernel/setup-common.c   | 4 +---
  arch/powerpc/platforms/powermac/bootx_init.c | 2 +-
  5 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/setup.h b/arch/powerpc/include/asm/setup.h
index e89bfebd4e00..6c1a7d217d1a 100644
--- a/arch/powerpc/include/asm/setup.h
+++ b/arch/powerpc/include/asm/setup.h
@@ -10,7 +10,6 @@ extern void ppc_printk_progress(char *s, unsigned short hex);
  extern unsigned int rtas_data;
  extern unsigned long long memory_limit;
  extern bool init_mem_is_free;
-extern unsigned long klimit;
  extern void *zalloc_maybe_bootmem(size_t size, gfp_t mask);
  
  struct device_node;

diff --git a/arch/powerpc/kernel/head_book3s_32.S 
b/arch/powerpc/kernel/head_book3s_32.S
index 326262030279..b724e88bcdaf 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -766,9 +766,6 @@ PerformanceMonitor:
   * the kernel image to physical address PHYSICAL_START.
   */
  relocate_kernel:
-   addis   r9,r26,klimit@ha/* fetch klimit */
-   lwz r25,klimit@l(r9)
-   addis   r25,r25,-KERNELBASE@h
lis r3,PHYSICAL_START@h /* Destination base address */
li  r6,0/* Destination offset */
li  r5,0x4000   /* # bytes of memory to copy */
@@ -776,7 +773,8 @@ relocate_kernel:
addir0,r3,4f@l  /* jump to the address of 4f */
mtctr   r0  /* in copy and do the rest. */
bctr/* jump to the copy */
-4: mr  r5,r25
+4: lis r5,_end-KERNELBASE@h
+   ori r5,r5,_end-KERNELBASE@l
bl  copy_and_flush  /* copy the rest */
b   turn_on_mmu
  
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c

index fbe9deebc8e1..f620e04dc9bf 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -758,7 +758,7 @@ void __init early_init_devtree(void *params)
first_memblock_size = min_t(u64, first_memblock_size, 
memory_limit);
setup_initial_memory_limit(memstart_addr, first_memblock_size);
/* Reserve MEMBLOCK regions used by kernel, initrd, dt, etc... */
-   memblock_reserve(PHYSICAL_START, __pa(klimit) - PHYSICAL_START);
+   memblock_reserve(PHYSICAL_START, __pa(_end) - PHYSICAL_START);
/* If relocatable, reserve first 32k for interrupt vectors etc. */
if (PHYSICAL_START > MEMORY_START)
memblock_reserve(MEMORY_START, 0x8000);
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 74a98fff2c2f..138bb7f49ef9 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -91,8 +91,6 @@ EXPORT_SYMBOL_GPL(boot_cpuid);
  int dcache_bsize;
  int icache_bsize;
  
-unsigned long klimit = (unsigned long) _end;

-
  /*
   * This still seems to be needed... -- paulus
   */
@@ -930,7 +928,7 @@ void __init setup_arch(char **cmdline_p)
init_mm.start_code = (unsigned long)_stext;
init_mm.end_code = (unsigned long) _etext;
init_mm.end_data = (unsigned long) _edata;
-   init_mm.brk = klimit;
+   init_mm.brk = (unsigned long)_end;
  
  	mm_iommu_init(&init_mm);

irqstack_early_init();
diff --git a/arch/powerpc/platforms/powermac/bootx_init.c 
b/arch/powerpc/platforms/powermac/bootx_init.c
index 9d4ecd292255..d20ef35e6d9d 100644
--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -433,7 +433,7 @@ static void __init btext_welcome(boot_infos_t *bi)
bootx_printf("\nframe buffer at  : 0x%x", bi->dispDeviceBase);
bootx_printf(" (phys), 0x%x", bi->logicalDisplayBase);
bootx_printf(" (log)");
-   bootx_printf("\nklimit   : 0x%x",(unsigned long)klimit);
+   bootx_printf("\nklimit   : 0x%x",(unsigned long)_end);
bootx_printf("\nboot_info at : 0x%x", bi);
__asm__ __volatile__ ("mfmsr %0" : "=r" (flags));
bootx_printf("\nMSR  : 0x%x", flags);


Re: [PATCH v3 0/4] shoot lazy tlbs

2021-06-04 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of June 5, 2021 10:26 am:
> Excerpts from Nicholas Piggin's message of June 5, 2021 10:17 am:
>> Excerpts from Andy Lutomirski's message of June 5, 2021 3:05 am:
>>> On 6/4/21 9:54 AM, Andy Lutomirski wrote:
 On 5/31/21 11:22 PM, Nicholas Piggin wrote:
> There haven't been objections to the series since last posting, this
> is just a rebase and tidies up a few comments minor patch rearranging.
>
 
 I continue to object to having too many modes.  I like my more generic
 improvements better.  Let me try to find some time to email again.
 
>>> 
>>> Specifically, this:
>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/mm
>> 
>> That's worse than what powerpc does with the shoot lazies code so 
>> we wouldn't use it anyway.
>> 
>> The fact is mm-cpumask and lazy mm is very architecture specific, so I 
>> don't really see that another "mode" is such a problem, it's for the 
>> most part "this is what powerpc does" -> "this is what powerpc does".
>> The only mode in the context switch is just "take a ref on the lazy mm"
>> or "don't take a ref". Surely that's not too onerous to add!?
>> 
>> Actually the bigger part of it is actually the no-lazy mmu mode which
>> is not yet used, I thought it was a neat little demonstrator of how code
>> works with/without lazy but I will get rid of that for submission.
> 
> I admit that does add a bit more churn than necessary maybe that was
> the main objection.
> 
> Here is the entire kernel/sched/core.c change after that is removed.
> Pretty simple now. I'll resubmit.

If it gives you some concerns about a great complex new mode, I'll
put it a different way -- all this allows is the arch to say that it
does not use lazy tlb mms beyond their refcounted lifetime, so there
is no need to refcount the lazy tlb reference.

That's all it is. One implementation of that is shoot lazies, and that
could be done entirely in arch/powerpc via destroy_context (I just put 
it in mm/ in case it is useful to others, but that's no real 
difference).

So you see it's really just about management of lazies, the refcounting
is just a bit on the side. And lazy management is highly arch specific,
x86 being one of the really different complex ones there including
very complex and unique interactions with membarrier ordering, so that
can't be a fair objection.

Thanks,
Nick



Re: [PATCH v4 04/16] powerpc/vas: Create take/drop pid and mm references

2021-06-04 Thread Nicholas Piggin
Excerpts from Nicholas Piggin's message of June 5, 2021 10:31 am:
> Excerpts from Haren Myneni's message of June 4, 2021 2:08 pm:
>> On Thu, 2021-06-03 at 14:21 +1000, Nicholas Piggin wrote:
>>> Excerpts from Haren Myneni's message of May 21, 2021 7:31 pm:
>>> > Take pid and mm references when each window opens and drops during
>>> > close. This functionality is needed for powerNV and pseries. So
>>> > this patch defines the existing code as functions in common book3s
>>> > platform vas-api.c
>>> > 
>>> > Signed-off-by: Haren Myneni 
>>> 
>>> Seems like a good idea to put these into their own helper functions.
>>> 
>>> > ---
>>> >  arch/powerpc/include/asm/vas.h  | 25 +
>>> >  arch/powerpc/platforms/book3s/vas-api.c | 51
>>> > ++
>>> >  arch/powerpc/platforms/powernv/vas-fault.c  | 10 ++--
>>> >  arch/powerpc/platforms/powernv/vas-window.c | 57 ++---
>>> > 
>>> >  arch/powerpc/platforms/powernv/vas.h|  6 +--
>>> >  5 files changed, 88 insertions(+), 61 deletions(-)
>>> > 
>>> > diff --git a/arch/powerpc/include/asm/vas.h
>>> > b/arch/powerpc/include/asm/vas.h
>>> > index 668303198772..3f2b02461a76 100644
>>> > --- a/arch/powerpc/include/asm/vas.h
>>> > +++ b/arch/powerpc/include/asm/vas.h
>>> > @@ -5,6 +5,9 @@
>>> >  
>>> >  #ifndef _ASM_POWERPC_VAS_H
>>> >  #define _ASM_POWERPC_VAS_H
>>> > +#include 
>>> > +#include 
>>> > +#include 
>>> >  #include 
>>> >  
>>> >  struct vas_window;
>>> > @@ -49,6 +52,17 @@ enum vas_cop_type {
>>> >   VAS_COP_TYPE_MAX,
>>> >  };
>>> >  
>>> > +/*
>>> > + * User space VAS windows are opened by tasks and take references
>>> > + * to pid and mm until windows are closed.
>>> > + * Stores pid, mm, and tgid for each window.
>>> > + */
>>> > +struct vas_user_win_ref {
>>> > + struct pid *pid;/* PID of owner */
>>> > + struct pid *tgid;   /* Thread group ID of owner */
>>> > + struct mm_struct *mm;   /* Linux process mm_struct */
>>> > +};
>>> > +
>>> >  /*
>>> >   * User space window operations used for powernv and powerVM
>>> >   */
>>> > @@ -59,6 +73,16 @@ struct vas_user_win_ops {
>>> >   int (*close_win)(void *);
>>> >  };
>>> >  
>>> > +static inline void vas_drop_reference_pid_mm(struct
>>> > vas_user_win_ref *ref)
>>> > +{
>>> > + /* Drop references to pid and mm */
>>> > + put_pid(ref->pid);
>>> > + if (ref->mm) {
>>> > + mm_context_remove_vas_window(ref->mm);
>>> > + mmdrop(ref->mm);
>>> > + }
>>> > +}
>>> 
>>> You don't have to make up a new name for such a thing because you 
>>> already have one
>>> 
>>> put_vas_user_win_ref(struct vas_user_win_ref *ref)
>>> 
>>> 
>>> > +
>>> >  /*
>>> >   * Receive window attributes specified by the (in-kernel) owner of
>>> > window.
>>> >   */
>>> > @@ -192,4 +216,5 @@ int vas_register_coproc_api(struct module *mod,
>>> > enum vas_cop_type cop_type,
>>> >   struct vas_user_win_ops *vops);
>>> >  void vas_unregister_coproc_api(void);
>>> >  
>>> > +int vas_reference_pid_mm(struct vas_user_win_ref *task_ref);
>>> >  #endif /* __ASM_POWERPC_VAS_H */
>>> > diff --git a/arch/powerpc/platforms/book3s/vas-api.c
>>> > b/arch/powerpc/platforms/book3s/vas-api.c
>>> > index 6c39320bfb9b..a0141bfb2e4b 100644
>>> > --- a/arch/powerpc/platforms/book3s/vas-api.c
>>> > +++ b/arch/powerpc/platforms/book3s/vas-api.c
>>> > @@ -55,6 +55,57 @@ static char *coproc_devnode(struct device *dev,
>>> > umode_t *mode)
>>> >   return kasprintf(GFP_KERNEL, "crypto/%s", dev_name(dev));
>>> >  }
>>> >  
>>> > +/*
>>> > + * Take reference to pid and mm
>>> > + */
>>> > +int vas_reference_pid_mm(struct vas_user_win_ref *task_ref)
>>> > +{
>>> 
>>> So this is quite different from a typical refcount object in that
>>> it's 
>>> opening it for access as well. I would split it in two functions, one
>>> matching put_vas_user_win_ref() and appearing in the same place in
>>> code,
>>> which is up to about mmput and another function that adds the window
>>> and
>>> does the CP_ABORT etc... hmm, where do you release tgid?
>> 
>> Basically copied the existing code in to these functions
>> (vas_reference_pid_mm/vas_drop_reference_pid_mm) so that useful for
>> both platforms. 
>> 
>> mm_context_add/remove_vas_window() is also like taking reference. So
>> instead of adding 2 seperate functions, how about naming
>> get/put_vas_user_win_ref() 
> 
> It's actually different though. What I'm asking is the parts where you 
> interact with core kernel data structure refcounts go into their own 
> get/put functions.
> 
> Someone who understands that refcounting and looks at the code will care 
> about those bits, so having them all together I think is helpful. They 
> don't know about adding vas windows or CP_ABORT.
> 
>> Regarding tgid, the reference is taking only with pid, but not tgid.
>> pid reuse can happen only in the case of multithread applications when
>> the child that opened VAS window exits. But these windows will be
>> closed when tgid exists. So do not need tgi